2025-01-21 13:57:25 +01:00

2.8 KiB

title, keywords, summary, sidebar, permalink
title keywords summary sidebar permalink
Slurm cluster 'merlin7' configuration, partitions, node definition This document describes a summary of the Merlin7 configuration. merlin7_sidebar /merlin7/merlin7-configuration.html

This documentation shows basic Slurm configuration and options needed to run jobs in the Merlin7 cluster.

Infrastructure

Hardware

  • 2 CPU-only login nodes
  • 77 CPU-only compute nodes
  • 5 GPU A100 nodes
  • 8 GPU Grace Hopper nodes

The specification of the node types is:

Node #Nodes CPU RAM GRES
Login Nodes 2 2x AMD EPYC 7742 (x86_64 Rome, 64 Cores, 2.25GHz) 512GB DDR4 3200Mhz
CPU Nodes 77 2x AMD EPYC 7742 (x86_64 Rome, 64 Cores, 2.25GHz) 512GB DDR4 3200Mhz
A100 GPU Nodes 8 2x AMD EPYC 7713 (x86_64 Milan, 64 Cores, 3.2GHz) 512GB DDR4 3200Mhz 4 x NV_A100 (80GB)
GH GPU Nodes 5 2x NVidia Grace Neoverse-V2 (SBSA ARM 64bit, 144 Cores, 3.1GHz) 2x 480GB DDR5X (CPU+GPU) 4 x NV_GH200 (120GB)

Network

The Merlin7 cluster builds on top of HPE/Cray technologies, including a high-performance network fabric called Slingshot. This network fabric is able to provide up to 200 Gbit/s throughput between nodes. Further information on Slignshot can be found on at HPE and at https://www.glennklockwood.com/garden/slingshot.

Through software interfaces like libFabric (which available on Merlin7), application can leverage the network seamlessly.

Storage

Unlike previous iteration of the Merlin HPC clusters, Merlin7 does not have any local storage. Instead storage for the entire cluster is provided through a dedicated storage appliance from HPE/Cray called ClusterStor.

The appliance is built of several storage servers:

  • 2 management nodes
  • 2 MDS servers, 12 drives per server, 2.9TiB (Raid10)
  • 8 OSS-D servers, 106 drives per server, 14.5 T.B HDDs (Gridraid / Raid6)
  • 4 OSS-F servers, 12 drives per server 7TiB SSDs (Raid10)

With effective storage capacity of:

  • 10 PB HDD
    • value visible on linux: HDD 9302.4 TiB
  • 162 TB SSD
    • value visible on linux: SSD 151.6 TiB
  • 23.6 TiB on Metadata

The storage is directly connected to the cluster (and each individual node) through the Slingshot NIC.