--- title: Slurm cluster 'merlin7' #tags: keywords: configuration, partitions, node definition #last_updated: 24 Mai 2023 summary: "This document describes a summary of the Merlin7 configuration." sidebar: merlin7_sidebar permalink: /merlin7/merlin7-configuration.html --- This documentation shows basic Slurm configuration and options needed to run jobs in the Merlin7 cluster. ## Infrastructure ### Hardware * 2 CPU-only login nodes * 77 CPU-only compute nodes * 5 GPU A100 nodes * 8 GPU Grace Hopper nodes The specification of the node types is: | Node | #Nodes | CPU | RAM | GRES | | ----: | ------ | --- | --- | ---- | | Login Nodes | 2 | _2x_ AMD EPYC 7742 (x86_64 Rome, 64 Cores, 2.25GHz) | 512GB DDR4 3200Mhz | | | CPU Nodes | 77 | _2x_ AMD EPYC 7742 (x86_64 Rome, 64 Cores, 2.25GHz) | 512GB DDR4 3200Mhz | | | A100 GPU Nodes | 8 | _2x_ AMD EPYC 7713 (x86_64 Milan, 64 Cores, 3.2GHz) | 512GB DDR4 3200Mhz | 4 x NV_A100 (80GB) | | GH GPU Nodes | 5 | _2x_ NVidia Grace Neoverse-V2 (SBSA ARM 64bit, 144 Cores, 3.1GHz) | _2x_ 480GB DDR5X (CPU+GPU) | 4 x NV_GH200 (120GB) | ### Network The Merlin7 cluster builds on top of HPE/Cray technologies, including a high-performance network fabric called Slingshot. This network fabric is able to provide up to 200 Gbit/s throughput between nodes. Further information on Slignshot can be found on at [HPE](https://www.hpe.com/psnow/doc/PSN1012904596HREN) and at . Through software interfaces like [libFabric](https://ofiwg.github.io/libfabric/) (which available on Merlin7), application can leverage the network seamlessly. ### Storage Unlike previous iteration of the Merlin HPC clusters, Merlin7 _does not_ have any local storage. Instead storage for the entire cluster is provided through a dedicated storage appliance from HPE/Cray called [ClusterStor](https://www.hpe.com/psnow/doc/PSN1012842049INEN.pdf). The appliance is built of several storage servers: * 2 management nodes * 2 MDS servers, 12 drives per server, 2.9TiB (Raid10) * 8 OSS-D servers, 106 drives per server, 14.5 T.B HDDs (Gridraid / Raid6) * 4 OSS-F servers, 12 drives per server 7TiB SSDs (Raid10) With effective storage capacity of: * 10 PB HDD * value visible on linux: HDD 9302.4 TiB * 162 TB SSD * value visible on linux: SSD 151.6 TiB * 23.6 TiB on Metadata The storage is directly connected to the cluster (and each individual node) through the Slingshot NIC.