refactor CSCS and Meg content add merlin6 quick start update merlin6 nomachine docs give the userdoc its own color scheme we use the Materials default one refactored slurm general docs merlin6 add merlin6 JB docs add software support m6 docs add all files to nav vibed changes #1 add missing pages further vibing #2 vibe #3 further fixes
60 lines
2.8 KiB
Markdown
60 lines
2.8 KiB
Markdown
---
|
|
title: Slurm cluster 'merlin7'
|
|
#tags:
|
|
keywords: configuration, partitions, node definition
|
|
#last_updated: 24 Mai 2023
|
|
summary: "This document describes a summary of the Merlin7 configuration."
|
|
sidebar: merlin7_sidebar
|
|
permalink: /merlin7/merlin7-configuration.html
|
|
---
|
|
|
|
This documentation shows basic Slurm configuration and options needed to run jobs in the Merlin7 cluster.
|
|
|
|
## Infrastructure
|
|
|
|
### Hardware
|
|
|
|
* 2 CPU-only login nodes
|
|
* 77 CPU-only compute nodes
|
|
* 5 GPU A100 nodes
|
|
* 8 GPU Grace Hopper nodes
|
|
|
|
The specification of the node types is:
|
|
|
|
| Node | #Nodes | CPU | RAM | GRES |
|
|
| ----: | ------ | --- | --- | ---- |
|
|
| Login Nodes | 2 | _2x_ AMD EPYC 7742 (x86_64 Rome, 64 Cores, 2.25GHz) | 512GB DDR4 3200Mhz | |
|
|
| CPU Nodes | 77 | _2x_ AMD EPYC 7742 (x86_64 Rome, 64 Cores, 2.25GHz) | 512GB DDR4 3200Mhz | |
|
|
| A100 GPU Nodes | 5 | _2x_ AMD EPYC 7713 (x86_64 Milan, 64 Cores, 3.2GHz) | 512GB DDR4 3200Mhz | 4 x NV_A100 (80GB) |
|
|
| GH GPU Nodes | 3 | _2x_ NVidia Grace Neoverse-V2 (SBSA ARM 64bit, 144 Cores, 3.1GHz) | _2x_ 480GB DDR5X (CPU+GPU) | 4 x NV_GH200 (120GB) |
|
|
|
|
### Network
|
|
|
|
The Merlin7 cluster builds on top of HPE/Cray technologies, including a high-performance network fabric called Slingshot. This network fabric is able
|
|
to provide up to 200 Gbit/s throughput between nodes. Further information on Slignshot can be found on at [HPE](https://www.hpe.com/psnow/doc/PSN1012904596HREN) and
|
|
at <https://www.glennklockwood.com/garden/slingshot>.
|
|
|
|
Through software interfaces like [libFabric](https://ofiwg.github.io/libfabric/) (which available on Merlin7), application can leverage the network seamlessly.
|
|
|
|
### Storage
|
|
|
|
Unlike previous iteration of the Merlin HPC clusters, Merlin7 _does not_ have any local storage. Instead storage for the entire cluster is provided through
|
|
a dedicated storage appliance from HPE/Cray called [ClusterStor](https://www.hpe.com/psnow/doc/PSN1012842049INEN.pdf).
|
|
|
|
The appliance is built of several storage servers:
|
|
|
|
* 2 management nodes
|
|
* 2 MDS servers, 12 drives per server, 2.9TiB (Raid10)
|
|
* 8 OSS-D servers, 106 drives per server, 14.5 T.B HDDs (Gridraid / Raid6)
|
|
* 4 OSS-F servers, 12 drives per server 7TiB SSDs (Raid10)
|
|
|
|
With effective storage capacity of:
|
|
|
|
* 10 PB HDD
|
|
* value visible on linux: HDD 9302.4 TiB
|
|
* 162 TB SSD
|
|
* value visible on linux: SSD 151.6 TiB
|
|
* 23.6 TiB on Metadata
|
|
|
|
The storage is directly connected to the cluster (and each individual node) through the Slingshot NIC.
|