60 lines
2.8 KiB
Markdown
60 lines
2.8 KiB
Markdown
---
|
|
title: Slurm cluster 'merlin7'
|
|
#tags:
|
|
keywords: configuration, partitions, node definition
|
|
#last_updated: 24 Mai 2023
|
|
summary: "This document describes a summary of the Merlin7 configuration."
|
|
sidebar: merlin7_sidebar
|
|
permalink: /merlin7/merlin7-configuration.html
|
|
---
|
|
|
|
This documentation shows basic Slurm configuration and options needed to run jobs in the Merlin7 cluster.
|
|
|
|
## Infrastructure
|
|
|
|
### Hardware
|
|
|
|
* 2 CPU-only login nodes
|
|
* 77 CPU-only compute nodes
|
|
* 5 GPU A100 nodes
|
|
* 8 GPU Grace Hopper nodes
|
|
|
|
The specification of the node types is:
|
|
|
|
| Node | #Nodes | CPU | RAM | GRES |
|
|
| ----: | ------ | --- | --- | ---- |
|
|
| Login Nodes | 2 | _2x_ AMD EPYC 7742 (x86_64 Rome, 64 Cores, 2.25GHz) | 512GB DDR4 3200Mhz | |
|
|
| CPU Nodes | 77 | _2x_ AMD EPYC 7742 (x86_64 Rome, 64 Cores, 2.25GHz) | 512GB DDR4 3200Mhz | |
|
|
| A100 GPU Nodes | 8 | _2x_ AMD EPYC 7713 (x86_64 Milan, 64 Cores, 3.2GHz) | 512GB DDR4 3200Mhz | 4 x NV_A100 (80GB) |
|
|
| GH GPU Nodes | 5 | _2x_ NVidia Grace Neoverse-V2 (SBSA ARM 64bit, 144 Cores, 3.1GHz) | _2x_ 480GB DDR5X (CPU+GPU) | 4 x NV_GH200 (120GB) |
|
|
|
|
### Network
|
|
|
|
The Merlin7 cluster builds on top of HPE/Cray technologies, including a high-performance network fabric called Slingshot. This network fabric is able
|
|
to provide up to 200 Gbit/s throughput between nodes. Further information on Slignshot can be found on at [HPE](https://www.hpe.com/psnow/doc/PSN1012904596HREN) and
|
|
at <https://www.glennklockwood.com/garden/slingshot>.
|
|
|
|
Through software interfaces like [libFabric](https://ofiwg.github.io/libfabric/) (which available on Merlin7), application can leverage the network seamlessly.
|
|
|
|
### Storage
|
|
|
|
Unlike previous iteration of the Merlin HPC clusters, Merlin7 _does not_ have any local storage. Instead storage for the entire cluster is provided through
|
|
a dedicated storage appliance from HPE/Cray called [ClusterStor](https://www.hpe.com/psnow/doc/PSN1012842049INEN.pdf).
|
|
|
|
The appliance is built of several storage servers:
|
|
|
|
* 2 management nodes
|
|
* 2 MDS servers, 12 drives per server, 2.9TiB (Raid10)
|
|
* 8 OSS-D servers, 106 drives per server, 14.5 T.B HDDs (Gridraid / Raid6)
|
|
* 4 OSS-F servers, 12 drives per server 7TiB SSDs (Raid10)
|
|
|
|
With effective storage capacity of:
|
|
|
|
* 10 PB HDD
|
|
* value visible on linux: HDD 9302.4 TiB
|
|
* 162 TB SSD
|
|
* value visible on linux: SSD 151.6 TiB
|
|
* 23.6 TiB on Metadata
|
|
|
|
The storage is directly connected to the cluster (and each individual node) through the Slingshot NIC.
|