This commit is contained in:
2024-12-18 17:11:12 +01:00
parent f24a644c8e
commit 08c999f97a
4 changed files with 218 additions and 47 deletions

View File

@ -47,6 +47,8 @@ entries:
- title: Slurm General Documentation
folderitems:
- title: Merlin7 Infrastructure
url: /merlin7/merlin7-configuration.html
- title: Slurm Configuration
url: /merlin7/slurm-configuration.html
- title: Running Slurm Interactive Jobs
url: /merlin7/interactive-jobs.html

View File

@ -15,15 +15,16 @@ initiate the transfer from either merlin or the other system, depending on the n
visibility.
- Merlin login nodes are visible from the PSI network, so direct data transfer
(rsync/WinSCP) is generally preferable. This can be initiated from either endpoint.
- Merlin login nodes can access the internet using a limited set of protocols
- SSH-based protocols using port 22 (rsync-over-ssh, sftp, WinSCP, etc)
(rsync/WinSCP/sftp) is generally preferable.
- Protocols from Merlin7 to PSI may require special firewall rules.
- Merlin login nodes can access the internet using a limited set of protocols:
- HTTP-based protocols using ports 80 or 445 (https, WebDav, etc)
- Protocols using other ports require admin configuration and may only work with
specific hosts (ftp, rsync daemons, etc)
specific hosts, and may require new firewall rules (ssh, ftp, rsync daemons, etc).
- Systems on the internet can access the [PSI Data Transfer](https://www.psi.ch/en/photon-science-data-services/data-transfer) service
`datatransfer.psi.ch`, using ssh-based protocols and [Globus](https://www.globus.org/)
SSH-based protocols using port 22 to most PSI servers and (rsync-over-ssh, sftp, WinSCP, etc.), are in general, not permitted
## Direct transfer via Merlin7 login nodes

View File

@ -0,0 +1,68 @@
---
title: Slurm cluster 'merlin7'
#tags:
keywords: configuration, partitions, node definition
#last_updated: 24 Mai 2023
summary: "This document describes a summary of the Merlin7 configuration."
sidebar: merlin7_sidebar
permalink: /merlin7/merlin7-configuration.html
---
![Work In Progress](/images/WIP/WIP1.webp){:style="display:block; margin-left:auto; margin-right:auto"}
{{site.data.alerts.warning}}The Merlin7 documentation is <b>Work In Progress</b>.
Please do not use or rely on this documentation until this becomes official.
This applies to any page under <b><a href="https://lsm-hpce.gitpages.psi.ch/merlin7/">https://lsm-hpce.gitpages.psi.ch/merlin7/</a></b>
{{site.data.alerts.end}}
This documentation shows basic Slurm configuration and options needed to run jobs in the Merlin7 cluster.
## Infrastructure
### Hardware
The current configuration for the _preproduction_ phase (and likely the production phase) is made up as:
* 92 nodes in total for Merlin7:
* 2 CPU-only login nodes
* 77 CPU-only compute nodes
* 5 GPU A100 nodes
* 8 GPU Grace Hopper nodes
The specification of the node types is:
| Node | CPU | RAM | GRES | Notes |
| ---- | --- | --- | ---- | ----- |
| Multi-core node | _2x_ AMD EPYC 7742 (x86_64 Rome, 64 Cores, 2.25GHz) | 512GB DDR4 3200Mhz | | For both the login and CPU-only compute nodes |
| A100 node | _2x_ AMD EPYC 7713 (x86_64 Milan, 64 Cores, 3.2GHz) | 512GB DDR4 3200Mhz | _4x_ NVidia A100 (Ampere, 80GB) | |
| GH Node | _2x_ NVidia Grace Neoverse-V2 (SBSA ARM 64bit, 144 Cores, 3.1GHz) | _2x_ 480GB DDR5X (CPU + GPU) | _4x_ NVidia GH200 (Hopper, 120GB) | |
### Network
The Merlin7 cluster builds on top of HPE/Cray technologies, including a high-performance network fabric called Slingshot. This network fabric is able
to provide up to 200 Gbit/s throughput between nodes. Further information on Slignshot can be found on at [HPE](https://www.hpe.com/psnow/doc/PSN1012904596HREN) and
at <https://www.glennklockwood.com/garden/slingshot>.
Through software interfaces like [libFabric](https://ofiwg.github.io/libfabric/) (which available on Merlin7), application can leverage the network seamlessly.
### Storage
Unlike previous iteration of the Merlin HPC clusters, Merlin7 _does not_ have any local storage. Instead storage for the entire cluster is provided through
a dedicated storage appliance from HPE/Cray called [ClusterStor](https://www.hpe.com/psnow/doc/PSN1012842049INEN.pdf).
The appliance is built of several storage servers:
* 2 management nodes
* 2 MDS servers, 12 drives per server, 2.9TiB (Raid10)
* 8 OSS-D servers, 106 drives per server, 14.5 T.B HDDs (Gridraid / Raid6)
* 4 OSS-F servers, 12 drives per server 7TiB SSDs (Raid10)
With effective storage capacity of:
* 10 PB HDD
* value visible on linux: HDD 9302.4 TiB
* 162 TB SSD
* value visible on linux: SSD 151.6 TiB
* 23.6 TiB on Metadata
The storage is directly connected to the cluster (and each individual node) through the Slingshot NIC.

View File

@ -1,68 +1,168 @@
---
title: Slurm cluster 'merlin7'
title: Slurm merlin7 Configuration
#tags:
keywords: configuration, partitions, node definition
#last_updated: 24 Mai 2023
summary: "This document describes a summary of the Merlin7 configuration."
summary: "This document describes a summary of the Merlin7 Slurm CPU-based configuration."
sidebar: merlin7_sidebar
permalink: /merlin7/slurm-configuration.html
---
![Work In Progress](/images/WIP/WIP1.webp){:style="display:block; margin-left:auto; margin-right:auto"}
{{site.data.alerts.warning}}The Merlin7 documentation is <b>Work In Progress</b>.
Please do not use or rely on this documentation until this becomes official.
This applies to any page under <b><a href="https://lsm-hpce.gitpages.psi.ch/merlin7/">https://lsm-hpce.gitpages.psi.ch/merlin7/</a></b>
{{site.data.alerts.end}}
This documentation shows basic Slurm configuration and options needed to run jobs in the Merlin7 cluster.
## Infrastructure
## General configuration
### Hardware
The **Merlin7 cluster** is configured with the **`CR_CORE_MEMORY`** and **`CR_ONE_TASK_PER_CORE`** options.
* This configuration treats both cores and memory as consumable resources.
* Since the nodes are running with **hyper-threading** enabled, each core thread is counted as a CPU
to fulfill a job's resource requirements.
The current configuration for the _preproduction_ phase (and likely the production phase) is made up as:
By default, Slurm will allocate one task per core, which means:
* Each task will consume 2 **CPUs**, regardless of whether both threads are actively used by the job.
* 92 nodes in total for Merlin7:
* 2 CPU-only login nodes
* 77 CPU-only compute nodes
* 5 GPU A100 nodes
* 8 GPU Grace Hopper nodes
This behavior ensures consistent resource allocation but may result in underutilization of hyper-threading in some cases.
The specification of the node types is:
### Default cluster
| Node | CPU | RAM | GRES | Notes |
| ---- | --- | --- | ---- | ----- |
| Multi-core node | _2x_ AMD EPYC 7742 (x86_64 Rome, 64 Cores, 2.25GHz) | 512GB DDR4 3200Mhz | | For both the login and CPU-only compute nodes |
| A100 node | _2x_ AMD EPYC 7713 (x86_64 Milan, 64 Cores, 3.2GHz) | 512GB DDR4 3200Mhz | _4x_ NVidia A100 (Ampere, 80GB) | |
| GH Node | _2x_ NVidia Grace Neoverse-V2 (SBSA ARM 64bit, 144 Cores, 3.1GHz) | _2x_ 480GB DDR5X (CPU + GPU) | _4x_ NVidia GH200 (Hopper, 120GB) | |
By default, jobs will be submitted to **`merlin7`**, as it is the primary cluster configured on the login nodes.
Specifying the cluster name is typically unnecessary unless you have defined environment variables that could override the default cluster name.
However, when necessary, one can specify the cluster as follows:
```bash
#SBATCH --cluster=merlin7
```
### Network
## Slurm nodes definition
The Merlin7 cluster builds on top of HPE/Cray technologies, including a high-performance network fabric called Slingshot. This network fabric is able
to provide up to 200 Gbit/s throughput between nodes. Further information on Slignshot can be found on at [HPE](https://www.hpe.com/psnow/doc/PSN1012904596HREN) and
at <https://www.glennklockwood.com/garden/slingshot>.
The table below provides an overview of the Slurm configuration for the different node types in the Merlin7 cluster.
This information is essential for understanding how resources are allocated, enabling users to tailor their submission
scripts accordingly.
Through software interfaces like [libFabric](https://ofiwg.github.io/libfabric/) (which available on Merlin7), application can leverage the network seamlessly.
| Nodes | Sockets | CoresPerSocket | Cores | ThreadsPerCore | CPUs | MaxMemPerNode | DefMemPerCPU | Features |
| --------------------:| -------: | --------------: | -----: | --------------: | ----: | ------------: | -----------: | ------------: |
| login[001-002] | 2 | 64 | 128 | 2 | 256 | 480G | 1920M | AMD_EPYC_7713 |
| cn[001-077] | 2 | 64 | 128 | 2 | 256 | 480G | 1920M | AMD_EPYC_7713 |
### Storage
Notes on memory configuration:
* **Memory allocation options:** To request additional memory, use the following options in your submission script:
* **`--mem=<mem_in_MB>`**: Allocates memory per node.
* **`--mem-per-cpu=<mem_in_MB>`**: Allocates memory per CPU (equivalent to a core thread).
Unlike previous iteration of the Merlin HPC clusters, Merlin7 _does not_ have any local storage. Instead storage for the entire cluster is provided through
a dedicated storage appliance from HPE/Cray called [ClusterStor](https://www.hpe.com/psnow/doc/PSN1012842049INEN.pdf).
The total memory requested cannot exceed the **`MaxMemPerNode`** value.
* **Impact of disabling Hyper-Threading:** Using the **`--hint=nomultithread`** option disables one thread per core,
effectively halving the number of available CPUs. Consequently, memory allocation will also be halved unless explicitly
adjusted.
The appliance is built of several storage servers:
For MPI-based jobs, where performance generally improves with single-threaded CPUs, this option is recommended.
In such cases, you should double the **`--mem-per-cpu`** value to account for the reduced number of threads.
* 2 management nodes
* 2 MDS servers, 12 drives per server, 2.9TiB (Raid10)
* 8 OSS-D servers, 106 drives per server, 14.5 T.B HDDs (Gridraid / Raid6)
* 4 OSS-F servers, 12 drives per server 7TiB SSDs (Raid10)
{{site.data.alerts.tip}}
Always verify the Slurm <b>'/var/spool/slurmd/conf-cache/slurm.conf'</b> configuration file for potential changes.
{{site.data.alerts.end}}
With effective storage capacity of:
### User and job limits with QoS
* 10 PB HDD
* value visible on linux: HDD 9302.4 TiB
* 162 TB SSD
* value visible on linux: SSD 151.6 TiB
* 23.6 TiB on Metadata
In the `merlin7` CPU cluster, we enforce certain limits on jobs and users to ensure fair resource usage and prevent
overuse by a single user or job. These limits aim to balance resource availability while maintaining overall cluster
efficiency. However, applying limits can occasionally impact the clusters utilization. For example, user-specific
limits may result in pending jobs even when many nodes are idle due to low activity.
The storage is directly connected to the cluster (and each individual node) through the Slingshot NIC.
On the other hand, these limits also enhance cluster efficiency by preventing scenarios such as a single job monopolizing
all available resources, which could block other jobs from running. Without job size limits, for instance, a large job
might drain the entire cluster to satisfy its resource request, a situation that is generally undesirable.
Thus, setting appropriate limits is essential to maintain fair resource usage while optimizing cluster efficiency. These
limits should allow for a mix of jobs of varying sizes and types, including single-core and parallel jobs, to coexist
effectively.
To implement these limits, **we utilize Quality of Service (QoS)**. Different QoS policies are defined and applied
**to specific partitions** in line with the established resource allocation policies. The table below outlines the
various QoS definitions applicable to the merlin7 CPU-based cluster. Here:
* `MaxTRES` specifies resource limits per job.
* `MaxTRESPU` specifies resource limits per user.
| Name | MaxTRES | MaxTRESPU | Scope |
| --------------: | -----------------: | -----------------: | ---------------------: |
| **normal** | | | partition |
| **cpu_general** | cpu=1024,mem=1920G | cpu=1024,mem=1920G | <u>user</u>, partition |
| **cpu_daily** | cpu=1024,mem=1920G | cpu=2048,mem=3840G | partition |
| **cpu_hourly** | cpu=2048,mem=3840G | cpu=8192,mem=15T | partition |
Where:
* **`normal` QoS:** This QoS has no limits and is typically applied to partitions that do not require user or job
restrictions.
* **`cpu_general` QoS:** This is the **default QoS** for `merlin7` _users_. It limits the total resources available to each
user. Additionally, this QoS is applied to the `general` partition, enforcing restrictions at the partition level and
overriding user-level QoS.
* **`cpu_daily` QoS:** Guarantees increased resources for the `daily` partition, accommodating shorter-duration jobs
with higher resource needs.
* **`cpu_hourly` QoS:** Offers the least constraints, allowing more resources to be used for the `hourly` partition,
which caters to very short-duration jobs.
For additional details, refer to the [Partitions](/merlin7/slurm-configuration.html#Partitions) section.
{{site.data.alerts.tip}}
Always verify QoS definitions for potential changes using the <b>'sacctmgr show qos format="Name%22,MaxTRESPU%35,MaxTRES%35"'</b> command.
{{site.data.alerts.end}}
## Partitions
This section provides a summary of the partitions available in the `merlin7` CPU cluster.
Key concepts:
* **`PriorityJobFactor`**: This value is added to a jobs priority (visible in the `PARTITION` column of the `sprio -l` command).
Jobs submitted to partitions with higher `PriorityJobFactor` values generally run sooner. However, other factors like *job age*
and especially *fair share* can also influence scheduling.
* **`PriorityTier`**: Jobs submitted to partitions with higher `PriorityTier` values take precedence over pending jobs in partitions
with lower `PriorityTier` values. Additionally, jobs from higher `PriorityTier` partitions can preempt running jobs in lower-tier
partitions, where applicable.
* **`QoS`**: Specifies the quality of service associated with a partition. It is used to control and restrict resource availability
for specific partitions, ensuring that resource allocation aligns with intended usage policies. Detailed explanations of the various
QoS settings can be found in the [User and job limits with QoS](/merlin7/slurm-configuration.html#user-and-job-limits-with-qos) section.
{{site.data.alerts.tip}}
Always verify partition configurations for potential changes using the <b>'scontrol show partition'</b> command.
{{site.data.alerts.end}}
### Public partitions
| PartitionName | DefaultTime | MaxTime | TotalNodes | PriorityJobFactor | PriorityTier | QoS | AllowAccounts |
| -----------------: | -----------: | ----------: | --------: | ----------------: | -----------: | ----------: | -------------: |
| **<u>general</u>** | 1-00:00:00 | 7-00:00:00 | 50 | 1 | 1 | cpu_general | <u>merlin</u> |
| **daily** | 0-01:00:00 | 1-00:00:00 | 63 | 500 | 1 | cpu_daily | <u>merlin</u> |
| **hourly** | 0-00:30:00 | 0-01:00:00 | 77 | 1000 | 1 | cpu_hourly | <u>merlin</u> |
All Merlin users are part of the `merlin` account, which is used as the *default account* when submitting jobs.
Similarly, if no partition is specified, jobs are automatically submitted to the `general` partition by default.
{{site.data.alerts.tip}}
For jobs running less than one day, submit them to the <b>daily</b> partition.
For jobs running less than one hour, use the <b>hourly</b> partition.
These partitions provide higher priority and ensure quicker scheduling compared to <b>general</b>, which has limited node availability.
{{site.data.alerts.end}}
The **`hourly`** partition may include private nodes as an additional buffer. However, the current Slurm partition configuration, governed
by **`PriorityTier`**, ensures that jobs submitted to private partitions are prioritized and processed first. As a result, access to the
**`hourly`** partition might experience delays in such scenarios.
### Private partitions
#### CAS / ASA
| PartitionName | DefaultTime | MaxTime | TotalNodes | PriorityJobFactor | PriorityTier | QoS | AllowAccounts |
| -----------------: | -----------: | ----------: | --------: | ----------------: | -----------: | ----------: | -------------: |
| **asa-general** | 0-01:00:00 | 14-00:00:00 | 10 | 1 | 2 | normal | asa |
| **asa-daily** | 0-01:00:00 | 1-00:00:00 | 10 | 1000 | 2 | normal | asa |
#### CNM / Mu3e
| PartitionName | DefaultTime | MaxTime | TotalNodes | PriorityJobFactor | PriorityTier | QoS | AllowAccounts |
| -----------------: | -----------: | ----------: | --------: | ----------------: | -----------: | ----------: | -------------: |
| **mu3e** | 1-00:00:00 | 7-00:00:00 | 4 | 1 | 2 | normal | mu3e, meg |
#### CNM / MeG
| PartitionName | DefaultTime | MaxTime | TotalNodes | PriorityJobFactor | PriorityTier | QoS | AllowAccounts |
| -----------------: | -----------: | ----------: | --------: | ----------------: | -----------: | ----------: | -------------: |
| **meg-short** | 0-01:00:00 | 0-01:00:00 | unlimited | 1000 | 2 | normal | meg |
| **meg-long** | 1-00:00:00 | 5-00:00:00 | unlimited | 1 | 2 | normal | meg |
| **meg-prod** | 1-00:00:00 | 5-00:00:00 | unlimited | 1000 | 4 | normal | meg |