Update
This commit is contained in:
@ -15,15 +15,16 @@ initiate the transfer from either merlin or the other system, depending on the n
|
||||
visibility.
|
||||
|
||||
- Merlin login nodes are visible from the PSI network, so direct data transfer
|
||||
(rsync/WinSCP) is generally preferable. This can be initiated from either endpoint.
|
||||
- Merlin login nodes can access the internet using a limited set of protocols
|
||||
- SSH-based protocols using port 22 (rsync-over-ssh, sftp, WinSCP, etc)
|
||||
(rsync/WinSCP/sftp) is generally preferable.
|
||||
- Protocols from Merlin7 to PSI may require special firewall rules.
|
||||
- Merlin login nodes can access the internet using a limited set of protocols:
|
||||
- HTTP-based protocols using ports 80 or 445 (https, WebDav, etc)
|
||||
- Protocols using other ports require admin configuration and may only work with
|
||||
specific hosts (ftp, rsync daemons, etc)
|
||||
specific hosts, and may require new firewall rules (ssh, ftp, rsync daemons, etc).
|
||||
- Systems on the internet can access the [PSI Data Transfer](https://www.psi.ch/en/photon-science-data-services/data-transfer) service
|
||||
`datatransfer.psi.ch`, using ssh-based protocols and [Globus](https://www.globus.org/)
|
||||
|
||||
SSH-based protocols using port 22 to most PSI servers and (rsync-over-ssh, sftp, WinSCP, etc.), are in general, not permitted
|
||||
|
||||
## Direct transfer via Merlin7 login nodes
|
||||
|
||||
|
@ -0,0 +1,68 @@
|
||||
---
|
||||
title: Slurm cluster 'merlin7'
|
||||
#tags:
|
||||
keywords: configuration, partitions, node definition
|
||||
#last_updated: 24 Mai 2023
|
||||
summary: "This document describes a summary of the Merlin7 configuration."
|
||||
sidebar: merlin7_sidebar
|
||||
permalink: /merlin7/merlin7-configuration.html
|
||||
---
|
||||
|
||||
{:style="display:block; margin-left:auto; margin-right:auto"}
|
||||
|
||||
{{site.data.alerts.warning}}The Merlin7 documentation is <b>Work In Progress</b>.
|
||||
Please do not use or rely on this documentation until this becomes official.
|
||||
This applies to any page under <b><a href="https://lsm-hpce.gitpages.psi.ch/merlin7/">https://lsm-hpce.gitpages.psi.ch/merlin7/</a></b>
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
This documentation shows basic Slurm configuration and options needed to run jobs in the Merlin7 cluster.
|
||||
|
||||
## Infrastructure
|
||||
|
||||
### Hardware
|
||||
|
||||
The current configuration for the _preproduction_ phase (and likely the production phase) is made up as:
|
||||
|
||||
* 92 nodes in total for Merlin7:
|
||||
* 2 CPU-only login nodes
|
||||
* 77 CPU-only compute nodes
|
||||
* 5 GPU A100 nodes
|
||||
* 8 GPU Grace Hopper nodes
|
||||
|
||||
The specification of the node types is:
|
||||
|
||||
| Node | CPU | RAM | GRES | Notes |
|
||||
| ---- | --- | --- | ---- | ----- |
|
||||
| Multi-core node | _2x_ AMD EPYC 7742 (x86_64 Rome, 64 Cores, 2.25GHz) | 512GB DDR4 3200Mhz | | For both the login and CPU-only compute nodes |
|
||||
| A100 node | _2x_ AMD EPYC 7713 (x86_64 Milan, 64 Cores, 3.2GHz) | 512GB DDR4 3200Mhz | _4x_ NVidia A100 (Ampere, 80GB) | |
|
||||
| GH Node | _2x_ NVidia Grace Neoverse-V2 (SBSA ARM 64bit, 144 Cores, 3.1GHz) | _2x_ 480GB DDR5X (CPU + GPU) | _4x_ NVidia GH200 (Hopper, 120GB) | |
|
||||
|
||||
### Network
|
||||
|
||||
The Merlin7 cluster builds on top of HPE/Cray technologies, including a high-performance network fabric called Slingshot. This network fabric is able
|
||||
to provide up to 200 Gbit/s throughput between nodes. Further information on Slignshot can be found on at [HPE](https://www.hpe.com/psnow/doc/PSN1012904596HREN) and
|
||||
at <https://www.glennklockwood.com/garden/slingshot>.
|
||||
|
||||
Through software interfaces like [libFabric](https://ofiwg.github.io/libfabric/) (which available on Merlin7), application can leverage the network seamlessly.
|
||||
|
||||
### Storage
|
||||
|
||||
Unlike previous iteration of the Merlin HPC clusters, Merlin7 _does not_ have any local storage. Instead storage for the entire cluster is provided through
|
||||
a dedicated storage appliance from HPE/Cray called [ClusterStor](https://www.hpe.com/psnow/doc/PSN1012842049INEN.pdf).
|
||||
|
||||
The appliance is built of several storage servers:
|
||||
|
||||
* 2 management nodes
|
||||
* 2 MDS servers, 12 drives per server, 2.9TiB (Raid10)
|
||||
* 8 OSS-D servers, 106 drives per server, 14.5 T.B HDDs (Gridraid / Raid6)
|
||||
* 4 OSS-F servers, 12 drives per server 7TiB SSDs (Raid10)
|
||||
|
||||
With effective storage capacity of:
|
||||
|
||||
* 10 PB HDD
|
||||
* value visible on linux: HDD 9302.4 TiB
|
||||
* 162 TB SSD
|
||||
* value visible on linux: SSD 151.6 TiB
|
||||
* 23.6 TiB on Metadata
|
||||
|
||||
The storage is directly connected to the cluster (and each individual node) through the Slingshot NIC.
|
@ -1,68 +1,168 @@
|
||||
---
|
||||
title: Slurm cluster 'merlin7'
|
||||
title: Slurm merlin7 Configuration
|
||||
#tags:
|
||||
keywords: configuration, partitions, node definition
|
||||
#last_updated: 24 Mai 2023
|
||||
summary: "This document describes a summary of the Merlin7 configuration."
|
||||
summary: "This document describes a summary of the Merlin7 Slurm CPU-based configuration."
|
||||
sidebar: merlin7_sidebar
|
||||
permalink: /merlin7/slurm-configuration.html
|
||||
---
|
||||
|
||||
{:style="display:block; margin-left:auto; margin-right:auto"}
|
||||
|
||||
{{site.data.alerts.warning}}The Merlin7 documentation is <b>Work In Progress</b>.
|
||||
Please do not use or rely on this documentation until this becomes official.
|
||||
This applies to any page under <b><a href="https://lsm-hpce.gitpages.psi.ch/merlin7/">https://lsm-hpce.gitpages.psi.ch/merlin7/</a></b>
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
This documentation shows basic Slurm configuration and options needed to run jobs in the Merlin7 cluster.
|
||||
|
||||
## Infrastructure
|
||||
## General configuration
|
||||
|
||||
### Hardware
|
||||
The **Merlin7 cluster** is configured with the **`CR_CORE_MEMORY`** and **`CR_ONE_TASK_PER_CORE`** options.
|
||||
* This configuration treats both cores and memory as consumable resources.
|
||||
* Since the nodes are running with **hyper-threading** enabled, each core thread is counted as a CPU
|
||||
to fulfill a job's resource requirements.
|
||||
|
||||
The current configuration for the _preproduction_ phase (and likely the production phase) is made up as:
|
||||
By default, Slurm will allocate one task per core, which means:
|
||||
* Each task will consume 2 **CPUs**, regardless of whether both threads are actively used by the job.
|
||||
|
||||
* 92 nodes in total for Merlin7:
|
||||
* 2 CPU-only login nodes
|
||||
* 77 CPU-only compute nodes
|
||||
* 5 GPU A100 nodes
|
||||
* 8 GPU Grace Hopper nodes
|
||||
This behavior ensures consistent resource allocation but may result in underutilization of hyper-threading in some cases.
|
||||
|
||||
The specification of the node types is:
|
||||
### Default cluster
|
||||
|
||||
| Node | CPU | RAM | GRES | Notes |
|
||||
| ---- | --- | --- | ---- | ----- |
|
||||
| Multi-core node | _2x_ AMD EPYC 7742 (x86_64 Rome, 64 Cores, 2.25GHz) | 512GB DDR4 3200Mhz | | For both the login and CPU-only compute nodes |
|
||||
| A100 node | _2x_ AMD EPYC 7713 (x86_64 Milan, 64 Cores, 3.2GHz) | 512GB DDR4 3200Mhz | _4x_ NVidia A100 (Ampere, 80GB) | |
|
||||
| GH Node | _2x_ NVidia Grace Neoverse-V2 (SBSA ARM 64bit, 144 Cores, 3.1GHz) | _2x_ 480GB DDR5X (CPU + GPU) | _4x_ NVidia GH200 (Hopper, 120GB) | |
|
||||
By default, jobs will be submitted to **`merlin7`**, as it is the primary cluster configured on the login nodes.
|
||||
Specifying the cluster name is typically unnecessary unless you have defined environment variables that could override the default cluster name.
|
||||
However, when necessary, one can specify the cluster as follows:
|
||||
```bash
|
||||
#SBATCH --cluster=merlin7
|
||||
```
|
||||
|
||||
### Network
|
||||
## Slurm nodes definition
|
||||
|
||||
The Merlin7 cluster builds on top of HPE/Cray technologies, including a high-performance network fabric called Slingshot. This network fabric is able
|
||||
to provide up to 200 Gbit/s throughput between nodes. Further information on Slignshot can be found on at [HPE](https://www.hpe.com/psnow/doc/PSN1012904596HREN) and
|
||||
at <https://www.glennklockwood.com/garden/slingshot>.
|
||||
The table below provides an overview of the Slurm configuration for the different node types in the Merlin7 cluster.
|
||||
This information is essential for understanding how resources are allocated, enabling users to tailor their submission
|
||||
scripts accordingly.
|
||||
|
||||
Through software interfaces like [libFabric](https://ofiwg.github.io/libfabric/) (which available on Merlin7), application can leverage the network seamlessly.
|
||||
| Nodes | Sockets | CoresPerSocket | Cores | ThreadsPerCore | CPUs | MaxMemPerNode | DefMemPerCPU | Features |
|
||||
| --------------------:| -------: | --------------: | -----: | --------------: | ----: | ------------: | -----------: | ------------: |
|
||||
| login[001-002] | 2 | 64 | 128 | 2 | 256 | 480G | 1920M | AMD_EPYC_7713 |
|
||||
| cn[001-077] | 2 | 64 | 128 | 2 | 256 | 480G | 1920M | AMD_EPYC_7713 |
|
||||
|
||||
### Storage
|
||||
Notes on memory configuration:
|
||||
* **Memory allocation options:** To request additional memory, use the following options in your submission script:
|
||||
* **`--mem=<mem_in_MB>`**: Allocates memory per node.
|
||||
* **`--mem-per-cpu=<mem_in_MB>`**: Allocates memory per CPU (equivalent to a core thread).
|
||||
|
||||
Unlike previous iteration of the Merlin HPC clusters, Merlin7 _does not_ have any local storage. Instead storage for the entire cluster is provided through
|
||||
a dedicated storage appliance from HPE/Cray called [ClusterStor](https://www.hpe.com/psnow/doc/PSN1012842049INEN.pdf).
|
||||
The total memory requested cannot exceed the **`MaxMemPerNode`** value.
|
||||
* **Impact of disabling Hyper-Threading:** Using the **`--hint=nomultithread`** option disables one thread per core,
|
||||
effectively halving the number of available CPUs. Consequently, memory allocation will also be halved unless explicitly
|
||||
adjusted.
|
||||
|
||||
The appliance is built of several storage servers:
|
||||
For MPI-based jobs, where performance generally improves with single-threaded CPUs, this option is recommended.
|
||||
In such cases, you should double the **`--mem-per-cpu`** value to account for the reduced number of threads.
|
||||
|
||||
* 2 management nodes
|
||||
* 2 MDS servers, 12 drives per server, 2.9TiB (Raid10)
|
||||
* 8 OSS-D servers, 106 drives per server, 14.5 T.B HDDs (Gridraid / Raid6)
|
||||
* 4 OSS-F servers, 12 drives per server 7TiB SSDs (Raid10)
|
||||
{{site.data.alerts.tip}}
|
||||
Always verify the Slurm <b>'/var/spool/slurmd/conf-cache/slurm.conf'</b> configuration file for potential changes.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
With effective storage capacity of:
|
||||
### User and job limits with QoS
|
||||
|
||||
* 10 PB HDD
|
||||
* value visible on linux: HDD 9302.4 TiB
|
||||
* 162 TB SSD
|
||||
* value visible on linux: SSD 151.6 TiB
|
||||
* 23.6 TiB on Metadata
|
||||
In the `merlin7` CPU cluster, we enforce certain limits on jobs and users to ensure fair resource usage and prevent
|
||||
overuse by a single user or job. These limits aim to balance resource availability while maintaining overall cluster
|
||||
efficiency. However, applying limits can occasionally impact the cluster’s utilization. For example, user-specific
|
||||
limits may result in pending jobs even when many nodes are idle due to low activity.
|
||||
|
||||
The storage is directly connected to the cluster (and each individual node) through the Slingshot NIC.
|
||||
On the other hand, these limits also enhance cluster efficiency by preventing scenarios such as a single job monopolizing
|
||||
all available resources, which could block other jobs from running. Without job size limits, for instance, a large job
|
||||
might drain the entire cluster to satisfy its resource request, a situation that is generally undesirable.
|
||||
|
||||
Thus, setting appropriate limits is essential to maintain fair resource usage while optimizing cluster efficiency. These
|
||||
limits should allow for a mix of jobs of varying sizes and types, including single-core and parallel jobs, to coexist
|
||||
effectively.
|
||||
|
||||
To implement these limits, **we utilize Quality of Service (QoS)**. Different QoS policies are defined and applied
|
||||
**to specific partitions** in line with the established resource allocation policies. The table below outlines the
|
||||
various QoS definitions applicable to the merlin7 CPU-based cluster. Here:
|
||||
* `MaxTRES` specifies resource limits per job.
|
||||
* `MaxTRESPU` specifies resource limits per user.
|
||||
|
||||
| Name | MaxTRES | MaxTRESPU | Scope |
|
||||
| --------------: | -----------------: | -----------------: | ---------------------: |
|
||||
| **normal** | | | partition |
|
||||
| **cpu_general** | cpu=1024,mem=1920G | cpu=1024,mem=1920G | <u>user</u>, partition |
|
||||
| **cpu_daily** | cpu=1024,mem=1920G | cpu=2048,mem=3840G | partition |
|
||||
| **cpu_hourly** | cpu=2048,mem=3840G | cpu=8192,mem=15T | partition |
|
||||
|
||||
Where:
|
||||
* **`normal` QoS:** This QoS has no limits and is typically applied to partitions that do not require user or job
|
||||
restrictions.
|
||||
* **`cpu_general` QoS:** This is the **default QoS** for `merlin7` _users_. It limits the total resources available to each
|
||||
user. Additionally, this QoS is applied to the `general` partition, enforcing restrictions at the partition level and
|
||||
overriding user-level QoS.
|
||||
* **`cpu_daily` QoS:** Guarantees increased resources for the `daily` partition, accommodating shorter-duration jobs
|
||||
with higher resource needs.
|
||||
* **`cpu_hourly` QoS:** Offers the least constraints, allowing more resources to be used for the `hourly` partition,
|
||||
which caters to very short-duration jobs.
|
||||
|
||||
For additional details, refer to the [Partitions](/merlin7/slurm-configuration.html#Partitions) section.
|
||||
|
||||
{{site.data.alerts.tip}}
|
||||
Always verify QoS definitions for potential changes using the <b>'sacctmgr show qos format="Name%22,MaxTRESPU%35,MaxTRES%35"'</b> command.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
## Partitions
|
||||
|
||||
This section provides a summary of the partitions available in the `merlin7` CPU cluster.
|
||||
|
||||
Key concepts:
|
||||
* **`PriorityJobFactor`**: This value is added to a job’s priority (visible in the `PARTITION` column of the `sprio -l` command).
|
||||
Jobs submitted to partitions with higher `PriorityJobFactor` values generally run sooner. However, other factors like *job age*
|
||||
and especially *fair share* can also influence scheduling.
|
||||
* **`PriorityTier`**: Jobs submitted to partitions with higher `PriorityTier` values take precedence over pending jobs in partitions
|
||||
with lower `PriorityTier` values. Additionally, jobs from higher `PriorityTier` partitions can preempt running jobs in lower-tier
|
||||
partitions, where applicable.
|
||||
* **`QoS`**: Specifies the quality of service associated with a partition. It is used to control and restrict resource availability
|
||||
for specific partitions, ensuring that resource allocation aligns with intended usage policies. Detailed explanations of the various
|
||||
QoS settings can be found in the [User and job limits with QoS](/merlin7/slurm-configuration.html#user-and-job-limits-with-qos) section.
|
||||
|
||||
{{site.data.alerts.tip}}
|
||||
Always verify partition configurations for potential changes using the <b>'scontrol show partition'</b> command.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
### Public partitions
|
||||
|
||||
| PartitionName | DefaultTime | MaxTime | TotalNodes | PriorityJobFactor | PriorityTier | QoS | AllowAccounts |
|
||||
| -----------------: | -----------: | ----------: | --------: | ----------------: | -----------: | ----------: | -------------: |
|
||||
| **<u>general</u>** | 1-00:00:00 | 7-00:00:00 | 50 | 1 | 1 | cpu_general | <u>merlin</u> |
|
||||
| **daily** | 0-01:00:00 | 1-00:00:00 | 63 | 500 | 1 | cpu_daily | <u>merlin</u> |
|
||||
| **hourly** | 0-00:30:00 | 0-01:00:00 | 77 | 1000 | 1 | cpu_hourly | <u>merlin</u> |
|
||||
|
||||
All Merlin users are part of the `merlin` account, which is used as the *default account* when submitting jobs.
|
||||
Similarly, if no partition is specified, jobs are automatically submitted to the `general` partition by default.
|
||||
|
||||
{{site.data.alerts.tip}}
|
||||
For jobs running less than one day, submit them to the <b>daily</b> partition.
|
||||
For jobs running less than one hour, use the <b>hourly</b> partition.
|
||||
These partitions provide higher priority and ensure quicker scheduling compared to <b>general</b>, which has limited node availability.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
The **`hourly`** partition may include private nodes as an additional buffer. However, the current Slurm partition configuration, governed
|
||||
by **`PriorityTier`**, ensures that jobs submitted to private partitions are prioritized and processed first. As a result, access to the
|
||||
**`hourly`** partition might experience delays in such scenarios.
|
||||
|
||||
### Private partitions
|
||||
|
||||
#### CAS / ASA
|
||||
|
||||
| PartitionName | DefaultTime | MaxTime | TotalNodes | PriorityJobFactor | PriorityTier | QoS | AllowAccounts |
|
||||
| -----------------: | -----------: | ----------: | --------: | ----------------: | -----------: | ----------: | -------------: |
|
||||
| **asa-general** | 0-01:00:00 | 14-00:00:00 | 10 | 1 | 2 | normal | asa |
|
||||
| **asa-daily** | 0-01:00:00 | 1-00:00:00 | 10 | 1000 | 2 | normal | asa |
|
||||
|
||||
#### CNM / Mu3e
|
||||
|
||||
| PartitionName | DefaultTime | MaxTime | TotalNodes | PriorityJobFactor | PriorityTier | QoS | AllowAccounts |
|
||||
| -----------------: | -----------: | ----------: | --------: | ----------------: | -----------: | ----------: | -------------: |
|
||||
| **mu3e** | 1-00:00:00 | 7-00:00:00 | 4 | 1 | 2 | normal | mu3e, meg |
|
||||
|
||||
#### CNM / MeG
|
||||
|
||||
| PartitionName | DefaultTime | MaxTime | TotalNodes | PriorityJobFactor | PriorityTier | QoS | AllowAccounts |
|
||||
| -----------------: | -----------: | ----------: | --------: | ----------------: | -----------: | ----------: | -------------: |
|
||||
| **meg-short** | 0-01:00:00 | 0-01:00:00 | unlimited | 1000 | 2 | normal | meg |
|
||||
| **meg-long** | 1-00:00:00 | 5-00:00:00 | unlimited | 1 | 2 | normal | meg |
|
||||
| **meg-prod** | 1-00:00:00 | 5-00:00:00 | unlimited | 1000 | 4 | normal | meg |
|
||||
|
Reference in New Issue
Block a user