From 08c999f97a4d5520ba8e0e9b26098ed5a6e640db Mon Sep 17 00:00:00 2001 From: caubet_m Date: Wed, 18 Dec 2024 17:11:12 +0100 Subject: [PATCH] Update --- _data/sidebars/merlin7_sidebar.yml | 2 + .../02-How-To-Use-Merlin/transfer-data.md | 9 +- .../merlin7-configuration.md | 68 +++++++ .../slurm-configuration.md | 186 ++++++++++++++---- 4 files changed, 218 insertions(+), 47 deletions(-) create mode 100644 pages/merlin7/03-Slurm-General-Documentation/merlin7-configuration.md diff --git a/_data/sidebars/merlin7_sidebar.yml b/_data/sidebars/merlin7_sidebar.yml index 5c965e8..df6f9f7 100644 --- a/_data/sidebars/merlin7_sidebar.yml +++ b/_data/sidebars/merlin7_sidebar.yml @@ -47,6 +47,8 @@ entries: - title: Slurm General Documentation folderitems: - title: Merlin7 Infrastructure + url: /merlin7/merlin7-configuration.html + - title: Slurm Configuration url: /merlin7/slurm-configuration.html - title: Running Slurm Interactive Jobs url: /merlin7/interactive-jobs.html diff --git a/pages/merlin7/02-How-To-Use-Merlin/transfer-data.md b/pages/merlin7/02-How-To-Use-Merlin/transfer-data.md index dd17171..09230a8 100644 --- a/pages/merlin7/02-How-To-Use-Merlin/transfer-data.md +++ b/pages/merlin7/02-How-To-Use-Merlin/transfer-data.md @@ -15,15 +15,16 @@ initiate the transfer from either merlin or the other system, depending on the n visibility. - Merlin login nodes are visible from the PSI network, so direct data transfer - (rsync/WinSCP) is generally preferable. This can be initiated from either endpoint. -- Merlin login nodes can access the internet using a limited set of protocols - - SSH-based protocols using port 22 (rsync-over-ssh, sftp, WinSCP, etc) + (rsync/WinSCP/sftp) is generally preferable. + - Protocols from Merlin7 to PSI may require special firewall rules. +- Merlin login nodes can access the internet using a limited set of protocols: - HTTP-based protocols using ports 80 or 445 (https, WebDav, etc) - Protocols using other ports require admin configuration and may only work with - specific hosts (ftp, rsync daemons, etc) + specific hosts, and may require new firewall rules (ssh, ftp, rsync daemons, etc). - Systems on the internet can access the [PSI Data Transfer](https://www.psi.ch/en/photon-science-data-services/data-transfer) service `datatransfer.psi.ch`, using ssh-based protocols and [Globus](https://www.globus.org/) +SSH-based protocols using port 22 to most PSI servers and (rsync-over-ssh, sftp, WinSCP, etc.), are in general, not permitted ## Direct transfer via Merlin7 login nodes diff --git a/pages/merlin7/03-Slurm-General-Documentation/merlin7-configuration.md b/pages/merlin7/03-Slurm-General-Documentation/merlin7-configuration.md new file mode 100644 index 0000000..da80902 --- /dev/null +++ b/pages/merlin7/03-Slurm-General-Documentation/merlin7-configuration.md @@ -0,0 +1,68 @@ +--- +title: Slurm cluster 'merlin7' +#tags: +keywords: configuration, partitions, node definition +#last_updated: 24 Mai 2023 +summary: "This document describes a summary of the Merlin7 configuration." +sidebar: merlin7_sidebar +permalink: /merlin7/merlin7-configuration.html +--- + +![Work In Progress](/images/WIP/WIP1.webp){:style="display:block; margin-left:auto; margin-right:auto"} + +{{site.data.alerts.warning}}The Merlin7 documentation is Work In Progress. +Please do not use or rely on this documentation until this becomes official. +This applies to any page under https://lsm-hpce.gitpages.psi.ch/merlin7/ +{{site.data.alerts.end}} + +This documentation shows basic Slurm configuration and options needed to run jobs in the Merlin7 cluster. + +## Infrastructure + +### Hardware + +The current configuration for the _preproduction_ phase (and likely the production phase) is made up as: + +* 92 nodes in total for Merlin7: + * 2 CPU-only login nodes + * 77 CPU-only compute nodes + * 5 GPU A100 nodes + * 8 GPU Grace Hopper nodes + +The specification of the node types is: + +| Node | CPU | RAM | GRES | Notes | +| ---- | --- | --- | ---- | ----- | +| Multi-core node | _2x_ AMD EPYC 7742 (x86_64 Rome, 64 Cores, 2.25GHz) | 512GB DDR4 3200Mhz | | For both the login and CPU-only compute nodes | +| A100 node | _2x_ AMD EPYC 7713 (x86_64 Milan, 64 Cores, 3.2GHz) | 512GB DDR4 3200Mhz | _4x_ NVidia A100 (Ampere, 80GB) | | +| GH Node | _2x_ NVidia Grace Neoverse-V2 (SBSA ARM 64bit, 144 Cores, 3.1GHz) | _2x_ 480GB DDR5X (CPU + GPU) | _4x_ NVidia GH200 (Hopper, 120GB) | | + +### Network + +The Merlin7 cluster builds on top of HPE/Cray technologies, including a high-performance network fabric called Slingshot. This network fabric is able +to provide up to 200 Gbit/s throughput between nodes. Further information on Slignshot can be found on at [HPE](https://www.hpe.com/psnow/doc/PSN1012904596HREN) and +at . + +Through software interfaces like [libFabric](https://ofiwg.github.io/libfabric/) (which available on Merlin7), application can leverage the network seamlessly. + +### Storage + +Unlike previous iteration of the Merlin HPC clusters, Merlin7 _does not_ have any local storage. Instead storage for the entire cluster is provided through +a dedicated storage appliance from HPE/Cray called [ClusterStor](https://www.hpe.com/psnow/doc/PSN1012842049INEN.pdf). + +The appliance is built of several storage servers: + +* 2 management nodes +* 2 MDS servers, 12 drives per server, 2.9TiB (Raid10) +* 8 OSS-D servers, 106 drives per server, 14.5 T.B HDDs (Gridraid / Raid6) +* 4 OSS-F servers, 12 drives per server 7TiB SSDs (Raid10) + +With effective storage capacity of: + +* 10 PB HDD + * value visible on linux: HDD 9302.4 TiB +* 162 TB SSD + * value visible on linux: SSD 151.6 TiB +* 23.6 TiB on Metadata + +The storage is directly connected to the cluster (and each individual node) through the Slingshot NIC. diff --git a/pages/merlin7/03-Slurm-General-Documentation/slurm-configuration.md b/pages/merlin7/03-Slurm-General-Documentation/slurm-configuration.md index f25f279..c939bd3 100644 --- a/pages/merlin7/03-Slurm-General-Documentation/slurm-configuration.md +++ b/pages/merlin7/03-Slurm-General-Documentation/slurm-configuration.md @@ -1,68 +1,168 @@ --- -title: Slurm cluster 'merlin7' +title: Slurm merlin7 Configuration #tags: keywords: configuration, partitions, node definition #last_updated: 24 Mai 2023 -summary: "This document describes a summary of the Merlin7 configuration." +summary: "This document describes a summary of the Merlin7 Slurm CPU-based configuration." sidebar: merlin7_sidebar permalink: /merlin7/slurm-configuration.html --- -![Work In Progress](/images/WIP/WIP1.webp){:style="display:block; margin-left:auto; margin-right:auto"} - -{{site.data.alerts.warning}}The Merlin7 documentation is Work In Progress. -Please do not use or rely on this documentation until this becomes official. -This applies to any page under https://lsm-hpce.gitpages.psi.ch/merlin7/ -{{site.data.alerts.end}} - This documentation shows basic Slurm configuration and options needed to run jobs in the Merlin7 cluster. -## Infrastructure +## General configuration -### Hardware +The **Merlin7 cluster** is configured with the **`CR_CORE_MEMORY`** and **`CR_ONE_TASK_PER_CORE`** options. +* This configuration treats both cores and memory as consumable resources. +* Since the nodes are running with **hyper-threading** enabled, each core thread is counted as a CPU + to fulfill a job's resource requirements. -The current configuration for the _preproduction_ phase (and likely the production phase) is made up as: +By default, Slurm will allocate one task per core, which means: +* Each task will consume 2 **CPUs**, regardless of whether both threads are actively used by the job. -* 92 nodes in total for Merlin7: - * 2 CPU-only login nodes - * 77 CPU-only compute nodes - * 5 GPU A100 nodes - * 8 GPU Grace Hopper nodes +This behavior ensures consistent resource allocation but may result in underutilization of hyper-threading in some cases. -The specification of the node types is: +### Default cluster -| Node | CPU | RAM | GRES | Notes | -| ---- | --- | --- | ---- | ----- | -| Multi-core node | _2x_ AMD EPYC 7742 (x86_64 Rome, 64 Cores, 2.25GHz) | 512GB DDR4 3200Mhz | | For both the login and CPU-only compute nodes | -| A100 node | _2x_ AMD EPYC 7713 (x86_64 Milan, 64 Cores, 3.2GHz) | 512GB DDR4 3200Mhz | _4x_ NVidia A100 (Ampere, 80GB) | | -| GH Node | _2x_ NVidia Grace Neoverse-V2 (SBSA ARM 64bit, 144 Cores, 3.1GHz) | _2x_ 480GB DDR5X (CPU + GPU) | _4x_ NVidia GH200 (Hopper, 120GB) | | +By default, jobs will be submitted to **`merlin7`**, as it is the primary cluster configured on the login nodes. +Specifying the cluster name is typically unnecessary unless you have defined environment variables that could override the default cluster name. +However, when necessary, one can specify the cluster as follows: +```bash +#SBATCH --cluster=merlin7 +``` -### Network +## Slurm nodes definition -The Merlin7 cluster builds on top of HPE/Cray technologies, including a high-performance network fabric called Slingshot. This network fabric is able -to provide up to 200 Gbit/s throughput between nodes. Further information on Slignshot can be found on at [HPE](https://www.hpe.com/psnow/doc/PSN1012904596HREN) and -at . +The table below provides an overview of the Slurm configuration for the different node types in the Merlin7 cluster. +This information is essential for understanding how resources are allocated, enabling users to tailor their submission +scripts accordingly. -Through software interfaces like [libFabric](https://ofiwg.github.io/libfabric/) (which available on Merlin7), application can leverage the network seamlessly. +| Nodes | Sockets | CoresPerSocket | Cores | ThreadsPerCore | CPUs | MaxMemPerNode | DefMemPerCPU | Features | +| --------------------:| -------: | --------------: | -----: | --------------: | ----: | ------------: | -----------: | ------------: | +| login[001-002] | 2 | 64 | 128 | 2 | 256 | 480G | 1920M | AMD_EPYC_7713 | +| cn[001-077] | 2 | 64 | 128 | 2 | 256 | 480G | 1920M | AMD_EPYC_7713 | -### Storage +Notes on memory configuration: +* **Memory allocation options:** To request additional memory, use the following options in your submission script: + * **`--mem=`**: Allocates memory per node. + * **`--mem-per-cpu=`**: Allocates memory per CPU (equivalent to a core thread). -Unlike previous iteration of the Merlin HPC clusters, Merlin7 _does not_ have any local storage. Instead storage for the entire cluster is provided through -a dedicated storage appliance from HPE/Cray called [ClusterStor](https://www.hpe.com/psnow/doc/PSN1012842049INEN.pdf). + The total memory requested cannot exceed the **`MaxMemPerNode`** value. +* **Impact of disabling Hyper-Threading:** Using the **`--hint=nomultithread`** option disables one thread per core, +effectively halving the number of available CPUs. Consequently, memory allocation will also be halved unless explicitly +adjusted. -The appliance is built of several storage servers: + For MPI-based jobs, where performance generally improves with single-threaded CPUs, this option is recommended. + In such cases, you should double the **`--mem-per-cpu`** value to account for the reduced number of threads. -* 2 management nodes -* 2 MDS servers, 12 drives per server, 2.9TiB (Raid10) -* 8 OSS-D servers, 106 drives per server, 14.5 T.B HDDs (Gridraid / Raid6) -* 4 OSS-F servers, 12 drives per server 7TiB SSDs (Raid10) +{{site.data.alerts.tip}} +Always verify the Slurm '/var/spool/slurmd/conf-cache/slurm.conf' configuration file for potential changes. +{{site.data.alerts.end}} -With effective storage capacity of: +### User and job limits with QoS -* 10 PB HDD - * value visible on linux: HDD 9302.4 TiB -* 162 TB SSD - * value visible on linux: SSD 151.6 TiB -* 23.6 TiB on Metadata +In the `merlin7` CPU cluster, we enforce certain limits on jobs and users to ensure fair resource usage and prevent +overuse by a single user or job. These limits aim to balance resource availability while maintaining overall cluster +efficiency. However, applying limits can occasionally impact the cluster’s utilization. For example, user-specific +limits may result in pending jobs even when many nodes are idle due to low activity. -The storage is directly connected to the cluster (and each individual node) through the Slingshot NIC. +On the other hand, these limits also enhance cluster efficiency by preventing scenarios such as a single job monopolizing +all available resources, which could block other jobs from running. Without job size limits, for instance, a large job +might drain the entire cluster to satisfy its resource request, a situation that is generally undesirable. + +Thus, setting appropriate limits is essential to maintain fair resource usage while optimizing cluster efficiency. These +limits should allow for a mix of jobs of varying sizes and types, including single-core and parallel jobs, to coexist +effectively. + +To implement these limits, **we utilize Quality of Service (QoS)**. Different QoS policies are defined and applied +**to specific partitions** in line with the established resource allocation policies. The table below outlines the +various QoS definitions applicable to the merlin7 CPU-based cluster. Here: +* `MaxTRES` specifies resource limits per job. +* `MaxTRESPU` specifies resource limits per user. + +| Name | MaxTRES | MaxTRESPU | Scope | +| --------------: | -----------------: | -----------------: | ---------------------: | +| **normal** | | | partition | +| **cpu_general** | cpu=1024,mem=1920G | cpu=1024,mem=1920G | user, partition | +| **cpu_daily** | cpu=1024,mem=1920G | cpu=2048,mem=3840G | partition | +| **cpu_hourly** | cpu=2048,mem=3840G | cpu=8192,mem=15T | partition | + +Where: +* **`normal` QoS:** This QoS has no limits and is typically applied to partitions that do not require user or job + restrictions. +* **`cpu_general` QoS:** This is the **default QoS** for `merlin7` _users_. It limits the total resources available to each + user. Additionally, this QoS is applied to the `general` partition, enforcing restrictions at the partition level and + overriding user-level QoS. +* **`cpu_daily` QoS:** Guarantees increased resources for the `daily` partition, accommodating shorter-duration jobs + with higher resource needs. +* **`cpu_hourly` QoS:** Offers the least constraints, allowing more resources to be used for the `hourly` partition, + which caters to very short-duration jobs. + +For additional details, refer to the [Partitions](/merlin7/slurm-configuration.html#Partitions) section. + +{{site.data.alerts.tip}} +Always verify QoS definitions for potential changes using the 'sacctmgr show qos format="Name%22,MaxTRESPU%35,MaxTRES%35"' command. +{{site.data.alerts.end}} + +## Partitions + +This section provides a summary of the partitions available in the `merlin7` CPU cluster. + +Key concepts: +* **`PriorityJobFactor`**: This value is added to a job’s priority (visible in the `PARTITION` column of the `sprio -l` command). + Jobs submitted to partitions with higher `PriorityJobFactor` values generally run sooner. However, other factors like *job age* + and especially *fair share* can also influence scheduling. +* **`PriorityTier`**: Jobs submitted to partitions with higher `PriorityTier` values take precedence over pending jobs in partitions + with lower `PriorityTier` values. Additionally, jobs from higher `PriorityTier` partitions can preempt running jobs in lower-tier + partitions, where applicable. +* **`QoS`**: Specifies the quality of service associated with a partition. It is used to control and restrict resource availability + for specific partitions, ensuring that resource allocation aligns with intended usage policies. Detailed explanations of the various + QoS settings can be found in the [User and job limits with QoS](/merlin7/slurm-configuration.html#user-and-job-limits-with-qos) section. + +{{site.data.alerts.tip}} +Always verify partition configurations for potential changes using the 'scontrol show partition' command. +{{site.data.alerts.end}} + +### Public partitions + +| PartitionName | DefaultTime | MaxTime | TotalNodes | PriorityJobFactor | PriorityTier | QoS | AllowAccounts | +| -----------------: | -----------: | ----------: | --------: | ----------------: | -----------: | ----------: | -------------: | +| **general** | 1-00:00:00 | 7-00:00:00 | 50 | 1 | 1 | cpu_general | merlin | +| **daily** | 0-01:00:00 | 1-00:00:00 | 63 | 500 | 1 | cpu_daily | merlin | +| **hourly** | 0-00:30:00 | 0-01:00:00 | 77 | 1000 | 1 | cpu_hourly | merlin | + +All Merlin users are part of the `merlin` account, which is used as the *default account* when submitting jobs. +Similarly, if no partition is specified, jobs are automatically submitted to the `general` partition by default. + +{{site.data.alerts.tip}} +For jobs running less than one day, submit them to the daily partition. +For jobs running less than one hour, use the hourly partition. +These partitions provide higher priority and ensure quicker scheduling compared to general, which has limited node availability. +{{site.data.alerts.end}} + +The **`hourly`** partition may include private nodes as an additional buffer. However, the current Slurm partition configuration, governed +by **`PriorityTier`**, ensures that jobs submitted to private partitions are prioritized and processed first. As a result, access to the +**`hourly`** partition might experience delays in such scenarios. + +### Private partitions + +#### CAS / ASA + +| PartitionName | DefaultTime | MaxTime | TotalNodes | PriorityJobFactor | PriorityTier | QoS | AllowAccounts | +| -----------------: | -----------: | ----------: | --------: | ----------------: | -----------: | ----------: | -------------: | +| **asa-general** | 0-01:00:00 | 14-00:00:00 | 10 | 1 | 2 | normal | asa | +| **asa-daily** | 0-01:00:00 | 1-00:00:00 | 10 | 1000 | 2 | normal | asa | + +#### CNM / Mu3e + +| PartitionName | DefaultTime | MaxTime | TotalNodes | PriorityJobFactor | PriorityTier | QoS | AllowAccounts | +| -----------------: | -----------: | ----------: | --------: | ----------------: | -----------: | ----------: | -------------: | +| **mu3e** | 1-00:00:00 | 7-00:00:00 | 4 | 1 | 2 | normal | mu3e, meg | + +#### CNM / MeG + +| PartitionName | DefaultTime | MaxTime | TotalNodes | PriorityJobFactor | PriorityTier | QoS | AllowAccounts | +| -----------------: | -----------: | ----------: | --------: | ----------------: | -----------: | ----------: | -------------: | +| **meg-short** | 0-01:00:00 | 0-01:00:00 | unlimited | 1000 | 2 | normal | meg | +| **meg-long** | 1-00:00:00 | 5-00:00:00 | unlimited | 1 | 2 | normal | meg | +| **meg-prod** | 1-00:00:00 | 5-00:00:00 | unlimited | 1000 | 4 | normal | meg |