diff --git a/pages/merlin7/03-Slurm-General-Documentation/slurm-configuration.md b/pages/merlin7/03-Slurm-General-Documentation/slurm-configuration.md index c939bd3..12b03ae 100644 --- a/pages/merlin7/03-Slurm-General-Documentation/slurm-configuration.md +++ b/pages/merlin7/03-Slurm-General-Documentation/slurm-configuration.md @@ -10,9 +10,18 @@ permalink: /merlin7/slurm-configuration.html This documentation shows basic Slurm configuration and options needed to run jobs in the Merlin7 cluster. -## General configuration +## CPU cluster: merlin7 -The **Merlin7 cluster** is configured with the **`CR_CORE_MEMORY`** and **`CR_ONE_TASK_PER_CORE`** options. +**By default, jobs will be submitted to `merlin7`**, as it is the primary cluster configured on the login nodes. +Specifying the cluster name is typically unnecessary unless you have defined environment variables that could override the default cluster name. +However, when necessary, one can specify the cluster as follows: +```bash +#SBATCH --cluster=merlin7 +``` + +### CPU general configuration + +The **Merlin7 CPU cluster** is configured with the **`CR_CORE_MEMORY`** and **`CR_ONE_TASK_PER_CORE`** options. * This configuration treats both cores and memory as consumable resources. * Since the nodes are running with **hyper-threading** enabled, each core thread is counted as a CPU to fulfill a job's resource requirements. @@ -22,16 +31,7 @@ By default, Slurm will allocate one task per core, which means: This behavior ensures consistent resource allocation but may result in underutilization of hyper-threading in some cases. -### Default cluster - -By default, jobs will be submitted to **`merlin7`**, as it is the primary cluster configured on the login nodes. -Specifying the cluster name is typically unnecessary unless you have defined environment variables that could override the default cluster name. -However, when necessary, one can specify the cluster as follows: -```bash -#SBATCH --cluster=merlin7 -``` - -## Slurm nodes definition +### CPU nodes definition The table below provides an overview of the Slurm configuration for the different node types in the Merlin7 cluster. This information is essential for understanding how resources are allocated, enabling users to tailor their submission @@ -98,13 +98,13 @@ Where: * **`cpu_hourly` QoS:** Offers the least constraints, allowing more resources to be used for the `hourly` partition, which caters to very short-duration jobs. -For additional details, refer to the [Partitions](/merlin7/slurm-configuration.html#Partitions) section. +For additional details, refer to the [CPU partitions](/merlin7/slurm-configuration.html#CPU-partitions) section. {{site.data.alerts.tip}} Always verify QoS definitions for potential changes using the 'sacctmgr show qos format="Name%22,MaxTRESPU%35,MaxTRES%35"' command. {{site.data.alerts.end}} -## Partitions +### CPU partitions This section provides a summary of the partitions available in the `merlin7` CPU cluster. @@ -123,12 +123,12 @@ Key concepts: Always verify partition configurations for potential changes using the 'scontrol show partition' command. {{site.data.alerts.end}} -### Public partitions +#### CPU public partitions | PartitionName | DefaultTime | MaxTime | TotalNodes | PriorityJobFactor | PriorityTier | QoS | AllowAccounts | | -----------------: | -----------: | ----------: | --------: | ----------------: | -----------: | ----------: | -------------: | | **general** | 1-00:00:00 | 7-00:00:00 | 50 | 1 | 1 | cpu_general | merlin | -| **daily** | 0-01:00:00 | 1-00:00:00 | 63 | 500 | 1 | cpu_daily | merlin | +| **daily** | 0-01:00:00 | 1-00:00:00 | 62 | 500 | 1 | cpu_daily | merlin | | **hourly** | 0-00:30:00 | 0-01:00:00 | 77 | 1000 | 1 | cpu_hourly | merlin | All Merlin users are part of the `merlin` account, which is used as the *default account* when submitting jobs. @@ -144,25 +144,179 @@ The **`hourly`** partition may include private nodes as an additional buffer. Ho by **`PriorityTier`**, ensures that jobs submitted to private partitions are prioritized and processed first. As a result, access to the **`hourly`** partition might experience delays in such scenarios. -### Private partitions +#### CPU private partitions -#### CAS / ASA +##### CAS / ASA | PartitionName | DefaultTime | MaxTime | TotalNodes | PriorityJobFactor | PriorityTier | QoS | AllowAccounts | | -----------------: | -----------: | ----------: | --------: | ----------------: | -----------: | ----------: | -------------: | | **asa-general** | 0-01:00:00 | 14-00:00:00 | 10 | 1 | 2 | normal | asa | | **asa-daily** | 0-01:00:00 | 1-00:00:00 | 10 | 1000 | 2 | normal | asa | -#### CNM / Mu3e +##### CNM / Mu3e | PartitionName | DefaultTime | MaxTime | TotalNodes | PriorityJobFactor | PriorityTier | QoS | AllowAccounts | | -----------------: | -----------: | ----------: | --------: | ----------------: | -----------: | ----------: | -------------: | | **mu3e** | 1-00:00:00 | 7-00:00:00 | 4 | 1 | 2 | normal | mu3e, meg | -#### CNM / MeG +##### CNM / MeG | PartitionName | DefaultTime | MaxTime | TotalNodes | PriorityJobFactor | PriorityTier | QoS | AllowAccounts | | -----------------: | -----------: | ----------: | --------: | ----------------: | -----------: | ----------: | -------------: | | **meg-short** | 0-01:00:00 | 0-01:00:00 | unlimited | 1000 | 2 | normal | meg | | **meg-long** | 1-00:00:00 | 5-00:00:00 | unlimited | 1 | 2 | normal | meg | | **meg-prod** | 1-00:00:00 | 5-00:00:00 | unlimited | 1000 | 4 | normal | meg | + +## GPU cluster: gmerlin7 + +As mentioned in previous sections, by default, jobs will be submitted to `merlin7`, as it is the primary cluster configured on the login nodes. +For submittng jobs to the GPU cluster, **the cluster name `gmerlin7` must be specified**, as follows: +```bash +#SBATCH --cluster=gmerlin7 +``` + +### GPU general configuration + +The **Merlin7 GPU cluster** is configured with the **`CR_CORE_MEMORY`**, **`CR_ONE_TASK_PER_CORE`**, and **`ENFORCE_BINDING_GRES`** options. +* This configuration treats both cores and memory as consumable resources. +* Since the nodes are running with **hyper-threading** enabled, each core thread is counted as a CPU + to fulfill a job's resource requirements. +* Slurm will allocate the CPUs to the selected GPU. + +By default, Slurm will allocate one task per core, which means: +* For hyper-threaded nodes (NVIDIA A100-based nodes), each task will consume 2 **CPUs**, regardless of whether both threads are actively used by the job. +* For the NVIDIA GraceHopper-based nodes, each task will consume 1 **CPU**. + +This behavior ensures consistent resource allocation but may result in underutilization of hyper-threading in some cases. + +### GPU nodes definition + +The table below provides an overview of the Slurm configuration for the different node types in the Merlin7 cluster. +This information is essential for understanding how resources are allocated, enabling users to tailor their submission +scripts accordingly. + +| Nodes | Sockets | CoresPerSocket | Cores | ThreadsPerCore | CPUs | MaxMemPerNode | DefMemPerCPU | Gres | Features | +| --------------------:| -------: | --------------: | -----: | --------------: | ----: | ------------: | -----------: | --------------------------: | ---------------------: | +| gpu[001-007] | 4 | 72 | 288 | 1 | 288 | 828G | 2944M | gpu:gh200:4 | AMD_EPYC_7713, NV_A100 | +| gpu[101-105] | 1 | 64 | 64 | 2 | 128 | 480G | 3840M | gpu:nvidia_a100-sxm4-80gb:4 | GH200, NV_H100 | + +Notes on memory configuration: +* **Memory allocation options:** To request additional memory, use the following options in your submission script: + * **`--mem=`**: Allocates memory per node. + * **`--mem-per-cpu=`**: Allocates memory per CPU (equivalent to a core thread). + + The total memory requested cannot exceed the **`MaxMemPerNode`** value. +* **Impact of disabling Hyper-Threading:** Using the **`--hint=nomultithread`** option disables one thread per core, +effectively halving the number of available CPUs. Consequently, memory allocation will also be halved unless explicitly +adjusted. + + For MPI-based jobs, where performance generally improves with single-threaded CPUs, this option is recommended. + In such cases, you should double the **`--mem-per-cpu`** value to account for the reduced number of threads. + +{{site.data.alerts.tip}} +Always verify the Slurm '/var/spool/slurmd/conf-cache/slurm.conf' configuration file for potential changes. +{{site.data.alerts.end}} + +### User and job limits with QoS + +In the `gmerlin7` CPU cluster, we enforce certain limits on jobs and users to ensure fair resource usage and prevent +overuse by a single user or job. These limits aim to balance resource availability while maintaining overall cluster +efficiency. However, applying limits can occasionally impact the cluster’s utilization. For example, user-specific +limits may result in pending jobs even when many nodes are idle due to low activity. + +On the other hand, these limits also enhance cluster efficiency by preventing scenarios such as a single job monopolizing +all available resources, which could block other jobs from running. Without job size limits, for instance, a large job +might drain the entire cluster to satisfy its resource request, a situation that is generally undesirable. + +Thus, setting appropriate limits is essential to maintain fair resource usage while optimizing cluster efficiency. These +limits should allow for a mix of jobs of varying sizes and types, including single-core and parallel jobs, to coexist +effectively. + +To implement these limits, **we utilize Quality of Service (QoS)**. Different QoS policies are defined and applied +**to specific partitions** in line with the established resource allocation policies. The table below outlines the +various QoS definitions applicable to the merlin7 CPU-based cluster. Here: +* `MaxTRES` specifies resource limits per job. +* `MaxTRESPU` specifies resource limits per user. + +| Name | MaxTRES | MaxTRESPU | Scope | +| -----------------------: | -------------------------------: | ------------------------------: | ---------------------: | +| **normal** | | | partition | +| **gpu_general** | gres/gpu=4 | gres/gpu=8 | user, partition | +| **gpu_daily** | gres/gpu=8 | gres/gpu=16 | partition | +| **gpu_hourly** | gres/gpu=8 | gres/gpu=16 | partition | +| **gpu_gh_interactive** | cpu=16,gres/gpu=1,mem=46G,node=1 |cpu=16,gres/gpu=1,mem=46G,node=1 | partition | +| **gpu_a100_interactive** | cpu=16,gres/gpu=1,mem=60G,node=1 |cpu=16,gres/gpu=1,mem=60G,node=1 | partition | + +Where: +* **`normal` QoS:** This QoS has no limits and is typically applied to partitions that do not require user or job + restrictions. +* **`gpu_general` QoS:** This is the **default QoS** for `gmerlin7` _users_. It limits the total resources available to each + user. Additionally, this QoS is applied to the `[a100|gh]-general` partitions, enforcing restrictions at the partition level and + overriding user-level QoS. +* **`gpu_daily` QoS:** Guarantees increased resources for the `[a100|gh]-daily` partitions, accommodating shorter-duration jobs + with higher resource needs. +* **`gpu_hourly` QoS:** Offers the least constraints, allowing more resources to be used for the `[a100|gh]-hourly` partitions, + which caters to very short-duration jobs. +* **`gpu_a100_interactive` & `gpu_gh_interactive` QoS:** Guarantee interactive access to GPU nodes for software compilation and + small testing. + +For additional details, refer to the [GPU partitions](/merlin7/slurm-configuration.html#GPU-partitions) section. + +{{site.data.alerts.tip}} +Always verify QoS definitions for potential changes using the 'sacctmgr show qos format="Name%22,MaxTRESPU%35,MaxTRES%35"' command. +{{site.data.alerts.end}} + +### GPU partitions + +This section provides a summary of the partitions available in the `merlin7` CPU cluster. + +Key concepts: +* **`PriorityJobFactor`**: This value is added to a job’s priority (visible in the `PARTITION` column of the `sprio -l` command). + Jobs submitted to partitions with higher `PriorityJobFactor` values generally run sooner. However, other factors like *job age* + and especially *fair share* can also influence scheduling. +* **`PriorityTier`**: Jobs submitted to partitions with higher `PriorityTier` values take precedence over pending jobs in partitions + with lower `PriorityTier` values. Additionally, jobs from higher `PriorityTier` partitions can preempt running jobs in lower-tier + partitions, where applicable. +* **`QoS`**: Specifies the quality of service associated with a partition. It is used to control and restrict resource availability + for specific partitions, ensuring that resource allocation aligns with intended usage policies. Detailed explanations of the various + QoS settings can be found in the [User and job limits with QoS](/merlin7/slurm-configuration.html#user-and-job-limits-with-qos) section. + +{{site.data.alerts.tip}} +Always verify partition configurations for potential changes using the 'scontrol show partition' command. +{{site.data.alerts.end}} + +#### A100-based partitions + +| PartitionName | DefaultTime | MaxTime | TotalNodes | PriorityJobFactor | PriorityTier | QoS | AllowAccounts | +| -------------------: | -----------: | ----------: | --------: | ----------------: | -----------: | -------------------: | -------------: | +| **a100-general** | 1-00:00:00 | 7-00:00:00 | 3 | 1 | 1 | gpu_general | merlin | +| **a100-daily** | 0-01:00:00 | 1-00:00:00 | 4 | 500 | 1 | gpu_daily | merlin | +| **a100-hourly** | 0-00:30:00 | 0-01:00:00 | 5 | 1000 | 1 | gpu_hourly | merlin | +| **a100-interactive** | 0-01:00:00 | 0-12:00:00 | 5 | 1 | 2 | gpu_a100_interactive | merlin | + +All Merlin users are part of the `merlin` account, which is used as the *default account* when submitting jobs. +Similarly, if no partition is specified, jobs are automatically submitted to the `general` partition by default. + +{{site.data.alerts.tip}} +For jobs running less than one day, submit them to the a100-daily partition. +For jobs running less than one hour, use the a100-hourly partition. +These partitions provide higher priority and ensure quicker scheduling compared to a100-general, which has limited node availability. +{{site.data.alerts.end}} + +#### GH-based partitions + +| PartitionName | DefaultTime | MaxTime | TotalNodes | PriorityJobFactor | PriorityTier | QoS | AllowAccounts | +| -------------------: | -----------: | ----------: | --------: | ----------------: | -----------: | -------------------: | -------------: | +| **gh-general** | 1-00:00:00 | 7-00:00:00 | 5 | 1 | 1 | gpu_general | merlin | +| **gh-daily** | 0-01:00:00 | 1-00:00:00 | 6 | 500 | 1 | gpu_daily | merlin | +| **gh-hourly** | 0-00:30:00 | 0-01:00:00 | 7 | 1000 | 1 | gpu_hourly | merlin | +| **gh-interactive** | 0-01:00:00 | 0-12:00:00 | 7 | 1 | 2 | gpu_gh_interactive | merlin | + +All Merlin users are part of the `merlin` account, which is used as the *default account* when submitting jobs. +Similarly, if no partition is specified, jobs are automatically submitted to the `general` partition by default. + +{{site.data.alerts.tip}} +For jobs running less than one day, submit them to the gh-daily partition. +For jobs running less than one hour, use the gh-hourly partition. +These partitions provide higher priority and ensure quicker scheduling compared to gh-general, which has limited node availability. +{{site.data.alerts.end}}