2025-01-21 13:57:25 +01:00

24 KiB
Raw Blame History

title, keywords, summary, sidebar, permalink
title keywords summary sidebar permalink
Slurm merlin7 Configuration configuration, partitions, node definition This document describes a summary of the Merlin7 Slurm CPU-based configuration. merlin7_sidebar /merlin7/slurm-configuration.html

This documentation shows basic Slurm configuration and options needed to run jobs in the Merlin7 cluster.

Public partitions configuration summary

CPU public partitions

PartitionName DefaultTime MaxTime Priority Account Per Job Limits Per User Limits
general 1-00:00:00 7-00:00:00 Low merlin cpu=1024,mem=1920G cpu=1024,mem=1920G
daily 0-01:00:00 1-00:00:00 Medium merlin cpu=1024,mem=1920G cpu=2048,mem=3840G
hourly 0-00:30:00 0-01:00:00 High merlin cpu=2048,mem=3840G cpu=8192,mem=15T

GPU public partitions

A100 nodes

PartitionName DefaultTime MaxTime Priority Account Per Job Limits Per User Limits
a100-general 1-00:00:00 7-00:00:00 Low merlin gres/gpu=4 gres/gpu=8
a100-daily 0-01:00:00 1-00:00:00 Medium merlin gres/gpu=8 gres/gpu=8
a100-hourly 0-00:30:00 0-01:00:00 High merlin gres/gpu=8 gres/gpu=8
a100-interactive 0-01:00:00 0-12:00:00 Very High merlin cpu=16,gres/gpu=1,mem=60G,node=1 cpu=16,gres/gpu=1,mem=60G,node=1

Grace-Hopper nodes

PartitionName DefaultTime MaxTime Priority Account Per Job Limits Per User Limits
gh-general 1-00:00:00 7-00:00:00 Low merlin gres/gpu=4 gres/gpu=8
gh-daily 0-01:00:00 1-00:00:00 Medium merlin gres/gpu=8 gres/gpu=8
gh-hourly 0-00:30:00 0-01:00:00 High merlin gres/gpu=8 gres/gpu=8
gh-interactive 0-01:00:00 0-12:00:00 Very High merlin cpu=16,gres/gpu=1,mem=46G,node=1 cpu=16,gres/gpu=1,mem=46G,node=1

CPU cluster: merlin7

By default, jobs will be submitted to merlin7, as it is the primary cluster configured on the login nodes. Specifying the cluster name is typically unnecessary unless you have defined environment variables that could override the default cluster name. However, when necessary, one can specify the cluster as follows:

#SBATCH --cluster=merlin7

CPU general configuration

The Merlin7 CPU cluster is configured with the CR_CORE_MEMORY and CR_ONE_TASK_PER_CORE options.

  • This configuration treats both cores and memory as consumable resources.
  • Since the nodes are running with hyper-threading enabled, each core thread is counted as a CPU to fulfill a job's resource requirements.

By default, Slurm will allocate one task per core, which means:

  • Each task will consume 2 CPUs, regardless of whether both threads are actively used by the job.

This behavior ensures consistent resource allocation but may result in underutilization of hyper-threading in some cases.

CPU nodes definition

The table below provides an overview of the Slurm configuration for the different node types in the Merlin7 cluster. This information is essential for understanding how resources are allocated, enabling users to tailor their submission scripts accordingly.

Nodes Sockets CoresPerSocket Cores ThreadsPerCore CPUs MaxMemPerNode DefMemPerCPU Features
login[001-002] 2 64 128 2 256 480G 1920M AMD_EPYC_7713
cn[001-077] 2 64 128 2 256 480G 1920M AMD_EPYC_7713

Notes on memory configuration:

  • Memory allocation options: To request additional memory, use the following options in your submission script:

    • --mem=<mem_in_MB>: Allocates memory per node.
    • --mem-per-cpu=<mem_in_MB>: Allocates memory per CPU (equivalent to a core thread).

    The total memory requested cannot exceed the MaxMemPerNode value.

  • Impact of disabling Hyper-Threading: Using the --hint=nomultithread option disables one thread per core, effectively halving the number of available CPUs. Consequently, memory allocation will also be halved unless explicitly adjusted.

    For MPI-based jobs, where performance generally improves with single-threaded CPUs, this option is recommended. In such cases, you should double the --mem-per-cpu value to account for the reduced number of threads.

{{site.data.alerts.tip}} Always verify the Slurm '/var/spool/slurmd/conf-cache/slurm.conf' configuration file for potential changes. {{site.data.alerts.end}}

User and job limits with QoS

In the merlin7 CPU cluster, we enforce certain limits on jobs and users to ensure fair resource usage and prevent overuse by a single user or job. These limits aim to balance resource availability while maintaining overall cluster efficiency. However, applying limits can occasionally impact the clusters utilization. For example, user-specific limits may result in pending jobs even when many nodes are idle due to low activity.

On the other hand, these limits also enhance cluster efficiency by preventing scenarios such as a single job monopolizing all available resources, which could block other jobs from running. Without job size limits, for instance, a large job might drain the entire cluster to satisfy its resource request, a situation that is generally undesirable.

Thus, setting appropriate limits is essential to maintain fair resource usage while optimizing cluster efficiency. These limits should allow for a mix of jobs of varying sizes and types, including single-core and parallel jobs, to coexist effectively.

To implement these limits, we utilize Quality of Service (QoS). Different QoS policies are defined and applied to specific partitions in line with the established resource allocation policies. The table below outlines the various QoS definitions applicable to the merlin7 CPU-based cluster. Here:

  • MaxTRES specifies resource limits per job.
  • MaxTRESPU specifies resource limits per user.
Name MaxTRES MaxTRESPU Scope
normal partition
cpu_general cpu=1024,mem=1920G cpu=1024,mem=1920G user, partition
cpu_daily cpu=1024,mem=1920G cpu=2048,mem=3840G partition
cpu_hourly cpu=2048,mem=3840G cpu=8192,mem=15T partition

Where:

  • normal QoS: This QoS has no limits and is typically applied to partitions that do not require user or job restrictions.
  • cpu_general QoS: This is the default QoS for merlin7 users. It limits the total resources available to each user. Additionally, this QoS is applied to the general partition, enforcing restrictions at the partition level and overriding user-level QoS.
  • cpu_daily QoS: Guarantees increased resources for the daily partition, accommodating shorter-duration jobs with higher resource needs.
  • cpu_hourly QoS: Offers the least constraints, allowing more resources to be used for the hourly partition, which caters to very short-duration jobs.

For additional details, refer to the CPU partitions section.

{{site.data.alerts.tip}} Always verify QoS definitions for potential changes using the 'sacctmgr show qos format="Name%22,MaxTRESPU%35,MaxTRES%35"' command. {{site.data.alerts.end}}

CPU partitions

This section provides a summary of the partitions available in the merlin7 CPU cluster.

Key concepts:

  • PriorityJobFactor: This value is added to a jobs priority (visible in the PARTITION column of the sprio -l command). Jobs submitted to partitions with higher PriorityJobFactor values generally run sooner. However, other factors like job age and especially fair share can also influence scheduling.
  • PriorityTier: Jobs submitted to partitions with higher PriorityTier values take precedence over pending jobs in partitions with lower PriorityTier values. Additionally, jobs from higher PriorityTier partitions can preempt running jobs in lower-tier partitions, where applicable.
  • QoS: Specifies the quality of service associated with a partition. It is used to control and restrict resource availability for specific partitions, ensuring that resource allocation aligns with intended usage policies. Detailed explanations of the various QoS settings can be found in the User and job limits with QoS section.

{{site.data.alerts.tip}} Always verify partition configurations for potential changes using the 'scontrol show partition' command. {{site.data.alerts.end}}

CPU public partitions

PartitionName DefaultTime MaxTime TotalNodes PriorityJobFactor PriorityTier QoS AllowAccounts
general 1-00:00:00 7-00:00:00 50 1 1 cpu_general merlin
daily 0-01:00:00 1-00:00:00 62 500 1 cpu_daily merlin
hourly 0-00:30:00 0-01:00:00 77 1000 1 cpu_hourly merlin

All Merlin users are part of the merlin account, which is used as the default account when submitting jobs. Similarly, if no partition is specified, jobs are automatically submitted to the general partition by default.

{{site.data.alerts.tip}} For jobs running less than one day, submit them to the daily partition. For jobs running less than one hour, use the hourly partition. These partitions provide higher priority and ensure quicker scheduling compared to general, which has limited node availability. {{site.data.alerts.end}}

The hourly partition may include private nodes as an additional buffer. However, the current Slurm partition configuration, governed by PriorityTier, ensures that jobs submitted to private partitions are prioritized and processed first. As a result, access to the hourly partition might experience delays in such scenarios.

CPU private partitions

CAS / ASA
PartitionName DefaultTime MaxTime TotalNodes PriorityJobFactor PriorityTier QoS AllowAccounts
asa 0-01:00:00 14-00:00:00 10 1 2 normal asa
CNM / Mu3e
PartitionName DefaultTime MaxTime TotalNodes PriorityJobFactor PriorityTier QoS AllowAccounts
mu3e 1-00:00:00 7-00:00:00 4 1 2 normal mu3e, meg
CNM / MeG
PartitionName DefaultTime MaxTime TotalNodes PriorityJobFactor PriorityTier QoS AllowAccounts
meg-short 0-01:00:00 0-01:00:00 unlimited 1000 2 normal meg
meg-long 1-00:00:00 5-00:00:00 unlimited 1 2 normal meg
meg-prod 1-00:00:00 5-00:00:00 unlimited 1000 4 normal meg

GPU cluster: gmerlin7

As mentioned in previous sections, by default, jobs will be submitted to merlin7, as it is the primary cluster configured on the login nodes. For submittng jobs to the GPU cluster, the cluster name gmerlin7 must be specified, as follows:

#SBATCH --cluster=gmerlin7

GPU general configuration

The Merlin7 GPU cluster is configured with the CR_CORE_MEMORY, CR_ONE_TASK_PER_CORE, and ENFORCE_BINDING_GRES options.

  • This configuration treats both cores and memory as consumable resources.
  • Since the nodes are running with hyper-threading enabled, each core thread is counted as a CPU to fulfill a job's resource requirements.
  • Slurm will allocate the CPUs to the selected GPU.

By default, Slurm will allocate one task per core, which means:

  • For hyper-threaded nodes (NVIDIA A100-based nodes), each task will consume 2 CPUs, regardless of whether both threads are actively used by the job.
  • For the NVIDIA GraceHopper-based nodes, each task will consume 1 CPU.

This behavior ensures consistent resource allocation but may result in underutilization of hyper-threading in some cases.

GPU nodes definition

The table below provides an overview of the Slurm configuration for the different node types in the Merlin7 cluster. This information is essential for understanding how resources are allocated, enabling users to tailor their submission scripts accordingly.

Nodes Sockets CoresPerSocket Cores ThreadsPerCore CPUs MaxMemPerNode DefMemPerCPU Gres Features
gpu[001-007] 4 72 288 1 288 828G 2944M gpu:gh200:4 AMD_EPYC_7713, NV_A100
gpu[101-105] 1 64 64 2 128 480G 3840M gpu:nvidia_a100-sxm4-80gb:4 GH200, NV_H100

Notes on memory configuration:

  • Memory allocation options: To request additional memory, use the following options in your submission script:

    • --mem=<mem_in_MB>: Allocates memory per node.
    • --mem-per-cpu=<mem_in_MB>: Allocates memory per CPU (equivalent to a core thread).

    The total memory requested cannot exceed the MaxMemPerNode value.

  • Impact of disabling Hyper-Threading: Using the --hint=nomultithread option disables one thread per core, effectively halving the number of available CPUs. Consequently, memory allocation will also be halved unless explicitly adjusted.

    For MPI-based jobs, where performance generally improves with single-threaded CPUs, this option is recommended. In such cases, you should double the --mem-per-cpu value to account for the reduced number of threads.

{{site.data.alerts.tip}} Always verify the Slurm '/var/spool/slurmd/conf-cache/slurm.conf' configuration file for potential changes. {{site.data.alerts.end}}

User and job limits with QoS

In the gmerlin7 CPU cluster, we enforce certain limits on jobs and users to ensure fair resource usage and prevent overuse by a single user or job. These limits aim to balance resource availability while maintaining overall cluster efficiency. However, applying limits can occasionally impact the clusters utilization. For example, user-specific limits may result in pending jobs even when many nodes are idle due to low activity.

On the other hand, these limits also enhance cluster efficiency by preventing scenarios such as a single job monopolizing all available resources, which could block other jobs from running. Without job size limits, for instance, a large job might drain the entire cluster to satisfy its resource request, a situation that is generally undesirable.

Thus, setting appropriate limits is essential to maintain fair resource usage while optimizing cluster efficiency. These limits should allow for a mix of jobs of varying sizes and types, including single-core and parallel jobs, to coexist effectively.

To implement these limits, we utilize Quality of Service (QoS). Different QoS policies are defined and applied to specific partitions in line with the established resource allocation policies. The table below outlines the various QoS definitions applicable to the merlin7 CPU-based cluster. Here:

  • MaxTRES specifies resource limits per job.
  • MaxTRESPU specifies resource limits per user.
Name MaxTRES MaxTRESPU Scope
normal partition
gpu_general gres/gpu=4 gres/gpu=8 user, partition
gpu_daily gres/gpu=8 gres/gpu=8 partition
gpu_hourly gres/gpu=8 gres/gpu=8 partition
gpu_gh_interactive cpu=16,gres/gpu=1,mem=46G,node=1 cpu=16,gres/gpu=1,mem=46G,node=1 partition
gpu_a100_interactive cpu=16,gres/gpu=1,mem=60G,node=1 cpu=16,gres/gpu=1,mem=60G,node=1 partition

Where:

  • normal QoS: This QoS has no limits and is typically applied to partitions that do not require user or job restrictions.
  • gpu_general QoS: This is the default QoS for gmerlin7 users. It limits the total resources available to each user. Additionally, this QoS is applied to the [a100|gh]-general partitions, enforcing restrictions at the partition level and overriding user-level QoS.
  • gpu_daily QoS: Guarantees increased resources for the [a100|gh]-daily partitions, accommodating shorter-duration jobs with higher resource needs.
  • gpu_hourly QoS: Offers the least constraints, allowing more resources to be used for the [a100|gh]-hourly partitions, which caters to very short-duration jobs.
  • gpu_a100_interactive & gpu_gh_interactive QoS: Guarantee interactive access to GPU nodes for software compilation and small testing.

For additional details, refer to the GPU partitions section.

{{site.data.alerts.tip}} Always verify QoS definitions for potential changes using the 'sacctmgr show qos format="Name%22,MaxTRESPU%35,MaxTRES%35"' command. {{site.data.alerts.end}}

GPU partitions

This section provides a summary of the partitions available in the merlin7 CPU cluster.

Key concepts:

  • PriorityJobFactor: This value is added to a jobs priority (visible in the PARTITION column of the sprio -l command). Jobs submitted to partitions with higher PriorityJobFactor values generally run sooner. However, other factors like job age and especially fair share can also influence scheduling.
  • PriorityTier: Jobs submitted to partitions with higher PriorityTier values take precedence over pending jobs in partitions with lower PriorityTier values. Additionally, jobs from higher PriorityTier partitions can preempt running jobs in lower-tier partitions, where applicable.
  • QoS: Specifies the quality of service associated with a partition. It is used to control and restrict resource availability for specific partitions, ensuring that resource allocation aligns with intended usage policies. Detailed explanations of the various QoS settings can be found in the User and job limits with QoS section.

{{site.data.alerts.tip}} Always verify partition configurations for potential changes using the 'scontrol show partition' command. {{site.data.alerts.end}}

A100-based partitions

PartitionName DefaultTime MaxTime TotalNodes PriorityJobFactor PriorityTier QoS AllowAccounts
a100-general 1-00:00:00 7-00:00:00 3 1 1 gpu_general merlin
a100-daily 0-01:00:00 1-00:00:00 4 500 1 gpu_daily merlin
a100-hourly 0-00:30:00 0-01:00:00 5 1000 1 gpu_hourly merlin
a100-interactive 0-01:00:00 0-12:00:00 5 1 2 gpu_a100_interactive merlin

All Merlin users are part of the merlin account, which is used as the default account when submitting jobs. Similarly, if no partition is specified, jobs are automatically submitted to the general partition by default.

{{site.data.alerts.tip}} For jobs running less than one day, submit them to the a100-daily partition. For jobs running less than one hour, use the a100-hourly partition. These partitions provide higher priority and ensure quicker scheduling compared to a100-general, which has limited node availability. {{site.data.alerts.end}}

GH-based partitions

PartitionName DefaultTime MaxTime TotalNodes PriorityJobFactor PriorityTier QoS AllowAccounts
gh-general 1-00:00:00 7-00:00:00 5 1 1 gpu_general merlin
gh-daily 0-01:00:00 1-00:00:00 6 500 1 gpu_daily merlin
gh-hourly 0-00:30:00 0-01:00:00 7 1000 1 gpu_hourly merlin
gh-interactive 0-01:00:00 0-12:00:00 7 1 2 gpu_gh_interactive merlin

All Merlin users are part of the merlin account, which is used as the default account when submitting jobs. Similarly, if no partition is specified, jobs are automatically submitted to the general partition by default.

{{site.data.alerts.tip}} For jobs running less than one day, submit them to the gh-daily partition. For jobs running less than one hour, use the gh-hourly partition. These partitions provide higher priority and ensure quicker scheduling compared to gh-general, which has limited node availability. {{site.data.alerts.end}}