16 KiB
title, keywords, last_updated, summary, sidebar, permalink
title | keywords | last_updated | summary | sidebar | permalink |
---|---|---|---|---|---|
Slurm Configuration | configuration, partitions, node definition | 20 May 2021 | This document describes a summary of the Merlin5 Slurm configuration. | merlin6_sidebar | /merlin5/slurm-configuration.html |
This documentation shows basic Slurm configuration and options needed to run jobs in the Merlin5 cluster.
The Merlin5 cluster is an old cluster with old hardware which is maintained in a best effort for increasing the CPU power of the Merlin cluster.
Merlin5 CPU nodes definition
The following table show default and maximum resources that can be used per node:
Nodes | Def.#CPUs | Max.#CPUs | #Threads | Max.Mem/Node | Max.Swap |
---|---|---|---|---|---|
merlin-c-[18-30] | 1 core | 16 cores | 1 | 60000 | 10000 |
merlin-c-[31-32] | 1 core | 16 cores | 1 | 124000 | 10000 |
merlin-c-[33-45] | 1 core | 16 cores | 1 | 60000 | 10000 |
merlin-c-[46-47] | 1 core | 16 cores | 1 | 124000 | 10000 |
There is one main difference between the Merlin5 and Merlin6 clusters: Merlin5 is keeping an old configuration which does not consider the memory as a consumable resource. Hence, users can oversubscribe memory. This might trigger some side-effects, but this legacy configuration has been kept to ensure that old jobs can keep running in the same way they did a few years ago. If you know that this might be a problem for you, please, always use Merlin6 instead.
Running jobs in the 'merlin5' cluster
In this chapter we will cover basic settings that users need to specify in order to run jobs in the Merlin5 CPU cluster.
Merlin5 CPU cluster
To run jobs in the merlin5
cluster users must specify the cluster name in Slurm:
#SBATCH --cluster=merlin5
CPU partitions
Users might need to specify the Slurm partition. If no partition is specified, it will default to merlin
:
#SBATCH --partition=<partition_name> # Possible <partition_name> values: merlin, merlin-long:
The table below resumes shows all possible partitions available to users:
CPU Partition | Default Time | Max Time | Max Nodes | PriorityJobFactor* | PriorityTier** |
---|---|---|---|---|---|
merlin | 5 days | 1 week | All nodes | 500 | 1 |
merlin-long | 5 days | 21 days | 4 | 1 | 1 |
*****The PriorityJobFactor value will be added to the job priority (PARTITION column in sprio -l
). In other words, jobs sent to higher priority
partitions will usually run first (however, other factors such like job age or mainly fair share might affect to that decision). For the GPU
partitions, Slurm will also attempt first to allocate jobs on partitions with higher priority over partitions with lesser priority.
******Jobs submitted to a partition with a higher PriorityTier value will be dispatched before pending jobs in partition with lower PriorityTier value and, if possible, they will preempt running jobs from partitions with lower PriorityTier values.
The merlin-long
partition, as it might contain jobs running for up to 21 days, is limited to 4 nodes.
Merlin5 CPU Accounts
Users need to ensure that the merlin
account is specified (or no account is specified).
This is the unique account available in the merlin5 cluster.
This is mostly needed by users which have multiple Slurm accounts, which may defined by mistake a different account existing in
one of the other Merlin clusters (i.e. merlin6
, gmerlin6
).
#SBATCH --account=merlin # Possible values: merlin
Merlin5 CPU specific options
Some options are available when using CPUs. These are detailed here.
Alternative Slurm options for CPU based jobs are available. Please refer to the man pages
for each Slurm command for further information about it (man salloc
, man sbatch
, man srun
).
Below are listed the most common settings:
#SBATCH --ntasks=<ntasks>
#SBATCH --ntasks-per-core=<ntasks>
#SBATCH --ntasks-per-socket=<ntasks>
#SBATCH --ntasks-per-node=<ntasks>
#SBATCH --mem=<size[units]>
#SBATCH --mem-per-cpu=<size[units]>
#SBATCH --cpus-per-task=<ncpus>
#SBATCH --cpu-bind=[{quiet,verbose},]<type> # only for 'srun' command
Notice that in Merlin5 no hyper-threading is available (while in Merlin6 it is).
Hence, in Merlin5 there is not need to specify --hint
hyper-threading related options.
User and job limits
In the CPU cluster we provide some limits which basically apply to jobs and users. The idea behind this is to ensure a fair usage of the resources and to avoid overabuse of the resources from a single user or job. However, applying limits might affect the overall usage efficiency of the cluster (in example, pending jobs from a single user while having many idle nodes due to low overall activity is something that can be seen when user limits are applied). In the same way, these limits can be also used to improve the efficiency of the cluster (in example, without any job size limits, a job requesting all resources from the batch system would drain the entire cluster for fitting the job, which is undesirable).
Hence, there is a need of setting up wise limits and to ensure that there is a fair usage of the resources, by trying to optimize the overall efficiency of the cluster while allowing jobs of different nature and sizes (it is, single core based vs parallel jobs of different sizes) to run.
{{site.data.alerts.warning}}Wide limits are provided in the daily and hourly partitions, while for general those limits are
more restrictive.
However, we kindly ask users to inform the Merlin administrators when there are plans to send big jobs which would require a
massive draining of nodes for allocating such jobs. This would apply to jobs requiring the unlimited QoS (see below "Per job limits")
{{site.data.alerts.end}}
{{site.data.alerts.tip}}If you have different requirements, please let us know, we will try to accomodate or propose a solution for you. {{site.data.alerts.end}}
Per job limits
These are limits which apply to a single job. In other words, there is a maximum of resources a single job can use. This is described in the table below,
and limits will vary depending on the day of the week and the time (working vs non-working hours). Limits are shown in format: SlurmQoS(limits)
,
where SlurmQoS
can be seen with the command sacctmgr show qos
:
Partition | Mon-Fri 0h-18h | Sun-Thu 18h-0h | From Fri 18h to Mon 0h |
---|---|---|---|
general | normal(cpu=704,mem=2750G) | normal(cpu=704,mem=2750G) | normal(cpu=704,mem=2750G) |
daily | daytime(cpu=704,mem=2750G) | nighttime(cpu=1408,mem=5500G) | unlimited(cpu=2200,mem=8593.75G) |
hourly | unlimited(cpu=2200,mem=8593.75G) | unlimited(cpu=2200,mem=8593.75G) | unlimited(cpu=2200,mem=8593.75G) |
By default, a job can not use more than 704 cores (max CPU per job). In the same way, memory is also proportionally limited. This is equivalent as running a job using up to 8 nodes at once. This limit applies to the general partition (fixed limit) and to the daily partition (only during working hours). Limits are softed for the daily partition during non working hours, and during the weekend limits are even wider.
For the hourly partition, despite running many parallel jobs is something not desirable (for allocating such jobs it requires massive draining of nodes), wider limits are provided. In order to avoid massive nodes drain in the cluster, for allocating huge jobs, setting per job limits is necessary. Hence, unlimited QoS mostly refers to "per user" limits more than to "per job" limits (in other words, users can run any number of hourly jobs, but the job size for such jobs is limited with wide values).
Per user limits for CPU partitions
These limits which apply exclusively to users. In other words, there is a maximum of resources a single user can use. This is described in the table below,
and limits will vary depending on the day of the week and the time (working vs non-working hours). Limits are shown in format: SlurmQoS(limits)
,
where SlurmQoS
can be seen with the command sacctmgr show qos
:
Partition | Mon-Fri 0h-18h | Sun-Thu 18h-0h | From Fri 18h to Mon 0h |
---|---|---|---|
general | normal(cpu=704,mem=2750G) | normal(cpu=704,mem=2750G) | normal(cpu=704,mem=2750G) |
daily | daytime(cpu=1408,mem=5500G) | nighttime(cpu=2112,mem=8250G) | unlimited(cpu=6336,mem=24750G) |
hourly | unlimited(cpu=6336,mem=24750G) | unlimited(cpu=6336,mem=24750G) | unlimited(cpu=6336,mem=24750G) |
By default, users can not use more than 704 cores at the same time (max CPU per user). Memory is also proportionally limited in the same way. This is equivalent to 8 exclusive nodes. This limit applies to the general partition (fixed limit) and to the daily partition (only during working hours). For the hourly partition, there are no limits restriction and user limits are removed. Limits are softed for the daily partition during non working hours, and during the weekend limits are removed.
Merlin6 GPU
Basic configuration for the merlin5 GPUs will be detailed here. For advanced usage, please refer to Understanding the Slurm configuration (for advanced users)
GPU nodes definition
Nodes | Def.#CPUs | Max.#CPUs | #Threads | Def.Mem/CPU | Max.Mem/CPU | Max.Mem/Node | Max.Swap | GPU Type | Def.#GPUs | Max.#GPUs |
---|---|---|---|---|---|---|---|---|---|---|
merlin-g-[001] | 1 core | 8 cores | 1 | 4000 | 102400 | 102400 | 10000 | GTX1080 | 1 | 2 |
merlin-g-[002-005] | 1 core | 20 cores | 1 | 4000 | 102400 | 102400 | 10000 | GTX1080 | 1 | 4 |
merlin-g-[006-009] | 1 core | 20 cores | 1 | 4000 | 102400 | 102400 | 10000 | GTX1080Ti | 1 | 4 |
merlin-g-[010-013] | 1 core | 20 cores | 1 | 4000 | 102400 | 102400 | 10000 | RTX2080Ti | 1 | 4 |
{{site.data.alerts.tip}}Always check '/etc/slurm/gres.conf' for changes in the GPU type and details of the NUMA node. {{site.data.alerts.end}}
GPU partitions
GPU Partition | Default Time | Max Time | Max Nodes | Priority | PriorityJobFactor* |
---|---|---|---|---|---|
gpu | 1 day | 1 week | 4 | low | 1 |
gpu-short | 2 hours | 2 hours | 4 | highest | 1000 |
*The PriorityJobFactor value will be added to the job priority (PARTITION column in sprio -l
). In other words, jobs sent to higher priority
partitions will usually run first (however, other factors such like job age or mainly fair share might affect to that decision). For the GPU
partitions, Slurm will also attempt first to allocate jobs on partitions with higher priority over partitions with lesser priority.
User and job limits
The GPU cluster contains some basic user and job limits to ensure that a single user can not overabuse the resources and a fair usage of the cluster. The limits are described below.
Per job limits
These are limits applying to a single job. In other words, there is a maximum of resources a single job can use.
Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: SlurmQoS(limits)
,
(list of possible SlurmQoS
values can be listed with the command sacctmgr show qos
):
Partition | Mon-Sun 0h-24h |
---|---|
gpu | gpu_week(cpu=40,gres/gpu=8,mem=200G) |
gpu-short | gpu_week(cpu=40,gres/gpu=8,mem=200G) |
With these limits, a single job can not use more than 40 CPUs, more than 8 GPUs or more than 200GB.
Any job exceeding such limits will stay in the queue with the message QOSMax[Cpu|GRES|Mem]PerJob
.
Since there are no more existing QoS during the week temporary overriding job limits (this happens for instance in the CPU daily partition), the job needs to be cancelled, and the requested resources must be adapted according to the above resource limits.
Per user limits for CPU partitions
These limits apply exclusively to users. In other words, there is a maximum of resources a single user can use.
Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: SlurmQoS(limits)
,
(list of possible SlurmQoS
values can be listed with the command sacctmgr show qos
):
Partition | Mon-Sun 0h-24h |
---|---|
gpu | gpu_week(cpu=80,gres/gpu=16,mem=400G) |
gpu-short | gpu_week(cpu=80,gres/gpu=16,mem=400G) |
With these limits, a single user can not use more than 80 CPUs, more than 16 GPUs or more than 400GB.
Jobs sent by any user already exceeding such limits will stay in the queue with the message QOSMax[Cpu|GRES|Mem]PerUser
. In that case, job can wait in the queue until some of the running resources are freed.
Notice that user limits are wider than job limits. In that way, a user can run up to two 8 GPUs based jobs, or up to four 4 GPUs based jobs, etc. Please try to avoid occupying all GPUs of the same type for several hours or multiple days, otherwise it would block other users needing the same type of GPU.
Understanding the Slurm configuration (for advanced users)
Clusters at PSI use the Slurm Workload Manager as the batch system technology for managing and scheduling jobs. Historically, Merlin4 and Merlin5 also used Slurm. In the same way, Merlin6 has been also configured with this batch system.
Slurm has been installed in a multi-clustered configuration, allowing to integrate multiple clusters in the same batch system.
For understanding the Slurm configuration setup in the cluster, sometimes may be useful to check the following files:
/etc/slurm/slurm.conf
- can be found in the login nodes and computing nodes./etc/slurm/gres.conf
- can be found in the GPU nodes, is also propgated to login nodes and computing nodes for user read access./etc/slurm/cgroup.conf
- can be found in the computing nodes, is also propagated to login nodes for user read access.
The previous configuration files which can be found in the login nodes, correspond exclusively to the merlin5 cluster configuration files. Configuration files for the old merlin5 cluster must be checked directly on any of the merlin5 computing nodes: these are not propagated to the merlin5 login nodes.