16 KiB
title, keywords, last_updated, summary, sidebar, permalink
title | keywords | last_updated | summary | sidebar | permalink |
---|---|---|---|---|---|
Slurm cluster 'gmerlin6' | configuration, partitions, node definition, gmerlin6 | 29 January 2021 | This document describes a summary of the Slurm 'configuration. | merlin6_sidebar | /gmerlin6/slurm-configuration.html |
This documentation shows basic Slurm configuration and options needed to run jobs in the GPU cluster.
Merlin6 GPU nodes definition
The table below shows a summary of the hardware setup for the different GPU nodes
Nodes | Def.#CPUs | Max.#CPUs | #Threads | Def.Mem/CPU | Max.Mem/CPU | Max.Mem/Node | Max.Swap | GPU Type | Def.#GPUs | Max.#GPUs |
---|---|---|---|---|---|---|---|---|---|---|
merlin-g-[001] | 1 core | 8 cores | 1 | 4000 | 102400 | 102400 | 10000 | geforce_gtx_1080 | 1 | 2 |
merlin-g-[002-005] | 1 core | 20 cores | 1 | 4000 | 102400 | 102400 | 10000 | geforce_gtx_1080 | 1 | 4 |
merlin-g-[006-009] | 1 core | 20 cores | 1 | 4000 | 102400 | 102400 | 10000 | geforce_gtx_1080_ti | 1 | 4 |
merlin-g-[010-013] | 1 core | 20 cores | 1 | 4000 | 102400 | 102400 | 10000 | geforce_rtx_2080_ti | 1 | 4 |
merlin-g-014 | 1 core | 48 cores | 1 | 4000 | 360448 | 360448 | 10000 | geforce_rtx_2080_ti | 1 | 8 |
merlin-g-100 | 1 core | 128 cores | 2 | 3900 | 998400 | 998400 | 10000 | A100 | 1 | 8 |
{{site.data.alerts.tip}}Always check '/etc/slurm/gres.conf' and '/etc/slurm/slurm.conf' for changes in the GPU type and details of the hardware. {{site.data.alerts.end}}
Running jobs in the 'gmerlin6' cluster
In this chapter we will cover basic settings that users need to specify in order to run jobs in the GPU cluster.
Merlin6 GPU cluster
To run jobs in the gmerlin6
cluster users must specify the cluster name in Slurm:
#SBATCH --cluster=gmerlin6
Merlin6 GPU partitions
Users might need to specify the Slurm partition. If no partition is specified, it will default to gpu
:
#SBATCH --partition=<partition_name> # Possible <partition_name> values: gpu, gpu-short, gwendolen
The table below resumes shows all possible partitions available to users:
GPU Partition | Default Time | Max Time | PriorityJobFactor* | PriorityTier** |
---|---|---|---|---|
gpu |
1 day | 1 week | 1 | 1 |
gpu-short |
2 hours | 2 hours | 1000 | 500 |
gwendolen |
1 hour | 12 hours | 1000 | 1000 |
*The PriorityJobFactor value will be added to the job priority (PARTITION column in sprio -l
). In other words, jobs sent to higher priority
partitions will usually run first (however, other factors such like job age or mainly fair share might affect to that decision). For the GPU
partitions, Slurm will also attempt first to allocate jobs on partitions with higher priority over partitions with lesser priority.
**Jobs submitted to a partition with a higher PriorityTier value will be dispatched before pending jobs in partition with lower PriorityTier value and, if possible, they will preempt running jobs from partitions with lower PriorityTier values.
Merlin6 GPU Accounts
Users need to ensure that the public merlin
account is specified. No specifying account options would default to this account.
This is mostly needed by users which have multiple Slurm accounts, which may define by mistake a different account.
#SBATCH --account=merlin # Possible values: merlin, gwendolen
Not all the accounts can be used on all partitions. This is resumed in the table below:
Slurm Account | Slurm Partitions | Special QoS |
---|---|---|
merlin |
gpu ,gpu-short |
|
gwendolen |
gwendolen |
gwendolen , gwendolen_public |
By default, all users belong to the merlin
and gwendolen
Slurm accounts.
Users only need to specify gwendolen
when using gwendolen
, otherwise specfying account is not needed (it will always default to merlin
). gwendolen
is a special account, with two different QoS granting different types of access (see details below).
The 'gwendolen' account
For running jobs in the gwendolen
partition, users must specify the gwendolen
account. The merlin
account is not allowed to use the gwendolen
partition.
In addition, in Slurm there is the concept of QoS, which stands for Quality of Service. The gwendolen
account has two different QoS configured:
- The QoS
gwendolen_public
is set by default to all Merlin users. This restricts the number of resources than can be used on Gwendolen. For further information about restrictions, please read the 'User and Job Limits' documentation. - The QoS
gwendolen
provides full access togwendolen
, however this is restricted to a set of users belonging to theunx-gwendolen
Unix group.
Users don't need to specify any QoS, however, they need to be aware about resources restrictions. If you belong to one of the projects which is allowed to use Gwendolen without restrictions, please request access to the unx-gwendolen
through PSI Service Now.
Slurm GPU specific options
Some options are available when using GPUs. These are detailed here.
Number of GPUs and type
When using the GPU cluster, users must specify the number of GPUs they need to use:
#SBATCH --gpus=[<type>:]<number>
The GPU type is optional: if left empty, it will try allocating any type of GPU.
The different [<type>:]
values and <number>
of GPUs depends on the node.
This is detailed in the below table.
Nodes | GPU Type | #GPUs |
---|---|---|
merlin-g-[001] | geforce_gtx_1080 |
2 |
merlin-g-[002-005] | geforce_gtx_1080 |
4 |
merlin-g-[006-009] | geforce_gtx_1080_ti |
4 |
merlin-g-[010-013] | geforce_rtx_2080_ti |
4 |
merlin-g-014 | geforce_rtx_2080_ti |
8 |
merlin-g-100 | A100 |
8 |
Constraint / Features
Instead of specifying the GPU type, sometimes users would need to specify the GPU by the amount of memory available in the GPU card itself.
This has been defined in Slurm with Features, which is a tag which defines the GPU memory for the different GPU cards.
Users can specify which GPU memory size needs to be used with the --constraint
option. In that case, notice that in many cases
there is not need to specify [<type>:]
in the --gpus
option.
#SBATCH --contraint=<Feature> # Possible values: gpumem_8gb, gpumem_11gb, gpumem_40gb
The table below shows the available Features and which GPU card models and GPU nodes they belong to:
Merlin6 CPU Computing Nodes | ||
---|---|---|
Nodes | GPU Type | Feature |
merlin-g-[001-005] | `geforce_gtx_1080` | `gpumem_8gb` |
merlin-g-[006-009] | `geforce_gtx_1080_ti` | `gpumem_11gb` |
merlin-g-[010-014] | `geforce_rtx_2080_ti` | |
merlin-g-100 | `A100` | `gpumem_40gb` |
Other GPU options
Alternative Slurm options for GPU based jobs are available. Please refer to the man pages
for each Slurm command for further information about it (man salloc
, man sbatch
, man srun
).
Below are listed the most common settings:
#SBATCH --hint=[no]multithread
#SBATCH --ntasks=\<ntasks\>
#SBATCH --ntasks-per-gpu=\<ntasks\>
#SBATCH --mem-per-gpu=\<size[units]\>
#SBATCH --cpus-per-gpu=\<ncpus\>
#SBATCH --gpus-per-node=[\<type\>:]\<number\>
#SBATCH --gpus-per-socket=[\<type\>:]\<number\>
#SBATCH --gpus-per-task=[\<type\>:]\<number\>
#SBATCH --gpu-bind=[verbose,]\<type\>
Please, notice that when defining [<type>:]
once, then all other options must use it too!
Dealing with Hyper-Threading
The gmerlin6
cluster contains the partition gwendolen
, which has a node with Hyper-Threading enabled.
In that case, one should always specify whether to use Hyper-Threading or not. If not defined, Slurm will
generally use it (exceptions apply). For this machine, generally HT is recommended.
#SBATCH --hint=multithread # Use extra threads with in-core multi-threading.
#SBATCH --hint=nomultithread # Don't use extra threads with in-core multi-threading.
User and job limits
The GPU cluster contains some basic user and job limits to ensure that a single user can not overabuse the resources and a fair usage of the cluster. The limits are described below.
Per job limits
These are limits applying to a single job. In other words, there is a maximum of resources a single job can use.
Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: SlurmQoS(limits)
(possible SlurmQoS
values can be listed with the command sacctmgr show qos
):
Partition | Slurm Account | Mon-Sun 0h-24h |
---|---|---|
gpu | merlin |
gpu_week(cpu=40,gres/gpu=8,mem=200G) |
gpu-short | merlin |
gpu_week(cpu=40,gres/gpu=8,mem=200G) |
gwendolen | gwendolen |
gwendolen_public(cpu=32,gres/gpu=2,mem=200G) |
gwendolen | gwendolen |
gwendolen(No limits, full access granted) |
-
With the limits in the public
gpu
andgpu-short
partitions, a single job using themerlin
acccount (default account) can not use more than 40 CPUs, more than 8 GPUs or more than 200GB. Any job exceeding such limits will stay in the queue with the messageQOSMax[Cpu|GRES|Mem]PerJob
. As there are no more existing QoS during the week temporary overriding job limits (this happens for instance in the CPU daily partition), the job needs to be cancelled, and the requested resources must be adapted according to the above resource limits. -
The gwendolen partition is a special partition with a NVIDIA DGX A100 machine. Public access is possible through the
gwendolen
account, however this is limited to 2 GPUs per job, 32 CPUs and 121875MB of memory). For full access, thegwendolen
account withgwendolen
QoS (Quality of Service) is needed, and this is restricted to a set of users (belonging to theunx-gwendolen
Unix group). Any other user will have by default a QoSgwendolen_public
, which restricts resources in Gwendolen.
Per user limits for GPU partitions
These limits apply exclusively to users. In other words, there is a maximum of resources a single user can use.
Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: SlurmQoS(limits)
(possible SlurmQoS
values can be listed with the command sacctmgr show qos
):
Partition | Slurm Account | Mon-Sun 0h-24h |
---|---|---|
gpu | merlin |
gpu_week(cpu=80,gres/gpu=16,mem=400G) |
gpu-short | merlin |
gpu_week(cpu=80,gres/gpu=16,mem=400G) |
gwendolen | gwendolen |
gwendolen_public(cpu=64,gres/gpu=4,mem=243750M) |
gwendolen | gwendolen |
gwendolen(No limits, full access granted) |
-
With the limits in the public
gpu
andgpu-short
partitions, a single user can not use more than 80 CPUs, more than 16 GPUs or more than 400GB. Jobs sent by any user already exceeding such limits will stay in the queue with the messageQOSMax[Cpu|GRES|Mem]PerUser
. In that case, job can wait in the queue until some of the running resources are freed. -
Notice that user limits are wider than job limits. In that way, a user can run up to two 8 GPUs based jobs, or up to four 4 GPUs based jobs, etc. Please try to avoid occupying all GPUs of the same type for several hours or multiple days, otherwise it would block other users needing the same type of GPU.
Advanced Slurm configuration
Clusters at PSI use the Slurm Workload Manager as the batch system technology for managing and scheduling jobs. Slurm has been installed in a multi-clustered configuration, allowing to integrate multiple clusters in the same batch system.
For understanding the Slurm configuration setup in the cluster, sometimes may be useful to check the following files:
/etc/slurm/slurm.conf
- can be found in the login nodes and computing nodes./etc/slurm/gres.conf
- can be found in the GPU nodes, is also propgated to login nodes and computing nodes for user read access./etc/slurm/cgroup.conf
- can be found in the computing nodes, is also propagated to login nodes for user read access.
The previous configuration files which can be found in the login nodes, correspond exclusively to the merlin6 cluster configuration files. Configuration files for the old merlin5 cluster or for the gmerlin6 cluster must be checked directly on any of the merlin5 or gmerlin6 computing nodes (in example, by login in to one of the nodes while a job or an active allocation is running).