16 KiB
title, keywords, last_updated, summary, sidebar, permalink
title | keywords | last_updated | summary | sidebar | permalink |
---|---|---|---|---|---|
Slurm cluster 'gmerlin6' | configuration, partitions, node definition, gmerlin6 | 29 January 2021 | This document describes a summary of the Slurm 'configuration. | merlin6_sidebar | /gmerlin6/slurm-configuration.html |
This documentation shows basic Slurm configuration and options needed to run jobs in the GPU cluster.
Merlin6 GPU nodes definition
The table below shows a summary of the hardware setup for the different GPU nodes
Nodes | Def.#CPUs | Max.#CPUs | #Threads | Def.Mem/CPU | Max.Mem/CPU | Max.Mem/Node | Max.Swap | GPU Type | Def.#GPUs | Max.#GPUs |
---|---|---|---|---|---|---|---|---|---|---|
merlin-g-[001] | 1 core | 8 cores | 1 | 5120 | 102400 | 102400 | 10000 | geforce_gtx_1080 | 1 | 2 |
merlin-g-[002-005] | 1 core | 20 cores | 1 | 5120 | 102400 | 102400 | 10000 | geforce_gtx_1080 | 1 | 4 |
merlin-g-[006-009] | 1 core | 20 cores | 1 | 5120 | 102400 | 102400 | 10000 | geforce_gtx_1080_ti | 1 | 4 |
merlin-g-[010-013] | 1 core | 20 cores | 1 | 5120 | 102400 | 102400 | 10000 | geforce_rtx_2080_ti | 1 | 4 |
merlin-g-014 | 1 core | 48 cores | 1 | 5120 | 360448 | 360448 | 10000 | geforce_rtx_2080_ti | 1 | 8 |
merlin-g-015 | 1 core | 48 cores | 1 | 5120 | 360448 | 360448 | 10000 | A5000 | 1 | 8 |
merlin-g-100 | 1 core | 128 cores | 2 | 3900 | 998400 | 998400 | 10000 | A100 | 1 | 8 |
{{site.data.alerts.tip}}Always check '/etc/slurm/gres.conf' and '/etc/slurm/slurm.conf' for changes in the GPU type and details of the hardware. {{site.data.alerts.end}}
Running jobs in the 'gmerlin6' cluster
In this chapter we will cover basic settings that users need to specify in order to run jobs in the GPU cluster.
Merlin6 GPU cluster
To run jobs in the gmerlin6
cluster users must specify the cluster name in Slurm:
#SBATCH --cluster=gmerlin6
Merlin6 GPU partitions
Users might need to specify the Slurm partition. If no partition is specified, it will default to gpu
:
#SBATCH --partition=<partition_name> # Possible <partition_name> values: gpu, gpu-short, gwendolen
The table below resumes shows all possible partitions available to users:
GPU Partition | Default Time | Max Time | PriorityJobFactor* | PriorityTier** |
---|---|---|---|---|
gpu |
1 day | 1 week | 1 | 1 |
gpu-short |
2 hours | 2 hours | 1000 | 500 |
gwendolen |
30 minutes | 2 hours | 1000 | 1000 |
gwendolen-long *** |
30 minutes | 8 hours | 1 | 1 |
*The PriorityJobFactor value will be added to the job priority (PARTITION column in sprio -l
). In other words, jobs sent to higher priority
partitions will usually run first (however, other factors such like job age or mainly fair share might affect to that decision). For the GPU
partitions, Slurm will also attempt first to allocate jobs on partitions with higher priority over partitions with lesser priority.
**Jobs submitted to a partition with a higher PriorityTier value will be dispatched before pending jobs in partition with lower PriorityTier value and, if possible, they will preempt running jobs from partitions with lower PriorityTier values.
***gwnedolen-long is a special partition which is enabled during non-working hours only. As of Nov 2023, the current policy is to disable this partition from Mon to Fri, from 1am to 5pm. However, jobs can be submitted anytime, but can only be scheduled outside this time range.
Merlin6 GPU Accounts
Users need to ensure that the public merlin
account is specified. No specifying account options would default to this account.
This is mostly needed by users which have multiple Slurm accounts, which may define by mistake a different account.
#SBATCH --account=merlin # Possible values: merlin, gwendolen
Not all the accounts can be used on all partitions. This is resumed in the table below:
Slurm Account | Slurm Partitions |
---|---|
merlin |
gpu ,gpu-short |
gwendolen |
gwendolen ,gwendolen-long |
By default, all users belong to the merlin
Slurm accounts, and jobs are submitted to the gpu
partition when no partition is defined.
Users only need to specify the gwendolen
account when using the gwendolen
or gwendolen-long
partitions, otherwise specifying account is not needed (it will always default to merlin
).
The 'gwendolen' account
For running jobs in the gwendolen
/gwendolen-long
partitions, users must specify the gwendolen
account.
The merlin
account is not allowed to use the Gwendolen partitions.
Gwendolen is restricted to a set of users belonging to the unx-gwendolen
Unix group. If you belong to a project allowed to use Gwendolen, or you are a user which would like to have access to it, please request access to the unx-gwendolen
Unix group through PSI Service Now: the request will be redirected to the responsible of the project (Andreas Adelmann).
Slurm GPU specific options
Some options are available when using GPUs. These are detailed here.
Number of GPUs and type
When using the GPU cluster, users must specify the number of GPUs they need to use:
#SBATCH --gpus=[<type>:]<number>
The GPU type is optional: if left empty, it will try allocating any type of GPU.
The different [<type>:]
values and <number>
of GPUs depends on the node.
This is detailed in the below table.
Nodes | GPU Type | #GPUs |
---|---|---|
merlin-g-[001] | geforce_gtx_1080 |
2 |
merlin-g-[002-005] | geforce_gtx_1080 |
4 |
merlin-g-[006-009] | geforce_gtx_1080_ti |
4 |
merlin-g-[010-013] | geforce_rtx_2080_ti |
4 |
merlin-g-014 | geforce_rtx_2080_ti |
8 |
merlin-g-015 | A5000 |
8 |
merlin-g-100 | A100 |
8 |
Constraint / Features
Instead of specifying the GPU type, sometimes users would need to specify the GPU by the amount of memory available in the GPU card itself.
This has been defined in Slurm with Features, which is a tag which defines the GPU memory for the different GPU cards.
Users can specify which GPU memory size needs to be used with the --constraint
option. In that case, notice that in many cases
there is not need to specify [<type>:]
in the --gpus
option.
#SBATCH --contraint=<Feature> # Possible values: gpumem_8gb, gpumem_11gb, gpumem_24gb, gpumem_40gb
The table below shows the available Features and which GPU card models and GPU nodes they belong to:
Merlin6 GPU Computing Nodes | ||
---|---|---|
Nodes | GPU Type | Feature |
merlin-g-[001-005] | `geforce_gtx_1080` | `gpumem_8gb` |
merlin-g-[006-009] | `geforce_gtx_1080_ti` | `gpumem_11gb` |
merlin-g-[010-014] | `geforce_rtx_2080_ti` | |
merlin-g-015 | `A5000` | `gpumem_24gb` |
merlin-g-100 | `A100` | `gpumem_40gb` |
Other GPU options
Alternative Slurm options for GPU based jobs are available. Please refer to the man pages
for each Slurm command for further information about it (man salloc
, man sbatch
, man srun
).
Below are listed the most common settings:
#SBATCH --hint=[no]multithread
#SBATCH --ntasks=\<ntasks\>
#SBATCH --ntasks-per-gpu=\<ntasks\>
#SBATCH --mem-per-gpu=\<size[units]\>
#SBATCH --cpus-per-gpu=\<ncpus\>
#SBATCH --gpus-per-node=[\<type\>:]\<number\>
#SBATCH --gpus-per-socket=[\<type\>:]\<number\>
#SBATCH --gpus-per-task=[\<type\>:]\<number\>
#SBATCH --gpu-bind=[verbose,]\<type\>
Please, notice that when defining [<type>:]
once, then all other options must use it too!
Dealing with Hyper-Threading
The gmerlin6
cluster contains the partitions gwendolen
and gwendolen-long
, which have a node with Hyper-Threading enabled.
In that case, one should always specify whether to use Hyper-Threading or not. If not defined, Slurm will
generally use it (exceptions apply). For this machine, generally HT is recommended.
#SBATCH --hint=multithread # Use extra threads with in-core multi-threading.
#SBATCH --hint=nomultithread # Don't use extra threads with in-core multi-threading.
User and job limits
The GPU cluster contains some basic user and job limits to ensure that a single user can not overabuse the resources and a fair usage of the cluster. The limits are described below.
Per job limits
These are limits applying to a single job. In other words, there is a maximum of resources a single job can use.
Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: SlurmQoS(limits)
(possible SlurmQoS
values can be listed with the command sacctmgr show qos
):
Partition | Slurm Account | Mon-Sun 0h-24h |
---|---|---|
gpu | merlin |
gpu_week(gres/gpu=8) |
gpu-short | merlin |
gpu_week(gres/gpu=8) |
gwendolen | gwendolen |
No limits |
gwendolen-long | gwendolen |
No limits, active from 9pm to 5:30am |
-
With the limits in the public
gpu
andgpu-short
partitions, a single job using themerlin
acccount (default account) can not use more than 40 CPUs, more than 8 GPUs or more than 200GB. Any job exceeding such limits will stay in the queue with the messageQOSMax[Cpu|GRES|Mem]PerJob
. As there are no more existing QoS during the week temporary overriding job limits (this happens for instance in the CPU daily partition), the job needs to be cancelled, and the requested resources must be adapted according to the above resource limits. -
The gwendolen and gwendolen-long partitions are two special partitions for a NVIDIA DGX A100 machine. Only users belonging to the
unx-gwendolen
Unix group can run in these partitions. No limits are applied (machine resources can be completely used). -
The
gwendolen-long
partition is available 24h. However,- from 5:30am to 9pm the partition is
down
(jobs can be submitted, but can not run until the partition is set toactive
). - from 9pm to 5:30am jobs are allowed to run (partition is set to
active
).
- from 5:30am to 9pm the partition is
Per user limits for GPU partitions
These limits apply exclusively to users. In other words, there is a maximum of resources a single user can use.
Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: SlurmQoS(limits)
(possible SlurmQoS
values can be listed with the command sacctmgr show qos
):
Partition | Slurm Account | Mon-Sun 0h-24h |
---|---|---|
gpu | merlin |
gpu_week(gres/gpu=16) |
gpu-short | merlin |
gpu_week(gres/gpu=16) |
gwendolen | gwendolen |
No limits |
gwendolen-long | gwendolen |
No limits, active from 9pm to 5:30am |
-
With the limits in the public
gpu
andgpu-short
partitions, a single user can not use more than 80 CPUs, more than 16 GPUs or more than 400GB. Jobs sent by any user already exceeding such limits will stay in the queue with the messageQOSMax[Cpu|GRES|Mem]PerUser
. In that case, job can wait in the queue until some of the running resources are freed. -
Notice that user limits are wider than job limits. In that way, a user can run up to two 8 GPUs based jobs, or up to four 4 GPUs based jobs, etc. Please try to avoid occupying all GPUs of the same type for several hours or multiple days, otherwise it would block other users needing the same type of GPU.
Advanced Slurm configuration
Clusters at PSI use the Slurm Workload Manager as the batch system technology for managing and scheduling jobs. Slurm has been installed in a multi-clustered configuration, allowing to integrate multiple clusters in the same batch system.
For understanding the Slurm configuration setup in the cluster, sometimes may be useful to check the following files:
/etc/slurm/slurm.conf
- can be found in the login nodes and computing nodes./etc/slurm/gres.conf
- can be found in the GPU nodes, is also propgated to login nodes and computing nodes for user read access./etc/slurm/cgroup.conf
- can be found in the computing nodes, is also propagated to login nodes for user read access.
The previous configuration files which can be found in the login nodes, correspond exclusively to the merlin6 cluster configuration files. Configuration files for the old merlin5 cluster or for the gmerlin6 cluster must be checked directly on any of the merlin5 or gmerlin6 computing nodes (in example, by login in to one of the nodes while a job or an active allocation is running).