gitea-pages/slurm-configuration.md at 0fd165393812f918aa5ccc7914d4192af87c5036

Files

caubet_m 0fd1653938 Expanded PModules docs

2021-05-21 18:39:38 +02:00

14 KiB

Raw Blame History

title, keywords, last_updated, summary, sidebar, permalink

title	keywords	last_updated	summary	sidebar	permalink
Slurm cluster 'gmerlin6'	configuration, partitions, node definition, gmerlin6	29 January 2021	This document describes a summary of the Slurm 'configuration.	merlin6_sidebar	/gmerlin6/slurm-configuration.html

This documentation shows basic Slurm configuration and options needed to run jobs in the GPU cluster.

Merlin6 GPU nodes definition

The table below shows a summary of the hardware setup for the different GPU nodes

Nodes	Def.#CPUs	Max.#CPUs	#Threads	Def.Mem/CPU	Max.Mem/CPU	Max.Mem/Node	Max.Swap	GPU Type	Def.#GPUs	Max.#GPUs
merlin-g-[001]	1 core	8 cores	1	4000	102400	102400	10000	geforce_gtx_1080	1	2
merlin-g-[002-005]	1 core	20 cores	1	4000	102400	102400	10000	geforce_gtx_1080	1	4
merlin-g-[006-009]	1 core	20 cores	1	4000	102400	102400	10000	geforce_gtx_1080_ti	1	4
merlin-g-[010-013]	1 core	20 cores	1	4000	102400	102400	10000	geforce_rtx_2080_ti	1	4
merlin-g-014	1 core	48 cores	1	4000	360448	360448	10000	geforce_rtx_2080_ti	1	8
merlin-g-100	1 core	128 cores	2	3900	998400	998400	10000	A100	1	8

{{site.data.alerts.tip}}Always check '/etc/slurm/gres.conf' and '/etc/slurm/slurm.conf' for changes in the GPU type and details of the hardware. {{site.data.alerts.end}}

Running jobs in the 'gmerlin6' cluster

In this chapter we will cover basic settings that users need to specify in order to run jobs in the GPU cluster.

Merlin6 GPU cluster

To run jobs in the gmerlin6 cluster users must specify the cluster name in Slurm:

#SBATCH --cluster=gmerlin6

Merlin6 GPU partitions

Users might need to specify the Slurm partition. If no partition is specified, it will default to gpu:

#SBATCH --partition=<partition_name>  # Possible <partition_name> values: gpu, gpu-short, gwendolen

The table below resumes shows all possible partitions available to users:

GPU Partition	Default Time	Max Time	PriorityJobFactor*	PriorityTier**
`gpu`	1 day	1 week	1	1
`gpu-short`	2 hours	2 hours	1000	500
`gwendolen`	1 hour	12 hours	1000	1000

*The PriorityJobFactor value will be added to the job priority (PARTITION column in sprio -l ). In other words, jobs sent to higher priority partitions will usually run first (however, other factors such like job age or mainly fair share might affect to that decision). For the GPU partitions, Slurm will also attempt first to allocate jobs on partitions with higher priority over partitions with lesser priority.

**Jobs submitted to a partition with a higher PriorityTier value will be dispatched before pending jobs in partition with lower PriorityTier value and, if possible, they will preempt running jobs from partitions with lower PriorityTier values.

Merlin6 GPU Accounts

Users need to ensure that the public merlin account is specified. No specifying account options would default to this account. This is mostly needed by users which have multiple Slurm accounts, which may define by mistake a different account.

#SBATCH --account=merlin  # Possible values: merlin, gwendolen_public, gwendolen

Not all the accounts can be used on all partitions. This is resumed in the table below:

Slurm Account	Slurm Partitions
`merlin`	`gpu`,`gpu-short`
`gwendolen_public`	`gwendolen`
`gwendolen`	`gwendolen`

By default, all users belong to the merlin and gwendolen_public Slurm accounts. gwendolen is a restricted account.

The 'gwendolen' accounts

For running jobs in the gwendolen partition, users must specify one of the gwendolen_public or gwendolen accounts. The merlin account is not allowed to use the gwendolen partition.

The gwendolen_public can be used by any Merlin user, and provides restricted resource access to gwendolen.
The gwendolen is restricted to a set of users, and provides full access to gwendolen.

Slurm GPU specific options

Some options are available when using GPUs. These are detailed here.

Number of GPUs and type

When using the GPU cluster, users must specify the number of GPUs they need to use:

#SBATCH --gpus=[<type>:]<number>

The GPU type is optional: if left empty, it will try allocating any type of GPU. The different [<type>:] values and <number> of GPUs depends on the node. This is detailed in the below table.

Nodes	GPU Type	#GPUs
merlin-g-[001]	`geforce_gtx_1080`	2
merlin-g-[002-005]	`geforce_gtx_1080`	4
merlin-g-[006-009]	`geforce_gtx_1080_ti`	4
merlin-g-[010-013]	`geforce_rtx_2080_ti`	4
merlin-g-014	`geforce_rtx_2080_ti`	8
merlin-g-100	`A100`	8

Constraint / Features

Instead of specifying the GPU type, sometimes users would need to specify the GPU by the amount of memory available in the GPU card itself. This has been defined in Slurm with Features, which is a tag which defines the GPU memory for the different GPU cards. Users can specify which GPU memory size needs to be used with the --constraint option. In that case, notice that in many cases there is not need to specify [<type>:] in the --gpus option.

#SBATCH --contraint=<Feature>    # Possible values: gpumem_8gb, gpumem_11gb, gpumem_40gb

The table below shows the available Features and which GPU card models and GPU nodes they belong to:

Merlin6 CPU Computing Nodes
Nodes	GPU Type	Feature
merlin-g-[001-005]	`geforce_gtx_1080`	`gpumem_8gb`
merlin-g-[006-009]	`geforce_gtx_1080_ti`	`gpumem_11gb`
merlin-g-[010-014]	`geforce_rtx_2080_ti`	`gpumem_11gb`
merlin-g-100	`A100`	`gpumem_40gb`

Other GPU options

Alternative Slurm options for GPU based jobs are available. Please refer to the man pages for each Slurm command for further information about it (man salloc, man sbatch, man srun). Below are listed the most common settings:

#SBATCH --hint=[no]multithread
#SBATCH --ntasks=\<ntasks\>
#SBATCH --ntasks-per-gpu=\<ntasks\>
#SBATCH --mem-per-gpu=\<size[units]\>
#SBATCH --cpus-per-gpu=\<ncpus\>
#SBATCH --gpus-per-node=[\<type\>:]\<number\>
#SBATCH --gpus-per-socket=[\<type\>:]\<number\>
#SBATCH --gpus-per-task=[\<type\>:]\<number\>
#SBATCH --gpu-bind=[verbose,]\<type\>

Please, notice that when defining [<type>:] once, then all other options must use it too!

Dealing with Hyper-Threading

The gmerlin6 cluster contains the partition gwendolen, which has a node with Hyper-Threading enabled. In that case, one should always specify whether to use Hyper-Threading or not. If not defined, Slurm will generally use it (exceptions apply). For this machine, generally HT is recommended.

#SBATCH --hint=multithread            # Use extra threads with in-core multi-threading.
#SBATCH --hint=nomultithread          # Don't use extra threads with in-core multi-threading.

User and job limits

The GPU cluster contains some basic user and job limits to ensure that a single user can not overabuse the resources and a fair usage of the cluster. The limits are described below.

Per job limits

These are limits applying to a single job. In other words, there is a maximum of resources a single job can use. Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: SlurmQoS(limits) (possible SlurmQoS values can be listed with the command sacctmgr show qos):

Partition	Slurm Account	Mon-Sun 0h-24h
gpu	`merlin`	gpu_week(cpu=40,gres/gpu=8,mem=200G)
gpu-short	`merlin`	gpu_week(cpu=40,gres/gpu=8,mem=200G)
gwendolen	`gwendolen_public`	gwendolen_public(cpu=32,gres/gpu=2,mem=200G)
gwendolen	`gwendolen`	No limits, full access granted

With the limits in the public gpu and gpu-short partitions, a single job using the merlin acccount (default account) can not use more than 40 CPUs, more than 8 GPUs or more than 200GB. Any job exceeding such limits will stay in the queue with the message QOSMax[Cpu|GRES|Mem]PerJob. As there are no more existing QoS during the week temporary overriding job limits (this happens for instance in the CPU daily partition), the job needs to be cancelled, and the requested resources must be adapted according to the above resource limits.
The gwendolen partition is a special partition with a NVIDIA DGX A100 machine. Public access is possible through the gwendolen_public account, however is limited to 2 GPUs per job, 32 CPUs and 121875MB of memory). For full access, the gwendolen account is needed, and this is restricted to a set of users.

Per user limits for GPU partitions

These limits apply exclusively to users. In other words, there is a maximum of resources a single user can use. Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: SlurmQoS(limits) (possible SlurmQoS values can be listed with the command sacctmgr show qos):

Partition	Slurm Account	Mon-Sun 0h-24h
gpu	`merlin`	gpu_week(cpu=80,gres/gpu=16,mem=400G)
gpu-short	`merlin`	gpu_week(cpu=80,gres/gpu=16,mem=400G)
gwendolen	`gwendolen_public`	gwendolen_public(cpu=64,gres/gpu=4,mem=243750M)
gwendolen	`gwendolen`	No limits, full access granted

With the limits in the public gpu and gpu-short partitions, a single user can not use more than 80 CPUs, more than 16 GPUs or more than 400GB. Jobs sent by any user already exceeding such limits will stay in the queue with the message QOSMax[Cpu|GRES|Mem]PerUser. In that case, job can wait in the queue until some of the running resources are freed.
Notice that user limits are wider than job limits. In that way, a user can run up to two 8 GPUs based jobs, or up to four 4 GPUs based jobs, etc. Please try to avoid occupying all GPUs of the same type for several hours or multiple days, otherwise it would block other users needing the same type of GPU.

Advanced Slurm configuration

Clusters at PSI use the Slurm Workload Manager as the batch system technology for managing and scheduling jobs. Slurm has been installed in a multi-clustered configuration, allowing to integrate multiple clusters in the same batch system.

For understanding the Slurm configuration setup in the cluster, sometimes may be useful to check the following files:

/etc/slurm/slurm.conf - can be found in the login nodes and computing nodes.
/etc/slurm/gres.conf - can be found in the GPU nodes, is also propgated to login nodes and computing nodes for user read access.
/etc/slurm/cgroup.conf - can be found in the computing nodes, is also propagated to login nodes for user read access.

The previous configuration files which can be found in the login nodes, correspond exclusively to the merlin6 cluster configuration files. Configuration files for the old merlin5 cluster or for the gmerlin6 cluster must be checked directly on any of the merlin5 or gmerlin6 computing nodes (in example, by login in to one of the nodes while a job or an active allocation is running).

14 KiB Raw Blame History