Files
gitea-pages/pages/gmerlin6/slurm-configuration.md
2021-05-21 18:39:38 +02:00

14 KiB

title, keywords, last_updated, summary, sidebar, permalink
title keywords last_updated summary sidebar permalink
Slurm cluster 'gmerlin6' configuration, partitions, node definition, gmerlin6 29 January 2021 This document describes a summary of the Slurm 'configuration. merlin6_sidebar /gmerlin6/slurm-configuration.html

This documentation shows basic Slurm configuration and options needed to run jobs in the GPU cluster.

Merlin6 GPU nodes definition

The table below shows a summary of the hardware setup for the different GPU nodes

Nodes Def.#CPUs Max.#CPUs #Threads Def.Mem/CPU Max.Mem/CPU Max.Mem/Node Max.Swap GPU Type Def.#GPUs Max.#GPUs
merlin-g-[001] 1 core 8 cores 1 4000 102400 102400 10000 geforce_gtx_1080 1 2
merlin-g-[002-005] 1 core 20 cores 1 4000 102400 102400 10000 geforce_gtx_1080 1 4
merlin-g-[006-009] 1 core 20 cores 1 4000 102400 102400 10000 geforce_gtx_1080_ti 1 4
merlin-g-[010-013] 1 core 20 cores 1 4000 102400 102400 10000 geforce_rtx_2080_ti 1 4
merlin-g-014 1 core 48 cores 1 4000 360448 360448 10000 geforce_rtx_2080_ti 1 8
merlin-g-100 1 core 128 cores 2 3900 998400 998400 10000 A100 1 8

{{site.data.alerts.tip}}Always check '/etc/slurm/gres.conf' and '/etc/slurm/slurm.conf' for changes in the GPU type and details of the hardware. {{site.data.alerts.end}}

Running jobs in the 'gmerlin6' cluster

In this chapter we will cover basic settings that users need to specify in order to run jobs in the GPU cluster.

Merlin6 GPU cluster

To run jobs in the gmerlin6 cluster users must specify the cluster name in Slurm:

#SBATCH --cluster=gmerlin6

Merlin6 GPU partitions

Users might need to specify the Slurm partition. If no partition is specified, it will default to gpu:

#SBATCH --partition=<partition_name>  # Possible <partition_name> values: gpu, gpu-short, gwendolen

The table below resumes shows all possible partitions available to users:

GPU Partition Default Time Max Time PriorityJobFactor* PriorityTier**
gpu 1 day 1 week 1 1
gpu-short 2 hours 2 hours 1000 500
gwendolen 1 hour 12 hours 1000 1000

*The PriorityJobFactor value will be added to the job priority (PARTITION column in sprio -l ). In other words, jobs sent to higher priority partitions will usually run first (however, other factors such like job age or mainly fair share might affect to that decision). For the GPU partitions, Slurm will also attempt first to allocate jobs on partitions with higher priority over partitions with lesser priority.

**Jobs submitted to a partition with a higher PriorityTier value will be dispatched before pending jobs in partition with lower PriorityTier value and, if possible, they will preempt running jobs from partitions with lower PriorityTier values.

Merlin6 GPU Accounts

Users need to ensure that the public merlin account is specified. No specifying account options would default to this account. This is mostly needed by users which have multiple Slurm accounts, which may define by mistake a different account.

#SBATCH --account=merlin  # Possible values: merlin, gwendolen_public, gwendolen

Not all the accounts can be used on all partitions. This is resumed in the table below:

Slurm Account Slurm Partitions
merlin gpu,gpu-short
gwendolen_public gwendolen
gwendolen gwendolen

By default, all users belong to the merlin and gwendolen_public Slurm accounts. gwendolen is a restricted account.

The 'gwendolen' accounts

For running jobs in the gwendolen partition, users must specify one of the gwendolen_public or gwendolen accounts. The merlin account is not allowed to use the gwendolen partition.

  • The gwendolen_public can be used by any Merlin user, and provides restricted resource access to gwendolen.
  • The gwendolen is restricted to a set of users, and provides full access to gwendolen.

Slurm GPU specific options

Some options are available when using GPUs. These are detailed here.

Number of GPUs and type

When using the GPU cluster, users must specify the number of GPUs they need to use:

#SBATCH --gpus=[<type>:]<number>

The GPU type is optional: if left empty, it will try allocating any type of GPU. The different [<type>:] values and <number> of GPUs depends on the node. This is detailed in the below table.

Nodes GPU Type #GPUs
merlin-g-[001] geforce_gtx_1080 2
merlin-g-[002-005] geforce_gtx_1080 4
merlin-g-[006-009] geforce_gtx_1080_ti 4
merlin-g-[010-013] geforce_rtx_2080_ti 4
merlin-g-014 geforce_rtx_2080_ti 8
merlin-g-100 A100 8

Constraint / Features

Instead of specifying the GPU type, sometimes users would need to specify the GPU by the amount of memory available in the GPU card itself. This has been defined in Slurm with Features, which is a tag which defines the GPU memory for the different GPU cards. Users can specify which GPU memory size needs to be used with the --constraint option. In that case, notice that in many cases there is not need to specify [<type>:] in the --gpus option.

#SBATCH --contraint=<Feature>    # Possible values: gpumem_8gb, gpumem_11gb, gpumem_40gb

The table below shows the available Features and which GPU card models and GPU nodes they belong to:

Merlin6 CPU Computing Nodes
Nodes GPU Type Feature
merlin-g-[001-005] `geforce_gtx_1080` `gpumem_8gb`
merlin-g-[006-009] `geforce_gtx_1080_ti` `gpumem_11gb`
merlin-g-[010-014] `geforce_rtx_2080_ti`
merlin-g-100 `A100` `gpumem_40gb`

Other GPU options

Alternative Slurm options for GPU based jobs are available. Please refer to the man pages for each Slurm command for further information about it (man salloc, man sbatch, man srun). Below are listed the most common settings:

#SBATCH --hint=[no]multithread
#SBATCH --ntasks=\<ntasks\>
#SBATCH --ntasks-per-gpu=\<ntasks\>
#SBATCH --mem-per-gpu=\<size[units]\>
#SBATCH --cpus-per-gpu=\<ncpus\>
#SBATCH --gpus-per-node=[\<type\>:]\<number\>
#SBATCH --gpus-per-socket=[\<type\>:]\<number\>
#SBATCH --gpus-per-task=[\<type\>:]\<number\>
#SBATCH --gpu-bind=[verbose,]\<type\>

Please, notice that when defining [<type>:] once, then all other options must use it too!

Dealing with Hyper-Threading

The gmerlin6 cluster contains the partition gwendolen, which has a node with Hyper-Threading enabled. In that case, one should always specify whether to use Hyper-Threading or not. If not defined, Slurm will generally use it (exceptions apply). For this machine, generally HT is recommended.

#SBATCH --hint=multithread            # Use extra threads with in-core multi-threading.
#SBATCH --hint=nomultithread          # Don't use extra threads with in-core multi-threading.

User and job limits

The GPU cluster contains some basic user and job limits to ensure that a single user can not overabuse the resources and a fair usage of the cluster. The limits are described below.

Per job limits

These are limits applying to a single job. In other words, there is a maximum of resources a single job can use. Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: SlurmQoS(limits) (possible SlurmQoS values can be listed with the command sacctmgr show qos):

Partition Slurm Account Mon-Sun 0h-24h
gpu merlin gpu_week(cpu=40,gres/gpu=8,mem=200G)
gpu-short merlin gpu_week(cpu=40,gres/gpu=8,mem=200G)
gwendolen gwendolen_public gwendolen_public(cpu=32,gres/gpu=2,mem=200G)
gwendolen gwendolen No limits, full access granted
  • With the limits in the public gpu and gpu-short partitions, a single job using the merlin acccount (default account) can not use more than 40 CPUs, more than 8 GPUs or more than 200GB. Any job exceeding such limits will stay in the queue with the message QOSMax[Cpu|GRES|Mem]PerJob. As there are no more existing QoS during the week temporary overriding job limits (this happens for instance in the CPU daily partition), the job needs to be cancelled, and the requested resources must be adapted according to the above resource limits.

  • The gwendolen partition is a special partition with a NVIDIA DGX A100 machine. Public access is possible through the gwendolen_public account, however is limited to 2 GPUs per job, 32 CPUs and 121875MB of memory). For full access, the gwendolen account is needed, and this is restricted to a set of users.

Per user limits for GPU partitions

These limits apply exclusively to users. In other words, there is a maximum of resources a single user can use. Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: SlurmQoS(limits) (possible SlurmQoS values can be listed with the command sacctmgr show qos):

Partition Slurm Account Mon-Sun 0h-24h
gpu merlin gpu_week(cpu=80,gres/gpu=16,mem=400G)
gpu-short merlin gpu_week(cpu=80,gres/gpu=16,mem=400G)
gwendolen gwendolen_public gwendolen_public(cpu=64,gres/gpu=4,mem=243750M)
gwendolen gwendolen No limits, full access granted
  • With the limits in the public gpu and gpu-short partitions, a single user can not use more than 80 CPUs, more than 16 GPUs or more than 400GB. Jobs sent by any user already exceeding such limits will stay in the queue with the message QOSMax[Cpu|GRES|Mem]PerUser. In that case, job can wait in the queue until some of the running resources are freed.

  • Notice that user limits are wider than job limits. In that way, a user can run up to two 8 GPUs based jobs, or up to four 4 GPUs based jobs, etc. Please try to avoid occupying all GPUs of the same type for several hours or multiple days, otherwise it would block other users needing the same type of GPU.

Advanced Slurm configuration

Clusters at PSI use the Slurm Workload Manager as the batch system technology for managing and scheduling jobs. Slurm has been installed in a multi-clustered configuration, allowing to integrate multiple clusters in the same batch system.

For understanding the Slurm configuration setup in the cluster, sometimes may be useful to check the following files:

  • /etc/slurm/slurm.conf - can be found in the login nodes and computing nodes.
  • /etc/slurm/gres.conf - can be found in the GPU nodes, is also propgated to login nodes and computing nodes for user read access.
  • /etc/slurm/cgroup.conf - can be found in the computing nodes, is also propagated to login nodes for user read access.

The previous configuration files which can be found in the login nodes, correspond exclusively to the merlin6 cluster configuration files. Configuration files for the old merlin5 cluster or for the gmerlin6 cluster must be checked directly on any of the merlin5 or gmerlin6 computing nodes (in example, by login in to one of the nodes while a job or an active allocation is running).