--- title: Slurm cluster 'gmerlin6' #tags: keywords: configuration, partitions, node definition, gmerlin6 last_updated: 29 January 2021 summary: "This document describes a summary of the Slurm 'configuration." sidebar: merlin6_sidebar permalink: /gmerlin6/slurm-configuration.html --- This documentation shows basic Slurm configuration and options needed to run jobs in the GPU cluster. ## Merlin6 GPU nodes definition The table below shows a summary of the hardware setup for the different GPU nodes | Nodes | Def.#CPUs | Max.#CPUs | #Threads | Def.Mem/CPU | Max.Mem/CPU | Max.Mem/Node | Max.Swap | GPU Type | Def.#GPUs | Max.#GPUs | |:------------------:| ---------:| :--------:| :------: | :----------:| :----------:| :-----------:| :-------:| :--------: | :-------: | :-------: | | merlin-g-[001] | 1 core | 8 cores | 1 | 4000 | 102400 | 102400 | 10000 | **geforce_gtx_1080** | 1 | 2 | | merlin-g-[002-005] | 1 core | 20 cores | 1 | 4000 | 102400 | 102400 | 10000 | **geforce_gtx_1080** | 1 | 4 | | merlin-g-[006-009] | 1 core | 20 cores | 1 | 4000 | 102400 | 102400 | 10000 | **geforce_gtx_1080_ti** | 1 | 4 | | merlin-g-[010-013] | 1 core | 20 cores | 1 | 4000 | 102400 | 102400 | 10000 | **geforce_rtx_2080_ti** | 1 | 4 | | merlin-g-014 | 1 core | 48 cores | 1 | 4000 | 360448 | 360448 | 10000 | **geforce_rtx_2080_ti** | 1 | 8 | | merlin-g-100 | 1 core | 128 cores | 2 | 3900 | 998400 | 998400 | 10000 | **A100** | 1 | 8 | {{site.data.alerts.tip}}Always check '/etc/slurm/gres.conf' and '/etc/slurm/slurm.conf' for changes in the GPU type and details of the hardware. {{site.data.alerts.end}} ## Running jobs in the 'gmerlin6' cluster In this chapter we will cover basic settings that users need to specify in order to run jobs in the GPU cluster. ### Merlin6 GPU cluster To run jobs in the **`gmerlin6`** cluster users **must** specify the cluster name in Slurm: ```bash #SBATCH --cluster=gmerlin6 ``` ### Merlin6 GPU partitions Users might need to specify the Slurm partition. If no partition is specified, it will default to **`gpu`**: ```bash #SBATCH --partition= # Possible values: gpu, gpu-short, gwendolen ``` The table below resumes shows all possible partitions available to users: | GPU Partition | Default Time | Max Time | PriorityJobFactor\* | PriorityTier\*\* | |:-----------------: | :----------: | :------: | :-----------------: | :--------------: | | `gpu` | 1 day | 1 week | 1 | 1 | | `gpu-short` | 2 hours | 2 hours | 1000 | 500 | | `gwendolen` | 1 hour | 12 hours | 1000 | 1000 | \*The **PriorityJobFactor** value will be added to the job priority (*PARTITION* column in `sprio -l` ). In other words, jobs sent to higher priority partitions will usually run first (however, other factors such like **job age** or mainly **fair share** might affect to that decision). For the GPU partitions, Slurm will also attempt first to allocate jobs on partitions with higher priority over partitions with lesser priority. \*\*Jobs submitted to a partition with a higher **PriorityTier** value will be dispatched before pending jobs in partition with lower *PriorityTier* value and, if possible, they will preempt running jobs from partitions with lower *PriorityTier* values. ### Merlin6 GPU Accounts Users need to ensure that the public **`merlin`** account is specified. No specifying account options would default to this account. This is mostly needed by users which have multiple Slurm accounts, which may define by mistake a different account. ```bash #SBATCH --account=merlin # Possible values: merlin, gwendolen ``` Not all the accounts can be used on all partitions. This is resumed in the table below: | Slurm Account | Slurm Partitions | Special QoS | |:-------------------: | :------------------: | :---------------------------------: | | **`merlin`** | **`gpu`**,`gpu-short` | | | `gwendolen` | `gwendolen` | `gwendolen`, **`gwendolen_public`** | By default, all users belong to the `merlin` and `gwendolen` Slurm accounts. Users only need to specify `gwendolen` when using `gwendolen`, otherwise specfying account is not needed (it will always default to `merlin`). `gwendolen` is a special account, with two different **QoS** granting different types of access (see details below). #### The 'gwendolen' account For running jobs in the **`gwendolen`** partition, users must specify the `gwendolen` account. The `merlin` account is not allowed to use the `gwendolen` partition. In addition, in Slurm there is the concept of **QoS**, which stands for **Quality of Service**. The **`gwendolen`** account has two different QoS configured: * The **QoS** **`gwendolen_public`** is set by default to all Merlin users. This restricts the number of resources than can be used on **Gwendolen**. For further information about restrictions, please read the ['User and Job Limits'](/gmerlin6/slurm-configuration.html#user-and-job-limits) documentation. * The **QoS** **`gwendolen`** provides full access to **`gwendolen`**, however this is restricted to a set of users belonging to the **`unx-gwendolen`** Unix group. Users don't need to specify any QoS, however, they need to be aware about resources restrictions. If you belong to one of the projects which is allowed to use **Gwendolen** without restrictions, please request access to the **`unx-gwendolen`** through [PSI Service Now](https://psi.service-now.com/). ### Slurm GPU specific options Some options are available when using GPUs. These are detailed here. #### Number of GPUs and type When using the GPU cluster, users **must** specify the number of GPUs they need to use: ```bash #SBATCH --gpus=[:] ``` The GPU type is optional: if left empty, it will try allocating any type of GPU. The different `[:]` values and `` of GPUs depends on the node. This is detailed in the below table. | Nodes | GPU Type | #GPUs | |:---------------------: | :-----------------------: | :---: | | **merlin-g-[001]** | **`geforce_gtx_1080`** | 2 | | **merlin-g-[002-005]** | **`geforce_gtx_1080`** | 4 | | **merlin-g-[006-009]** | **`geforce_gtx_1080_ti`** | 4 | | **merlin-g-[010-013]** | **`geforce_rtx_2080_ti`** | 4 | | **merlin-g-014** | **`geforce_rtx_2080_ti`** | 8 | | **merlin-g-100** | **`A100`** | 8 | #### Constraint / Features Instead of specifying the GPU **type**, sometimes users would need to **specify the GPU by the amount of memory available in the GPU** card itself. This has been defined in Slurm with **Features**, which is a tag which defines the GPU memory for the different GPU cards. Users can specify which GPU memory size needs to be used with the `--constraint` option. In that case, notice that *in many cases there is not need to specify `[:]`* in the `--gpus` option. ```bash #SBATCH --contraint= # Possible values: gpumem_8gb, gpumem_11gb, gpumem_40gb ``` The table below shows the available **Features** and which GPU card models and GPU nodes they belong to:

Merlin6 CPU Computing Nodes
Nodes	GPU Type	Feature
merlin-g-[001-005]	`geforce_gtx_1080`	`gpumem_8gb`
merlin-g-[006-009]	`geforce_gtx_1080_ti`	`gpumem_11gb`
merlin-g-[010-014]	`geforce_rtx_2080_ti`	`gpumem_11gb`
merlin-g-100	`A100`	`gpumem_40gb`

#### Other GPU options Alternative Slurm options for GPU based jobs are available. Please refer to the **man** pages for each Slurm command for further information about it (`man salloc`, `man sbatch`, `man srun`). Below are listed the most common settings: ```bash #SBATCH --hint=[no]multithread #SBATCH --ntasks=\ #SBATCH --ntasks-per-gpu=\ #SBATCH --mem-per-gpu=\ #SBATCH --cpus-per-gpu=\ #SBATCH --gpus-per-node=[\:]\ #SBATCH --gpus-per-socket=[\:]\ #SBATCH --gpus-per-task=[\:]\ #SBATCH --gpu-bind=[verbose,]\ ``` Please, notice that when defining `[:]` once, then all other options must use it too! #### Dealing with Hyper-Threading The **`gmerlin6`** cluster contains the partition `gwendolen`, which has a node with Hyper-Threading enabled. In that case, one should always specify whether to use Hyper-Threading or not. If not defined, Slurm will generally use it (exceptions apply). For this machine, generally HT is recommended. ```bash #SBATCH --hint=multithread # Use extra threads with in-core multi-threading. #SBATCH --hint=nomultithread # Don't use extra threads with in-core multi-threading. ``` ## User and job limits The GPU cluster contains some basic user and job limits to ensure that a single user can not overabuse the resources and a fair usage of the cluster. The limits are described below. ### Per job limits These are limits applying to a single job. In other words, there is a maximum of resources a single job can use. Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: `SlurmQoS(limits)` (possible `SlurmQoS` values can be listed with the command `sacctmgr show qos`): | Partition | Slurm Account | Mon-Sun 0h-24h | |:-------------:| :------------: | :------------------------------------------: | | **gpu** | **`merlin`** | gpu_week(cpu=40,gres/gpu=8,mem=200G) | | **gpu-short** | **`merlin`** | gpu_week(cpu=40,gres/gpu=8,mem=200G) | | **gwendolen** | `gwendolen` | gwendolen_public(cpu=32,gres/gpu=2,mem=200G) | | **gwendolen** | `gwendolen` | gwendolen(No limits, full access granted) | * With the limits in the public `gpu` and `gpu-short` partitions, a single job using the `merlin` acccount (default account) can not use more than 40 CPUs, more than 8 GPUs or more than 200GB. Any job exceeding such limits will stay in the queue with the message **`QOSMax[Cpu|GRES|Mem]PerJob`**. As there are no more existing QoS during the week temporary overriding job limits (this happens for instance in the CPU **daily** partition), the job needs to be cancelled, and the requested resources must be adapted according to the above resource limits. * The **gwendolen** partition is a special partition with a **[NVIDIA DGX A100](https://www.nvidia.com/en-us/data-center/dgx-a100/)** machine. Public access is possible through the `gwendolen` account, however this is limited to 2 GPUs per job, 32 CPUs and 121875MB of memory). For full access, the `gwendolen` account with `gwendolen` **QoS** (Quality of Service) is needed, and this is restricted to a set of users (belonging to the **`unx-gwendolen`** Unix group). Any other user will have by default a QoS **`gwendolen_public`**, which restricts resources in Gwendolen. ### Per user limits for GPU partitions These limits apply exclusively to users. In other words, there is a maximum of resources a single user can use. Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: `SlurmQoS(limits)` (possible `SlurmQoS` values can be listed with the command `sacctmgr show qos`): | Partition | Slurm Account | Mon-Sun 0h-24h | |:-------------:| :----------------: | :---------------------------------------------: | | **gpu** | **`merlin`** | gpu_week(cpu=80,gres/gpu=16,mem=400G) | | **gpu-short** | **`merlin`** | gpu_week(cpu=80,gres/gpu=16,mem=400G) | | **gwendolen** | `gwendolen` | gwendolen_public(cpu=64,gres/gpu=4,mem=243750M) | | **gwendolen** | `gwendolen` | gwendolen(No limits, full access granted) | * With the limits in the public `gpu` and `gpu-short` partitions, a single user can not use more than 80 CPUs, more than 16 GPUs or more than 400GB. Jobs sent by any user already exceeding such limits will stay in the queue with the message **`QOSMax[Cpu|GRES|Mem]PerUser`**. In that case, job can wait in the queue until some of the running resources are freed. * Notice that user limits are wider than job limits. In that way, a user can run up to two 8 GPUs based jobs, or up to four 4 GPUs based jobs, etc. Please try to avoid occupying all GPUs of the same type for several hours or multiple days, otherwise it would block other users needing the same type of GPU. ## Advanced Slurm configuration Clusters at PSI use the [Slurm Workload Manager](http://slurm.schedmd.com/) as the batch system technology for managing and scheduling jobs. Slurm has been installed in a **multi-clustered** configuration, allowing to integrate multiple clusters in the same batch system. For understanding the Slurm configuration setup in the cluster, sometimes may be useful to check the following files: * ``/etc/slurm/slurm.conf`` - can be found in the login nodes and computing nodes. * ``/etc/slurm/gres.conf`` - can be found in the GPU nodes, is also propgated to login nodes and computing nodes for user read access. * ``/etc/slurm/cgroup.conf`` - can be found in the computing nodes, is also propagated to login nodes for user read access. The previous configuration files which can be found in the login nodes, correspond exclusively to the **merlin6** cluster configuration files. Configuration files for the old **merlin5** cluster or for the **gmerlin6** cluster must be checked directly on any of the **merlin5** or **gmerlin6** computing nodes (in example, by login in to one of the nodes while a job or an active allocation is running).