gitea-pages/pages/gmerlin6/slurm-configuration.md

---
title: Slurm cluster 'gmerlin6'
#tags:
keywords: configuration, partitions, node definition, gmerlin6
last_updated: 29 January 2021
summary: "This document describes a summary of the Slurm 'configuration."
sidebar: merlin6_sidebar
permalink: /gmerlin6/slurm-configuration.html
---

This documentation shows basic Slurm configuration and options needed to run jobs in the GPU cluster.

## Merlin6 GPU nodes definition

The table below shows a summary of the hardware setup for the different GPU nodes

| Nodes              | Def.#CPUs | Max.#CPUs | #Threads | Def.Mem/CPU | Max.Mem/CPU | Max.Mem/Node | Max.Swap | GPU Type                | Def.#GPUs | Max.#GPUs |
|:------------------:| ---------:| :--------:| :------: | :----------:| :----------:| :-----------:| :-------:| :--------:              | :-------: | :-------: |
| merlin-g-[001]     | 1 core    | 8 cores   | 1        | 4000        | 102400      | 102400       | 10000    | **geforce_gtx_1080**    | 1         | 2         |
| merlin-g-[002-005] | 1 core    | 20 cores  | 1        | 4000        | 102400      | 102400       | 10000    | **geforce_gtx_1080**    | 1         | 4         |
| merlin-g-[006-009] | 1 core    | 20 cores  | 1        | 4000        | 102400      | 102400       | 10000    | **geforce_gtx_1080_ti** | 1         | 4         |
| merlin-g-[010-013] | 1 core    | 20 cores  | 1        | 4000        | 102400      | 102400       | 10000    | **geforce_rtx_2080_ti** | 1         | 4         |
| merlin-g-014       | 1 core    | 48 cores  | 1        | 4000        | 360448      | 360448       | 10000    | **geforce_rtx_2080_ti** | 1         | 8         |
| merlin-g-100       | 1 core    | 128 cores | 2        | 3900        | 998400      | 998400       | 10000    | **A100**                | 1         | 8         |

{{site.data.alerts.tip}}Always check <b>'/etc/slurm/gres.conf'</b> and <b>'/etc/slurm/slurm.conf'</b> for changes in the GPU type and details of the hardware.
{{site.data.alerts.end}}

## Running jobs in the 'gmerlin6' cluster

In this chapter we will cover basic settings that users need to specify in order to run jobs in the GPU cluster.

### Merlin6 GPU cluster

To run jobs in the **`gmerlin6`** cluster users **must** specify the cluster name in Slurm:

```bash
#SBATCH --cluster=gmerlin6
```

### Merlin6 GPU partitions

Users might need to specify the Slurm partition. If no partition is specified, it will default to **`gpu`**:

```bash
#SBATCH --partition=<partition_name>  # Possible <partition_name> values: gpu, gpu-short, gwendolen
```

The table below resumes shows all possible partitions available to users:

| GPU Partition      |  Default Time | Max Time   | PriorityJobFactor\* | PriorityTier\*\* |
|:-----------------: |  :----------: | :--------: | :-----------------: | :--------------: |
| `gpu`              |  1 day        | 1 week     | 1                   | 1                |
| `gpu-short`        |  2 hours      | 2 hours    | 1000                | 500              |
| `gwendolen`        |  30 minutes   | 2 hours    | 1000                | 1000             |
| `gwendolen-long`   |  30 minutes   | 8 hours    | 1                   | 1                |

\*The **PriorityJobFactor** value will be added to the job priority (*PARTITION* column in `sprio -l` ). In other words, jobs sent to higher priority
partitions will usually run first (however, other factors such like **job age** or mainly **fair share** might affect to that decision). For the GPU
partitions, Slurm will also attempt first to allocate jobs on partitions with higher priority over partitions with lesser priority.

\*\*Jobs submitted to a partition with a higher **PriorityTier** value will be dispatched before pending jobs in partition with  lower *PriorityTier*  value
and, if possible, they will preempt running jobs from partitions with lower *PriorityTier* values.

### Merlin6 GPU Accounts

Users need to ensure that the public **`merlin`** account is specified. No specifying account options would default to this account.
This is mostly needed by users which have multiple Slurm accounts, which may define by mistake a different account.

```bash
#SBATCH --account=merlin  # Possible values: merlin, gwendolen
```
Not all the accounts can be used on all partitions. This is resumed in the table below:

| Slurm Account        | Slurm Partitions             |
|:-------------------: |  :------------------:        |
| **`merlin`**         | **`gpu`**,`gpu-short`        |
| `gwendolen`          | `gwendolen`,`gwendolen-long` |

By default, all users belong to the `merlin` Slurm accounts, and jobs are submitted to the `gpu` partition when no partition is defined.

Users only need to specify the `gwendolen` account when using the `gwendolen` or `gwendolen-long` partitions, otherwise specifying account is not needed (it will always default to `merlin`).

#### The 'gwendolen' account

For running jobs in the **`gwendolen`/`gwendolen-long`** partitions, users must specify the **`gwendolen`** account.
The `merlin` account is not allowed to use the Gwendolen partitions.

Gwendolen is restricted to a set of users belonging to the **`unx-gwendolen`** Unix group. If you belong to a project allowed to use **Gwendolen**, or you are a user which would like to have access to it, please request access to the **`unx-gwendolen`** Unix group through [PSI Service Now](https://psi.service-now.com/): the request will be redirected to the responsible of the project (Andreas Adelmann).

### Slurm GPU specific options

Some options are available when using GPUs. These are detailed here.

#### Number of GPUs and type

When using the GPU cluster, users **must** specify the number of GPUs they need to use:

```bash
#SBATCH --gpus=[<type>:]<number>
```

The GPU type is optional: if left empty, it will try allocating any type of GPU.
The different `[<type>:]` values and `<number>` of GPUs depends on the node.
This is detailed in the below table.

| Nodes                  | GPU Type                  | #GPUs |
|:---------------------: | :-----------------------: | :---: |
| **merlin-g-[001]**     | **`geforce_gtx_1080`**    | 2     |
| **merlin-g-[002-005]** | **`geforce_gtx_1080`**    | 4     |
| **merlin-g-[006-009]** | **`geforce_gtx_1080_ti`** | 4     |
| **merlin-g-[010-013]** | **`geforce_rtx_2080_ti`** | 4     |
| **merlin-g-014**       | **`geforce_rtx_2080_ti`** | 8     |
| **merlin-g-100**       | **`A100`**                | 8     |

#### Constraint / Features

Instead of specifying the GPU **type**, sometimes users would need to **specify the GPU by the amount of memory available in the GPU** card itself.
This has been defined in Slurm with **Features**, which is a tag which defines the GPU memory for the different GPU cards.
Users can specify which GPU memory size needs to be used with the `--constraint` option. In that case, notice that *in many cases
 there is not need to specify `[<type>:]`* in the `--gpus` option.

```bash
#SBATCH --contraint=<Feature>    # Possible values: gpumem_8gb, gpumem_11gb, gpumem_40gb
```

The table below shows the available **Features** and which GPU card models and GPU nodes they belong to:

<table>
  <thead>
   <tr>
   <th scope='colgroup' style="vertical-align:middle;text-align:center;" colspan="3">Merlin6 CPU Computing Nodes</th>
   </tr>
   <tr>
   <th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Nodes</th>
   <th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">GPU Type</th>
   <th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Feature</th>
   </tr>
  </thead>
  <tbody>
   <tr style="vertical-align:middle;text-align:center;" ralign="center">
   <td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-[001-005]</b></td>
   <td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1">`geforce_gtx_1080`</td>
   <td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1"><b>`gpumem_8gb`</b></td>
   </tr>
   <tr style="vertical-align:middle;text-align:center;" ralign="center">
   <td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-[006-009]</b></td>
   <td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1">`geforce_gtx_1080_ti`</td>
   <td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="2"><b>`gpumem_11gb`</b></td>
   </tr>
   <tr style="vertical-align:middle;text-align:center;" ralign="center">
   <td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-[010-014]</b></td>
   <td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1">`geforce_rtx_2080_ti`</td>
   </tr>
   <tr style="vertical-align:middle;text-align:center;" ralign="center">
   <td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-100</b></td>
   <td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1">`A100`</td>
   <td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1"><b>`gpumem_40gb`</b></td>
   </tr>
  </tbody>
</table>

#### Other GPU options

Alternative Slurm options for GPU based jobs are available. Please refer to the **man** pages
for each Slurm command for further information about it (`man salloc`, `man sbatch`, `man srun`).
Below are listed the most common settings:

```bash
#SBATCH --hint=[no]multithread
#SBATCH --ntasks=\<ntasks\>
#SBATCH --ntasks-per-gpu=\<ntasks\>
#SBATCH --mem-per-gpu=\<size[units]\>
#SBATCH --cpus-per-gpu=\<ncpus\>
#SBATCH --gpus-per-node=[\<type\>:]\<number\>
#SBATCH --gpus-per-socket=[\<type\>:]\<number\>
#SBATCH --gpus-per-task=[\<type\>:]\<number\>
#SBATCH --gpu-bind=[verbose,]\<type\>
```

Please, notice that when defining `[<type>:]` once, then all other options must use it too!

#### Dealing with Hyper-Threading

The **`gmerlin6`** cluster contains the partitions `gwendolen` and `gwendolen-long`, which have a node with Hyper-Threading enabled.
In that case, one should always specify whether to use Hyper-Threading or not. If not defined, Slurm will
generally use it (exceptions apply). For this machine, generally HT is recommended.

```bash
#SBATCH --hint=multithread            # Use extra threads with in-core multi-threading.
#SBATCH --hint=nomultithread          # Don't use extra threads with in-core multi-threading.
```

## User and job limits

The GPU cluster contains some basic user and job limits to ensure that a single user can not overabuse the resources and a fair usage of the cluster.
The limits are described below.

### Per job limits

These are limits applying to a single job. In other words, there is a maximum of resources a single job can use.
Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: `SlurmQoS(limits)`
(possible `SlurmQoS` values can be listed with the command `sacctmgr show qos`):

| Partition          | Slurm Account  | Mon-Sun 0h-24h                               |
|:------------------:| :------------: | :------------------------------------------: |
| **gpu**            | **`merlin`**   | gpu_week(cpu=40,gres/gpu=8,mem=200G)         |
| **gpu-short**      | **`merlin`**   | gpu_week(cpu=40,gres/gpu=8,mem=200G)         |
| **gwendolen**      | `gwendolen`    | No limits                                    |
| **gwendolen-long** | `gwendolen`    | No limits, active from 9pm to 5:30am         |

* With the limits in the public `gpu` and `gpu-short` partitions, a single job using the `merlin` acccount
(default account) can not use more than 40 CPUs, more than 8 GPUs or more than 200GB.
Any job exceeding such limits will stay in the queue with the message **`QOSMax[Cpu|GRES|Mem]PerJob`**.
As there are no more existing QoS during the week temporary overriding job limits (this happens for
instance in the CPU **daily** partition), the job needs to be cancelled, and the requested resources
must be adapted according to the above resource limits.

* The **gwendolen** and **gwendolen-long** partitions are two special partitions for a **[NVIDIA DGX A100](https://www.nvidia.com/en-us/data-center/dgx-a100/)** machine.
Only users belonging to the **`unx-gwendolen`** Unix group can run in these partitions. No limits are applied (machine resources can be completely used).

* The **`gwendolen-long`** partition is available 24h. However,
   * from 5:30am to 9pm the partition is `down` (jobs can be submitted, but can not run until the partition is set to `active`).
   * from 9pm to 5:30am jobs are allowed to run (partition is set to `active`).

### Per user limits for GPU partitions

These limits apply exclusively to users. In other words, there is a maximum of resources a single user can use.
Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: `SlurmQoS(limits)`
(possible `SlurmQoS` values can be listed with the command `sacctmgr show qos`):

| Partition          | Slurm Account      | Mon-Sun 0h-24h                                  |
|:------------------:| :----------------: | :---------------------------------------------: |
| **gpu**            | **`merlin`**       | gpu_week(cpu=80,gres/gpu=16,mem=400G)           |
| **gpu-short**      | **`merlin`**       | gpu_week(cpu=80,gres/gpu=16,mem=400G)           |
| **gwendolen**      | `gwendolen`        | No limits                                       |
| **gwendolen-long** | `gwendolen`        | No limits, active from 9pm to 5:30am            |

* With the limits in the public `gpu` and `gpu-short` partitions, a single user can not use more than 80 CPUs, more than 16 GPUs or more than 400GB.
Jobs sent by any user already exceeding such limits will stay in the queue with the message **`QOSMax[Cpu|GRES|Mem]PerUser`**.
In that case, job can wait in the queue until some of the running resources are freed.

* Notice that user limits are wider than job limits. In that way, a user can run up to two 8 GPUs based jobs, or up to four 4 GPUs based jobs, etc.
Please try to avoid occupying all GPUs of the same type for several hours or multiple days, otherwise it would block other users needing the same
type of GPU.

## Advanced Slurm configuration

Clusters at PSI use the [Slurm Workload Manager](http://slurm.schedmd.com/) as the batch system technology for managing and scheduling jobs.
Slurm has been installed in a **multi-clustered** configuration, allowing to integrate multiple clusters in the same batch system.

For understanding the Slurm configuration setup in the cluster, sometimes may be useful to check the following files:

* ``/etc/slurm/slurm.conf`` - can be found in the login nodes and computing nodes.
* ``/etc/slurm/gres.conf`` - can be found in the GPU nodes, is also propgated to login nodes and computing nodes for user read access.
* ``/etc/slurm/cgroup.conf`` - can be found in the computing nodes, is also propagated to login nodes for user read access.

The previous configuration files which can be found in the login nodes, correspond exclusively to the **merlin6** cluster configuration files.
Configuration files for the old **merlin5** cluster or for the **gmerlin6** cluster must be checked directly on any of the **merlin5** or **gmerlin6** computing nodes (in example, by login in to one of the nodes while a job or an active allocation is running).