gitea-pages/pages/gmerlin6/slurm-configuration.md

---
title: Slurm cluster 'gmerlin6'
#tags:
keywords: configuration, partitions, node definition, gmerlin6
last_updated: 29 January 2021
summary: "This document describes a summary of the Slurm 'configuration."
sidebar: merlin6_sidebar
permalink: /gmerlin6/slurm-configuration.html
---

This documentation shows basic Slurm configuration and options needed to run jobs in the GPU cluster.

## Merlin6 GPU nodes definition

The table below shows a summary of the hardware setup for the different GPU nodes

| Nodes              | Def.#CPUs | Max.#CPUs | #Threads | Def.Mem/CPU | Max.Mem/CPU | Max.Mem/Node | Max.Swap | GPU Type                | Def.#GPUs | Max.#GPUs |
|:------------------:| ---------:| :--------:| :------: | :----------:| :----------:| :-----------:| :-------:| :--------:              | :-------: | :-------: |
| merlin-g-[001]     | 1 core    | 8 cores   | 1        | 4000        | 102400      | 102400       | 10000    | **geforce_gtx_1080**    | 1         | 2         |
| merlin-g-[002-005] | 1 core    | 20 cores  | 1        | 4000        | 102400      | 102400       | 10000    | **geforce_gtx_1080**    | 1         | 4         |
| merlin-g-[006-009] | 1 core    | 20 cores  | 1        | 4000        | 102400      | 102400       | 10000    | **geforce_gtx_1080_ti** | 1         | 4         |
| merlin-g-[010-013] | 1 core    | 20 cores  | 1        | 4000        | 102400      | 102400       | 10000    | **geforce_rtx_2080_ti** | 1         | 4         |
| merlin-g-014       | 1 core    | 48 cores  | 1        | 4000        | 360448      | 360448       | 10000    | **geforce_rtx_2080_ti** | 1         | 8         |
| merlin-g-100       | 1 core    | 128 cores | 2        | 3900        | 998400      | 998400       | 10000    | **A100**                | 1         | 8         |

{{site.data.alerts.tip}}Always check <b>'/etc/slurm/gres.conf'</b> and <b>'/etc/slurm/slurm.conf'</b> for changes in the GPU type and details of the hardware.
{{site.data.alerts.end}}

## Running jobs in the 'gmerlin6' cluster

In this chapter we will cover basic settings that users need to specify in order to run jobs in the GPU cluster.

### Merlin6 GPU cluster

To run jobs in the **`gmerlin6`** cluster users **must** specify the cluster name in Slurm:

```bash
#SBATCH --cluster=gmerlin6
```

### Merlin6 GPU partitions

Users might need to specify the Slurm partition. If no partition is specified, it will default to **`gpu`**:

```bash
#SBATCH --partition=<partition_name>  # Possible <partition_name> values: gpu, gpu-short, gwendolen
```

The table below resumes shows all possible partitions available to users:

| GPU Partition      |  Default Time | Max Time | PriorityJobFactor\* | PriorityTier\*\* |
|:-----------------: |  :----------: | :------: | :-----------------: | :--------------: |
| **<u>gpu</u>**     |  1 day        | 1 week   | 1                   | 1                |
| **gpu-short**      |  2 hours      | 2 hours  | 1000                | 500              |
| **gwendolen**      |  1 hour       | 12 hours | 1000                | 1000             |

**\***The **PriorityJobFactor** value will be added to the job priority (*PARTITION* column in `sprio -l` ). In other words, jobs sent to higher priority
partitions will usually run first (however, other factors such like **job age** or mainly **fair share** might affect to that decision). For the GPU
partitions, Slurm will also attempt first to allocate jobs on partitions with higher priority over partitions with lesser priority.

**\*\***Jobs submitted to a partition with a higher **PriorityTier** value will be dispatched before pending jobs in partition with  lower *PriorityTier*  value
and, if possible, they will preempt running jobs from partitions with lower *PriorityTier* values.

### Merlin6 GPU Accounts

Users might need to specify the Slurm account to be used. If no account is specified, the **`merlin`** **account** will be used as default:

```bash
#SBATCH --account=merlin  # Possible values: merlin, gwendolen_public, gwendolen
```

Not all accounts can be used on all partitions. This is resumed in the table below:

| Slurm Account        | Slurm Partitions  |
|:-------------------: |  :--------------: |
| **<u>merlin</u>**    | `gpu`,`gpu-short` |
| **gwendolen_public** | `gwendolen`       |
| **gwendolen**        | `gwendolen`       |

By default, all users belong to the `merlin` and `gwendolen_public` Slurm accounts.
The `gwendolen` **account** is only available for a few set of users, other users must use `gwendolen_public` instead.

For running jobs in the `gwendolen` **partition**, users must specify one of the `gwendolen_public` or `gwendolen` accounts.
The `merlin` account is not allowed in the `gwendolen` partition.

### GPU specific options

Some options are available when using GPUs. These are detailed here.

#### Number of GPUs and type

When using the GPU cluster, users **must** specify the number of GPUs they need to use:

```bash
#SBATCH --gpus=[<type>:]<number>
```

The GPU type is optional: if left empty, it will try allocating any type of GPU.
The different `[<type>:]` values and `<number>` of GPUs depends on the node.
This is detailed in the below table.

| Nodes              |  GPU Type              | #GPUs |
|:------------------:|  :-------------------: | :---: |
| merlin-g-[001]     |  `geforce_gtx_1080`    | 2     |
| merlin-g-[002-005] |  `geforce_gtx_1080`    | 4     |
| merlin-g-[006-009] |  `geforce_gtx_1080_ti` | 4     |
| merlin-g-[010-013] |  `geforce_rtx_2080_ti` | 4     |
| merlin-g-014       |  `geforce_rtx_2080_ti` | 8     |
| merlin-g-100       |  `A100`                | 8     |

#### Other GPU options

Alternative Slurm options for GPU based jobs are available. Please refer to the **man** pages
for each Slurm command for further information about it (`man salloc`, `man sbatch`, `man srun`).
Below are listed the most common settings:

```bash
#SBATCH --ntasks=<ntasks>
#SBATCH --ntasks-per-gpu=<ntasks>
#SBATCH --mem-per-gpu=<size[units]>
#SBATCH --cpus-per-gpu=<ncpus>
#SBATCH --gpus-per-node=[<type>:]<number>
#SBATCH --gpus-per-socket=[<type>:]<number>
#SBATCH --gpus-per-task=[<type>:]<number>
#SBATCH --gpu-bind=[verbose,]<type>
```

Please, notice that when defining `[<type>:]` once, then all other options must use it too!

## User and job limits

The GPU cluster contains some basic user and job limits to ensure that a single user can not overabuse the resources and a fair usage of the cluster.
The limits are described below.

### Per job limits

These are limits applying to a single job. In other words, there is a maximum of resources a single job can use.
Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: `SlurmQoS(limits)`,
(list of possible `SlurmQoS` values can be listed with the command `sacctmgr show qos`):

| Partition     | Slurm Account      | Mon-Sun 0h-24h                               |
|:-------------:| :----------------: | :------------------------------------------: |
| **gpu**       | **`merlin`**       | gpu_week(cpu=40,gres/gpu=8,mem=200G)         |
| **gpu-short** | **`merlin`**       | gpu_week(cpu=40,gres/gpu=8,mem=200G)         |
| **gwendolen** | `gwendolen_public` | gwendolen_public(cpu=32,gres/gpu=2,mem=200G) |
| **gwendolen** | `gwendolen`        | No limits, full access granted               |

* With the limits in the public `gpu` and `gpu-short` partitions, a single job using the `merlin` acccount
(default account) can not use more than 40 CPUs, more than 8 GPUs or more than 200GB.
Any job exceeding such limits will stay in the queue with the message **`QOSMax[Cpu|GRES|Mem]PerJob`**.
As there are no more existing QoS during the week temporary overriding job limits (this happens for
instance in the CPU **daily** partition), the job needs to be cancelled, and the requested resources
must be adapted according to the above resource limits.

* The **gwendolen** partition is a special partition with a **[NVIDIA DGX A100](https://www.nvidia.com/en-us/data-center/dgx-a100/)** machine.
Public access is possible through the `gwendolen_public` account, however is limited to 2 GPUs per job, 32 CPUs and 121875MB of memory).
For full access, the `gwendolen` account is needed, and this is restricted to a set of users.

### Per user limits for GPU partitions

These limits apply exclusively to users. In other words, there is a maximum of resources a single user can use.
Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: `SlurmQoS(limits)`,
(list of possible `SlurmQoS` values can be listed with the command `sacctmgr show qos`):

| Partition     | Slurm Account      | Mon-Sun 0h-24h                                  |
|:-------------:| :----------------: | :---------------------------------------------: |
| **gpu**       | **`merlin`**       | gpu_week(cpu=80,gres/gpu=16,mem=400G)           |
| **gpu-short** | **`merlin`**       | gpu_week(cpu=80,gres/gpu=16,mem=400G)           |
| **gwendolen** | `gwendolen_public` | gwendolen_public(cpu=64,gres/gpu=4,mem=243750M) |
| **gwendolen** | `gwendolen`        | No limits, full access granted                  |

* With the limits in the public `gpu` and `gpu-short` partitions, a single user can not use more than 80 CPUs, more than 16 GPUs or more than 400GB.
Jobs sent by any user already exceeding such limits will stay in the queue with the message **`QOSMax[Cpu|GRES|Mem]PerUser`**.
In that case, job can wait in the queue until some of the running resources are freed.

* Notice that user limits are wider than job limits. In that way, a user can run up to two 8 GPUs based jobs, or up to four 4 GPUs based jobs, etc.
Please try to avoid occupying all GPUs of the same type for several hours or multiple days, otherwise it would block other users needing the same
type of GPU.

## Advanced Slurm configuration

Clusters at PSI use the [Slurm Workload Manager](http://slurm.schedmd.com/) as the batch system technology for managing and scheduling jobs.
Slurm has been installed in a **multi-clustered** configuration, allowing to integrate multiple clusters in the same batch system.

For understanding the Slurm configuration setup in the cluster, sometimes may be useful to check the following files:

* ``/etc/slurm/slurm.conf`` - can be found in the login nodes and computing nodes.
* ``/etc/slurm/gres.conf`` - can be found in the GPU nodes, is also propgated to login nodes and computing nodes for user read access.
* ``/etc/slurm/cgroup.conf`` - can be found in the computing nodes, is also propagated to login nodes for user read access.

The previous configuration files which can be found in the login nodes, correspond exclusively to the **merlin6** cluster configuration files.
Configuration files for the old **merlin5** cluster or for the **gmerlin6** cluster must be checked directly on any of the **merlin5** or **gmerlin6** computing nodes (in example, by login in to one of the nodes while a job or an active allocation is running).