260 lines
15 KiB
Markdown
260 lines
15 KiB
Markdown
---
|
|
title: Slurm cluster 'gmerlin6'
|
|
#tags:
|
|
keywords: configuration, partitions, node definition, gmerlin6
|
|
last_updated: 29 January 2021
|
|
summary: "This document describes a summary of the Slurm 'configuration."
|
|
sidebar: merlin6_sidebar
|
|
permalink: /gmerlin6/slurm-configuration.html
|
|
---
|
|
|
|
This documentation shows basic Slurm configuration and options needed to run jobs in the GPU cluster.
|
|
|
|
## Merlin6 GPU nodes definition
|
|
|
|
The table below shows a summary of the hardware setup for the different GPU nodes
|
|
|
|
| Nodes | Def.#CPUs | Max.#CPUs | #Threads | Def.Mem/CPU | Max.Mem/CPU | Max.Mem/Node | Max.Swap | GPU Type | Def.#GPUs | Max.#GPUs |
|
|
|:------------------:| ---------:| :--------:| :------: | :----------:| :----------:| :-----------:| :-------:| :--------: | :-------: | :-------: |
|
|
| merlin-g-[001] | 1 core | 8 cores | 1 | 4000 | 102400 | 102400 | 10000 | **geforce_gtx_1080** | 1 | 2 |
|
|
| merlin-g-[002-005] | 1 core | 20 cores | 1 | 4000 | 102400 | 102400 | 10000 | **geforce_gtx_1080** | 1 | 4 |
|
|
| merlin-g-[006-009] | 1 core | 20 cores | 1 | 4000 | 102400 | 102400 | 10000 | **geforce_gtx_1080_ti** | 1 | 4 |
|
|
| merlin-g-[010-013] | 1 core | 20 cores | 1 | 4000 | 102400 | 102400 | 10000 | **geforce_rtx_2080_ti** | 1 | 4 |
|
|
| merlin-g-014 | 1 core | 48 cores | 1 | 4000 | 360448 | 360448 | 10000 | **geforce_rtx_2080_ti** | 1 | 8 |
|
|
| merlin-g-100 | 1 core | 128 cores | 2 | 3900 | 998400 | 998400 | 10000 | **A100** | 1 | 8 |
|
|
|
|
{{site.data.alerts.tip}}Always check <b>'/etc/slurm/gres.conf'</b> and <b>'/etc/slurm/slurm.conf'</b> for changes in the GPU type and details of the hardware.
|
|
{{site.data.alerts.end}}
|
|
|
|
## Running jobs in the 'gmerlin6' cluster
|
|
|
|
In this chapter we will cover basic settings that users need to specify in order to run jobs in the GPU cluster.
|
|
|
|
### Merlin6 GPU cluster
|
|
|
|
To run jobs in the **`gmerlin6`** cluster users **must** specify the cluster name in Slurm:
|
|
|
|
```bash
|
|
#SBATCH --cluster=gmerlin6
|
|
```
|
|
|
|
### Merlin6 GPU partitions
|
|
|
|
Users might need to specify the Slurm partition. If no partition is specified, it will default to **`gpu`**:
|
|
|
|
```bash
|
|
#SBATCH --partition=<partition_name> # Possible <partition_name> values: gpu, gpu-short, gwendolen
|
|
```
|
|
|
|
The table below resumes shows all possible partitions available to users:
|
|
|
|
| GPU Partition | Default Time | Max Time | PriorityJobFactor\* | PriorityTier\*\* |
|
|
|:-----------------: | :----------: | :--------: | :-----------------: | :--------------: |
|
|
| `gpu` | 1 day | 1 week | 1 | 1 |
|
|
| `gpu-short` | 2 hours | 2 hours | 1000 | 500 |
|
|
| `gwendolen` | 30 minutes | 2 hours | 1000 | 1000 |
|
|
| `gwendolen-long` | 30 minutes | 8 hours | 1 | 1 |
|
|
|
|
\*The **PriorityJobFactor** value will be added to the job priority (*PARTITION* column in `sprio -l` ). In other words, jobs sent to higher priority
|
|
partitions will usually run first (however, other factors such like **job age** or mainly **fair share** might affect to that decision). For the GPU
|
|
partitions, Slurm will also attempt first to allocate jobs on partitions with higher priority over partitions with lesser priority.
|
|
|
|
\*\*Jobs submitted to a partition with a higher **PriorityTier** value will be dispatched before pending jobs in partition with lower *PriorityTier* value
|
|
and, if possible, they will preempt running jobs from partitions with lower *PriorityTier* values.
|
|
|
|
### Merlin6 GPU Accounts
|
|
|
|
Users need to ensure that the public **`merlin`** account is specified. No specifying account options would default to this account.
|
|
This is mostly needed by users which have multiple Slurm accounts, which may define by mistake a different account.
|
|
|
|
```bash
|
|
#SBATCH --account=merlin # Possible values: merlin, gwendolen
|
|
```
|
|
Not all the accounts can be used on all partitions. This is resumed in the table below:
|
|
|
|
| Slurm Account | Slurm Partitions |
|
|
|:-------------------: | :------------------: |
|
|
| **`merlin`** | **`gpu`**,`gpu-short` |
|
|
| `gwendolen` | `gwendolen`,`gwendolen-long` |
|
|
|
|
By default, all users belong to the `merlin` Slurm accounts, and jobs are submitted to the `gpu` partition when no partition is defined.
|
|
|
|
Users only need to specify the `gwendolen` account when using the `gwendolen` or `gwendolen-long` partitions, otherwise specifying account is not needed (it will always default to `merlin`).
|
|
|
|
#### The 'gwendolen' account
|
|
|
|
For running jobs in the **`gwendolen`/`gwendolen-long`** partitions, users must specify the **`gwendolen`** account.
|
|
The `merlin` account is not allowed to use the Gwendolen partitions.
|
|
|
|
Gwendolen is restricted to a set of users belonging to the **`unx-gwendolen`** Unix group. If you belong to a project allowed to use **Gwendolen**, or you are a user which would like to have access to it, please request access to the **`unx-gwendolen`** Unix group through [PSI Service Now](https://psi.service-now.com/): the request will be redirected to the responsible of the project (Andreas Adelmann).
|
|
|
|
### Slurm GPU specific options
|
|
|
|
Some options are available when using GPUs. These are detailed here.
|
|
|
|
#### Number of GPUs and type
|
|
|
|
When using the GPU cluster, users **must** specify the number of GPUs they need to use:
|
|
|
|
```bash
|
|
#SBATCH --gpus=[<type>:]<number>
|
|
```
|
|
|
|
The GPU type is optional: if left empty, it will try allocating any type of GPU.
|
|
The different `[<type>:]` values and `<number>` of GPUs depends on the node.
|
|
This is detailed in the below table.
|
|
|
|
| Nodes | GPU Type | #GPUs |
|
|
|:---------------------: | :-----------------------: | :---: |
|
|
| **merlin-g-[001]** | **`geforce_gtx_1080`** | 2 |
|
|
| **merlin-g-[002-005]** | **`geforce_gtx_1080`** | 4 |
|
|
| **merlin-g-[006-009]** | **`geforce_gtx_1080_ti`** | 4 |
|
|
| **merlin-g-[010-013]** | **`geforce_rtx_2080_ti`** | 4 |
|
|
| **merlin-g-014** | **`geforce_rtx_2080_ti`** | 8 |
|
|
| **merlin-g-100** | **`A100`** | 8 |
|
|
|
|
#### Constraint / Features
|
|
|
|
Instead of specifying the GPU **type**, sometimes users would need to **specify the GPU by the amount of memory available in the GPU** card itself.
|
|
This has been defined in Slurm with **Features**, which is a tag which defines the GPU memory for the different GPU cards.
|
|
Users can specify which GPU memory size needs to be used with the `--constraint` option. In that case, notice that *in many cases
|
|
there is not need to specify `[<type>:]`* in the `--gpus` option.
|
|
|
|
```bash
|
|
#SBATCH --contraint=<Feature> # Possible values: gpumem_8gb, gpumem_11gb, gpumem_40gb
|
|
```
|
|
|
|
The table below shows the available **Features** and which GPU card models and GPU nodes they belong to:
|
|
|
|
<table>
|
|
<thead>
|
|
<tr>
|
|
<th scope='colgroup' style="vertical-align:middle;text-align:center;" colspan="3">Merlin6 CPU Computing Nodes</th>
|
|
</tr>
|
|
<tr>
|
|
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Nodes</th>
|
|
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">GPU Type</th>
|
|
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Feature</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody>
|
|
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
|
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-[001-005]</b></td>
|
|
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1">`geforce_gtx_1080`</td>
|
|
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1"><b>`gpumem_8gb`</b></td>
|
|
</tr>
|
|
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
|
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-[006-009]</b></td>
|
|
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1">`geforce_gtx_1080_ti`</td>
|
|
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="2"><b>`gpumem_11gb`</b></td>
|
|
</tr>
|
|
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
|
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-[010-014]</b></td>
|
|
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1">`geforce_rtx_2080_ti`</td>
|
|
</tr>
|
|
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
|
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-100</b></td>
|
|
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1">`A100`</td>
|
|
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1"><b>`gpumem_40gb`</b></td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
|
|
#### Other GPU options
|
|
|
|
Alternative Slurm options for GPU based jobs are available. Please refer to the **man** pages
|
|
for each Slurm command for further information about it (`man salloc`, `man sbatch`, `man srun`).
|
|
Below are listed the most common settings:
|
|
|
|
```bash
|
|
#SBATCH --hint=[no]multithread
|
|
#SBATCH --ntasks=\<ntasks\>
|
|
#SBATCH --ntasks-per-gpu=\<ntasks\>
|
|
#SBATCH --mem-per-gpu=\<size[units]\>
|
|
#SBATCH --cpus-per-gpu=\<ncpus\>
|
|
#SBATCH --gpus-per-node=[\<type\>:]\<number\>
|
|
#SBATCH --gpus-per-socket=[\<type\>:]\<number\>
|
|
#SBATCH --gpus-per-task=[\<type\>:]\<number\>
|
|
#SBATCH --gpu-bind=[verbose,]\<type\>
|
|
```
|
|
|
|
Please, notice that when defining `[<type>:]` once, then all other options must use it too!
|
|
|
|
#### Dealing with Hyper-Threading
|
|
|
|
The **`gmerlin6`** cluster contains the partitions `gwendolen` and `gwendolen-long`, which have a node with Hyper-Threading enabled.
|
|
In that case, one should always specify whether to use Hyper-Threading or not. If not defined, Slurm will
|
|
generally use it (exceptions apply). For this machine, generally HT is recommended.
|
|
|
|
```bash
|
|
#SBATCH --hint=multithread # Use extra threads with in-core multi-threading.
|
|
#SBATCH --hint=nomultithread # Don't use extra threads with in-core multi-threading.
|
|
```
|
|
|
|
## User and job limits
|
|
|
|
The GPU cluster contains some basic user and job limits to ensure that a single user can not overabuse the resources and a fair usage of the cluster.
|
|
The limits are described below.
|
|
|
|
### Per job limits
|
|
|
|
These are limits applying to a single job. In other words, there is a maximum of resources a single job can use.
|
|
Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: `SlurmQoS(limits)`
|
|
(possible `SlurmQoS` values can be listed with the command `sacctmgr show qos`):
|
|
|
|
| Partition | Slurm Account | Mon-Sun 0h-24h |
|
|
|:------------------:| :------------: | :------------------------------------------: |
|
|
| **gpu** | **`merlin`** | gpu_week(cpu=40,gres/gpu=8,mem=200G) |
|
|
| **gpu-short** | **`merlin`** | gpu_week(cpu=40,gres/gpu=8,mem=200G) |
|
|
| **gwendolen** | `gwendolen` | No limits |
|
|
| **gwendolen-long** | `gwendolen` | No limits, active from 9pm to 5:30am |
|
|
|
|
* With the limits in the public `gpu` and `gpu-short` partitions, a single job using the `merlin` acccount
|
|
(default account) can not use more than 40 CPUs, more than 8 GPUs or more than 200GB.
|
|
Any job exceeding such limits will stay in the queue with the message **`QOSMax[Cpu|GRES|Mem]PerJob`**.
|
|
As there are no more existing QoS during the week temporary overriding job limits (this happens for
|
|
instance in the CPU **daily** partition), the job needs to be cancelled, and the requested resources
|
|
must be adapted according to the above resource limits.
|
|
|
|
* The **gwendolen** and **gwendolen-long** partitions are two special partitions for a **[NVIDIA DGX A100](https://www.nvidia.com/en-us/data-center/dgx-a100/)** machine.
|
|
Only users belonging to the **`unx-gwendolen`** Unix group can run in these partitions. No limits are applied (machine resources can be completely used).
|
|
|
|
* The **`gwendolen-long`** partition is available 24h. However,
|
|
* from 5:30am to 9pm the partition is `down` (jobs can be submitted, but can not run until the partition is set to `active`).
|
|
* from 9pm to 5:30am jobs are allowed to run (partition is set to `active`).
|
|
|
|
### Per user limits for GPU partitions
|
|
|
|
These limits apply exclusively to users. In other words, there is a maximum of resources a single user can use.
|
|
Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: `SlurmQoS(limits)`
|
|
(possible `SlurmQoS` values can be listed with the command `sacctmgr show qos`):
|
|
|
|
| Partition | Slurm Account | Mon-Sun 0h-24h |
|
|
|:------------------:| :----------------: | :---------------------------------------------: |
|
|
| **gpu** | **`merlin`** | gpu_week(cpu=80,gres/gpu=16,mem=400G) |
|
|
| **gpu-short** | **`merlin`** | gpu_week(cpu=80,gres/gpu=16,mem=400G) |
|
|
| **gwendolen** | `gwendolen` | No limits |
|
|
| **gwendolen-long** | `gwendolen` | No limits, active from 9pm to 5:30am |
|
|
|
|
* With the limits in the public `gpu` and `gpu-short` partitions, a single user can not use more than 80 CPUs, more than 16 GPUs or more than 400GB.
|
|
Jobs sent by any user already exceeding such limits will stay in the queue with the message **`QOSMax[Cpu|GRES|Mem]PerUser`**.
|
|
In that case, job can wait in the queue until some of the running resources are freed.
|
|
|
|
* Notice that user limits are wider than job limits. In that way, a user can run up to two 8 GPUs based jobs, or up to four 4 GPUs based jobs, etc.
|
|
Please try to avoid occupying all GPUs of the same type for several hours or multiple days, otherwise it would block other users needing the same
|
|
type of GPU.
|
|
|
|
## Advanced Slurm configuration
|
|
|
|
Clusters at PSI use the [Slurm Workload Manager](http://slurm.schedmd.com/) as the batch system technology for managing and scheduling jobs.
|
|
Slurm has been installed in a **multi-clustered** configuration, allowing to integrate multiple clusters in the same batch system.
|
|
|
|
For understanding the Slurm configuration setup in the cluster, sometimes may be useful to check the following files:
|
|
|
|
* ``/etc/slurm/slurm.conf`` - can be found in the login nodes and computing nodes.
|
|
* ``/etc/slurm/gres.conf`` - can be found in the GPU nodes, is also propgated to login nodes and computing nodes for user read access.
|
|
* ``/etc/slurm/cgroup.conf`` - can be found in the computing nodes, is also propagated to login nodes for user read access.
|
|
|
|
The previous configuration files which can be found in the login nodes, correspond exclusively to the **merlin6** cluster configuration files.
|
|
Configuration files for the old **merlin5** cluster or for the **gmerlin6** cluster must be checked directly on any of the **merlin5** or **gmerlin6** computing nodes (in example, by login in to one of the nodes while a job or an active allocation is running).
|