Gwendolen changes

This commit is contained in:
caubet_m 2021-08-04 14:02:49 +02:00
parent 252d08affb
commit a2ddf59412

View File

@ -49,10 +49,11 @@ Users might need to specify the Slurm partition. If no partition is specified, i
The table below resumes shows all possible partitions available to users:
| GPU Partition | Default Time | Max Time | PriorityJobFactor\* | PriorityTier\*\* |
|:-----------------: | :----------: | :------: | :-----------------: | :--------------: |
|:-----------------: | :----------: | :--------: | :-----------------: | :--------------: |
| `gpu` | 1 day | 1 week | 1 | 1 |
| `gpu-short` | 2 hours | 2 hours | 1000 | 500 |
| `gwendolen` | 1 hour | 12 hours | 1000 | 1000 |
| `gwendolen` | 30 minutes | 30 minutes | 1000 | 1000 |
| `gwendolen_long` | 30 minutes | 4 hours | 1 | 1 |
\*The **PriorityJobFactor** value will be added to the job priority (*PARTITION* column in `sprio -l` ). In other words, jobs sent to higher priority
partitions will usually run first (however, other factors such like **job age** or mainly **fair share** might affect to that decision). For the GPU
@ -71,24 +72,21 @@ This is mostly needed by users which have multiple Slurm accounts, which may def
```
Not all the accounts can be used on all partitions. This is resumed in the table below:
| Slurm Account | Slurm Partitions | Special QoS |
|:-------------------: | :------------------: | :---------------------------------: |
| **`merlin`** | **`gpu`**,`gpu-short` | |
| `gwendolen` | `gwendolen` | `gwendolen`, **`gwendolen_public`** |
| Slurm Account | Slurm Partitions |
|:-------------------: | :------------------: |
| **`merlin`** | **`gpu`**,`gpu-short` |
| `gwendolen` | `gwendolen`,`gwendolen_long` |
By default, all users belong to the `merlin` and `gwendolen` Slurm accounts.
By default, all users belong to the `merlin` Slurm accounts, and jobs are submitted to the `gpu` partition when no partition is defined.
Users only need to specify `gwendolen` when using `gwendolen`, otherwise specfying account is not needed (it will always default to `merlin`). `gwendolen` is a special account, with two different **QoS** granting different types of access (see details below).
Users only need to specify the `gwendolen` account when using the `gwendolen` or `gwendolen_long` partitions, otherwise specifying account is not needed (it will always default to `merlin`).
#### The 'gwendolen' account
For running jobs in the **`gwendolen`** partition, users must specify the `gwendolen` account. The `merlin` account is not allowed to use the `gwendolen` partition.
For running jobs in the **`gwendolen`/`gwendolen_long`** partitions, users must specify the **`gwendolen`** account.
The `merlin` account is not allowed to use the Gwendolen partitions.
In addition, in Slurm there is the concept of **QoS**, which stands for **Quality of Service**. The **`gwendolen`** account has two different QoS configured:
* The **QoS** **`gwendolen_public`** is set by default to all Merlin users. This restricts the number of resources than can be used on **Gwendolen**. For further information about restrictions, please read the ['User and Job Limits'](/gmerlin6/slurm-configuration.html#user-and-job-limits) documentation.
* The **QoS** **`gwendolen`** provides full access to **`gwendolen`**, however this is restricted to a set of users belonging to the **`unx-gwendolen`** Unix group.
Users don't need to specify any QoS, however, they need to be aware about resources restrictions. If you belong to one of the projects which is allowed to use **Gwendolen** without restrictions, please request access to the **`unx-gwendolen`** through [PSI Service Now](https://psi.service-now.com/).
Gwendolen is restricted to a set of users belonging to the **`unx-gwendolen`** Unix group. If you belong to a project allowed to use **Gwendolen**, or you are a user which would like to have access to it, please request access to the **`unx-gwendolen`** Unix group through [PSI Service Now](https://psi.service-now.com/): the request will be redirected to the responsible of the project (Andreas Adelmann).
### Slurm GPU specific options
@ -184,7 +182,7 @@ Please, notice that when defining `[<type>:]` once, then all other options must
#### Dealing with Hyper-Threading
The **`gmerlin6`** cluster contains the partition `gwendolen`, which has a node with Hyper-Threading enabled.
The **`gmerlin6`** cluster contains the partitions `gwendolen` and `gwendolen_long`, which have a node with Hyper-Threading enabled.
In that case, one should always specify whether to use Hyper-Threading or not. If not defined, Slurm will
generally use it (exceptions apply). For this machine, generally HT is recommended.
@ -205,11 +203,11 @@ Limits are defined using QoS, and this is usually set at the partition level. Li
(possible `SlurmQoS` values can be listed with the command `sacctmgr show qos`):
| Partition | Slurm Account | Mon-Sun 0h-24h |
|:-------------:| :------------: | :------------------------------------------: |
|:------------------:| :------------: | :------------------------------------------: |
| **gpu** | **`merlin`** | gpu_week(cpu=40,gres/gpu=8,mem=200G) |
| **gpu-short** | **`merlin`** | gpu_week(cpu=40,gres/gpu=8,mem=200G) |
| **gwendolen** | `gwendolen` | gwendolen_public(cpu=32,gres/gpu=2,mem=200G) |
| **gwendolen** | `gwendolen` | gwendolen(No limits, full access granted) |
| **gwendolen** | `gwendolen` | No limits |
| **gwendolen_long** | `gwendolen` | No limits, active from 9pm to 5:30am |
* With the limits in the public `gpu` and `gpu-short` partitions, a single job using the `merlin` acccount
(default account) can not use more than 40 CPUs, more than 8 GPUs or more than 200GB.
@ -218,9 +216,12 @@ As there are no more existing QoS during the week temporary overriding job limit
instance in the CPU **daily** partition), the job needs to be cancelled, and the requested resources
must be adapted according to the above resource limits.
* The **gwendolen** partition is a special partition with a **[NVIDIA DGX A100](https://www.nvidia.com/en-us/data-center/dgx-a100/)** machine.
Public access is possible through the `gwendolen` account, however this is limited to 2 GPUs per job, 32 CPUs and 121875MB of memory).
For full access, the `gwendolen` account with `gwendolen` **QoS** (Quality of Service) is needed, and this is restricted to a set of users (belonging to the **`unx-gwendolen`** Unix group). Any other user will have by default a QoS **`gwendolen_public`**, which restricts resources in Gwendolen.
* The **gwendolen** and **gwendolen_long** partitions are two special partitions for a **[NVIDIA DGX A100](https://www.nvidia.com/en-us/data-center/dgx-a100/)** machine.
Only users belonging to the **`unx-gwendolen`** Unix group can run in these partitions. No limits are applied (machine resources can be completely used).
* The **`gwendolen_long`** partition is available 24h. However,
* from 5:30am to 9pm the partition is `down` (jobs can be submitted, but can not run until the partition is set to `active`).
* from 9pm to 5:30am jobs are allowed to run (partition is set to `active`).
### Per user limits for GPU partitions
@ -229,11 +230,11 @@ Limits are defined using QoS, and this is usually set at the partition level. Li
(possible `SlurmQoS` values can be listed with the command `sacctmgr show qos`):
| Partition | Slurm Account | Mon-Sun 0h-24h |
|:-------------:| :----------------: | :---------------------------------------------: |
|:------------------:| :----------------: | :---------------------------------------------: |
| **gpu** | **`merlin`** | gpu_week(cpu=80,gres/gpu=16,mem=400G) |
| **gpu-short** | **`merlin`** | gpu_week(cpu=80,gres/gpu=16,mem=400G) |
| **gwendolen** | `gwendolen` | gwendolen_public(cpu=64,gres/gpu=4,mem=243750M) |
| **gwendolen** | `gwendolen` | gwendolen(No limits, full access granted) |
| **gwendolen** | `gwendolen` | No limits |
| **gwendolen_long** | `gwendolen` | No limits, active from 9pm to 5:30am |
* With the limits in the public `gpu` and `gpu-short` partitions, a single user can not use more than 80 CPUs, more than 16 GPUs or more than 400GB.
Jobs sent by any user already exceeding such limits will stay in the queue with the message **`QOSMax[Cpu|GRES|Mem]PerUser`**.