Gwendolen changes
This commit is contained in:
parent
252d08affb
commit
a2ddf59412
@ -48,11 +48,12 @@ Users might need to specify the Slurm partition. If no partition is specified, i
|
||||
|
||||
The table below resumes shows all possible partitions available to users:
|
||||
|
||||
| GPU Partition | Default Time | Max Time | PriorityJobFactor\* | PriorityTier\*\* |
|
||||
|:-----------------: | :----------: | :------: | :-----------------: | :--------------: |
|
||||
| `gpu` | 1 day | 1 week | 1 | 1 |
|
||||
| `gpu-short` | 2 hours | 2 hours | 1000 | 500 |
|
||||
| `gwendolen` | 1 hour | 12 hours | 1000 | 1000 |
|
||||
| GPU Partition | Default Time | Max Time | PriorityJobFactor\* | PriorityTier\*\* |
|
||||
|:-----------------: | :----------: | :--------: | :-----------------: | :--------------: |
|
||||
| `gpu` | 1 day | 1 week | 1 | 1 |
|
||||
| `gpu-short` | 2 hours | 2 hours | 1000 | 500 |
|
||||
| `gwendolen` | 30 minutes | 30 minutes | 1000 | 1000 |
|
||||
| `gwendolen_long` | 30 minutes | 4 hours | 1 | 1 |
|
||||
|
||||
\*The **PriorityJobFactor** value will be added to the job priority (*PARTITION* column in `sprio -l` ). In other words, jobs sent to higher priority
|
||||
partitions will usually run first (however, other factors such like **job age** or mainly **fair share** might affect to that decision). For the GPU
|
||||
@ -71,24 +72,21 @@ This is mostly needed by users which have multiple Slurm accounts, which may def
|
||||
```
|
||||
Not all the accounts can be used on all partitions. This is resumed in the table below:
|
||||
|
||||
| Slurm Account | Slurm Partitions | Special QoS |
|
||||
|:-------------------: | :------------------: | :---------------------------------: |
|
||||
| **`merlin`** | **`gpu`**,`gpu-short` | |
|
||||
| `gwendolen` | `gwendolen` | `gwendolen`, **`gwendolen_public`** |
|
||||
| Slurm Account | Slurm Partitions |
|
||||
|:-------------------: | :------------------: |
|
||||
| **`merlin`** | **`gpu`**,`gpu-short` |
|
||||
| `gwendolen` | `gwendolen`,`gwendolen_long` |
|
||||
|
||||
By default, all users belong to the `merlin` and `gwendolen` Slurm accounts.
|
||||
By default, all users belong to the `merlin` Slurm accounts, and jobs are submitted to the `gpu` partition when no partition is defined.
|
||||
|
||||
Users only need to specify `gwendolen` when using `gwendolen`, otherwise specfying account is not needed (it will always default to `merlin`). `gwendolen` is a special account, with two different **QoS** granting different types of access (see details below).
|
||||
Users only need to specify the `gwendolen` account when using the `gwendolen` or `gwendolen_long` partitions, otherwise specifying account is not needed (it will always default to `merlin`).
|
||||
|
||||
#### The 'gwendolen' account
|
||||
|
||||
For running jobs in the **`gwendolen`** partition, users must specify the `gwendolen` account. The `merlin` account is not allowed to use the `gwendolen` partition.
|
||||
For running jobs in the **`gwendolen`/`gwendolen_long`** partitions, users must specify the **`gwendolen`** account.
|
||||
The `merlin` account is not allowed to use the Gwendolen partitions.
|
||||
|
||||
In addition, in Slurm there is the concept of **QoS**, which stands for **Quality of Service**. The **`gwendolen`** account has two different QoS configured:
|
||||
* The **QoS** **`gwendolen_public`** is set by default to all Merlin users. This restricts the number of resources than can be used on **Gwendolen**. For further information about restrictions, please read the ['User and Job Limits'](/gmerlin6/slurm-configuration.html#user-and-job-limits) documentation.
|
||||
* The **QoS** **`gwendolen`** provides full access to **`gwendolen`**, however this is restricted to a set of users belonging to the **`unx-gwendolen`** Unix group.
|
||||
|
||||
Users don't need to specify any QoS, however, they need to be aware about resources restrictions. If you belong to one of the projects which is allowed to use **Gwendolen** without restrictions, please request access to the **`unx-gwendolen`** through [PSI Service Now](https://psi.service-now.com/).
|
||||
Gwendolen is restricted to a set of users belonging to the **`unx-gwendolen`** Unix group. If you belong to a project allowed to use **Gwendolen**, or you are a user which would like to have access to it, please request access to the **`unx-gwendolen`** Unix group through [PSI Service Now](https://psi.service-now.com/): the request will be redirected to the responsible of the project (Andreas Adelmann).
|
||||
|
||||
### Slurm GPU specific options
|
||||
|
||||
@ -184,7 +182,7 @@ Please, notice that when defining `[<type>:]` once, then all other options must
|
||||
|
||||
#### Dealing with Hyper-Threading
|
||||
|
||||
The **`gmerlin6`** cluster contains the partition `gwendolen`, which has a node with Hyper-Threading enabled.
|
||||
The **`gmerlin6`** cluster contains the partitions `gwendolen` and `gwendolen_long`, which have a node with Hyper-Threading enabled.
|
||||
In that case, one should always specify whether to use Hyper-Threading or not. If not defined, Slurm will
|
||||
generally use it (exceptions apply). For this machine, generally HT is recommended.
|
||||
|
||||
@ -204,12 +202,12 @@ These are limits applying to a single job. In other words, there is a maximum of
|
||||
Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: `SlurmQoS(limits)`
|
||||
(possible `SlurmQoS` values can be listed with the command `sacctmgr show qos`):
|
||||
|
||||
| Partition | Slurm Account | Mon-Sun 0h-24h |
|
||||
|:-------------:| :------------: | :------------------------------------------: |
|
||||
| **gpu** | **`merlin`** | gpu_week(cpu=40,gres/gpu=8,mem=200G) |
|
||||
| **gpu-short** | **`merlin`** | gpu_week(cpu=40,gres/gpu=8,mem=200G) |
|
||||
| **gwendolen** | `gwendolen` | gwendolen_public(cpu=32,gres/gpu=2,mem=200G) |
|
||||
| **gwendolen** | `gwendolen` | gwendolen(No limits, full access granted) |
|
||||
| Partition | Slurm Account | Mon-Sun 0h-24h |
|
||||
|:------------------:| :------------: | :------------------------------------------: |
|
||||
| **gpu** | **`merlin`** | gpu_week(cpu=40,gres/gpu=8,mem=200G) |
|
||||
| **gpu-short** | **`merlin`** | gpu_week(cpu=40,gres/gpu=8,mem=200G) |
|
||||
| **gwendolen** | `gwendolen` | No limits |
|
||||
| **gwendolen_long** | `gwendolen` | No limits, active from 9pm to 5:30am |
|
||||
|
||||
* With the limits in the public `gpu` and `gpu-short` partitions, a single job using the `merlin` acccount
|
||||
(default account) can not use more than 40 CPUs, more than 8 GPUs or more than 200GB.
|
||||
@ -218,9 +216,12 @@ As there are no more existing QoS during the week temporary overriding job limit
|
||||
instance in the CPU **daily** partition), the job needs to be cancelled, and the requested resources
|
||||
must be adapted according to the above resource limits.
|
||||
|
||||
* The **gwendolen** partition is a special partition with a **[NVIDIA DGX A100](https://www.nvidia.com/en-us/data-center/dgx-a100/)** machine.
|
||||
Public access is possible through the `gwendolen` account, however this is limited to 2 GPUs per job, 32 CPUs and 121875MB of memory).
|
||||
For full access, the `gwendolen` account with `gwendolen` **QoS** (Quality of Service) is needed, and this is restricted to a set of users (belonging to the **`unx-gwendolen`** Unix group). Any other user will have by default a QoS **`gwendolen_public`**, which restricts resources in Gwendolen.
|
||||
* The **gwendolen** and **gwendolen_long** partitions are two special partitions for a **[NVIDIA DGX A100](https://www.nvidia.com/en-us/data-center/dgx-a100/)** machine.
|
||||
Only users belonging to the **`unx-gwendolen`** Unix group can run in these partitions. No limits are applied (machine resources can be completely used).
|
||||
|
||||
* The **`gwendolen_long`** partition is available 24h. However,
|
||||
* from 5:30am to 9pm the partition is `down` (jobs can be submitted, but can not run until the partition is set to `active`).
|
||||
* from 9pm to 5:30am jobs are allowed to run (partition is set to `active`).
|
||||
|
||||
### Per user limits for GPU partitions
|
||||
|
||||
@ -228,12 +229,12 @@ These limits apply exclusively to users. In other words, there is a maximum of r
|
||||
Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: `SlurmQoS(limits)`
|
||||
(possible `SlurmQoS` values can be listed with the command `sacctmgr show qos`):
|
||||
|
||||
| Partition | Slurm Account | Mon-Sun 0h-24h |
|
||||
|:-------------:| :----------------: | :---------------------------------------------: |
|
||||
| **gpu** | **`merlin`** | gpu_week(cpu=80,gres/gpu=16,mem=400G) |
|
||||
| **gpu-short** | **`merlin`** | gpu_week(cpu=80,gres/gpu=16,mem=400G) |
|
||||
| **gwendolen** | `gwendolen` | gwendolen_public(cpu=64,gres/gpu=4,mem=243750M) |
|
||||
| **gwendolen** | `gwendolen` | gwendolen(No limits, full access granted) |
|
||||
| Partition | Slurm Account | Mon-Sun 0h-24h |
|
||||
|:------------------:| :----------------: | :---------------------------------------------: |
|
||||
| **gpu** | **`merlin`** | gpu_week(cpu=80,gres/gpu=16,mem=400G) |
|
||||
| **gpu-short** | **`merlin`** | gpu_week(cpu=80,gres/gpu=16,mem=400G) |
|
||||
| **gwendolen** | `gwendolen` | No limits |
|
||||
| **gwendolen_long** | `gwendolen` | No limits, active from 9pm to 5:30am |
|
||||
|
||||
* With the limits in the public `gpu` and `gpu-short` partitions, a single user can not use more than 80 CPUs, more than 16 GPUs or more than 400GB.
|
||||
Jobs sent by any user already exceeding such limits will stay in the queue with the message **`QOSMax[Cpu|GRES|Mem]PerUser`**.
|
||||
|
Loading…
x
Reference in New Issue
Block a user