Update 2

2021-05-20 18:04:54 +02:00
parent 173759bbf0
commit 42d8f38934
10 changed files with 558 additions and 177 deletions
--- a/pages/gmerlin6/slurm-configuration.md
+++ b/pages/gmerlin6/slurm-configuration.md
@@ -0,0 +1,192 @@
+---
+title: Slurm cluster 'gmerlin6'
+#tags:
+keywords: configuration, partitions, node definition, gmerlin6
+last_updated: 29 January 2021
+summary: "This document describes a summary of the Slurm 'configuration."
+sidebar: merlin6_sidebar
+permalink: /gmerlin6/slurm-configuration.html
+---
+
+This documentation shows basic Slurm configuration and options needed to run jobs in the GPU cluster.
+
+## Merlin6 GPU nodes definition
+
+The table below shows a summary of the hardware setup for the different GPU nodes
+
+| Nodes              | Def.#CPUs | Max.#CPUs | #Threads | Def.Mem/CPU | Max.Mem/CPU | Max.Mem/Node | Max.Swap | GPU Type                | Def.#GPUs | Max.#GPUs |
+|:------------------:| ---------:| :--------:| :------: | :----------:| :----------:| :-----------:| :-------:| :--------:              | :-------: | :-------: |
+| merlin-g-[001]     | 1 core    | 8 cores   | 1        | 4000        | 102400      | 102400       | 10000    | **geforce_gtx_1080**    | 1         | 2         |
+| merlin-g-[002-005] | 1 core    | 20 cores  | 1        | 4000        | 102400      | 102400       | 10000    | **geforce_gtx_1080**    | 1         | 4         |
+| merlin-g-[006-009] | 1 core    | 20 cores  | 1        | 4000        | 102400      | 102400       | 10000    | **geforce_gtx_1080_ti** | 1         | 4         |
+| merlin-g-[010-013] | 1 core    | 20 cores  | 1        | 4000        | 102400      | 102400       | 10000    | **geforce_rtx_2080_ti** | 1         | 4         |
+| merlin-g-014       | 1 core    | 48 cores  | 1        | 4000        | 360448      | 360448       | 10000    | **geforce_rtx_2080_ti** | 1         | 8         |
+| merlin-g-100       | 1 core    | 128 cores | 2        | 3900        | 998400      | 998400       | 10000    | **A100**                | 1         | 8         |
+
+{{site.data.alerts.tip}}Always check <b>'/etc/slurm/gres.conf'</b> and <b>'/etc/slurm/slurm.conf'</b> for changes in the GPU type and details of the hardware.
+{{site.data.alerts.end}}
+
+## Running jobs in the 'gmerlin6' cluster
+
+In this chapter we will cover basic settings that users need to specify in order to run jobs in the GPU cluster.
+
+### Merlin6 GPU cluster
+
+To run jobs in the **`gmerlin6`** cluster users **must** specify the cluster name in Slurm:
+
+```bash
+#SBATCH --cluster=gmerlin6
+```
+
+### Merlin6 GPU partitions
+
+Users might need to specify the Slurm partition. If no partition is specified, it will default to **`gpu`**:
+
+```bash
+#SBATCH --partition=<partition_name>  # Possible <partition_name> values: gpu, gpu-short, gwendolen
+```
+
+The table below resumes shows all possible partitions available to users:
+
+| GPU Partition      |  Default Time | Max Time | PriorityJobFactor\* | PriorityTier\*\* |
+|:-----------------: |  :----------: | :------: | :-----------------: | :--------------: |
+| **<u>gpu</u>**     |  1 day        | 1 week   | 1                   | 1                |
+| **gpu-short**      |  2 hours      | 2 hours  | 1000                | 500              |
+| **gwendolen**      |  1 hour       | 12 hours | 1000                | 1000             |
+
+**\***The **PriorityJobFactor** value will be added to the job priority (*PARTITION* column in `sprio -l` ). In other words, jobs sent to higher priority
+partitions will usually run first (however, other factors such like **job age** or mainly **fair share** might affect to that decision). For the GPU
+partitions, Slurm will also attempt first to allocate jobs on partitions with higher priority over partitions with lesser priority.
+
+**\*\***Jobs submitted to a partition with a higher **PriorityTier** value will be dispatched before pending jobs in partition with  lower *PriorityTier*  value
+and, if possible, they will preempt running jobs from partitions with lower *PriorityTier* values.
+
+### Merlin6 GPU Accounts
+
+Users might need to specify the Slurm account to be used. If no account is specified, the **`merlin`** **account** will be used as default:
+
+```bash
+#SBATCH --account=merlin  # Possible values: merlin, gwendolen_public, gwendolen
+```
+
+Not all accounts can be used on all partitions. This is resumed in the table below:
+
+| Slurm Account        | Slurm Partitions  |
+|:-------------------: |  :--------------: |
+| **<u>merlin</u>**    | `gpu`,`gpu-short` |
+| **gwendolen_public** | `gwendolen`       |
+| **gwendolen**        | `gwendolen`       |
+
+By default, all users belong to the `merlin` and `gwendolen_public` Slurm accounts. 
+The `gwendolen` **account** is only available for a few set of users, other users must use `gwendolen_public` instead.
+
+For running jobs in the `gwendolen` **partition**, users must specify one of the `gwendolen_public` or `gwendolen` accounts.
+The `merlin` account is not allowed in the `gwendolen` partition.
+
+### GPU specific options
+
+Some options are available when using GPUs. These are detailed here.
+
+#### Number of GPUs and type
+
+When using the GPU cluster, users **must** specify the number of GPUs they need to use:
+
+```bash
+#SBATCH --gpus=[<type>:]<number>
+```
+
+The GPU type is optional: if left empty, it will try allocating any type of GPU.
+The different `[<type>:]` values and `<number>` of GPUs depends on the node.
+This is detailed in the below table.
+
+| Nodes              |  GPU Type              | #GPUs |
+|:------------------:|  :-------------------: | :---: |
+| merlin-g-[001]     |  `geforce_gtx_1080`    | 2     |
+| merlin-g-[002-005] |  `geforce_gtx_1080`    | 4     |
+| merlin-g-[006-009] |  `geforce_gtx_1080_ti` | 4     |
+| merlin-g-[010-013] |  `geforce_rtx_2080_ti` | 4     |
+| merlin-g-014       |  `geforce_rtx_2080_ti` | 8     |
+| merlin-g-100       |  `A100`                | 8     |
+
+#### Other GPU options
+
+Alternative Slurm options for GPU based jobs are available. Please refer to the **man** pages
+for each Slurm command for further information about it (`man salloc`, `man sbatch`, `man srun`).
+Below are listed the most common settings:
+
+```bash
+#SBATCH --ntasks=<ntasks>
+#SBATCH --ntasks-per-gpu=<ntasks>
+#SBATCH --mem-per-gpu=<size[units]>
+#SBATCH --cpus-per-gpu=<ncpus>
+#SBATCH --gpus-per-node=[<type>:]<number>
+#SBATCH --gpus-per-socket=[<type>:]<number>
+#SBATCH --gpus-per-task=[<type>:]<number>
+#SBATCH --gpu-bind=[verbose,]<type>
+```
+
+Please, notice that when defining `[<type>:]` once, then all other options must use it too!
+
+## User and job limits 
+
+The GPU cluster contains some basic user and job limits to ensure that a single user can not overabuse the resources and a fair usage of the cluster.
+The limits are described below.
+
+### Per job limits
+
+These are limits applying to a single job. In other words, there is a maximum of resources a single job can use. 
+Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: `SlurmQoS(limits)`,
+(list of possible `SlurmQoS` values can be listed with the command `sacctmgr show qos`):
+
+| Partition     | Slurm Account      | Mon-Sun 0h-24h                               |
+|:-------------:| :----------------: | :------------------------------------------: |
+| **gpu**       | **`merlin`**       | gpu_week(cpu=40,gres/gpu=8,mem=200G)         |
+| **gpu-short** | **`merlin`**       | gpu_week(cpu=40,gres/gpu=8,mem=200G)         |
+| **gwendolen** | `gwendolen_public` | gwendolen_public(cpu=32,gres/gpu=2,mem=200G) |
+| **gwendolen** | `gwendolen`        | No limits, full access granted               |
+
+* With the limits in the public `gpu` and `gpu-short` partitions, a single job using the `merlin` acccount 
+(default account) can not use more than 40 CPUs, more than 8 GPUs or more than 200GB. 
+Any job exceeding such limits will stay in the queue with the message **`QOSMax[Cpu|GRES|Mem]PerJob`**. 
+As there are no more existing QoS during the week temporary overriding job limits (this happens for 
+instance in the CPU **daily** partition), the job needs to be cancelled, and the requested resources 
+must be adapted according to the above resource limits.
+
+* The **gwendolen** partition is a special partition with a **[NVIDIA DGX A100](https://www.nvidia.com/en-us/data-center/dgx-a100/)** machine. 
+Public access is possible through the `gwendolen_public` account, however is limited to 2 GPUs per job, 32 CPUs and 121875MB of memory).
+For full access, the `gwendolen` account is needed, and this is restricted to a set of users.
+
+### Per user limits for GPU partitions
+
+These limits apply exclusively to users. In other words, there is a maximum of resources a single user can use. 
+Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: `SlurmQoS(limits)`,
+(list of possible `SlurmQoS` values can be listed with the command `sacctmgr show qos`):
+
+| Partition     | Slurm Account      | Mon-Sun 0h-24h                                  |
+|:-------------:| :----------------: | :---------------------------------------------: |
+| **gpu**       | **`merlin`**       | gpu_week(cpu=80,gres/gpu=16,mem=400G)           |
+| **gpu-short** | **`merlin`**       | gpu_week(cpu=80,gres/gpu=16,mem=400G)           |
+| **gwendolen** | `gwendolen_public` | gwendolen_public(cpu=64,gres/gpu=4,mem=243750M) |
+| **gwendolen** | `gwendolen`        | No limits, full access granted                  |
+
+* With the limits in the public `gpu` and `gpu-short` partitions, a single user can not use more than 80 CPUs, more than 16 GPUs or more than 400GB. 
+Jobs sent by any user already exceeding such limits will stay in the queue with the message **`QOSMax[Cpu|GRES|Mem]PerUser`**. 
+In that case, job can wait in the queue until some of the running resources are freed.
+
+* Notice that user limits are wider than job limits. In that way, a user can run up to two 8 GPUs based jobs, or up to four 4 GPUs based jobs, etc.
+Please try to avoid occupying all GPUs of the same type for several hours or multiple days, otherwise it would block other users needing the same
+type of GPU.
+
+## Advanced Slurm configuration
+
+Clusters at PSI use the [Slurm Workload Manager](http://slurm.schedmd.com/) as the batch system technology for managing and scheduling jobs.
+Slurm has been installed in a **multi-clustered** configuration, allowing to integrate multiple clusters in the same batch system.
+
+For understanding the Slurm configuration setup in the cluster, sometimes may be useful to check the following files:
+
+* ``/etc/slurm/slurm.conf`` - can be found in the login nodes and computing nodes.
+* ``/etc/slurm/gres.conf`` - can be found in the GPU nodes, is also propgated to login nodes and computing nodes for user read access.
+* ``/etc/slurm/cgroup.conf`` - can be found in the computing nodes, is also propagated to login nodes for user read access.
+
+The previous configuration files which can be found in the login nodes, correspond exclusively to the **merlin6** cluster configuration files.
+Configuration files for the old **merlin5** cluster or for the **gmerlin6** cluster must be checked directly on any of the **merlin5** or **gmerlin6** computing nodes (in example, by login in to one of the nodes while a job or an active allocation is running).