diff --git a/_data/sidebars/merlin6_sidebar.yml b/_data/sidebars/merlin6_sidebar.yml index d065a63..aef31a8 100644 --- a/_data/sidebars/merlin6_sidebar.yml +++ b/_data/sidebars/merlin6_sidebar.yml @@ -9,16 +9,20 @@ entries: # URLs for top-level folders are optional. If omitted it is a bit easier to toggle the accordion. #url: /merlin6/introduction.html folderitems: + - title: Introduction + url: /merlin6/introduction.html - title: Code Of Conduct url: /merlin6/code-of-conduct.html - title: Requesting Accounts url: /merlin6/request-account.html - title: Requesting Projects url: /merlin6/request-project.html + - title: Accessing the Interactive Nodes + url: /merlin6/interactive.html + - title: Accessing the Slurm Clusters + url: /merlin6/slurm-access.html - title: How To Use Merlin folderitems: - - title: Accessing Interactive Nodes - url: /merlin6/interactive.html - title: Accessing from a Linux client url: /merlin6/connect-from-linux.html - title: Accessing from a Windows client @@ -39,8 +43,6 @@ entries: url: /merlin6/using-modules.html - title: Job Submission folderitems: - - title: Accessing Slurm Cluster - url: /merlin6/slurm-access.html - title: Slurm Basic Commands url: /merlin6/slurm-basics.html - title: Running Batch Scripts @@ -49,27 +51,25 @@ entries: url: /merlin6/interactive-jobs.html - title: Slurm Examples url: /merlin6/slurm-examples.html - - title: Slurm Configuration - url: /merlin6/slurm-configuration.html - title: Monitoring url: /merlin6/monitoring.html - - title: Slurm CPU 'merlin6' + - title: Merlin6 CPU Slurm cluster folderitems: - - title: Introduction - url: /merlin6/introduction.html - - title: Hardware And Software Description + - title: Using Slurm - merlin6 + url: /merlin6/slurm-configuration.html + - title: HW/SW description url: /merlin6/hardware-and-software.html - - title: Slurm GPU 'gmerlin6' + - title: Merlin6 GPU Slurm cluster folderitems: - - title: Introduction - url: /gmerlin6/introduction.html - - title: Hardware And Software Description + - title: Using Slurm - gmerlin6 + url: /gmerlin6/slurm-configuration.html + - title: HW/SW description url: /gmerlin6/hardware-and-software.html - - title: Slurm CPU 'merlin5' + - title: Merlin5 CPU Slurm cluster folderitems: - - title: Introduction - url: /merlin5/introduction.html - - title: Hardware And Software Description + - title: Using Slurm - merlin5 + url: /merlin5/slurm-configuration.html + - title: HW/SW description url: /merlin5/hardware-and-software.html - title: Jupyterhub folderitems: diff --git a/pages/gmerlin6/introduction.md b/pages/gmerlin6/cluster-introduction.md similarity index 94% rename from pages/gmerlin6/introduction.md rename to pages/gmerlin6/cluster-introduction.md index 72383bb..ef8e5d1 100644 --- a/pages/gmerlin6/introduction.md +++ b/pages/gmerlin6/cluster-introduction.md @@ -5,10 +5,7 @@ title: Introduction last_updated: 28 June 2019 #summary: "GPU Merlin 6 cluster overview" sidebar: merlin6_sidebar -permalink: /gmerlin6/introduction.html -redirect_from: - - /gmerlin6 - - /gmerlin6/index.html +permalink: /gmerlin6/cluster-introduction.html --- ## About Merlin6 GPU cluster diff --git a/pages/gmerlin6/slurm-configuration.md b/pages/gmerlin6/slurm-configuration.md new file mode 100644 index 0000000..4a7731f --- /dev/null +++ b/pages/gmerlin6/slurm-configuration.md @@ -0,0 +1,192 @@ +--- +title: Slurm cluster 'gmerlin6' +#tags: +keywords: configuration, partitions, node definition, gmerlin6 +last_updated: 29 January 2021 +summary: "This document describes a summary of the Slurm 'configuration." +sidebar: merlin6_sidebar +permalink: /gmerlin6/slurm-configuration.html +--- + +This documentation shows basic Slurm configuration and options needed to run jobs in the GPU cluster. + +## Merlin6 GPU nodes definition + +The table below shows a summary of the hardware setup for the different GPU nodes + +| Nodes | Def.#CPUs | Max.#CPUs | #Threads | Def.Mem/CPU | Max.Mem/CPU | Max.Mem/Node | Max.Swap | GPU Type | Def.#GPUs | Max.#GPUs | +|:------------------:| ---------:| :--------:| :------: | :----------:| :----------:| :-----------:| :-------:| :--------: | :-------: | :-------: | +| merlin-g-[001] | 1 core | 8 cores | 1 | 4000 | 102400 | 102400 | 10000 | **geforce_gtx_1080** | 1 | 2 | +| merlin-g-[002-005] | 1 core | 20 cores | 1 | 4000 | 102400 | 102400 | 10000 | **geforce_gtx_1080** | 1 | 4 | +| merlin-g-[006-009] | 1 core | 20 cores | 1 | 4000 | 102400 | 102400 | 10000 | **geforce_gtx_1080_ti** | 1 | 4 | +| merlin-g-[010-013] | 1 core | 20 cores | 1 | 4000 | 102400 | 102400 | 10000 | **geforce_rtx_2080_ti** | 1 | 4 | +| merlin-g-014 | 1 core | 48 cores | 1 | 4000 | 360448 | 360448 | 10000 | **geforce_rtx_2080_ti** | 1 | 8 | +| merlin-g-100 | 1 core | 128 cores | 2 | 3900 | 998400 | 998400 | 10000 | **A100** | 1 | 8 | + +{{site.data.alerts.tip}}Always check '/etc/slurm/gres.conf' and '/etc/slurm/slurm.conf' for changes in the GPU type and details of the hardware. +{{site.data.alerts.end}} + +## Running jobs in the 'gmerlin6' cluster + +In this chapter we will cover basic settings that users need to specify in order to run jobs in the GPU cluster. + +### Merlin6 GPU cluster + +To run jobs in the **`gmerlin6`** cluster users **must** specify the cluster name in Slurm: + +```bash +#SBATCH --cluster=gmerlin6 +``` + +### Merlin6 GPU partitions + +Users might need to specify the Slurm partition. If no partition is specified, it will default to **`gpu`**: + +```bash +#SBATCH --partition= # Possible values: gpu, gpu-short, gwendolen +``` + +The table below resumes shows all possible partitions available to users: + +| GPU Partition | Default Time | Max Time | PriorityJobFactor\* | PriorityTier\*\* | +|:-----------------: | :----------: | :------: | :-----------------: | :--------------: | +| **gpu** | 1 day | 1 week | 1 | 1 | +| **gpu-short** | 2 hours | 2 hours | 1000 | 500 | +| **gwendolen** | 1 hour | 12 hours | 1000 | 1000 | + +**\***The **PriorityJobFactor** value will be added to the job priority (*PARTITION* column in `sprio -l` ). In other words, jobs sent to higher priority +partitions will usually run first (however, other factors such like **job age** or mainly **fair share** might affect to that decision). For the GPU +partitions, Slurm will also attempt first to allocate jobs on partitions with higher priority over partitions with lesser priority. + +**\*\***Jobs submitted to a partition with a higher **PriorityTier** value will be dispatched before pending jobs in partition with lower *PriorityTier* value +and, if possible, they will preempt running jobs from partitions with lower *PriorityTier* values. + +### Merlin6 GPU Accounts + +Users might need to specify the Slurm account to be used. If no account is specified, the **`merlin`** **account** will be used as default: + +```bash +#SBATCH --account=merlin # Possible values: merlin, gwendolen_public, gwendolen +``` + +Not all accounts can be used on all partitions. This is resumed in the table below: + +| Slurm Account | Slurm Partitions | +|:-------------------: | :--------------: | +| **merlin** | `gpu`,`gpu-short` | +| **gwendolen_public** | `gwendolen` | +| **gwendolen** | `gwendolen` | + +By default, all users belong to the `merlin` and `gwendolen_public` Slurm accounts. +The `gwendolen` **account** is only available for a few set of users, other users must use `gwendolen_public` instead. + +For running jobs in the `gwendolen` **partition**, users must specify one of the `gwendolen_public` or `gwendolen` accounts. +The `merlin` account is not allowed in the `gwendolen` partition. + +### GPU specific options + +Some options are available when using GPUs. These are detailed here. + +#### Number of GPUs and type + +When using the GPU cluster, users **must** specify the number of GPUs they need to use: + +```bash +#SBATCH --gpus=[:] +``` + +The GPU type is optional: if left empty, it will try allocating any type of GPU. +The different `[:]` values and `` of GPUs depends on the node. +This is detailed in the below table. + +| Nodes | GPU Type | #GPUs | +|:------------------:| :-------------------: | :---: | +| merlin-g-[001] | `geforce_gtx_1080` | 2 | +| merlin-g-[002-005] | `geforce_gtx_1080` | 4 | +| merlin-g-[006-009] | `geforce_gtx_1080_ti` | 4 | +| merlin-g-[010-013] | `geforce_rtx_2080_ti` | 4 | +| merlin-g-014 | `geforce_rtx_2080_ti` | 8 | +| merlin-g-100 | `A100` | 8 | + +#### Other GPU options + +Alternative Slurm options for GPU based jobs are available. Please refer to the **man** pages +for each Slurm command for further information about it (`man salloc`, `man sbatch`, `man srun`). +Below are listed the most common settings: + +```bash +#SBATCH --ntasks= +#SBATCH --ntasks-per-gpu= +#SBATCH --mem-per-gpu= +#SBATCH --cpus-per-gpu= +#SBATCH --gpus-per-node=[:] +#SBATCH --gpus-per-socket=[:] +#SBATCH --gpus-per-task=[:] +#SBATCH --gpu-bind=[verbose,] +``` + +Please, notice that when defining `[:]` once, then all other options must use it too! + +## User and job limits + +The GPU cluster contains some basic user and job limits to ensure that a single user can not overabuse the resources and a fair usage of the cluster. +The limits are described below. + +### Per job limits + +These are limits applying to a single job. In other words, there is a maximum of resources a single job can use. +Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: `SlurmQoS(limits)`, +(list of possible `SlurmQoS` values can be listed with the command `sacctmgr show qos`): + +| Partition | Slurm Account | Mon-Sun 0h-24h | +|:-------------:| :----------------: | :------------------------------------------: | +| **gpu** | **`merlin`** | gpu_week(cpu=40,gres/gpu=8,mem=200G) | +| **gpu-short** | **`merlin`** | gpu_week(cpu=40,gres/gpu=8,mem=200G) | +| **gwendolen** | `gwendolen_public` | gwendolen_public(cpu=32,gres/gpu=2,mem=200G) | +| **gwendolen** | `gwendolen` | No limits, full access granted | + +* With the limits in the public `gpu` and `gpu-short` partitions, a single job using the `merlin` acccount +(default account) can not use more than 40 CPUs, more than 8 GPUs or more than 200GB. +Any job exceeding such limits will stay in the queue with the message **`QOSMax[Cpu|GRES|Mem]PerJob`**. +As there are no more existing QoS during the week temporary overriding job limits (this happens for +instance in the CPU **daily** partition), the job needs to be cancelled, and the requested resources +must be adapted according to the above resource limits. + +* The **gwendolen** partition is a special partition with a **[NVIDIA DGX A100](https://www.nvidia.com/en-us/data-center/dgx-a100/)** machine. +Public access is possible through the `gwendolen_public` account, however is limited to 2 GPUs per job, 32 CPUs and 121875MB of memory). +For full access, the `gwendolen` account is needed, and this is restricted to a set of users. + +### Per user limits for GPU partitions + +These limits apply exclusively to users. In other words, there is a maximum of resources a single user can use. +Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: `SlurmQoS(limits)`, +(list of possible `SlurmQoS` values can be listed with the command `sacctmgr show qos`): + +| Partition | Slurm Account | Mon-Sun 0h-24h | +|:-------------:| :----------------: | :---------------------------------------------: | +| **gpu** | **`merlin`** | gpu_week(cpu=80,gres/gpu=16,mem=400G) | +| **gpu-short** | **`merlin`** | gpu_week(cpu=80,gres/gpu=16,mem=400G) | +| **gwendolen** | `gwendolen_public` | gwendolen_public(cpu=64,gres/gpu=4,mem=243750M) | +| **gwendolen** | `gwendolen` | No limits, full access granted | + +* With the limits in the public `gpu` and `gpu-short` partitions, a single user can not use more than 80 CPUs, more than 16 GPUs or more than 400GB. +Jobs sent by any user already exceeding such limits will stay in the queue with the message **`QOSMax[Cpu|GRES|Mem]PerUser`**. +In that case, job can wait in the queue until some of the running resources are freed. + +* Notice that user limits are wider than job limits. In that way, a user can run up to two 8 GPUs based jobs, or up to four 4 GPUs based jobs, etc. +Please try to avoid occupying all GPUs of the same type for several hours or multiple days, otherwise it would block other users needing the same +type of GPU. + +## Advanced Slurm configuration + +Clusters at PSI use the [Slurm Workload Manager](http://slurm.schedmd.com/) as the batch system technology for managing and scheduling jobs. +Slurm has been installed in a **multi-clustered** configuration, allowing to integrate multiple clusters in the same batch system. + +For understanding the Slurm configuration setup in the cluster, sometimes may be useful to check the following files: + +* ``/etc/slurm/slurm.conf`` - can be found in the login nodes and computing nodes. +* ``/etc/slurm/gres.conf`` - can be found in the GPU nodes, is also propgated to login nodes and computing nodes for user read access. +* ``/etc/slurm/cgroup.conf`` - can be found in the computing nodes, is also propagated to login nodes for user read access. + +The previous configuration files which can be found in the login nodes, correspond exclusively to the **merlin6** cluster configuration files. +Configuration files for the old **merlin5** cluster or for the **gmerlin6** cluster must be checked directly on any of the **merlin5** or **gmerlin6** computing nodes (in example, by login in to one of the nodes while a job or an active allocation is running). diff --git a/pages/merlin5/introduction.md b/pages/merlin5/cluster-introduction.md similarity index 95% rename from pages/merlin5/introduction.md rename to pages/merlin5/cluster-introduction.md index 40ccfab..93ba2de 100644 --- a/pages/merlin5/introduction.md +++ b/pages/merlin5/cluster-introduction.md @@ -5,10 +5,7 @@ title: Cluster 'merlin5' last_updated: 07 April 2021 #summary: "Merlin 5 cluster overview" sidebar: merlin6_sidebar -permalink: /merlin5/introduction.html -redirect_from: - - /merlin5 - - /merlin5/index.html +permalink: /merlin5/cluster-introduction.html --- ## Slurm 'merlin5' cluster diff --git a/pages/merlin5/slurm-configuration.md b/pages/merlin5/slurm-configuration.md new file mode 100644 index 0000000..839ac1d --- /dev/null +++ b/pages/merlin5/slurm-configuration.md @@ -0,0 +1,240 @@ +--- +title: Slurm Configuration +#tags: +keywords: configuration, partitions, node definition +last_updated: 20 May 2021 +summary: "This document describes a summary of the Merlin5 Slurm configuration." +sidebar: merlin6_sidebar +permalink: /merlin5/slurm-configuration.html +--- + +This documentation shows basic Slurm configuration and options needed to run jobs in the Merlin5 cluster. + +The Merlin5 cluster is an old cluster with old hardware which is maintained in a best effort for increasing the CPU power of the Merlin cluster. + +## Merlin5 CPU nodes definition + +The following table show default and maximum resources that can be used per node: + +| Nodes | Def.#CPUs | Max.#CPUs | #Threads | Max.Mem/Node | Max.Swap | +|:----------------:| ---------:| :--------:| :------: | :----------: | :-------:| +| merlin-c-[18-30] | 1 core | 16 cores | 1 | 60000 | 10000 | +| merlin-c-[31-32] | 1 core | 16 cores | 1 | 124000 | 10000 | +| merlin-c-[33-45] | 1 core | 16 cores | 1 | 60000 | 10000 | +| merlin-c-[46-47] | 1 core | 16 cores | 1 | 124000 | 10000 | + +There is one main difference between the Merlin5 and Merlin6 clusters: Merlin5 is keeping an old configuration which does not +consider the memory as a *consumable resource*. Hence, users can *oversubscribe* memory. This might trigger some side-effects, but +this legacy configuration has been kept to ensure that old jobs can keep running in the same way they did a few years ago. +If you know that this might be a problem for you, please, always use Merlin6 instead. + + +## Running jobs in the 'merlin5' cluster + +In this chapter we will cover basic settings that users need to specify in order to run jobs in the Merlin5 CPU cluster. + +### Merlin5 CPU cluster + +To run jobs in the **`merlin5`** cluster users **must** specify the cluster name in Slurm: + +```bash +#SBATCH --cluster=merlin5 +``` + +### CPU partitions + +Users might need to specify the Slurm partition. If no partition is specified, it will default to **`merlin`**: + +```bash +#SBATCH --partition= # Possible values: merlin, merlin-long: +``` + +The table below resumes shows all possible partitions available to users: + +| CPU Partition | Default Time | Max Time | Max Nodes | PriorityJobFactor\* | PriorityTier\*\* | +|:-----------------: | :----------: | :------: | :-------: | :-----------------: | :--------------: | +| **merlin** | 5 days | 1 week | All nodes | 500 | 1 | +| **merlin-long** | 5 days | 21 days | 4 | 1 | 1 | + +**\***The **PriorityJobFactor** value will be added to the job priority (*PARTITION* column in `sprio -l` ). In other words, jobs sent to higher priority +partitions will usually run first (however, other factors such like **job age** or mainly **fair share** might affect to that decision). For the GPU +partitions, Slurm will also attempt first to allocate jobs on partitions with higher priority over partitions with lesser priority. + +**\*\***Jobs submitted to a partition with a higher **PriorityTier** value will be dispatched before pending jobs in partition with lower *PriorityTier* value +and, if possible, they will preempt running jobs from partitions with lower *PriorityTier* values. + +The `merlin-long` partition, as it might contain jobs running for up to 21 days, is limited to 4 nodes. + +### Merlin5 CPU Accounts + +Users need to ensure that the **`merlin`** **account** is specified (or no account is specified). +This is the unique account available in the **merlin5** cluster. +This is mostly needed by users which have multiple Slurm accounts, which may defined by mistake a different account existing in +one of the other Merlin clusters (i.e. `merlin6`, `gmerlin6`). + +```bash +#SBATCH --account=merlin # Possible values: merlin +``` + +### Merlin5 CPU specific options + +Some options are available when using CPUs. These are detailed here. + +Alternative Slurm options for CPU based jobs are available. Please refer to the **man** pages +for each Slurm command for further information about it (`man salloc`, `man sbatch`, `man srun`). +Below are listed the most common settings: + +```bash +#SBATCH --ntasks= +#SBATCH --ntasks-per-core= +#SBATCH --ntasks-per-socket= +#SBATCH --ntasks-per-node= +#SBATCH --mem= +#SBATCH --mem-per-cpu= +#SBATCH --cpus-per-task= +#SBATCH --cpu-bind=[{quiet,verbose},] # only for 'srun' command +``` + +Notice that in **Merlin5** no hyper-threading is available (while in **Merlin6** it is). +Hence, in **Merlin5** there is not need to specify `--hint` hyper-threading related options. + +## User and job limits + +In the CPU cluster we provide some limits which basically apply to jobs and users. The idea behind this is to ensure a fair usage of the resources and to +avoid overabuse of the resources from a single user or job. However, applying limits might affect the overall usage efficiency of the cluster (in example, +pending jobs from a single user while having many idle nodes due to low overall activity is something that can be seen when user limits are applied). +In the same way, these limits can be also used to improve the efficiency of the cluster (in example, without any job size limits, a job requesting all +resources from the batch system would drain the entire cluster for fitting the job, which is undesirable). + +Hence, there is a need of setting up wise limits and to ensure that there is a fair usage of the resources, by trying to optimize the overall efficiency +of the cluster while allowing jobs of different nature and sizes (it is, **single core** based **vs parallel jobs** of different sizes) to run. + +{{site.data.alerts.warning}}Wide limits are provided in the daily and hourly partitions, while for general those limits are +more restrictive. +
However, we kindly ask users to inform the Merlin administrators when there are plans to send big jobs which would require a +massive draining of nodes for allocating such jobs. This would apply to jobs requiring the unlimited QoS (see below "Per job limits") +{{site.data.alerts.end}} + +{{site.data.alerts.tip}}If you have different requirements, please let us know, we will try to accomodate or propose a solution for you. +{{site.data.alerts.end}} + +#### Per job limits + +These are limits which apply to a single job. In other words, there is a maximum of resources a single job can use. This is described in the table below, +and limits will vary depending on the day of the week and the time (*working* vs *non-working* hours). Limits are shown in format: `SlurmQoS(limits)`, +where `SlurmQoS` can be seen with the command `sacctmgr show qos`: + +| Partition | Mon-Fri 0h-18h | Sun-Thu 18h-0h | From Fri 18h to Mon 0h | +|:----------: | :------------------: | :------------: | :---------------------: | +| **general** | normal(cpu=704,mem=2750G) | normal(cpu=704,mem=2750G) | normal(cpu=704,mem=2750G) | +| **daily** | daytime(cpu=704,mem=2750G) | nighttime(cpu=1408,mem=5500G) | unlimited(cpu=2200,mem=8593.75G) | +| **hourly** | unlimited(cpu=2200,mem=8593.75G) | unlimited(cpu=2200,mem=8593.75G) | unlimited(cpu=2200,mem=8593.75G) | + +By default, a job can not use more than 704 cores (max CPU per job). In the same way, memory is also proportionally limited. This is equivalent as +running a job using up to 8 nodes at once. This limit applies to the **general** partition (fixed limit) and to the **daily** partition (only during working hours). +Limits are softed for the **daily** partition during non working hours, and during the weekend limits are even wider. + +For the **hourly** partition, **despite running many parallel jobs is something not desirable** (for allocating such jobs it requires massive draining of nodes), +wider limits are provided. In order to avoid massive nodes drain in the cluster, for allocating huge jobs, setting per job limits is necessary. Hence, **unlimited** QoS +mostly refers to "per user" limits more than to "per job" limits (in other words, users can run any number of hourly jobs, but the job size for such jobs is limited +with wide values). + +#### Per user limits for CPU partitions + +These limits which apply exclusively to users. In other words, there is a maximum of resources a single user can use. This is described in the table below, +and limits will vary depending on the day of the week and the time (*working* vs *non-working* hours). Limits are shown in format: `SlurmQoS(limits)`, +where `SlurmQoS` can be seen with the command `sacctmgr show qos`: + +| Partition | Mon-Fri 0h-18h | Sun-Thu 18h-0h | From Fri 18h to Mon 0h | +|:-----------:| :----------------: | :------------: | :---------------------: | +| **general** | normal(cpu=704,mem=2750G) | normal(cpu=704,mem=2750G) | normal(cpu=704,mem=2750G) | +| **daily** | daytime(cpu=1408,mem=5500G) | nighttime(cpu=2112,mem=8250G) | unlimited(cpu=6336,mem=24750G) | +| **hourly** | unlimited(cpu=6336,mem=24750G) | unlimited(cpu=6336,mem=24750G)| unlimited(cpu=6336,mem=24750G) | + +By default, users can not use more than 704 cores at the same time (max CPU per user). Memory is also proportionally limited in the same way. This is +equivalent to 8 exclusive nodes. This limit applies to the **general** partition (fixed limit) and to the **daily** partition (only during working hours). +For the **hourly** partition, there are no limits restriction and user limits are removed. Limits are softed for the **daily** partition during non +working hours, and during the weekend limits are removed. + +## Merlin6 GPU + +Basic configuration for the **merlin5 GPUs** will be detailed here. +For advanced usage, please refer to [Understanding the Slurm configuration (for advanced users)](/merlin5/slurm-configuration.html#understanding-the-slurm-configuration-for-advanced-users) + +### GPU nodes definition + +| Nodes | Def.#CPUs | Max.#CPUs | #Threads | Def.Mem/CPU | Max.Mem/CPU | Max.Mem/Node | Max.Swap | GPU Type | Def.#GPUs | Max.#GPUs | +|:------------------:| ---------:| :--------:| :------: | :----------:| :----------:| :-----------:| :-------:| :--------: | :-------: | :-------: | +| merlin-g-[001] | 1 core | 8 cores | 1 | 4000 | 102400 | 102400 | 10000 | **GTX1080** | 1 | 2 | +| merlin-g-[002-005] | 1 core | 20 cores | 1 | 4000 | 102400 | 102400 | 10000 | **GTX1080** | 1 | 4 | +| merlin-g-[006-009] | 1 core | 20 cores | 1 | 4000 | 102400 | 102400 | 10000 | **GTX1080Ti** | 1 | 4 | +| merlin-g-[010-013] | 1 core | 20 cores | 1 | 4000 | 102400 | 102400 | 10000 | **RTX2080Ti** | 1 | 4 | + +{{site.data.alerts.tip}}Always check '/etc/slurm/gres.conf' for changes in the GPU type and details of the NUMA node. +{{site.data.alerts.end}} + +### GPU partitions + +| GPU Partition | Default Time | Max Time | Max Nodes | Priority | PriorityJobFactor\* | +|:-----------------: | :----------: | :------: | :-------: | :------: | :-----------------: | +| **gpu** | 1 day | 1 week | 4 | low | 1 | +| **gpu-short** | 2 hours | 2 hours | 4 | highest | 1000 | + +\*The **PriorityJobFactor** value will be added to the job priority (*PARTITION* column in `sprio -l` ). In other words, jobs sent to higher priority +partitions will usually run first (however, other factors such like **job age** or mainly **fair share** might affect to that decision). For the GPU +partitions, Slurm will also attempt first to allocate jobs on partitions with higher priority over partitions with lesser priority. + +### User and job limits + +The GPU cluster contains some basic user and job limits to ensure that a single user can not overabuse the resources and a fair usage of the cluster. +The limits are described below. + +#### Per job limits + +These are limits applying to a single job. In other words, there is a maximum of resources a single job can use. +Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: `SlurmQoS(limits)`, +(list of possible `SlurmQoS` values can be listed with the command `sacctmgr show qos`): + +| Partition | Mon-Sun 0h-24h | +|:-------------:| :------------------------------------: | +| **gpu** | gpu_week(cpu=40,gres/gpu=8,mem=200G) | +| **gpu-short** | gpu_week(cpu=40,gres/gpu=8,mem=200G) | + +With these limits, a single job can not use more than 40 CPUs, more than 8 GPUs or more than 200GB. +Any job exceeding such limits will stay in the queue with the message **`QOSMax[Cpu|GRES|Mem]PerJob`**. +Since there are no more existing QoS during the week temporary overriding job limits (this happens for instance in the CPU **daily** partition), the job needs to be cancelled, and the requested resources must be adapted according to the above resource limits. + +#### Per user limits for CPU partitions + +These limits apply exclusively to users. In other words, there is a maximum of resources a single user can use. +Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: `SlurmQoS(limits)`, +(list of possible `SlurmQoS` values can be listed with the command `sacctmgr show qos`): + +| Partition | Mon-Sun 0h-24h | +|:-------------:| :---------------------------------------------------------: | +| **gpu** | gpu_week(cpu=80,gres/gpu=16,mem=400G) | +| **gpu-short** | gpu_week(cpu=80,gres/gpu=16,mem=400G) | + +With these limits, a single user can not use more than 80 CPUs, more than 16 GPUs or more than 400GB. +Jobs sent by any user already exceeding such limits will stay in the queue with the message **`QOSMax[Cpu|GRES|Mem]PerUser`**. In that case, job can wait in the queue until some of the running resources are freed. + +Notice that user limits are wider than job limits. In that way, a user can run up to two 8 GPUs based jobs, or up to four 4 GPUs based jobs, etc. +Please try to avoid occupying all GPUs of the same type for several hours or multiple days, otherwise it would block other users needing the same +type of GPU. + +## Understanding the Slurm configuration (for advanced users) + +Clusters at PSI use the [Slurm Workload Manager](http://slurm.schedmd.com/) as the batch system technology for managing and scheduling jobs. +Historically, *Merlin4* and *Merlin5* also used Slurm. In the same way, **Merlin6** has been also configured with this batch system. + +Slurm has been installed in a **multi-clustered** configuration, allowing to integrate multiple clusters in the same batch system. + +For understanding the Slurm configuration setup in the cluster, sometimes may be useful to check the following files: + +* ``/etc/slurm/slurm.conf`` - can be found in the login nodes and computing nodes. +* ``/etc/slurm/gres.conf`` - can be found in the GPU nodes, is also propgated to login nodes and computing nodes for user read access. +* ``/etc/slurm/cgroup.conf`` - can be found in the computing nodes, is also propagated to login nodes for user read access. + +The previous configuration files which can be found in the login nodes, correspond exclusively to the **merlin5** cluster configuration files. +Configuration files for the old **merlin5** cluster must be checked directly on any of the **merlin5** computing nodes: these are not propagated +to the **merlin5** login nodes. diff --git a/pages/merlin6/01 introduction/cluster-introduction.md b/pages/merlin6/01 introduction/cluster-introduction.md new file mode 100644 index 0000000..a82b88b --- /dev/null +++ b/pages/merlin6/01 introduction/cluster-introduction.md @@ -0,0 +1,27 @@ +--- +title: Introduction +#tags: +#keywords: +last_updated: 28 June 2019 +#summary: "Merlin 6 cluster overview" +sidebar: merlin6_sidebar +permalink: /merlin6/cluster-introduction.html +--- + +## Slurm clusters + +* The new Slurm CPU cluster is called [**`merlin6`**](/merlin6/cluster-introduction.html). +* The new Slurm GPU cluster is called [**`gmerlin6`**](/gmerlin6/cluster-introduction.html) +* The old Slurm *merlin* cluster is still active and best effort support is provided. +The cluster, was renamed as [**merlin5**](/merlin5/cluster-introduction.html). + +From July 2019, **`merlin6`** becomes the **default Slurm cluster** and any job submitted from the login node will be submitted to that cluster if not . +* Users can keep submitting to the old *`merlin5`* computing nodes by using the option ``--cluster=merlin5``. +* Users submitting to the **`gmerlin6`** GPU cluster need to specify the option ``--cluster=gmerlin6``. + +### Slurm 'merlin6' + +**CPU nodes** are configured in a **Slurm** cluster, called **`merlin6`**, and +this is the _**default Slurm cluster**_. Hence, by default, if no Slurm cluster is +specified (with the `--cluster` option), this will be the cluster to which the jobs +will be sent. diff --git a/pages/merlin6/01 introduction/introduction.md b/pages/merlin6/01 introduction/introduction.md index 7bf8926..81d65b8 100644 --- a/pages/merlin6/01 introduction/introduction.md +++ b/pages/merlin6/01 introduction/introduction.md @@ -11,7 +11,12 @@ redirect_from: - /merlin6/index.html --- -## About Merlin6 +## The Merlin local HPC cluster + +Historically, the local HPC clusters at PSI were named Merlin. Over the years, +multiple generations of Merlin have been deployed. + +### Merlin6 Merlin6 is a the official PSI Local HPC cluster for development and mission-critical applications that has been built in 2019. It replaces @@ -22,25 +27,26 @@ more compute nodes and cluster storage without significant increase of the costs of the manpower and the operations. Merlin6 is mostly based on **CPU** resources, but also contains a small amount -of **GPU**-based resources which are mostly used by the BIO experiments. +of **GPU**-based resources which are mostly used by the BIO Division and Deep Learning projects: +* The Merlin6 CPU nodes are in a dedicated Slurm cluster called [**`merlin6`**](/merlin6/slurm-configuration.html). + * This is the default Slurm cluster configured in the login nodes, and any job submitted without the option `--cluster` will be submited to this cluster. +* The Merlin6 GPU resources are in a dedicated Slurm cluster called [**`gmerlin6`**](/gmerlin6/slurm-configuration.html). + * Users submitting to the **`gmerlin6`** GPU cluster need to specify the option ``--cluster=gmerlin6``. -### Slurm 'merlin6' +### Merlin5 -**CPU nodes** are configured in a **Slurm** cluster, called **`merlin6`**, and -this is the _**default Slurm cluster**_. Hence, by default, if no Slurm cluster is -specified (with the `--cluster` option), this will be the cluster to which the jobs -will be sent. +The old Slurm **CPU** *merlin* cluster is still active and is maintained in a best effort basis. +* The Merlin5 CPU cluster is called [**merlin5**](/merlin5/slurm-configuration.html). -## Merlin6 Architecture +## Merlin Architecture -### Merlin6 Cluster Architecture Diagram +The following image shows the Slurm architecture design for the Merlin5 & Merlin6 clusters: + +![Merlin6 Slurm Architecture Design]({{ "/images/merlin-slurm-architecture.png" }}) + +### Merlin6 Architecture Diagram The following image shows the Merlin6 cluster architecture diagram: ![Merlin6 Architecture Diagram]({{ "/images/merlinschema3.png" }}) -### Merlin5 + Merlin6 Slurm Cluster Architecture Design - -The following image shows the Slurm architecture design for the Merlin5 & Merlin6 clusters: - -![Merlin6 Slurm Architecture Design]({{ "/images/merlin-slurm-architecture.png" }}) diff --git a/pages/merlin6/02 accessing-merlin6/accessing-interactive-nodes.md b/pages/merlin6/02 accessing-merlin6/accessing-interactive-nodes.md index 7cfbf93..5cb393a 100644 --- a/pages/merlin6/02 accessing-merlin6/accessing-interactive-nodes.md +++ b/pages/merlin6/02 accessing-merlin6/accessing-interactive-nodes.md @@ -2,31 +2,13 @@ title: Accessing Interactive Nodes #tags: #keywords: -last_updated: 13 June 2019 +last_updated: 20 May 2021 #summary: "" sidebar: merlin6_sidebar permalink: /merlin6/interactive.html --- - -## Login nodes description - -The Merlin6 login nodes are the official machines for accessing the recources of Merlin6. -From these machines, users can submit jobs to the Slurm batch system as well as visualize or compile their software. - -The Merlin6 login nodes are the following: - -| Hostname | SSH | NoMachine | #cores | #Threads | CPU | Memory | Scratch | Scratch Mountpoint | -| ------------------- | --- | --------- | ------ |:--------:| :-------------------- | ------ | ---------- | :------------------ | -| merlin-l-001.psi.ch | yes | yes | 2 x 22 | 2 | Intel Xeon Gold 6152 | 384GB | 1.8TB NVMe | ``/scratch`` | -| merlin-l-002.psi.ch | yes | yes | 2 x 22 | 2 | Intel Xeon Gold 6142 | 384GB | 1.8TB NVMe | ``/scratch`` | -| merlin-l-01.psi.ch | yes | - | 2 x 16 | 2 | Intel Xeon E5-2697Av4 | 512GB | 100GB SAS | ``/scratch`` | - ---- - -## Remote Access - -### SSH Access +## SSH Access For interactive command shell access, use an SSH client. We recommend to activate SSH's X11 forwarding to allow you to use graphical applications (e.g. a text editor, but for more performant graphical access, refer to the sections below). X applications are supported @@ -38,26 +20,37 @@ in the login nodes and X11 forwarding can be used for those users who have prope * PSI desktop configuration issues must be addressed through **[PSI Service Now](https://psi.service-now.com/psisp)** as an *Incident Request*. * Ticket will be redirected to the corresponding Desktop support group (Windows, Linux). -#### Accessing from a Linux client +### Accessing from a Linux client -Refer to [{Accessing Merlin -> Accessing from Linux Clients}](/merlin6/connect-from-linux.html) for **Linux** SSH client and X11 configuration. +Refer to [{How To Use Merlin -> Accessing from Linux Clients}](/merlin6/connect-from-linux.html) for **Linux** SSH client and X11 configuration. -#### Accessing from a Windows client +### Accessing from a Windows client -Refer to [{Accessing Merlin -> Accessing from Windows Clients}](/merlin6/connect-from-windows.html) for **Windows** SSH client and X11 configuration. +Refer to [{How To Use Merlin -> Accessing from Windows Clients}](/merlin6/connect-from-windows.html) for **Windows** SSH client and X11 configuration. -#### Accessing from a MacOS client +### Accessing from a MacOS client -Refer to [{Accessing Merlin -> Accessing from MacOS Clients}](/merlin6/connect-from-macos.html) for **MacOS** SSH client and X11 configuration. +Refer to [{How To Use Merlin -> Accessing from MacOS Clients}](/merlin6/connect-from-macos.html) for **MacOS** SSH client and X11 configuration. -### Graphical access using **NoMachine** client +## NoMachine Remote Desktop Access -X applications are supported in the login nodes and can run efficiently through a **NoMachine** client. This is the officially supported way to run more demanding X applications on Merlin6. The client software can be downloaded from [the Nomachine Website](https://www.nomachine.com/product&p=NoMachine%20Enterprise%20Client). +X applications are supported in the login nodes and can run efficiently through a **NoMachine** client. This is the officially supported way to run more demanding X applications on Merlin6. +* For PSI Windows workstations, this can be installed from the Software Kiosk as 'NX Client'. If you have difficulties installing, please request support through **[PSI Service Now](https://psi.service-now.com/psisp)** as an *Incident Request*. +* For other workstations The client software can be downloaded from the [Nomachine Website](https://www.nomachine.com/product&p=NoMachine%20Enterprise%20Client). -* Install the NoMachine client locally. For PSI windows machines, this can be installed from the Software Kiosk as 'NX Client'. If you have difficulties installing, please request support through **[PSI Service Now](https://psi.service-now.com/psisp)** as an *Incident Request*. -* Configure a new connection in no machine to either `merlin-l-001.psi.ch` or `merlin-l-002.psi.ch`. The 'NX' protocol is recommended. Login nodes are available from the PSI network or through VPN. -* You can also connect via the photo science division's `rem-acc.psi.ch` jump point. After connecting you will be presented with options to jump to the merlin login nodes. This can be accessed remotely without VPN. -* NoMachine *client configuration* and *connectivity* for Merlin6 is fully supported by Merlin6 administrators. - * Please contact us through the official channels on any configuration issue with NoMachine. +### Configuring NoMachine ---- +Refer to [{How To Use Merlin -> Remote Desktop Access}](/merlin6/nomachine.html) for further instructions of how to configure the NoMachine client and how to access it from PSI and from outside PSI. + +## Login nodes hardware description + +The Merlin6 login nodes are the official machines for accessing the recources of Merlin6. +From these machines, users can submit jobs to the Slurm batch system as well as visualize or compile their software. + +The Merlin6 login nodes are the following: + +| Hostname | SSH | NoMachine | #cores | #Threads | CPU | Memory | Scratch | Scratch Mountpoint | +| ------------------- | --- | --------- | ------ |:--------:| :-------------------- | ------ | ---------- | :------------------ | +| merlin-l-001.psi.ch | yes | yes | 2 x 22 | 2 | Intel Xeon Gold 6152 | 384GB | 1.8TB NVMe | ``/scratch`` | +| merlin-l-002.psi.ch | yes | yes | 2 x 22 | 2 | Intel Xeon Gold 6142 | 384GB | 1.8TB NVMe | ``/scratch`` | +| merlin-l-01.psi.ch | yes | - | 2 x 16 | 2 | Intel Xeon E5-2697Av4 | 512GB | 100GB SAS | ``/scratch`` | diff --git a/pages/merlin6/02 accessing-merlin6/accessing-slurm.md b/pages/merlin6/02 accessing-merlin6/accessing-slurm.md index 057615a..3fe71ce 100644 --- a/pages/merlin6/02 accessing-merlin6/accessing-slurm.md +++ b/pages/merlin6/02 accessing-merlin6/accessing-slurm.md @@ -8,48 +8,46 @@ sidebar: merlin6_sidebar permalink: /merlin6/slurm-access.html --- -## The Merlin6 Slurm batch system +## The Merlin Slurm clusters -Clusters at PSI use the [Slurm Workload Manager](http://slurm.schedmd.com/) as the batch system technology for managing and scheduling jobs. -Historically, *Merlin4* and *Merlin5* also used Slurm. In the same way, **Merlin6** has been also configured with this batch system. +Merlin contains a multi-cluster setup, where multiple Slurm clusters coexist under the same umbrella. +It basically contains the following clusters: -Slurm has been installed in a **multi-clustered** configuration, allowing to integrate multiple clusters in the same batch system. -* Two different Slurm clusters exist: **merlin5** and **merlin6**. - * **merlin5** is a cluster with very old hardware (out-of-warranty). - * **merlin5** will exist as long as hardware incidents are soft and easy to repair/fix (i.e. hard disk replacement) - * **merlin6** is the default cluster when running Slurm commands (i.e. sinfo). +* The **Merlin6 Slurm CPU cluster**, which is called [**`merlin6`**](/merlin6/slurm-access.html#merlin6-cpu-cluster-access). +* The **Merlin6 Slurm GPU cluster**, which is called [**`gmerlin6`**](/merlin6/slurm-access.html#merlin6-gpu-cluster-access). +* The *old Merlin5 Slurm CPU cluster*, which is called [**`merlin5`**](/merlin6/slurm-access.html#merlin5-cpu-cluster-access), still supported in a best effort basis. -Please follow the section **Merlin6 Slurm** for more details about configuration and job submission. +## Accessing the Slurm clusters -### Merlin5 Access +Any job submission must be performed from a **Merlin login node**. Please refer to the [**Accessing the Interactive Nodes documentation**](/merlin6/interactive.html) +for further information about how to access the cluster. -Keeping the **merlin5** cluster will allow running jobs in the old computing nodes until users have fully migrated their codes to the new cluster. +In addition, any job *must be submitted from a high performance storage area visible by the login nodes and by the computing nodes*. For this, the possible storage areas are the following: +* `/data/user` +* `/data/project` +* `/shared-scratch` +Please, avoid using `/psi/home` directories for submitting jobs. -From July 2019, **merlin6** becomes the **default cluster**. However, users can keep submitting to the old **merlin5** computing nodes by using -the option ``--cluster=merlin5`` and using the corresponding Slurm partition with ``--partition=merlin``. In example: +### Merlin6 CPU cluster access -```bash -#SBATCH --clusters=merlin6 -``` +The **Merlin6 CPU cluster** (**`merlin6`**) is the default cluster configured in the login nodes. Any job submission will use by default this cluster, unless +the option `--cluster` is specified with another of the existing clusters. -Example of how to run a simple command: +For further information about how to use this cluster, please visit: [**Merlin6 CPU Slurm Cluster documentation**](/merlin6/slurm-configuration.html). -```bash -srun --clusters=merlin5 --partition=merlin hostname -sbatch --clusters=merlin5 --partition=merlin myScript.batch -``` +### Merlin6 GPU cluster access -### Merlin6 Access +The **Merlin6 GPU cluster** (**`gmerlin6`**) is visible from the login nodes. However, to submit jobs to this cluster, one needs to specify the option `--cluster=gmerlin6` when submitting a job or allocation. -In order to run jobs on the **Merlin6** cluster, you need to specify the following option in your batch scripts: +For further information about how to use this cluster, please visit: [**Merlin6 GPU Slurm Cluster documentation**](/gmerlin6/slurm-configuration.html). -```bash -#SBATCH --clusters=merlin6 -``` +### Merlin5 CPU cluster access -Example of how to run a simple command: +The **Merlin5 CPU cluster** (**`merlin5`**) is visible from the login nodes. However, to submit jobs +to this cluster, one needs to specify the option `--cluster=merlin5` when submitting a job or allocation. -```bash -srun --clusters=merlin6 hostname -sbatch --clusters=merlin6 myScript.batch -``` +Using this cluster is in general not recommended, however this is still available for old users needing + extra computational resources or longer jobs. Have in mind that this cluster is only supported in a +**best effort basis**, and it contains very old hardware and configurations. + +For further information about how to use this cluster, please visit the [**Merlin5 CPU Slurm Cluster documentation**](/gmerlin6/slurm-configuration.html). diff --git a/pages/merlin6/03 Job Submission/slurm-configuration.md b/pages/merlin6/03 Job Submission/slurm-configuration.md index e1c39bb..7dc4e44 100644 --- a/pages/merlin6/03 Job Submission/slurm-configuration.md +++ b/pages/merlin6/03 Job Submission/slurm-configuration.md @@ -23,7 +23,7 @@ In this documentation is only explained the usage of the **merlin6** Slurm clust Basic configuration for the **merlin6 CPUs** cluster will be detailed here. For advanced usage, please refer to [Understanding the Slurm configuration (for advanced users)](/merlin6/slurm-configuration.html#understanding-the-slurm-configuration-for-advanced-users) -### CPU nodes definition +## Merlin6 CPU nodes definition The following table show default and maximum resources that can be used per node: @@ -120,77 +120,9 @@ equivalent to 8 exclusive nodes. This limit applies to the **general** partition For the **hourly** partition, there are no limits restriction and user limits are removed. Limits are softed for the **daily** partition during non working hours, and during the weekend limits are removed. -## Merlin6 GPU - -Basic configuration for the **merlin6 GPUs** will be detailed here. -For advanced usage, please refer to [Understanding the Slurm configuration (for advanced users)](/merlin6/slurm-configuration.html#understanding-the-slurm-configuration-for-advanced-users) - -### GPU nodes definition - -| Nodes | Def.#CPUs | Max.#CPUs | #Threads | Def.Mem/CPU | Max.Mem/CPU | Max.Mem/Node | Max.Swap | GPU Type | Def.#GPUs | Max.#GPUs | -|:------------------:| ---------:| :--------:| :------: | :----------:| :----------:| :-----------:| :-------:| :--------: | :-------: | :-------: | -| merlin-g-[001] | 1 core | 8 cores | 1 | 4000 | 102400 | 102400 | 10000 | **GTX1080** | 1 | 2 | -| merlin-g-[002-005] | 1 core | 20 cores | 1 | 4000 | 102400 | 102400 | 10000 | **GTX1080** | 1 | 4 | -| merlin-g-[006-009] | 1 core | 20 cores | 1 | 4000 | 102400 | 102400 | 10000 | **GTX1080Ti** | 1 | 4 | -| merlin-g-[010-013] | 1 core | 20 cores | 1 | 4000 | 102400 | 102400 | 10000 | **RTX2080Ti** | 1 | 4 | - -{{site.data.alerts.tip}}Always check '/etc/slurm/gres.conf' for changes in the GPU type and details of the NUMA node. -{{site.data.alerts.end}} - -### GPU partitions - -| GPU Partition | Default Time | Max Time | Max Nodes | Priority | PriorityJobFactor\* | -|:-----------------: | :----------: | :------: | :-------: | :------: | :-----------------: | -| **gpu** | 1 day | 1 week | 4 | low | 1 | -| **gpu-short** | 2 hours | 2 hours | 4 | highest | 1000 | - -\*The **PriorityJobFactor** value will be added to the job priority (*PARTITION* column in `sprio -l` ). In other words, jobs sent to higher priority -partitions will usually run first (however, other factors such like **job age** or mainly **fair share** might affect to that decision). For the GPU -partitions, Slurm will also attempt first to allocate jobs on partitions with higher priority over partitions with lesser priority. - -### User and job limits - -The GPU cluster contains some basic user and job limits to ensure that a single user can not overabuse the resources and a fair usage of the cluster. -The limits are described below. - -#### Per job limits - -These are limits applying to a single job. In other words, there is a maximum of resources a single job can use. -Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: `SlurmQoS(limits)`, -(list of possible `SlurmQoS` values can be listed with the command `sacctmgr show qos`): - -| Partition | Mon-Sun 0h-24h | -|:-------------:| :------------------------------------: | -| **gpu** | gpu_week(cpu=40,gres/gpu=8,mem=200G) | -| **gpu-short** | gpu_week(cpu=40,gres/gpu=8,mem=200G) | - -With these limits, a single job can not use more than 40 CPUs, more than 8 GPUs or more than 200GB. -Any job exceeding such limits will stay in the queue with the message **`QOSMax[Cpu|GRES|Mem]PerJob`**. -Since there are no more existing QoS during the week temporary overriding job limits (this happens for instance in the CPU **daily** partition), the job needs to be cancelled, and the requested resources must be adapted according to the above resource limits. - -#### Per user limits for CPU partitions - -These limits apply exclusively to users. In other words, there is a maximum of resources a single user can use. -Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: `SlurmQoS(limits)`, -(list of possible `SlurmQoS` values can be listed with the command `sacctmgr show qos`): - -| Partition | Mon-Sun 0h-24h | -|:-------------:| :---------------------------------------------------------: | -| **gpu** | gpu_week(cpu=80,gres/gpu=16,mem=400G) | -| **gpu-short** | gpu_week(cpu=80,gres/gpu=16,mem=400G) | - -With these limits, a single user can not use more than 80 CPUs, more than 16 GPUs or more than 400GB. -Jobs sent by any user already exceeding such limits will stay in the queue with the message **`QOSMax[Cpu|GRES|Mem]PerUser`**. In that case, job can wait in the queue until some of the running resources are freed. - -Notice that user limits are wider than job limits. In that way, a user can run up to two 8 GPUs based jobs, or up to four 4 GPUs based jobs, etc. -Please try to avoid occupying all GPUs of the same type for several hours or multiple days, otherwise it would block other users needing the same -type of GPU. - -## Understanding the Slurm configuration (for advanced users) +## Advanced Slurm configuration Clusters at PSI use the [Slurm Workload Manager](http://slurm.schedmd.com/) as the batch system technology for managing and scheduling jobs. -Historically, *Merlin4* and *Merlin5* also used Slurm. In the same way, **Merlin6** has been also configured with this batch system. - Slurm has been installed in a **multi-clustered** configuration, allowing to integrate multiple clusters in the same batch system. For understanding the Slurm configuration setup in the cluster, sometimes may be useful to check the following files: @@ -200,5 +132,4 @@ For understanding the Slurm configuration setup in the cluster, sometimes may be * ``/etc/slurm/cgroup.conf`` - can be found in the computing nodes, is also propagated to login nodes for user read access. The previous configuration files which can be found in the login nodes, correspond exclusively to the **merlin6** cluster configuration files. -Configuration files for the old **merlin5** cluster must be checked directly on any of the **merlin5** computing nodes: these are not propagated -to the **merlin6** login nodes. +Configuration files for the old **merlin5** cluster or for the **gmerlin6** cluster must be checked directly on any of the **merlin5** or **gmerlin6** computing nodes (in example, by login in to one of the nodes while a job or an active allocation is running).