initial formatting changes complete

2026-01-06 16:40:15 +01:00
parent f58c1f57b8
commit 7db5d0fd05
81 changed files with 805 additions and 1112 deletions
--- a/docs/merlin6/slurm-configuration.md
+++ b/docs/merlin6/slurm-configuration.md
@@ -1,12 +1,4 @@
---
-title: Slurm Configuration
-#tags:
-keywords: configuration, partitions, node definition
-last_updated: 29 January 2021
-summary: "This document describes a summary of the Merlin6 configuration."
-sidebar: merlin6_sidebar
-permalink: /merlin6/slurm-configuration.html
---
+# Slurm Configuration

 This documentation shows basic Slurm configuration and options needed to run jobs in the Merlin6 CPU cluster.

@@ -23,11 +15,12 @@ The following table show default and maximum resources that can be used per node
 | merlin-c-[313-318]   | 1 core    | 44 cores  | 1        | 748800      | 748800       | 10000    | N/A       | N/A       |
 | merlin-c-[319-324]   | 1 core    | 44 cores  | 2        | 748800      | 748800       | 10000    | N/A       | N/A       |

-If nothing is specified, by default each core will use up to 8GB of memory. Memory can be increased with the `--mem=<mem_in_MB>` and 
+If nothing is specified, by default each core will use up to 8GB of memory. Memory can be increased with the `--mem=<mem_in_MB>` and
 `--mem-per-cpu=<mem_in_MB>` options, and maximum memory allowed is `Max.Mem/Node`.

 In **`merlin6`**, Memory is considered a Consumable Resource, as well as the CPU. Hence, both resources will account when submitting a job,
 and by default resources can not be oversubscribed. This is a main difference with the old **`merlin5`** cluster, when only CPU were accounted,
+
 and memory was by default oversubscribed.

 !!! tip "Check Configuration"
@@ -66,12 +59,12 @@ The following *partitions* (also known as *queues*) are configured in Slurm:
 | **asa-ansys**      |  1 hour       | 90 days  | unlimited | 1000                | 4                | 15600        |
 | **mu3e**           |  1 day        | 7 days   | unlimited | 1000                | 4                | 3712         |

-\*The **PriorityJobFactor** value will be added to the job priority (*PARTITION* column in `sprio -l` ). In other words, jobs sent to higher priority
+The **PriorityJobFactor** value will be added to the job priority (**PARTITION** column in `sprio -l` ). In other words, jobs sent to higher priority
 partitions will usually run first (however, other factors such like **job age** or mainly **fair share** might affect to that decision). For the GPU
 partitions, Slurm will also attempt first to allocate jobs on partitions with higher priority over partitions with lesser priority.

-**\*\***Jobs submitted to a partition with a higher **PriorityTier** value will be dispatched before pending jobs in partition with  lower *PriorityTier*  value
-and, if possible, they will preempt running jobs from partitions with lower *PriorityTier* values.
+Jobs submitted to a partition with a higher **PriorityTier** value will be dispatched before pending jobs in partition with  lower *PriorityTier*  value
+and, if possible, they will preempt running jobs from partitions with lower **PriorityTier** values.

 * The **`general`** partition is the **default**. It can not have more than 50 nodes running jobs.
 * For **`daily`** this limitation is extended to 67 nodes.
@@ -79,11 +72,18 @@ and, if possible, they will preempt running jobs from partitions with lower *Pri
 * **`asa-general`,`asa-daily`,`asa-ansys`,`asa-visas` and `mu3e`** are **private** partitions, belonging to different experiments owning the machines. **Access is restricted** in all cases. However, by agreement with the experiments, nodes are usually added to the **`hourly`** partition as extra resources for the public resources.

 !!! tip "Partition Selection"
-    Jobs which would run for less than one day should be always sent to **daily**, while jobs that would run for less than one hour should be sent to **hourly**. This would ensure that you have highest priority over jobs sent to partitions with less priority, but also because **general** has limited the number of nodes that can be used for that. The idea behind that, is that the cluster can not be blocked by long jobs and we can always ensure resources for shorter jobs.
+    Jobs which would run for less than one day should be always sent to
+    **daily**, while jobs that would run for less than one hour should be sent
+    to **hourly**. This would ensure that you have highest priority over jobs
+    sent to partitions with less priority, but also because **general** has
+    limited the number of nodes that can be used for that. The idea behind
+    that, is that the cluster can not be blocked by long jobs and we can always
+    ensure resources for shorter jobs.

 ### Merlin5 CPU Accounts

 Users need to ensure that the public **`merlin`** account is specified. No specifying account options would default to this account.
+
 This is mostly needed by users which have multiple Slurm accounts, which may define by mistake a different account.

 ```bash
@@ -100,16 +100,14 @@ Not all the accounts can be used on all partitions. This is resumed in the table

 #### Private accounts

-* The *`gfa-asa`* and *`mu3e`* accounts are private accounts. These can be used for accessing dedicated
-partitions with nodes owned by different groups.
+* The *`gfa-asa`* and *`mu3e`* accounts are private accounts. These can be used for accessing dedicated partitions with nodes owned by different groups.

 ### Slurm CPU specific options

 Some options are available when using CPUs. These are detailed here.
-
-Alternative Slurm options for CPU based jobs are available. Please refer to the **man** pages
-for each Slurm command for further information about it (`man salloc`, `man sbatch`, `man srun`).
-Below are listed the most common settings:
+Alternative Slurm options for CPU based jobs are available. Please refer to the
+**man** pages for each Slurm command for further information about it (`man
+salloc`, `man sbatch`, `man srun`).  Below are listed the most common settings:

 ```bash
 #SBATCH --hint=[no]multithread
@@ -125,8 +123,9 @@ Below are listed the most common settings:

 #### Enabling/Disabling Hyper-Threading

-The **`merlin6`** cluster contains nodes with Hyper-Threading enabled. One should always specify 
-whether to use Hyper-Threading or not. If not defined, Slurm will generally use it (exceptions apply).
+The **`merlin6`** cluster contains nodes with Hyper-Threading enabled. One
+should always specify whether to use Hyper-Threading or not. If not defined,
+Slurm will generally use it (exceptions apply).

 ```bash
 #SBATCH --hint=multithread            # Use extra threads with in-core multi-threading.
@@ -138,7 +137,7 @@ whether to use Hyper-Threading or not. If not defined, Slurm will generally use
 Slurm allows to define a set of features in the node definition. This can be used to filter and select nodes according to one or more
 specific features. For the CPU nodes, we have the following features:

-```
+```text
 NodeName=merlin-c-[001-024,101-124,201-224]   Features=mem_384gb,xeon-gold-6152
 NodeName=merlin-c-[301-312]                   Features=mem_768gb,xeon-gold-6240r
 NodeName=merlin-c-[313-318]                   Features=mem_768gb,xeon-gold-6240r
@@ -149,26 +148,36 @@ Therefore, users running on `hourly` can select which node they want to use (fat
 This is possible by using the option `--constraint=<feature_name>` in Slurm.

 Examples:
+
 1. Select nodes with 48 cores only (nodes with [2 x Xeon Gold 6240R](https://ark.intel.com/content/www/us/en/ark/products/199343/intel-xeon-gold-6240r-processor-35-75m-cache-2-40-ghz.html)):
-```
-sbatch --constraint=xeon-gold-6240r ...
-```
-2. Select nodes with 44 cores only (nodes with [2 x Xeon Gold 6152](https://ark.intel.com/content/www/us/en/ark/products/120491/intel-xeon-gold-6152-processor-30-25m-cache-2-10-ghz.html)):
-```
-sbatch --constraint=xeon-gold-6152 ...
-```
-3. Select fat memory nodes only:
-```
-sbatch --constraint=mem_768gb ...
-```
-4. Select regular memory nodes only:
-```
-sbatch --constraint=mem_384gb ...
-```
-5. Select fat memory nodes with 48 cores only:
-```
-sbatch --constraint=mem_768gb,xeon-gold-6240r ...
-```
+
+    ```bash
+    sbatch --constraint=xeon-gold-6240r ...
+    ```
+
+1. Select nodes with 44 cores only (nodes with [2 x Xeon Gold 6152](https://ark.intel.com/content/www/us/en/ark/products/120491/intel-xeon-gold-6152-processor-30-25m-cache-2-10-ghz.html)):
+
+    ```bash
+    sbatch --constraint=xeon-gold-6152 ...
+    ```
+
+1. Select fat memory nodes only:
+
+    ```bash
+    sbatch --constraint=mem_768gb ...
+    ```
+
+1. Select regular memory nodes only:
+
+    ```bash
+    sbatch --constraint=mem_384gb ...
+    ```
+
+1. Select fat memory nodes with 48 cores only:
+
+    ```bash
+    sbatch --constraint=mem_768gb,xeon-gold-6240r ...
+    ```

 Detailing exactly which type of nodes you want to use is important, therefore, for groups with private accounts (`mu3e`,`gfa-asa`) or for
 public users running on the `hourly` partition, *constraining nodes by features is recommended*. This becomes even more important when
@@ -178,11 +187,11 @@ having heterogeneous clusters.

 In this chapter we will cover basic settings that users need to specify in order to run jobs in the Merlin6 CPU cluster.

-### User and job limits 
+### User and job limits

-In the CPU cluster we provide some limits which basically apply to jobs and users. The idea behind this is to ensure a fair usage of the resources and to 
+In the CPU cluster we provide some limits which basically apply to jobs and users. The idea behind this is to ensure a fair usage of the resources and to
 avoid overabuse of the resources from a single user or job. However, applying limits might affect the overall usage efficiency of the cluster (in example,
-pending jobs from a single user while having many idle nodes due to low overall activity is something that can be seen when user limits are applied). 
+pending jobs from a single user while having many idle nodes due to low overall activity is something that can be seen when user limits are applied).
 In the same way, these limits can be also used to improve the efficiency of the cluster (in example, without any job size limits, a job requesting all
 resources from the batch system would drain the entire cluster for fitting the job, which is undesirable).

@@ -190,14 +199,24 @@ Hence, there is a need of setting up wise limits and to ensure that there is a f
 of the cluster while allowing jobs of different nature and sizes (it is, **single core** based **vs parallel jobs** of different sizes) to run.

 !!! warning "Resource Limits"
-    Wide limits are provided in the **daily** and **hourly** partitions, while for **general** those limits are more restrictive. However, we kindly ask users to inform the Merlin administrators when there are plans to send big jobs which would require a massive draining of nodes for allocating such jobs. This would apply to jobs requiring the **unlimited** QoS (see below "Per job limits").
+    Wide limits are provided in the **daily** and **hourly** partitions, while
+    for **general** those limits are more restrictive. However, we kindly ask
+    users to inform the Merlin administrators when there are plans to send big
+    jobs which would require a massive draining of nodes for allocating such
+    jobs. This would apply to jobs requiring the **unlimited** QoS (see below
+    "Per job limits").

 !!! tip "Custom Requirements"
-    If you have different requirements, please let us know, we will try to accommodate or propose a solution for you.
+    If you have different requirements, please let us know, we will try to
+    accommodate or propose a solution for you.

 #### Per job limits

-These are limits which apply to a single job. In other words, there is a maximum of resources a single job can use. Limits are described in the table below with the format: `SlurmQoS(limits)` (possible `SlurmQoS` values can be listed with the command `sacctmgr show qos`). Some limits will vary depending on the day and time of the week.
+These are limits which apply to a single job. In other words, there is a
+maximum of resources a single job can use. Limits are described in the table
+below with the format: `SlurmQoS(limits)` (possible `SlurmQoS` values can be
+listed with the command `sacctmgr show qos`). Some limits will vary depending
+on the day and time of the week.

 | Partition   | Mon-Fri  0h-18h                  | Sun-Thu 18h-0h                   | From Fri 18h to Mon 0h           |
 |:----------: | :------------------------------: | :------------------------------: | :------------------------------: |
@@ -205,18 +224,29 @@ These are limits which apply to a single job. In other words, there is a maximum
 | **daily**   | daytime(cpu=704,mem=2750G)       | nighttime(cpu=1408,mem=5500G)    | unlimited(cpu=2200,mem=8593.75G) |
 | **hourly**  | unlimited(cpu=2200,mem=8593.75G) | unlimited(cpu=2200,mem=8593.75G) | unlimited(cpu=2200,mem=8593.75G) |

-By default, a job can not use more than 704 cores (max CPU per job). In the same way, memory is also proportionally limited. This is equivalent as 
-running a job using up to 8 nodes at once. This limit applies to the **general** partition (fixed limit) and to the **daily** partition (only during working hours).
-Limits are softed for the **daily** partition during non working hours, and during the weekend limits are even wider.
+By default, a job can not use more than 704 cores (max CPU per job). In the
+same way, memory is also proportionally limited. This is equivalent as running
+a job using up to 8 nodes at once. This limit applies to the **general**
+partition (fixed limit) and to the **daily** partition (only during working
+hours).

-For the **hourly** partition, **despite running many parallel jobs is something not desirable** (for allocating such jobs it requires massive draining of nodes), 
-wider limits are provided. In order to avoid massive nodes drain in the cluster, for allocating huge jobs, setting per job limits is necessary. Hence, **unlimited** QoS
-mostly refers to "per user" limits more than to "per job" limits (in other words, users can run any number of hourly jobs, but the job size for such jobs is limited
-with wide values).
+Limits are softed for the **daily** partition during non working hours, and
+during the weekend limits are even wider.  For the **hourly** partition,
+**despite running many parallel jobs is something not desirable** (for
+allocating such jobs it requires massive draining of nodes), wider limits are
+provided. In order to avoid massive nodes drain in the cluster, for allocating
+huge jobs, setting per job limits is necessary. Hence, **unlimited** QoS mostly
+refers to "per user" limits more than to "per job" limits (in other words,
+users can run any number of hourly jobs, but the job size for such jobs is
+limited with wide values).

 #### Per user limits for CPU partitions

-These limits which apply exclusively to users. In other words, there is a maximum of resources a single user can use. Limits are described in the table below with the format: `SlurmQoS(limits)` (possible `SlurmQoS` values can be listed with the command `sacctmgr show qos`). Some limits will vary depending on the day and time of the week.
+These limits which apply exclusively to users. In other words, there is a
+maximum of resources a single user can use. Limits are described in the table
+below with the format: `SlurmQoS(limits)` (possible `SlurmQoS` values can be
+listed with the command `sacctmgr show qos`). Some limits will vary depending
+on the day and time of the week.

 | Partition   | Mon-Fri 0h-18h                 | Sun-Thu 18h-0h                | From Fri 18h to Mon 0h         |
 |:-----------:| :----------------------------: | :---------------------------: | :----------------------------: |
@@ -224,15 +254,22 @@ These limits which apply exclusively to users. In other words, there is a maximu
 | **daily**   | daytime(cpu=1408,mem=5500G)    | nighttime(cpu=2112,mem=8250G) | unlimited(cpu=6336,mem=24750G) |
 | **hourly**  | unlimited(cpu=6336,mem=24750G) | unlimited(cpu=6336,mem=24750G)| unlimited(cpu=6336,mem=24750G) |

-By default, users can not use more than 704 cores at the same time (max CPU per user). Memory is also proportionally limited in the same way. This is 
-equivalent to 8 exclusive nodes. This limit applies to the **general** partition (fixed limit) and to the **daily** partition (only during working hours). 
-For the **hourly** partition, there are no limits restriction and user limits are removed. Limits are softed for the **daily** partition during non 
-working hours, and during the weekend limits are removed.
+By default, users can not use more than 704 cores at the same time (max CPU per
+user). Memory is also proportionally limited in the same way. This is
+equivalent to 8 exclusive nodes. This limit applies to the **general**
+partition (fixed limit) and to the **daily** partition (only during working
+hours).
+
+For the **hourly** partition, there are no limits restriction and user limits
+are removed. Limits are softed for the **daily** partition during non working
+hours, and during the weekend limits are removed.

 ## Advanced Slurm configuration

-Clusters at PSI use the [Slurm Workload Manager](http://slurm.schedmd.com/) as the batch system technology for managing and scheduling jobs.
-Slurm has been installed in a **multi-clustered** configuration, allowing to integrate multiple clusters in the same batch system.
+Clusters at PSI use the [Slurm Workload Manager](http://slurm.schedmd.com/) as
+the batch system technology for managing and scheduling jobs.  Slurm has been
+installed in a **multi-clustered** configuration, allowing to integrate
+multiple clusters in the same batch system.

 For understanding the Slurm configuration setup in the cluster, sometimes may be useful to check the following files:

@@ -240,5 +277,10 @@ For understanding the Slurm configuration setup in the cluster, sometimes may be
 * ``/etc/slurm/gres.conf`` - can be found in the GPU nodes, is also propgated to login nodes and computing nodes for user read access.
 * ``/etc/slurm/cgroup.conf`` - can be found in the computing nodes, is also propagated to login nodes for user read access.

-The previous configuration files which can be found in the login nodes, correspond exclusively to the **merlin6** cluster configuration files.
-Configuration files for the old **merlin5** cluster or for the **gmerlin6** cluster must be checked directly on any of the **merlin5** or **gmerlin6** computing nodes (in example, by login in to one of the nodes while a job or an active allocation is running).
+The previous configuration files which can be found in the login nodes,
+correspond exclusively to the **merlin6** cluster configuration files.
+
+Configuration files for the old **merlin5** cluster or for the **gmerlin6**
+cluster must be checked directly on any of the **merlin5** or **gmerlin6**
+computing nodes (in example, by login in to one of the nodes while a job or an
+active allocation is running).