287 lines
16 KiB
Markdown
287 lines
16 KiB
Markdown
# Slurm Configuration
|
|
|
|
This documentation shows basic Slurm configuration and options needed to run jobs in the Merlin6 CPU cluster.
|
|
|
|
## Merlin6 CPU nodes definition
|
|
|
|
The following table show default and maximum resources that can be used per node:
|
|
|
|
| Nodes | Def.#CPUs | Max.#CPUs | #Threads | Max.Mem/CPU | Max.Mem/Node | Max.Swap | Def.#GPUs | Max.#GPUs |
|
|
|:--------------------:| ---------:| :--------:| :------: | :----------:| :-----------:| :-------:| :-------: | :-------: |
|
|
| merlin-c-[001-024] | 1 core | 44 cores | 2 | 352000 | 352000 | 10000 | N/A | N/A |
|
|
| merlin-c-[101-124] | 1 core | 44 cores | 2 | 352000 | 352000 | 10000 | N/A | N/A |
|
|
| merlin-c-[201-224] | 1 core | 44 cores | 2 | 352000 | 352000 | 10000 | N/A | N/A |
|
|
| merlin-c-[301-312] | 1 core | 44 cores | 2 | 748800 | 748800 | 10000 | N/A | N/A |
|
|
| merlin-c-[313-318] | 1 core | 44 cores | 1 | 748800 | 748800 | 10000 | N/A | N/A |
|
|
| merlin-c-[319-324] | 1 core | 44 cores | 2 | 748800 | 748800 | 10000 | N/A | N/A |
|
|
|
|
If nothing is specified, by default each core will use up to 8GB of memory. Memory can be increased with the `--mem=<mem_in_MB>` and
|
|
`--mem-per-cpu=<mem_in_MB>` options, and maximum memory allowed is `Max.Mem/Node`.
|
|
|
|
In **`merlin6`**, Memory is considered a Consumable Resource, as well as the CPU. Hence, both resources will account when submitting a job,
|
|
and by default resources can not be oversubscribed. This is a main difference with the old **`merlin5`** cluster, when only CPU were accounted,
|
|
|
|
and memory was by default oversubscribed.
|
|
|
|
!!! tip "Check Configuration"
|
|
Always check `/etc/slurm/slurm.conf` for changes in the hardware.
|
|
|
|
### Merlin6 CPU cluster
|
|
|
|
To run jobs in the **`merlin6`** cluster users **can optionally** specify the cluster name in Slurm:
|
|
|
|
```bash
|
|
#SBATCH --cluster=merlin6
|
|
```
|
|
|
|
If no cluster name is specified, by default any job will be submitted to this cluster (as this is the main cluster).
|
|
Hence, this would be only necessary if one has to deal with multiple clusters or when one has defined some environmental
|
|
variables which can modify the cluster name.
|
|
|
|
### Merlin6 CPU partitions
|
|
|
|
Users might need to specify the Slurm partition. If no partition is specified, it will default to **`general`**:
|
|
|
|
```bash
|
|
#SBATCH --partition=<partition_name> # Possible <partition_name> values: general, daily, hourly
|
|
```
|
|
|
|
The following *partitions* (also known as *queues*) are configured in Slurm:
|
|
|
|
| CPU Partition | Default Time | Max Time | Max Nodes | PriorityJobFactor\* | PriorityTier\*\* | DefMemPerCPU |
|
|
|:-----------------: | :----------: | :------: | :-------: | :-----------------: | :--------------: |:------------:|
|
|
| **<u>general</u>** | 1 day | 1 week | 50 | 1 | 1 | 4000 |
|
|
| **daily** | 1 day | 1 day | 67 | 500 | 1 | 4000 |
|
|
| **hourly** | 1 hour | 1 hour | unlimited | 1000 | 1 | 4000 |
|
|
| **asa-general** | 1 hour | 2 weeks | unlimited | 1 | 2 | 3712 |
|
|
| **asa-daily** | 1 hour | 1 week | unlimited | 500 | 2 | 3712 |
|
|
| **asa-visas** | 1 hour | 90 days | unlimited | 1000 | 4 | 3712 |
|
|
| **asa-ansys** | 1 hour | 90 days | unlimited | 1000 | 4 | 15600 |
|
|
| **mu3e** | 1 day | 7 days | unlimited | 1000 | 4 | 3712 |
|
|
|
|
The **PriorityJobFactor** value will be added to the job priority (**PARTITION** column in `sprio -l` ). In other words, jobs sent to higher priority
|
|
partitions will usually run first (however, other factors such like **job age** or mainly **fair share** might affect to that decision). For the GPU
|
|
partitions, Slurm will also attempt first to allocate jobs on partitions with higher priority over partitions with lesser priority.
|
|
|
|
Jobs submitted to a partition with a higher **PriorityTier** value will be dispatched before pending jobs in partition with lower *PriorityTier* value
|
|
and, if possible, they will preempt running jobs from partitions with lower **PriorityTier** values.
|
|
|
|
* The **`general`** partition is the **default**. It can not have more than 50 nodes running jobs.
|
|
* For **`daily`** this limitation is extended to 67 nodes.
|
|
* For **`hourly`** there are no limits.
|
|
* **`asa-general`,`asa-daily`,`asa-ansys`,`asa-visas` and `mu3e`** are **private** partitions, belonging to different experiments owning the machines. **Access is restricted** in all cases. However, by agreement with the experiments, nodes are usually added to the **`hourly`** partition as extra resources for the public resources.
|
|
|
|
!!! tip "Partition Selection"
|
|
Jobs which would run for less than one day should be always sent to
|
|
**daily**, while jobs that would run for less than one hour should be sent
|
|
to **hourly**. This would ensure that you have highest priority over jobs
|
|
sent to partitions with less priority, but also because **general** has
|
|
limited the number of nodes that can be used for that. The idea behind
|
|
that, is that the cluster can not be blocked by long jobs and we can always
|
|
ensure resources for shorter jobs.
|
|
|
|
### Merlin5 CPU Accounts
|
|
|
|
Users need to ensure that the public **`merlin`** account is specified. No specifying account options would default to this account.
|
|
|
|
This is mostly needed by users which have multiple Slurm accounts, which may define by mistake a different account.
|
|
|
|
```bash
|
|
#SBATCH --account=merlin # Possible values: merlin, gfa-asa
|
|
```
|
|
|
|
Not all the accounts can be used on all partitions. This is resumed in the table below:
|
|
|
|
| Slurm Account | Slurm Partitions |
|
|
| :------------------: | :----------------------------------: |
|
|
| **<u>merlin</u>** | `hourly`,`daily`, `general` |
|
|
| **gfa-asa** | `asa-general`,`asa-daily`,`asa-visas`,`asa-ansys`,`hourly`,`daily`, `general` |
|
|
| **mu3e** | `mu3e` |
|
|
|
|
#### Private accounts
|
|
|
|
* The *`gfa-asa`* and *`mu3e`* accounts are private accounts. These can be used for accessing dedicated partitions with nodes owned by different groups.
|
|
|
|
### Slurm CPU specific options
|
|
|
|
Some options are available when using CPUs. These are detailed here.
|
|
Alternative Slurm options for CPU based jobs are available. Please refer to the
|
|
**man** pages for each Slurm command for further information about it (`man
|
|
salloc`, `man sbatch`, `man srun`). Below are listed the most common settings:
|
|
|
|
```bash
|
|
#SBATCH --hint=[no]multithread
|
|
#SBATCH --ntasks=<ntasks>
|
|
#SBATCH --ntasks-per-core=<ntasks>
|
|
#SBATCH --ntasks-per-socket=<ntasks>
|
|
#SBATCH --ntasks-per-node=<ntasks>
|
|
#SBATCH --mem=<size[units]>
|
|
#SBATCH --mem-per-cpu=<size[units]>
|
|
#SBATCH --cpus-per-task=<ncpus>
|
|
#SBATCH --cpu-bind=[{quiet,verbose},]<type> # only for 'srun' command
|
|
```
|
|
|
|
#### Enabling/Disabling Hyper-Threading
|
|
|
|
The **`merlin6`** cluster contains nodes with Hyper-Threading enabled. One
|
|
should always specify whether to use Hyper-Threading or not. If not defined,
|
|
Slurm will generally use it (exceptions apply).
|
|
|
|
```bash
|
|
#SBATCH --hint=multithread # Use extra threads with in-core multi-threading.
|
|
#SBATCH --hint=nomultithread # Don't use extra threads with in-core multi-threading.
|
|
```
|
|
|
|
#### Constraint / Features
|
|
|
|
Slurm allows to define a set of features in the node definition. This can be used to filter and select nodes according to one or more
|
|
specific features. For the CPU nodes, we have the following features:
|
|
|
|
```text
|
|
NodeName=merlin-c-[001-024,101-124,201-224] Features=mem_384gb,xeon-gold-6152
|
|
NodeName=merlin-c-[301-312] Features=mem_768gb,xeon-gold-6240r
|
|
NodeName=merlin-c-[313-318] Features=mem_768gb,xeon-gold-6240r
|
|
NodeName=merlin-c-[319-324] Features=mem_384gb,xeon-gold-6240r
|
|
```
|
|
|
|
Therefore, users running on `hourly` can select which node they want to use (fat memory nodes vs regular memory nodes, CPU type).
|
|
This is possible by using the option `--constraint=<feature_name>` in Slurm.
|
|
|
|
Examples:
|
|
|
|
1. Select nodes with 48 cores only (nodes with [2 x Xeon Gold 6240R](https://ark.intel.com/content/www/us/en/ark/products/199343/intel-xeon-gold-6240r-processor-35-75m-cache-2-40-ghz.html)):
|
|
|
|
```bash
|
|
sbatch --constraint=xeon-gold-6240r ...
|
|
```
|
|
|
|
1. Select nodes with 44 cores only (nodes with [2 x Xeon Gold 6152](https://ark.intel.com/content/www/us/en/ark/products/120491/intel-xeon-gold-6152-processor-30-25m-cache-2-10-ghz.html)):
|
|
|
|
```bash
|
|
sbatch --constraint=xeon-gold-6152 ...
|
|
```
|
|
|
|
1. Select fat memory nodes only:
|
|
|
|
```bash
|
|
sbatch --constraint=mem_768gb ...
|
|
```
|
|
|
|
1. Select regular memory nodes only:
|
|
|
|
```bash
|
|
sbatch --constraint=mem_384gb ...
|
|
```
|
|
|
|
1. Select fat memory nodes with 48 cores only:
|
|
|
|
```bash
|
|
sbatch --constraint=mem_768gb,xeon-gold-6240r ...
|
|
```
|
|
|
|
Detailing exactly which type of nodes you want to use is important, therefore, for groups with private accounts (`mu3e`,`gfa-asa`) or for
|
|
public users running on the `hourly` partition, *constraining nodes by features is recommended*. This becomes even more important when
|
|
having heterogeneous clusters.
|
|
|
|
## Running jobs in the 'merlin6' cluster
|
|
|
|
In this chapter we will cover basic settings that users need to specify in order to run jobs in the Merlin6 CPU cluster.
|
|
|
|
### User and job limits
|
|
|
|
In the CPU cluster we provide some limits which basically apply to jobs and users. The idea behind this is to ensure a fair usage of the resources and to
|
|
avoid overabuse of the resources from a single user or job. However, applying limits might affect the overall usage efficiency of the cluster (in example,
|
|
pending jobs from a single user while having many idle nodes due to low overall activity is something that can be seen when user limits are applied).
|
|
In the same way, these limits can be also used to improve the efficiency of the cluster (in example, without any job size limits, a job requesting all
|
|
resources from the batch system would drain the entire cluster for fitting the job, which is undesirable).
|
|
|
|
Hence, there is a need of setting up wise limits and to ensure that there is a fair usage of the resources, by trying to optimize the overall efficiency
|
|
of the cluster while allowing jobs of different nature and sizes (it is, **single core** based **vs parallel jobs** of different sizes) to run.
|
|
|
|
!!! warning "Resource Limits"
|
|
Wide limits are provided in the **daily** and **hourly** partitions, while
|
|
for **general** those limits are more restrictive. However, we kindly ask
|
|
users to inform the Merlin administrators when there are plans to send big
|
|
jobs which would require a massive draining of nodes for allocating such
|
|
jobs. This would apply to jobs requiring the **unlimited** QoS (see below
|
|
"Per job limits").
|
|
|
|
!!! tip "Custom Requirements"
|
|
If you have different requirements, please let us know, we will try to
|
|
accommodate or propose a solution for you.
|
|
|
|
#### Per job limits
|
|
|
|
These are limits which apply to a single job. In other words, there is a
|
|
maximum of resources a single job can use. Limits are described in the table
|
|
below with the format: `SlurmQoS(limits)` (possible `SlurmQoS` values can be
|
|
listed with the command `sacctmgr show qos`). Some limits will vary depending
|
|
on the day and time of the week.
|
|
|
|
| Partition | Mon-Fri 0h-18h | Sun-Thu 18h-0h | From Fri 18h to Mon 0h |
|
|
|:----------: | :------------------------------: | :------------------------------: | :------------------------------: |
|
|
| **general** | normal(cpu=704,mem=2750G) | normal(cpu=704,mem=2750G) | normal(cpu=704,mem=2750G) |
|
|
| **daily** | daytime(cpu=704,mem=2750G) | nighttime(cpu=1408,mem=5500G) | unlimited(cpu=2200,mem=8593.75G) |
|
|
| **hourly** | unlimited(cpu=2200,mem=8593.75G) | unlimited(cpu=2200,mem=8593.75G) | unlimited(cpu=2200,mem=8593.75G) |
|
|
|
|
By default, a job can not use more than 704 cores (max CPU per job). In the
|
|
same way, memory is also proportionally limited. This is equivalent as running
|
|
a job using up to 8 nodes at once. This limit applies to the **general**
|
|
partition (fixed limit) and to the **daily** partition (only during working
|
|
hours).
|
|
|
|
Limits are softed for the **daily** partition during non working hours, and
|
|
during the weekend limits are even wider. For the **hourly** partition,
|
|
**despite running many parallel jobs is something not desirable** (for
|
|
allocating such jobs it requires massive draining of nodes), wider limits are
|
|
provided. In order to avoid massive nodes drain in the cluster, for allocating
|
|
huge jobs, setting per job limits is necessary. Hence, **unlimited** QoS mostly
|
|
refers to "per user" limits more than to "per job" limits (in other words,
|
|
users can run any number of hourly jobs, but the job size for such jobs is
|
|
limited with wide values).
|
|
|
|
#### Per user limits for CPU partitions
|
|
|
|
These limits which apply exclusively to users. In other words, there is a
|
|
maximum of resources a single user can use. Limits are described in the table
|
|
below with the format: `SlurmQoS(limits)` (possible `SlurmQoS` values can be
|
|
listed with the command `sacctmgr show qos`). Some limits will vary depending
|
|
on the day and time of the week.
|
|
|
|
| Partition | Mon-Fri 0h-18h | Sun-Thu 18h-0h | From Fri 18h to Mon 0h |
|
|
|:-----------:| :----------------------------: | :---------------------------: | :----------------------------: |
|
|
| **general** | normal(cpu=704,mem=2750G) | normal(cpu=704,mem=2750G) | normal(cpu=704,mem=2750G) |
|
|
| **daily** | daytime(cpu=1408,mem=5500G) | nighttime(cpu=2112,mem=8250G) | unlimited(cpu=6336,mem=24750G) |
|
|
| **hourly** | unlimited(cpu=6336,mem=24750G) | unlimited(cpu=6336,mem=24750G)| unlimited(cpu=6336,mem=24750G) |
|
|
|
|
By default, users can not use more than 704 cores at the same time (max CPU per
|
|
user). Memory is also proportionally limited in the same way. This is
|
|
equivalent to 8 exclusive nodes. This limit applies to the **general**
|
|
partition (fixed limit) and to the **daily** partition (only during working
|
|
hours).
|
|
|
|
For the **hourly** partition, there are no limits restriction and user limits
|
|
are removed. Limits are softed for the **daily** partition during non working
|
|
hours, and during the weekend limits are removed.
|
|
|
|
## Advanced Slurm configuration
|
|
|
|
Clusters at PSI use the [Slurm Workload Manager](http://slurm.schedmd.com/) as
|
|
the batch system technology for managing and scheduling jobs. Slurm has been
|
|
installed in a **multi-clustered** configuration, allowing to integrate
|
|
multiple clusters in the same batch system.
|
|
|
|
For understanding the Slurm configuration setup in the cluster, sometimes may be useful to check the following files:
|
|
|
|
* ``/etc/slurm/slurm.conf`` - can be found in the login nodes and computing nodes.
|
|
* ``/etc/slurm/gres.conf`` - can be found in the GPU nodes, is also propgated to login nodes and computing nodes for user read access.
|
|
* ``/etc/slurm/cgroup.conf`` - can be found in the computing nodes, is also propagated to login nodes for user read access.
|
|
|
|
The previous configuration files which can be found in the login nodes,
|
|
correspond exclusively to the **merlin6** cluster configuration files.
|
|
|
|
Configuration files for the old **merlin5** cluster or for the **gmerlin6**
|
|
cluster must be checked directly on any of the **merlin5** or **gmerlin6**
|
|
computing nodes (in example, by login in to one of the nodes while a job or an
|
|
active allocation is running).
|