143 lines
8.0 KiB
Markdown
143 lines
8.0 KiB
Markdown
---
|
|
title: Slurm Configuration
|
|
#tags:
|
|
keywords: configuration, partitions, node definition
|
|
last_updated: 20 May 2021
|
|
summary: "This document describes a summary of the Merlin5 Slurm configuration."
|
|
sidebar: merlin6_sidebar
|
|
permalink: /merlin5/slurm-configuration.html
|
|
---
|
|
|
|
This documentation shows basic Slurm configuration and options needed to run jobs in the Merlin5 cluster.
|
|
|
|
The Merlin5 cluster is an old cluster with old hardware which is maintained in a best effort for increasing the CPU power of the Merlin cluster.
|
|
|
|
## Merlin5 CPU nodes definition
|
|
|
|
The following table show default and maximum resources that can be used per node:
|
|
|
|
| Nodes | Def.#CPUs | Max.#CPUs | #Threads | Max.Mem/Node | Max.Swap |
|
|
|:----------------:| ---------:| :--------:| :------: | :----------: | :-------:|
|
|
| merlin-c-[18-30] | 1 core | 16 cores | 1 | 60000 | 10000 |
|
|
| merlin-c-[31-32] | 1 core | 16 cores | 1 | 124000 | 10000 |
|
|
| merlin-c-[33-45] | 1 core | 16 cores | 1 | 60000 | 10000 |
|
|
| merlin-c-[46-47] | 1 core | 16 cores | 1 | 124000 | 10000 |
|
|
|
|
There is one *main difference between the Merlin5 and Merlin6 clusters*: Merlin5 is keeping an old configuration which does not
|
|
consider the memory as a *consumable resource*. Hence, users can *oversubscribe* memory. This might trigger some side-effects, but
|
|
this legacy configuration has been kept to ensure that old jobs can keep running in the same way they did a few years ago.
|
|
If you know that this might be a problem for you, please, always use Merlin6 instead.
|
|
|
|
|
|
## Running jobs in the 'merlin5' cluster
|
|
|
|
In this chapter we will cover basic settings that users need to specify in order to run jobs in the Merlin5 CPU cluster.
|
|
|
|
### Merlin5 CPU cluster
|
|
|
|
To run jobs in the **`merlin5`** cluster users **must** specify the cluster name in Slurm:
|
|
|
|
```bash
|
|
#SBATCH --cluster=merlin5
|
|
```
|
|
|
|
### Merlin5 CPU partitions
|
|
|
|
Users might need to specify the Slurm partition. If no partition is specified, it will default to **`merlin`**:
|
|
|
|
```bash
|
|
#SBATCH --partition=<partition_name> # Possible <partition_name> values: merlin, merlin-long:
|
|
```
|
|
|
|
The table below resumes shows all possible partitions available to users:
|
|
|
|
| CPU Partition | Default Time | Max Time | Max Nodes | PriorityJobFactor\* | PriorityTier\*\* |
|
|
|:-----------------: | :----------: | :------: | :-------: | :-----------------: | :--------------: |
|
|
| **<u>merlin</u>** | 5 days | 1 week | All nodes | 500 | 1 |
|
|
| **merlin-long** | 5 days | 21 days | 4 | 1 | 1 |
|
|
|
|
**\***The **PriorityJobFactor** value will be added to the job priority (*PARTITION* column in `sprio -l` ). In other words, jobs sent to higher priority
|
|
partitions will usually run first (however, other factors such like **job age** or mainly **fair share** might affect to that decision). For the GPU
|
|
partitions, Slurm will also attempt first to allocate jobs on partitions with higher priority over partitions with lesser priority.
|
|
|
|
**\*\***Jobs submitted to a partition with a higher **PriorityTier** value will be dispatched before pending jobs in partition with lower *PriorityTier* value
|
|
and, if possible, they will preempt running jobs from partitions with lower *PriorityTier* values.
|
|
|
|
The **`merlin-long`** partition **is limited to 4 nodes**, as it might contain jobs running for up to 21 days.
|
|
|
|
### Merlin5 CPU Accounts
|
|
|
|
Users need to ensure that the public **`merlin`** account is specified. No specifying account options would default to this account.
|
|
This is mostly needed by users which have multiple Slurm accounts, which may define by mistake a different account.
|
|
|
|
```bash
|
|
#SBATCH --account=merlin # Possible values: merlin
|
|
```
|
|
|
|
### Slurm CPU specific options
|
|
|
|
Some options are available when using CPUs. These are detailed here.
|
|
|
|
Alternative Slurm options for CPU based jobs are available. Please refer to the **man** pages
|
|
for each Slurm command for further information about it (`man salloc`, `man sbatch`, `man srun`).
|
|
Below are listed the most common settings:
|
|
|
|
```bash
|
|
#SBATCH --ntasks=<ntasks>
|
|
#SBATCH --ntasks-per-core=<ntasks>
|
|
#SBATCH --ntasks-per-socket=<ntasks>
|
|
#SBATCH --ntasks-per-node=<ntasks>
|
|
#SBATCH --mem=<size[units]>
|
|
#SBATCH --mem-per-cpu=<size[units]>
|
|
#SBATCH --cpus-per-task=<ncpus>
|
|
#SBATCH --cpu-bind=[{quiet,verbose},]<type> # only for 'srun' command
|
|
```
|
|
|
|
Notice that in **Merlin5** no hyper-threading is available (while in **Merlin6** it is).
|
|
Hence, in **Merlin5** there is not need to specify `--hint` hyper-threading related options.
|
|
|
|
## User and job limits
|
|
|
|
In the CPU cluster we provide some limits which basically apply to jobs and users. The idea behind this is to ensure a fair usage of the resources and to
|
|
avoid overabuse of the resources from a single user or job. However, applying limits might affect the overall usage efficiency of the cluster (in example,
|
|
pending jobs from a single user while having many idle nodes due to low overall activity is something that can be seen when user limits are applied).
|
|
In the same way, these limits can be also used to improve the efficiency of the cluster (in example, without any job size limits, a job requesting all
|
|
resources from the batch system would drain the entire cluster for fitting the job, which is undesirable).
|
|
|
|
Hence, there is a need of setting up wise limits and to ensure that there is a fair usage of the resources, by trying to optimize the overall efficiency
|
|
of the cluster while allowing jobs of different nature and sizes (it is, **single core** based **vs parallel jobs** of different sizes) to run.
|
|
|
|
In the **`merlin5`** cluster, as not many users are running on it, these limits are wider than the ones set in the **`merlin6`** and **`gmerlin6`** clusters.
|
|
|
|
### Per job limits
|
|
|
|
These are limits which apply to a single job. In other words, there is a maximum of resources a single job can use. These limits are described in the table below,
|
|
with the format `SlurmQoS(limits)` (`SlurmQoS` can be listed from the `sacctmgr show qos` command):
|
|
|
|
| Partition | Mon-Sun 0h-24h | Other limits |
|
|
|:---------------: | :--------------: | :----------: |
|
|
| **merlin** | merlin5(cpu=384) | None |
|
|
| **merlin-long** | merlin5(cpu=384) | Max. 4 nodes |
|
|
|
|
By default, by QoS limits, a job can not use more than 384 cores (max CPU per job).
|
|
However, for the `merlin-long`, this is even more restricted: there is an extra limit of 4 dedicated nodes for this partion. This is defined
|
|
at the partition level, and will overwrite any QoS limit as long as this is more restrictive.
|
|
|
|
### Per user limits for CPU partitions
|
|
|
|
No user limits apply by QoS. For the **`merlin`** partition, a single user could fill the whole batch system with jobs (however, the restriction is at the job size, as explained above). For the **`merlin-limit`** partition, the 4 node limitation still applies.
|
|
|
|
## Advanced Slurm configuration
|
|
|
|
Clusters at PSI use the [Slurm Workload Manager](http://slurm.schedmd.com/) as the batch system technology for managing and scheduling jobs.
|
|
Slurm has been installed in a **multi-clustered** configuration, allowing to integrate multiple clusters in the same batch system.
|
|
|
|
For understanding the Slurm configuration setup in the cluster, sometimes may be useful to check the following files:
|
|
|
|
* ``/etc/slurm/slurm.conf`` - can be found in the login nodes and computing nodes.
|
|
* ``/etc/slurm/gres.conf`` - can be found in the GPU nodes, is also propgated to login nodes and computing nodes for user read access.
|
|
* ``/etc/slurm/cgroup.conf`` - can be found in the computing nodes, is also propagated to login nodes for user read access.
|
|
|
|
The previous configuration files which can be found in the login nodes, correspond exclusively to the **merlin6** cluster configuration files.
|
|
Configuration files for the old **merlin5** cluster or for the **gmerlin6** cluster must be checked directly on any of the **merlin5** or **gmerlin6** computing nodes (in example, by login in to one of the nodes while a job or an active allocation is running).
|