Updates in Slurm

This commit is contained in:
2019-07-01 18:04:15 +02:00
parent 864ef84a0f
commit 5c2ea17076
4 changed files with 23 additions and 88 deletions

View File

@ -19,22 +19,7 @@ Slurm has been installed in a **multi-clustered** configuration, allowing to int
* **merlin5** will exist as long as hardware incidents are soft and easy to repair/fix (i.e. hard disk replacement)
* **merlin6** is the default cluster when submitting jobs.
This document is mostly focused on the **merlin6** cluster. Details for **merlin5** are not shown here, and only basic access and recent
changes will be explained (**[Official Merlin5 User Guide](https://intranet.psi.ch/PSI_HPC/Merlin5)** is still valid).
### Merlin6 Slurm Configuration Details
For understanding the Slurm configuration setup in the cluster, sometimes can be useful to check the following files:
* ``/etc/slurm/slurm.conf`` - can be found in the login nodes and computing nodes.
* ``/etc/slurm/cgroup.conf`` - can be found in the computing nodes, is also propagated to login nodes for user read access.
* ``/etc/slurm/gres.conf`` - can be found in the GPU nodes, is also propgated to login nodes and computing nodes for user read access.
The previous configuration files can be found in the *login nodes* correspond exclusively to the **merlin6** cluster configuration files. These
configuration files are also present in the **merlin6** *computing nodes*.
Slurm configuration files for the old **merlin5** cluster have to be directly checked on any of the **merlin5** *computing nodes*: those files *do
not* exist in the **merlin6** *login nodes*.
Please follow the section **Merlin6 Slurm** for more details about configuration and job submission.
### Merlin5 Access
@ -49,50 +34,8 @@ srun --clusters=merlin5 --partition=merlin hostname
sbatch --clusters=merlin5 --partition=merlin myScript.batch
```
---
### Merlin6 Access
## Using Slurm 'merlin6' cluster
Basic usage for the **merlin6** cluster will be detailed here. For advanced usage, please use the following document [LINK TO SLURM ADVANCED CONFIG]()
### Merlin6 Node definition
The following table show default and maximum resources that can be used per node:
| Nodes | Def.#CPUs | Max.#CPUs | Def.Mem/CPU | Max.Mem/CPU | Max.Mem/Node | Max.Swap | Def.#GPUs | Max.#GPUs |
|:---------------------------------- | ---------:| ---------:| -----------:| -----------:| ------------:| --------:| --------- | --------- |
| merlin-c-[001-022,101-122,201-222] | 1 core | 44 cores | 8000 | 352000 | 352000 | 10000 | N/A | N/A |
| merlin-g-[001] | 1 core | 8 cores | 8000 | 102498 | 102498 | 10000 | 1 | 2 |
| merlin-g-[002-009] | 1 core | 10 cores | 8000 | 102498 | 102498 | 10000 | 1 | 4 |
If nothing is specified, by default each core will use up to 8GB of memory. More memory per core can be specified with the ``--mem=<memory>`` option,
and maximum memory allowed is ``Max.Mem/Node``.
In *Merlin6*, memory is considered a Consumable Resource, as well as the CPU.
### Merlin6 Slurm partitions
Partition can be specified when submitting a job with the ``--partition=<partitionname>`` option.
The following *partitions* (also known as *queues*) are configured in Slurm:
| Partition | Default Partition | Default Time | Max Time | Max Nodes | Priority |
|:----------- | ----------------- | ------------ | -------- | --------- | -------- |
| **general** | true | 1 day | 1 week | 50 | low |
| **daily** | false | 1 day | 1 day | 60 | medium |
| **hourly** | false | 1 hour | 1 hour | unlimited | highest |
General is the *default*, so when nothing is specified job will be by default assigned to that partition. General can not have more than 50 nodes
running jobs. For **daily** this limitation is extended to 60 nodes while for **hourly** there are no limits. Shorter jobs have more priority than
longer jobs, hence in general terms would be scheduled before (however, other factors such like user fair share value can affect to this decision).
### Merlin6 User limits
By default, users can not use more than 528 cores at the same time (Max CPU per user). This limit applies for the **general** and **daily** partitions. For the **hourly** partition, there is no restriction.
These limits are softed for the **daily** partition during non working hours and during the weekend as follows:
| Partition | Mon-Fri 08h-18h | Sun-Thu 18h-0h | From Fri 18h to Sun 8h | From Sun 8h to Mon 18h |
|:----------- | --------------- | -------------- | ----------------------- | ---------------------- |
| **general** | 528 | 528 | 528 | 528 |
| **daily** | 528 | 792 | Unlimited | 792 |
| **hourly** | Unlimited | Unlimited | Unlimited | Unlimited |
By default, any job submitted with specifying ``--clusters=`` should use the local cluster, so nothing extra should be specified. In any case,
you can optionally add ``--clusters=merlin6`` in order to force submission to the Merlin6 cluster.

View File

@ -28,9 +28,9 @@ but jobs already queued on the partition may be allocated to nodes and run.
Unless explicitly specified, the default draining policy for each partition will be the following:
* The **general** partition will be soft drained on the previous Friday from 8am.
* The **daily** partition will be soft drained on the previous day from 8am.
* The **hourly** partition will be soft drained on the same Monday from 7am.
* The **general** and **gpu_general** partitions will be soft drained on the previous Friday from 8am.
* The **daily** and **gpu_daily** partitions will be soft drained on the previous day from 8am.
* The **hourly** and **gpu_hourly** partitions will be soft drained on the same Monday from 7am.
Finally, **remaining running jobs will be killed** by default when the downtime starts. In some specific rare cases jobs will be
just *paused* and *resumed* back when the downtime finished.
@ -39,11 +39,14 @@ just *paused* and *resumed* back when the downtime finished.
The following table contains a summary of the draining policies during a Schedule Downtime:
| **Partition** | **Drain Policy** | **Default Drain Type** | **Default Job Policy** |
|:-------------:| -----------------:| ----------------------:| --------------------------------:|
| **general** | 72h before the SD | soft drain | Kill running jobs when SD starts |
| **daily** | 24h before the SD | soft drain | Kill running jobs when SD starts |
| **hourly** | 1h before the SD | soft drain | Kill running jobs when SD starts |
| **Partition** | **Drain Policy** | **Default Drain Type** | **Default Job Policy** |
|:---------------:| -----------------:| ----------------------:| --------------------------------:|
| **general** | 72h before the SD | soft drain | Kill running jobs when SD starts |
| **gpu_general** | 72h before the SD | soft drain | Kill running jobs when SD starts |
| **daily** | 24h before the SD | soft drain | Kill running jobs when SD starts |
| **gpu_daily** | 24h before the SD | soft drain | Kill running jobs when SD starts |
| **hourly** | 1h before the SD | soft drain | Kill running jobs when SD starts |
| **gpu_hourly** | 1h before the SD | soft drain | Kill running jobs when SD starts |
---

View File

@ -62,11 +62,15 @@ Use **man sbatch** (``man sbatch | grep -A36 '^filename pattern'``) for getting
## Partitions
Merlin6 contains 3 partitions for general purpose. These are ``general``, ``daily`` and ``hourly``. If no partition is defined,
``general`` will be the default. Partition can be defined with the ``--partition`` option as follows:
Merlin6 contains 6 partitions for general purpose:
* For the CPU these are ``general``, ``daily`` and ``hourly``.
* For the GPU these are ``gpu_general``, ``gpu_daily`` and ``gpu_hourly``.
If no partition is defined, ``general`` will be the default. Partition can be defined with the ``--partition`` option as follows:
```bash
#SBATCH --partition=<general|daily|hourly> # Partition to use. 'general' is the 'default'.
#SBATCH --partition=<partition_name> # Partition to use. 'general' is the 'default'.
```
Please check the section [Slurm Configuration#Merlin6 Slurm Partitions] for more information about Merlin6 partition setup.
@ -76,14 +80,6 @@ Please check the section [Slurm Configuration#Merlin6 Slurm Partitions] for more
CPU-based jobs are available for all PSI users. Users must belong to the ``merlin6`` Slurm ``Account`` in order to be able
to run on CPU-based nodes. All users registered in Merlin6 are automatically included in the ``Account``.
### Slurm CPU Mandatory Settings
The following options are mandatory settings that **must be included** in your batch scripts:
```bash
#SBATCH --constraint=mc # Always set it to 'mc' for CPU jobs.
```
### Slurm CPU Recommended Settings
There are some settings that are not mandatory but would be needed or useful to specify. These are the following:
@ -105,7 +101,6 @@ The following template should be used by any user submitting jobs to CPU nodes:
#SBATCH --time=<D-HH:MM:SS> # Strictly recommended when using 'general' partition.
#SBATCH --output=<output_file> # Generate custom output file
#SBATCH --error=<error_file> # Generate custom error file
#SBATCH --constraint=mc # You must specify 'mc' when using 'cpu' jobs
#SBATCH --ntasks-per-core=1 # Recommended one thread per core
##SBATCH --exclusive # Uncomment if you need exclusive node usage
@ -130,7 +125,6 @@ are automatically registered to the ``merlin6-gpu`` account. Other users should
The following options are mandatory settings that **must be included** in your batch scripts:
```bash
#SBATCH --constraint=gpu # Always set it to 'gpu' for GPU jobs.
#SBATCH --gres=gpu # Always set at least this option when using GPUs
```
@ -153,11 +147,10 @@ The following template should be used by any user submitting jobs to GPU nodes:
```bash
#!/bin/sh
#SBATCH --partition=<general|daily|hourly> # Specify 'general' or 'daily' or 'hourly'
#SBATCH --partition=gpu_<general|daily|hourly> # Specify 'general' or 'daily' or 'hourly'
#SBATCH --time=<D-HH:MM:SS> # Strictly recommended when using 'general' partition.
#SBATCH --output=<output_file> # Generate custom output file
#SBATCH --error=<error_file # Generate custom error file
#SBATCH --constraint=gpu # You must specify 'gpu' for using GPUs
#SBATCH --gres="gpu:<type>:<number_gpus>" # You should specify at least 'gpu'
#SBATCH --ntasks-per-core=1 # GPU nodes have hyper-threading disabled
##SBATCH --exclusive # Uncomment if you need exclusive node usage

View File

@ -15,7 +15,6 @@ permalink: /merlin6/slurm-examples.html
```bash
#!/bin/bash
#SBATCH --partition=hourly # Using 'hourly' will grant higher priority
#SBATCH --constraint=mc # Use CPU batch system
#SBATCH --ntasks-per-core=1 # Force no Hyper-Threading, will run 1 task per core
#SBATCH --mem-per-cpu=8000 # Double the default memory per cpu
#SBATCH --time=00:30:00 # Define max time job will run
@ -35,7 +34,6 @@ hyperthreads), hence we want to use the memory as if we were using 2 threads.
```bash
#!/bin/bash
#SBATCH --partition=hourly # Using 'hourly' will grant higher priority
#SBATCH --constraint=mc # Use CPU batch system
#SBATCH --ntasks-per-core=1 # Force no Hyper-Threading, will run 1 task per core
#SBATCH --mem=352000 # We want to use the whole memory
#SBATCH --time=00:30:00 # Define max time job will run
@ -58,7 +56,6 @@ the job will use. This must be done in order to avoid conflicts with other jobs
#SBATCH --exclusive # Use the node in exclusive mode
#SBATCH --ntasks=88 # Job will run 88 tasks
#SBATCH --ntasks-per-core=2 # Force Hyper-Threading, will run 2 tasks per core
#SBATCH --constraint=mc # Use CPU batch system
#SBATCH --time=00:30:00 # Define max time job will run
#SBATCH --output=myscript.out # Define your output file
#SBATCH --error=myscript.err # Define your error file
@ -81,7 +78,6 @@ per thread is 4000MB, in total this job can use up to 352000MB memory which is t
#SBATCH --ntasks=44 # Job will run 44 tasks
#SBATCH --ntasks-per-core=1 # Force no Hyper-Threading, will run 1 task per core
#SBATCH --mem=352000 # Define the whole memory of the node
#SBATCH --constraint=mc # Use CPU batch system
#SBATCH --time=00:30:00 # Define max time job will run
#SBATCH --output=myscript.out # Define your output file
#SBATCH --error=myscript.err # Define your output file