Updates in Slurm
This commit is contained in:
@ -19,22 +19,7 @@ Slurm has been installed in a **multi-clustered** configuration, allowing to int
|
||||
* **merlin5** will exist as long as hardware incidents are soft and easy to repair/fix (i.e. hard disk replacement)
|
||||
* **merlin6** is the default cluster when submitting jobs.
|
||||
|
||||
This document is mostly focused on the **merlin6** cluster. Details for **merlin5** are not shown here, and only basic access and recent
|
||||
changes will be explained (**[Official Merlin5 User Guide](https://intranet.psi.ch/PSI_HPC/Merlin5)** is still valid).
|
||||
|
||||
### Merlin6 Slurm Configuration Details
|
||||
|
||||
For understanding the Slurm configuration setup in the cluster, sometimes can be useful to check the following files:
|
||||
|
||||
* ``/etc/slurm/slurm.conf`` - can be found in the login nodes and computing nodes.
|
||||
* ``/etc/slurm/cgroup.conf`` - can be found in the computing nodes, is also propagated to login nodes for user read access.
|
||||
* ``/etc/slurm/gres.conf`` - can be found in the GPU nodes, is also propgated to login nodes and computing nodes for user read access.
|
||||
|
||||
The previous configuration files can be found in the *login nodes* correspond exclusively to the **merlin6** cluster configuration files. These
|
||||
configuration files are also present in the **merlin6** *computing nodes*.
|
||||
|
||||
Slurm configuration files for the old **merlin5** cluster have to be directly checked on any of the **merlin5** *computing nodes*: those files *do
|
||||
not* exist in the **merlin6** *login nodes*.
|
||||
Please follow the section **Merlin6 Slurm** for more details about configuration and job submission.
|
||||
|
||||
### Merlin5 Access
|
||||
|
||||
@ -49,50 +34,8 @@ srun --clusters=merlin5 --partition=merlin hostname
|
||||
sbatch --clusters=merlin5 --partition=merlin myScript.batch
|
||||
```
|
||||
|
||||
---
|
||||
### Merlin6 Access
|
||||
|
||||
## Using Slurm 'merlin6' cluster
|
||||
|
||||
Basic usage for the **merlin6** cluster will be detailed here. For advanced usage, please use the following document [LINK TO SLURM ADVANCED CONFIG]()
|
||||
|
||||
### Merlin6 Node definition
|
||||
|
||||
The following table show default and maximum resources that can be used per node:
|
||||
|
||||
| Nodes | Def.#CPUs | Max.#CPUs | Def.Mem/CPU | Max.Mem/CPU | Max.Mem/Node | Max.Swap | Def.#GPUs | Max.#GPUs |
|
||||
|:---------------------------------- | ---------:| ---------:| -----------:| -----------:| ------------:| --------:| --------- | --------- |
|
||||
| merlin-c-[001-022,101-122,201-222] | 1 core | 44 cores | 8000 | 352000 | 352000 | 10000 | N/A | N/A |
|
||||
| merlin-g-[001] | 1 core | 8 cores | 8000 | 102498 | 102498 | 10000 | 1 | 2 |
|
||||
| merlin-g-[002-009] | 1 core | 10 cores | 8000 | 102498 | 102498 | 10000 | 1 | 4 |
|
||||
|
||||
If nothing is specified, by default each core will use up to 8GB of memory. More memory per core can be specified with the ``--mem=<memory>`` option,
|
||||
and maximum memory allowed is ``Max.Mem/Node``.
|
||||
|
||||
In *Merlin6*, memory is considered a Consumable Resource, as well as the CPU.
|
||||
|
||||
### Merlin6 Slurm partitions
|
||||
|
||||
Partition can be specified when submitting a job with the ``--partition=<partitionname>`` option.
|
||||
The following *partitions* (also known as *queues*) are configured in Slurm:
|
||||
|
||||
| Partition | Default Partition | Default Time | Max Time | Max Nodes | Priority |
|
||||
|:----------- | ----------------- | ------------ | -------- | --------- | -------- |
|
||||
| **general** | true | 1 day | 1 week | 50 | low |
|
||||
| **daily** | false | 1 day | 1 day | 60 | medium |
|
||||
| **hourly** | false | 1 hour | 1 hour | unlimited | highest |
|
||||
|
||||
General is the *default*, so when nothing is specified job will be by default assigned to that partition. General can not have more than 50 nodes
|
||||
running jobs. For **daily** this limitation is extended to 60 nodes while for **hourly** there are no limits. Shorter jobs have more priority than
|
||||
longer jobs, hence in general terms would be scheduled before (however, other factors such like user fair share value can affect to this decision).
|
||||
|
||||
### Merlin6 User limits
|
||||
|
||||
By default, users can not use more than 528 cores at the same time (Max CPU per user). This limit applies for the **general** and **daily** partitions. For the **hourly** partition, there is no restriction.
|
||||
These limits are softed for the **daily** partition during non working hours and during the weekend as follows:
|
||||
|
||||
| Partition | Mon-Fri 08h-18h | Sun-Thu 18h-0h | From Fri 18h to Sun 8h | From Sun 8h to Mon 18h |
|
||||
|:----------- | --------------- | -------------- | ----------------------- | ---------------------- |
|
||||
| **general** | 528 | 528 | 528 | 528 |
|
||||
| **daily** | 528 | 792 | Unlimited | 792 |
|
||||
| **hourly** | Unlimited | Unlimited | Unlimited | Unlimited |
|
||||
By default, any job submitted with specifying ``--clusters=`` should use the local cluster, so nothing extra should be specified. In any case,
|
||||
you can optionally add ``--clusters=merlin6`` in order to force submission to the Merlin6 cluster.
|
||||
|
||||
|
@ -28,9 +28,9 @@ but jobs already queued on the partition may be allocated to nodes and run.
|
||||
|
||||
Unless explicitly specified, the default draining policy for each partition will be the following:
|
||||
|
||||
* The **general** partition will be soft drained on the previous Friday from 8am.
|
||||
* The **daily** partition will be soft drained on the previous day from 8am.
|
||||
* The **hourly** partition will be soft drained on the same Monday from 7am.
|
||||
* The **general** and **gpu_general** partitions will be soft drained on the previous Friday from 8am.
|
||||
* The **daily** and **gpu_daily** partitions will be soft drained on the previous day from 8am.
|
||||
* The **hourly** and **gpu_hourly** partitions will be soft drained on the same Monday from 7am.
|
||||
|
||||
Finally, **remaining running jobs will be killed** by default when the downtime starts. In some specific rare cases jobs will be
|
||||
just *paused* and *resumed* back when the downtime finished.
|
||||
@ -39,11 +39,14 @@ just *paused* and *resumed* back when the downtime finished.
|
||||
|
||||
The following table contains a summary of the draining policies during a Schedule Downtime:
|
||||
|
||||
| **Partition** | **Drain Policy** | **Default Drain Type** | **Default Job Policy** |
|
||||
|:-------------:| -----------------:| ----------------------:| --------------------------------:|
|
||||
| **general** | 72h before the SD | soft drain | Kill running jobs when SD starts |
|
||||
| **daily** | 24h before the SD | soft drain | Kill running jobs when SD starts |
|
||||
| **hourly** | 1h before the SD | soft drain | Kill running jobs when SD starts |
|
||||
| **Partition** | **Drain Policy** | **Default Drain Type** | **Default Job Policy** |
|
||||
|:---------------:| -----------------:| ----------------------:| --------------------------------:|
|
||||
| **general** | 72h before the SD | soft drain | Kill running jobs when SD starts |
|
||||
| **gpu_general** | 72h before the SD | soft drain | Kill running jobs when SD starts |
|
||||
| **daily** | 24h before the SD | soft drain | Kill running jobs when SD starts |
|
||||
| **gpu_daily** | 24h before the SD | soft drain | Kill running jobs when SD starts |
|
||||
| **hourly** | 1h before the SD | soft drain | Kill running jobs when SD starts |
|
||||
| **gpu_hourly** | 1h before the SD | soft drain | Kill running jobs when SD starts |
|
||||
|
||||
---
|
||||
|
||||
|
@ -62,11 +62,15 @@ Use **man sbatch** (``man sbatch | grep -A36 '^filename pattern'``) for getting
|
||||
|
||||
## Partitions
|
||||
|
||||
Merlin6 contains 3 partitions for general purpose. These are ``general``, ``daily`` and ``hourly``. If no partition is defined,
|
||||
``general`` will be the default. Partition can be defined with the ``--partition`` option as follows:
|
||||
Merlin6 contains 6 partitions for general purpose:
|
||||
|
||||
* For the CPU these are ``general``, ``daily`` and ``hourly``.
|
||||
* For the GPU these are ``gpu_general``, ``gpu_daily`` and ``gpu_hourly``.
|
||||
|
||||
If no partition is defined, ``general`` will be the default. Partition can be defined with the ``--partition`` option as follows:
|
||||
|
||||
```bash
|
||||
#SBATCH --partition=<general|daily|hourly> # Partition to use. 'general' is the 'default'.
|
||||
#SBATCH --partition=<partition_name> # Partition to use. 'general' is the 'default'.
|
||||
```
|
||||
|
||||
Please check the section [Slurm Configuration#Merlin6 Slurm Partitions] for more information about Merlin6 partition setup.
|
||||
@ -76,14 +80,6 @@ Please check the section [Slurm Configuration#Merlin6 Slurm Partitions] for more
|
||||
CPU-based jobs are available for all PSI users. Users must belong to the ``merlin6`` Slurm ``Account`` in order to be able
|
||||
to run on CPU-based nodes. All users registered in Merlin6 are automatically included in the ``Account``.
|
||||
|
||||
### Slurm CPU Mandatory Settings
|
||||
|
||||
The following options are mandatory settings that **must be included** in your batch scripts:
|
||||
|
||||
```bash
|
||||
#SBATCH --constraint=mc # Always set it to 'mc' for CPU jobs.
|
||||
```
|
||||
|
||||
### Slurm CPU Recommended Settings
|
||||
|
||||
There are some settings that are not mandatory but would be needed or useful to specify. These are the following:
|
||||
@ -105,7 +101,6 @@ The following template should be used by any user submitting jobs to CPU nodes:
|
||||
#SBATCH --time=<D-HH:MM:SS> # Strictly recommended when using 'general' partition.
|
||||
#SBATCH --output=<output_file> # Generate custom output file
|
||||
#SBATCH --error=<error_file> # Generate custom error file
|
||||
#SBATCH --constraint=mc # You must specify 'mc' when using 'cpu' jobs
|
||||
#SBATCH --ntasks-per-core=1 # Recommended one thread per core
|
||||
##SBATCH --exclusive # Uncomment if you need exclusive node usage
|
||||
|
||||
@ -130,7 +125,6 @@ are automatically registered to the ``merlin6-gpu`` account. Other users should
|
||||
The following options are mandatory settings that **must be included** in your batch scripts:
|
||||
|
||||
```bash
|
||||
#SBATCH --constraint=gpu # Always set it to 'gpu' for GPU jobs.
|
||||
#SBATCH --gres=gpu # Always set at least this option when using GPUs
|
||||
```
|
||||
|
||||
@ -153,11 +147,10 @@ The following template should be used by any user submitting jobs to GPU nodes:
|
||||
|
||||
```bash
|
||||
#!/bin/sh
|
||||
#SBATCH --partition=<general|daily|hourly> # Specify 'general' or 'daily' or 'hourly'
|
||||
#SBATCH --partition=gpu_<general|daily|hourly> # Specify 'general' or 'daily' or 'hourly'
|
||||
#SBATCH --time=<D-HH:MM:SS> # Strictly recommended when using 'general' partition.
|
||||
#SBATCH --output=<output_file> # Generate custom output file
|
||||
#SBATCH --error=<error_file # Generate custom error file
|
||||
#SBATCH --constraint=gpu # You must specify 'gpu' for using GPUs
|
||||
#SBATCH --gres="gpu:<type>:<number_gpus>" # You should specify at least 'gpu'
|
||||
#SBATCH --ntasks-per-core=1 # GPU nodes have hyper-threading disabled
|
||||
##SBATCH --exclusive # Uncomment if you need exclusive node usage
|
||||
|
@ -15,7 +15,6 @@ permalink: /merlin6/slurm-examples.html
|
||||
```bash
|
||||
#!/bin/bash
|
||||
#SBATCH --partition=hourly # Using 'hourly' will grant higher priority
|
||||
#SBATCH --constraint=mc # Use CPU batch system
|
||||
#SBATCH --ntasks-per-core=1 # Force no Hyper-Threading, will run 1 task per core
|
||||
#SBATCH --mem-per-cpu=8000 # Double the default memory per cpu
|
||||
#SBATCH --time=00:30:00 # Define max time job will run
|
||||
@ -35,7 +34,6 @@ hyperthreads), hence we want to use the memory as if we were using 2 threads.
|
||||
```bash
|
||||
#!/bin/bash
|
||||
#SBATCH --partition=hourly # Using 'hourly' will grant higher priority
|
||||
#SBATCH --constraint=mc # Use CPU batch system
|
||||
#SBATCH --ntasks-per-core=1 # Force no Hyper-Threading, will run 1 task per core
|
||||
#SBATCH --mem=352000 # We want to use the whole memory
|
||||
#SBATCH --time=00:30:00 # Define max time job will run
|
||||
@ -58,7 +56,6 @@ the job will use. This must be done in order to avoid conflicts with other jobs
|
||||
#SBATCH --exclusive # Use the node in exclusive mode
|
||||
#SBATCH --ntasks=88 # Job will run 88 tasks
|
||||
#SBATCH --ntasks-per-core=2 # Force Hyper-Threading, will run 2 tasks per core
|
||||
#SBATCH --constraint=mc # Use CPU batch system
|
||||
#SBATCH --time=00:30:00 # Define max time job will run
|
||||
#SBATCH --output=myscript.out # Define your output file
|
||||
#SBATCH --error=myscript.err # Define your error file
|
||||
@ -81,7 +78,6 @@ per thread is 4000MB, in total this job can use up to 352000MB memory which is t
|
||||
#SBATCH --ntasks=44 # Job will run 44 tasks
|
||||
#SBATCH --ntasks-per-core=1 # Force no Hyper-Threading, will run 1 task per core
|
||||
#SBATCH --mem=352000 # Define the whole memory of the node
|
||||
#SBATCH --constraint=mc # Use CPU batch system
|
||||
#SBATCH --time=00:30:00 # Define max time job will run
|
||||
#SBATCH --output=myscript.out # Define your output file
|
||||
#SBATCH --error=myscript.err # Define your output file
|
||||
|
Reference in New Issue
Block a user