Slurm Configuration: Running jobs

This commit is contained in:
2020-12-23 15:36:33 +01:00
parent a107b80bd0
commit 701f5f6d11

View File

@ -15,7 +15,7 @@ Before starting using the cluster, please read the following rules:
1. Always try to **estimate and** to **define a proper run time** of your jobs: 1. Always try to **estimate and** to **define a proper run time** of your jobs:
* Use ``--time=<D-HH:MM:SS>`` for that. * Use ``--time=<D-HH:MM:SS>`` for that.
* This will ease the scheduling. * This will ease *scheduling* and *backfilling*.
* Slurm will schedule efficiently the queued jobs. * Slurm will schedule efficiently the queued jobs.
* For very long runs, please consider using ***[Job Arrays with Checkpointing](/merlin6/running-jobs.html#array-jobs-running-very-long-tasks-with-checkpoint-files)*** * For very long runs, please consider using ***[Job Arrays with Checkpointing](/merlin6/running-jobs.html#array-jobs-running-very-long-tasks-with-checkpoint-files)***
2. Try to optimize your jobs for running within **one day**. Please, consider the following: 2. Try to optimize your jobs for running within **one day**. Please, consider the following:
@ -37,146 +37,117 @@ Before starting using the cluster, please read the following rules:
4. Do not submit hundreds of similar jobs! 4. Do not submit hundreds of similar jobs!
* Use ***[Array Jobs](/merlin6/running-jobs.html#array-jobs-launching-a-large-number-of-related-jobs)*** for gathering jobs instead. * Use ***[Array Jobs](/merlin6/running-jobs.html#array-jobs-launching-a-large-number-of-related-jobs)*** for gathering jobs instead.
{{site.data.alerts.tip}}Having a good estimation of the <i>time</i> needed by your jobs, a proper way for running them, and optimizing the jobs to <i>run within one day</i> will contribute to make the system fairly and efficiently used.
{{site.data.alerts.end}}
## Basic commands for running batch scripts ## Basic commands for running batch scripts
**``sbatch``** is the command used for submitting a batch script to Slurm * Use **``sbatch``** for submitting a batch script to Slurm.
* Use **``srun``**: to run parallel tasks. * Use **``srun``** for running parallel tasks.
* As an alternative, ``mpirun`` and ``mpiexec`` can be used. However, ***is strongly recommended to user ``srun``*** instead. * Use **``squeue``** for checking jobs status.
* Use **``squeue``** for checking jobs status * Use **``scancel``** for cancelling/deleting a job from the queue.
* Use **``scancel``** for deleting a job from the queue.
{{site.data.alerts.tip}}Use Linux <b><u>man pages</u></b> (i.e. <i>man sbatch</i> for checking the available options for the above commands.
{{site.data.alerts.end}}
## Basic settings ## Basic settings
For a complete list of options and parameters available is recommended to use the **man** pages (``man sbatch``, ``man srun``, ``man salloc``). Please, notice that behaviour for some parameters might change depending on the command (in example, ``--exclusive`` behaviour in ``sbatch`` differs from ``srun``. For a complete list of options and parameters available is recommended to use the **man pages** (i.e. ``man sbatch``, ``man srun``, ``man salloc``).
Please, notice that behaviour for some parameters might change depending on the command used when running jobs (in example, ``--exclusive`` behaviour in ``sbatch`` differs from ``srun``).
In this chapter we show the basic parameters which are usually needed in the Merlin cluster. In this chapter we show the basic parameters which are usually needed in the Merlin cluster.
### Clusters ### Common settings
* For running jobs in the **Merlin6** computing nodes, users have to add the following option: The following settings are the minimum required for running a job in the Merlin CPU and GPU nodes. Please, consider taking a look to the **man pages** (i.e. `man sbatch`, `man salloc`, `man srun`) for more
information about all possible options. Also, do not hesitate to contact us on any questions.
* **Clusters:** For running jobs in the Merlin6 CPU and GPU nodes, users should to add the following option:
```bash ```bash
#SBATCH --clusters=merlin6 #SBATCH --clusters=merlin6
``` ```
* For running jobs in the **Merlin5** computing nodes, users have to add the following options: Users with proper access, can also use the `merlin5` cluster.
* **Partitions:** except when using the *default* partition, one needs to specify the partition:
* GPU partitions: ``gpu``, ``gpu-short`` (more details **[Slurm GPU Partitions](/merlin6/slurm-configuration.html#gpu-partitions)**)
* CPU partitions: ``general`` (**default** if no partition is specified), ``daily`` and ``hourly`` (more details: **[Slurm CPU Partitions](/merlin6/slurm-configuration.html#cpu-partitions)**)
Partition can be set as follows:
```bash ```bash
#SBATCH --clusters=merlin5 #SBATCH --partition=<partition_name> # Partition to use. 'general' is the 'default'
``` ```
* **[Optional] Disabling shared nodes**: by default, nodes can share jobs from multiple users, but by ensuring that CPU/Memory/GPU resources are dedicated.
***For advanced users:*** If you do not care where to run the jobs (**Merlin5** or **Merlin6**) you can skip this setting, however you must make sure that your code can run on both clusters without any problem and you have defined proper settings in your *batch* script. One can request exclusive usage of a node (or set of nodes) with the following option:
### Partitions
**Merlin6** contains 4 partitions for general purpose, while **Merlin5** contains 1 single CPU partition (for historical reasons):
* **Merlin6** CPU partitions are 3: ``general``, ``daily`` and ``hourly``.
* **Merlin6** GPU partition is 1: ``gpu``.
* **Merlin5** CPU partition is 1: ``merlin``
For Merlin6, if no partition is defined, ``general`` will be the default, while for Merlin5 is ``merlin``. Partitions can be changed by defining the ``--partition`` option as follows:
```bash
#SBATCH --partition=<partition_name> # Partition to use. 'general' is the 'default' in Merlin6.
```
Please check the section [Slurm Configuration#Merlin6 Slurm Partitions] for more information about Merlin6 partition setup.
### Hyperthreaded vs non-Hyperthreaded jobs
Computing nodes in **merlin6** have hyperthreading enabled: every core is running two threads. It means that for many cases it needs to be disabled and only those multithread-based applications will benefit from that. There are some parameters that users must apply:
* For **hyperthreaded based jobs** users ***must*** specify the following options:
```bash ```bash
#SBATCH --hint=multithread # Mandatory for multithreaded jobs #SBATCH --exclusive # Only if you want a dedicated node
#SBATCH --ntasks-per-core=2 # Only needed when a task fits into a core
``` ```
* **Time**: is important to define how long a job should run, according to the reality. This will help Slurm when *scheduling* and *backfilling*, by managing job queues in a more efficient
* For **non-hyperthreaded based jobs** users ***must*** specify the following options: way. This value can never exceed the `MaxTime` of the affected partition. Please review the partition information (`scontrol show partition <partition_name>` or [GPU Partition Configuration](/merlin6/slurm-configuration.html#gpu-partitions)) for
`DefaultTime` and `MaxTime` values.
```bash
#SBATCH --hint=nomultithread # Mandatory for non-multithreaded jobs
#SBATCH --ntasks-per-core=1 # Only needed when a task fits into a core
```
{{site.data.alerts.tip}} In general, <span style="color:orange;"><b>--hint=[no]multithread</b></span> is a mandatory field. On the other hand, <span style="color:orange;"><b>--ntasks-per-core</b></span> is only needed when
one needs to define how a task should be handled within a core, and this setting will not be generally used on Hybrid MPI/OpenMP jobs where multiple cores are needed for a single tasks.
{{site.data.alerts.end}}
### Shared vs exclusive nodes
The **Merlin5** and **Merlin6** clusters are designed in a way that should allow running MPI/OpenMP processes as well as single core based jobs. For allowing co-existence, nodes are configured by default in a shared mode. It means, that multiple jobs from multiple users may land in the same node. This behaviour can be changed by a user if they require exclusive usage of nodes.
By default, Slurm will try to allocate jobs on nodes that are already occupied by processes not requiring exclusive usage of a node. In this way, we fill up first mixed nodes and we ensure that free full resources are available for MPI/OpenMP jobs.
Exclusivity of a node can be setup by specific the ``--exclusive`` option as follows:
```bash
#SBATCH --exclusive
```
### Time
There are some settings that are not mandatory but would be needed or useful to specify. These are the following:
* ``--time``: mostly used when you need to specify longer runs in the ``general`` partition, also useful for specifying
shorter times. **This will affect scheduling priorities**, hence is important to define it (and to define it properly).
```bash ```bash
#SBATCH --time=<D-HH:MM:SS> # Time job needs to run #SBATCH --time=<D-HH:MM:SS> # Time job needs to run. Can not exceed the partition `MaxTime`
``` ```
* **Output and error files**: by default, Slurm script will generate standard output and errors files in the directory from where you submit the batch script:
* standard output will be written into a file ``slurm-$SLURM_JOB_ID.out``.
* standard error will be written into a file ``slurm-$SLURM_JOB_ID.err``.
### Output and Errors If you want to the default names it can be done with the options ``--output`` and ``--error``. In example:
```bash
#SBATCH --output=logs/myJob.%N.%j.out # Generate an output file per hostname and jobid
#SBATCH --error=logs/myJob.%N.%j.err # Generate an errori file per hostname and jobid
```
Use **man sbatch** (``man sbatch | grep -A36 '^filename pattern'``) for getting a list specification of **filename patterns**.
By default, Slurm script will generate standard output and errors files in the directory from where * **Multithreading/No-Multithreading:** Whether a node has or not multithreading depends on the node configuration. By default, HT nodes have HT enabled, but one can ensure this feature with the option `--hint` as follows:
you submit the batch script: ```bash
#SBATCH --hint=multithread # Use extra threads with in-core multi-threading.
* standard output will be written into a file ``slurm-$SLURM_JOB_ID.out``. #SBATCH --hint=nomultithread # Don't use extra threads with in-core multi-threading.
* standard error will be written into a file ``slurm-$SLURM_JOB_ID.err``. ```
Consider that, sometimes, depending on your job requirements, you might need also to setup how many `--ntasks-per-core` or `--cpus-per-task` (even other options) in addition to the `--hint` command. Please, contact us in case of doubts.
If you want to the default names it can be done with the options ``--output`` and ``--error``. In example:
```bash
#SBATCH --output=logs/myJob.%N.%j.out # Generate an output file per hostname and jobid
#SBATCH --error=logs/myJob.%N.%j.err # Generate an errori file per hostname and jobid
```
Use **man sbatch** (``man sbatch | grep -A36 '^filename pattern'``) for getting a list specification of **filename patterns**.
### GPU specific settings ### GPU specific settings
#### Slurm account The following settings are required for running on the GPU nodes:
When using GPUs, users must switch to the **merlin-gpu** Slurm account in order to be able to run on GPU-based nodes. This is done with the ``--account`` setting as follows: * **Slurm account**: When using GPUs, users must use the `merlin-gpu` Slurm account. This is done with the ``--account`` setting as follows:
```bash
#SBATCH --account=merlin-gpu # The account 'merlin-gpu' must be used for GPUs
```
* **`[Valid until 08.01.2021]` GRES:** Slurm must be aware that the job will use GPUs. This is done with the `--gres` setting, at least, as follows:
```bash
#SBATCH --gres=gpu # Always set at least this option when using GPUs
```
```bash Please read below **[GPU advanced settings](/merlin6/running-gpu-jobs.html#gpu-advanced-settings)** for other `--gres` options.
#SBATCH --account=merlin-gpu # The account 'merlin-gpu' must be used * **`[Valid from 08.01.2021]` GPU options (instead of GRES):** Slurm must be aware that the job will use GPUs. New options are available for specifying
``` the GPUs as a consumable resource. These are the following:
* `--gpus` *instead of* (but also in addition with) `--gres=gpu`: specifies the total number of GPUs required for the job.
* `--cpus-per-gpu`, to specify the number of CPUs to be used for each GPU.
* `--mem-per-gpu`, to specify the amount of memory to be used for each GPU.
* `--gpus-per-node`, `--gpus-per-socket`, `--gpus-per-task`, to specify how many GPUs per node, socket and or tasks need to be allocated.
* Other advanced options (i.e. `--gpu-bind`). Please see **man** pages for **sbatch**/**srun**/**salloc** (i.e. *`man sbatch`*) for further information.
Please read below **[GPU advanced settings](/merlin6/running-gpu-jobs.html#gpu-advanced-settings)** for other `--gpus` options.
#### GRES #### GPU advanced settings
The following options are mandatory settings that **must be included** in your batch scripts:
```bash
#SBATCH --gres=gpu # Always set at least this option when using GPUs
```
##### GRES advanced settings
GPUs are also a shared resource. Hence, multiple users can run jobs on a single node, but only one GPU per user process GPUs are also a shared resource. Hence, multiple users can run jobs on a single node, but only one GPU per user process
must be used. Users can define which GPUs resources they need with the ``--gres`` option. must be used.
Valid ``gres`` options are: ``gpu[[:type]:count]`` where ``type=GTX1080|GTX1080Ti`` and ``count=<number of gpus to use>``
This would be according to the following rules:
In example:
**Until 08.01.2021**, users can define which GPUs resources and *how many per node* they need with the ``--gres`` option.
Valid ``gres`` options are: ``gpu[[:type]:count]`` where ``type=GTX1080|GTX1080Ti|RTX2080Ti`` and ``count=<number of gpus requested per node>``. In example:
```bash ```bash
#SBATCH --gres=gpu:GTX1080:4 # Use a node with 4 x GTX1080 GPUs #SBATCH --gres=gpu:GTX1080:4 # Use a node with 4 x GTX1080 GPUs
``` ```
***Important note:*** Due to a bug in the configuration, ``[:type]`` (i.e. ``GTX1080`` or ``GTX1080Ti``) is not working. Users should skip that and use only ``gpu[:count]``. This will be fixed in the upcoming downtimes as it requires a full restart of the batch system. **From 08.01.2021**, `--gres` is not needed anymore (but can still be used), and `--gpus` and related other options should replace it. `--gpus` works in a similar way, but without
the need of specifying the `gpu` resource. In oher words, `--gpus` options are: ``[[:type]:count]`` where ``type=GTX1080|GTX1080Ti|RTX2080Ti`` and ``count=<number of gpus to use>``. In example:
```bash
#SBATCH --gpus=GTX1080:4 # Use 4 GPUs with Type=GTX1080
```
This setting can use in addition other settings, such like `--gpus-per-node`, in order to accomplish a similar behaviour as with `--gres`.
{{site.data.alerts.tip}}Always check <b>'/etc/slurm/gres.conf'</b> for checking available <i>Types</i> and for details of the NUMA node.
{{site.data.alerts.end}}
## Batch script templates ## Batch script templates
@ -232,8 +203,8 @@ The following template should be used by any user submitting jobs to GPU nodes:
```bash ```bash
#!/bin/bash #!/bin/bash
#SBATCH --partition=gpu_<general|daily|hourly> # Specify 'general' or 'daily' or 'hourly' #SBATCH --partition=<gpu|gpu-short> # Specify GPU partition
#SBATCH --gres="gpu:<type>:<number_gpus>" # You should specify at least 'gpu' #SBATCH --gpus="<type>:<number_gpus>" # You should specify at least 'gpu'
#SBATCH --time=<D-HH:MM:SS> # Strongly recommended #SBATCH --time=<D-HH:MM:SS> # Strongly recommended
#SBATCH --output=<output_file> # Generate custom output file #SBATCH --output=<output_file> # Generate custom output file
#SBATCH --error=<error_file # Generate custom error file #SBATCH --error=<error_file # Generate custom error file
@ -242,9 +213,12 @@ The following template should be used by any user submitting jobs to GPU nodes:
## Advanced options example ## Advanced options example
##SBATCH --nodes=1 # Uncomment and specify number of nodes to use ##SBATCH --nodes=1 # Uncomment and specify number of nodes to use
##SBATCH --ntasks=20 # Uncomment and specify number of nodes to use ##SBATCH --ntasks=1 # Uncomment and specify number of nodes to use
##SBATCH --ntasks-per-node=20 # Uncomment and specify number of tasks per node ##SBATCH --cpus-per-gpu=5 # Uncomment and specify the number of cores per task
##SBATCH --cpus-per-task=1 # Uncomment and specify the number of cores per task ##SBATCH --mem-per-gpu=16000 # Uncomment and specify the number of cores per task
##SBATCH --gpus-per-node=2 # Uncomment and specify the number of GPUs per node
##SBATCH --gpus-per-socket=2 # Uncomment and specify the number of GPUs per socket
##SBATCH --gpus-per-task=1 # Uncomment and specify the number of GPUs per task
``` ```
## Advanced configurations ## Advanced configurations