Slurm Configuration: Running jobs
This commit is contained in:
@ -15,7 +15,7 @@ Before starting using the cluster, please read the following rules:
|
||||
|
||||
1. Always try to **estimate and** to **define a proper run time** of your jobs:
|
||||
* Use ``--time=<D-HH:MM:SS>`` for that.
|
||||
* This will ease the scheduling.
|
||||
* This will ease *scheduling* and *backfilling*.
|
||||
* Slurm will schedule efficiently the queued jobs.
|
||||
* For very long runs, please consider using ***[Job Arrays with Checkpointing](/merlin6/running-jobs.html#array-jobs-running-very-long-tasks-with-checkpoint-files)***
|
||||
2. Try to optimize your jobs for running within **one day**. Please, consider the following:
|
||||
@ -27,7 +27,7 @@ Before starting using the cluster, please read the following rules:
|
||||
3. Is **forbidden** to run **very short jobs**:
|
||||
* Running jobs of few seconds can cause severe problems.
|
||||
* Running very short jobs causes a lot of overhead.
|
||||
* ***Question:*** Is my job a very short job?
|
||||
* ***Question:*** Is my job a very short job?
|
||||
* ***Answer:*** If it lasts in few seconds or very few minutes, yes.
|
||||
* ***Question:*** How long should my job run?
|
||||
* ***Answer:*** as the *Rule of Thumb*, from 5' would start being ok, from 15' would preferred.
|
||||
@ -37,146 +37,117 @@ Before starting using the cluster, please read the following rules:
|
||||
4. Do not submit hundreds of similar jobs!
|
||||
* Use ***[Array Jobs](/merlin6/running-jobs.html#array-jobs-launching-a-large-number-of-related-jobs)*** for gathering jobs instead.
|
||||
|
||||
{{site.data.alerts.tip}}Having a good estimation of the <i>time</i> needed by your jobs, a proper way for running them, and optimizing the jobs to <i>run within one day</i> will contribute to make the system fairly and efficiently used.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
## Basic commands for running batch scripts
|
||||
|
||||
**``sbatch``** is the command used for submitting a batch script to Slurm
|
||||
* Use **``srun``**: to run parallel tasks.
|
||||
* As an alternative, ``mpirun`` and ``mpiexec`` can be used. However, ***is strongly recommended to user ``srun``*** instead.
|
||||
* Use **``squeue``** for checking jobs status
|
||||
* Use **``scancel``** for deleting a job from the queue.
|
||||
* Use **``sbatch``** for submitting a batch script to Slurm.
|
||||
* Use **``srun``** for running parallel tasks.
|
||||
* Use **``squeue``** for checking jobs status.
|
||||
* Use **``scancel``** for cancelling/deleting a job from the queue.
|
||||
|
||||
{{site.data.alerts.tip}}Use Linux <b><u>man pages</u></b> (i.e. <i>man sbatch</i> for checking the available options for the above commands.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
## Basic settings
|
||||
|
||||
For a complete list of options and parameters available is recommended to use the **man** pages (``man sbatch``, ``man srun``, ``man salloc``). Please, notice that behaviour for some parameters might change depending on the command (in example, ``--exclusive`` behaviour in ``sbatch`` differs from ``srun``.
|
||||
For a complete list of options and parameters available is recommended to use the **man pages** (i.e. ``man sbatch``, ``man srun``, ``man salloc``).
|
||||
Please, notice that behaviour for some parameters might change depending on the command used when running jobs (in example, ``--exclusive`` behaviour in ``sbatch`` differs from ``srun``).
|
||||
|
||||
In this chapter we show the basic parameters which are usually needed in the Merlin cluster.
|
||||
|
||||
### Clusters
|
||||
### Common settings
|
||||
|
||||
* For running jobs in the **Merlin6** computing nodes, users have to add the following option:
|
||||
The following settings are the minimum required for running a job in the Merlin CPU and GPU nodes. Please, consider taking a look to the **man pages** (i.e. `man sbatch`, `man salloc`, `man srun`) for more
|
||||
information about all possible options. Also, do not hesitate to contact us on any questions.
|
||||
|
||||
* **Clusters:** For running jobs in the Merlin6 CPU and GPU nodes, users should to add the following option:
|
||||
```bash
|
||||
#SBATCH --clusters=merlin6
|
||||
```
|
||||
|
||||
* For running jobs in the **Merlin5** computing nodes, users have to add the following options:
|
||||
Users with proper access, can also use the `merlin5` cluster.
|
||||
* **Partitions:** except when using the *default* partition, one needs to specify the partition:
|
||||
* GPU partitions: ``gpu``, ``gpu-short`` (more details **[Slurm GPU Partitions](/merlin6/slurm-configuration.html#gpu-partitions)**)
|
||||
* CPU partitions: ``general`` (**default** if no partition is specified), ``daily`` and ``hourly`` (more details: **[Slurm CPU Partitions](/merlin6/slurm-configuration.html#cpu-partitions)**)
|
||||
|
||||
Partition can be set as follows:
|
||||
```bash
|
||||
#SBATCH --clusters=merlin5
|
||||
#SBATCH --partition=<partition_name> # Partition to use. 'general' is the 'default'
|
||||
```
|
||||
|
||||
***For advanced users:*** If you do not care where to run the jobs (**Merlin5** or **Merlin6**) you can skip this setting, however you must make sure that your code can run on both clusters without any problem and you have defined proper settings in your *batch* script.
|
||||
|
||||
### Partitions
|
||||
|
||||
**Merlin6** contains 4 partitions for general purpose, while **Merlin5** contains 1 single CPU partition (for historical reasons):
|
||||
|
||||
* **Merlin6** CPU partitions are 3: ``general``, ``daily`` and ``hourly``.
|
||||
* **Merlin6** GPU partition is 1: ``gpu``.
|
||||
* **Merlin5** CPU partition is 1: ``merlin``
|
||||
|
||||
For Merlin6, if no partition is defined, ``general`` will be the default, while for Merlin5 is ``merlin``. Partitions can be changed by defining the ``--partition`` option as follows:
|
||||
|
||||
```bash
|
||||
#SBATCH --partition=<partition_name> # Partition to use. 'general' is the 'default' in Merlin6.
|
||||
```
|
||||
|
||||
Please check the section [Slurm Configuration#Merlin6 Slurm Partitions] for more information about Merlin6 partition setup.
|
||||
|
||||
### Hyperthreaded vs non-Hyperthreaded jobs
|
||||
|
||||
Computing nodes in **merlin6** have hyperthreading enabled: every core is running two threads. It means that for many cases it needs to be disabled and only those multithread-based applications will benefit from that. There are some parameters that users must apply:
|
||||
|
||||
* For **hyperthreaded based jobs** users ***must*** specify the following options:
|
||||
|
||||
* **[Optional] Disabling shared nodes**: by default, nodes can share jobs from multiple users, but by ensuring that CPU/Memory/GPU resources are dedicated.
|
||||
One can request exclusive usage of a node (or set of nodes) with the following option:
|
||||
```bash
|
||||
#SBATCH --hint=multithread # Mandatory for multithreaded jobs
|
||||
#SBATCH --ntasks-per-core=2 # Only needed when a task fits into a core
|
||||
#SBATCH --exclusive # Only if you want a dedicated node
|
||||
```
|
||||
|
||||
* For **non-hyperthreaded based jobs** users ***must*** specify the following options:
|
||||
|
||||
```bash
|
||||
#SBATCH --hint=nomultithread # Mandatory for non-multithreaded jobs
|
||||
#SBATCH --ntasks-per-core=1 # Only needed when a task fits into a core
|
||||
```
|
||||
|
||||
{{site.data.alerts.tip}} In general, <span style="color:orange;"><b>--hint=[no]multithread</b></span> is a mandatory field. On the other hand, <span style="color:orange;"><b>--ntasks-per-core</b></span> is only needed when
|
||||
one needs to define how a task should be handled within a core, and this setting will not be generally used on Hybrid MPI/OpenMP jobs where multiple cores are needed for a single tasks.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
### Shared vs exclusive nodes
|
||||
|
||||
The **Merlin5** and **Merlin6** clusters are designed in a way that should allow running MPI/OpenMP processes as well as single core based jobs. For allowing co-existence, nodes are configured by default in a shared mode. It means, that multiple jobs from multiple users may land in the same node. This behaviour can be changed by a user if they require exclusive usage of nodes.
|
||||
|
||||
By default, Slurm will try to allocate jobs on nodes that are already occupied by processes not requiring exclusive usage of a node. In this way, we fill up first mixed nodes and we ensure that free full resources are available for MPI/OpenMP jobs.
|
||||
|
||||
Exclusivity of a node can be setup by specific the ``--exclusive`` option as follows:
|
||||
|
||||
```bash
|
||||
#SBATCH --exclusive
|
||||
```
|
||||
|
||||
### Time
|
||||
|
||||
There are some settings that are not mandatory but would be needed or useful to specify. These are the following:
|
||||
|
||||
* ``--time``: mostly used when you need to specify longer runs in the ``general`` partition, also useful for specifying
|
||||
shorter times. **This will affect scheduling priorities**, hence is important to define it (and to define it properly).
|
||||
|
||||
* **Time**: is important to define how long a job should run, according to the reality. This will help Slurm when *scheduling* and *backfilling*, by managing job queues in a more efficient
|
||||
way. This value can never exceed the `MaxTime` of the affected partition. Please review the partition information (`scontrol show partition <partition_name>` or [GPU Partition Configuration](/merlin6/slurm-configuration.html#gpu-partitions)) for
|
||||
`DefaultTime` and `MaxTime` values.
|
||||
```bash
|
||||
#SBATCH --time=<D-HH:MM:SS> # Time job needs to run
|
||||
#SBATCH --time=<D-HH:MM:SS> # Time job needs to run. Can not exceed the partition `MaxTime`
|
||||
```
|
||||
* **Output and error files**: by default, Slurm script will generate standard output and errors files in the directory from where you submit the batch script:
|
||||
* standard output will be written into a file ``slurm-$SLURM_JOB_ID.out``.
|
||||
* standard error will be written into a file ``slurm-$SLURM_JOB_ID.err``.
|
||||
|
||||
If you want to the default names it can be done with the options ``--output`` and ``--error``. In example:
|
||||
```bash
|
||||
#SBATCH --output=logs/myJob.%N.%j.out # Generate an output file per hostname and jobid
|
||||
#SBATCH --error=logs/myJob.%N.%j.err # Generate an errori file per hostname and jobid
|
||||
```
|
||||
Use **man sbatch** (``man sbatch | grep -A36 '^filename pattern'``) for getting a list specification of **filename patterns**.
|
||||
|
||||
### Output and Errors
|
||||
|
||||
By default, Slurm script will generate standard output and errors files in the directory from where
|
||||
you submit the batch script:
|
||||
|
||||
* standard output will be written into a file ``slurm-$SLURM_JOB_ID.out``.
|
||||
* standard error will be written into a file ``slurm-$SLURM_JOB_ID.err``.
|
||||
|
||||
If you want to the default names it can be done with the options ``--output`` and ``--error``. In example:
|
||||
|
||||
```bash
|
||||
#SBATCH --output=logs/myJob.%N.%j.out # Generate an output file per hostname and jobid
|
||||
#SBATCH --error=logs/myJob.%N.%j.err # Generate an errori file per hostname and jobid
|
||||
```
|
||||
|
||||
Use **man sbatch** (``man sbatch | grep -A36 '^filename pattern'``) for getting a list specification of **filename patterns**.
|
||||
* **Multithreading/No-Multithreading:** Whether a node has or not multithreading depends on the node configuration. By default, HT nodes have HT enabled, but one can ensure this feature with the option `--hint` as follows:
|
||||
```bash
|
||||
#SBATCH --hint=multithread # Use extra threads with in-core multi-threading.
|
||||
#SBATCH --hint=nomultithread # Don't use extra threads with in-core multi-threading.
|
||||
```
|
||||
Consider that, sometimes, depending on your job requirements, you might need also to setup how many `--ntasks-per-core` or `--cpus-per-task` (even other options) in addition to the `--hint` command. Please, contact us in case of doubts.
|
||||
|
||||
### GPU specific settings
|
||||
|
||||
#### Slurm account
|
||||
The following settings are required for running on the GPU nodes:
|
||||
|
||||
When using GPUs, users must switch to the **merlin-gpu** Slurm account in order to be able to run on GPU-based nodes. This is done with the ``--account`` setting as follows:
|
||||
* **Slurm account**: When using GPUs, users must use the `merlin-gpu` Slurm account. This is done with the ``--account`` setting as follows:
|
||||
```bash
|
||||
#SBATCH --account=merlin-gpu # The account 'merlin-gpu' must be used for GPUs
|
||||
```
|
||||
* **`[Valid until 08.01.2021]` GRES:** Slurm must be aware that the job will use GPUs. This is done with the `--gres` setting, at least, as follows:
|
||||
```bash
|
||||
#SBATCH --gres=gpu # Always set at least this option when using GPUs
|
||||
```
|
||||
|
||||
Please read below **[GPU advanced settings](/merlin6/running-gpu-jobs.html#gpu-advanced-settings)** for other `--gres` options.
|
||||
* **`[Valid from 08.01.2021]` GPU options (instead of GRES):** Slurm must be aware that the job will use GPUs. New options are available for specifying
|
||||
the GPUs as a consumable resource. These are the following:
|
||||
* `--gpus` *instead of* (but also in addition with) `--gres=gpu`: specifies the total number of GPUs required for the job.
|
||||
* `--cpus-per-gpu`, to specify the number of CPUs to be used for each GPU.
|
||||
* `--mem-per-gpu`, to specify the amount of memory to be used for each GPU.
|
||||
* `--gpus-per-node`, `--gpus-per-socket`, `--gpus-per-task`, to specify how many GPUs per node, socket and or tasks need to be allocated.
|
||||
* Other advanced options (i.e. `--gpu-bind`). Please see **man** pages for **sbatch**/**srun**/**salloc** (i.e. *`man sbatch`*) for further information.
|
||||
Please read below **[GPU advanced settings](/merlin6/running-gpu-jobs.html#gpu-advanced-settings)** for other `--gpus` options.
|
||||
|
||||
```bash
|
||||
#SBATCH --account=merlin-gpu # The account 'merlin-gpu' must be used
|
||||
```
|
||||
|
||||
#### GRES
|
||||
|
||||
The following options are mandatory settings that **must be included** in your batch scripts:
|
||||
|
||||
```bash
|
||||
#SBATCH --gres=gpu # Always set at least this option when using GPUs
|
||||
```
|
||||
|
||||
##### GRES advanced settings
|
||||
#### GPU advanced settings
|
||||
|
||||
GPUs are also a shared resource. Hence, multiple users can run jobs on a single node, but only one GPU per user process
|
||||
must be used. Users can define which GPUs resources they need with the ``--gres`` option.
|
||||
Valid ``gres`` options are: ``gpu[[:type]:count]`` where ``type=GTX1080|GTX1080Ti`` and ``count=<number of gpus to use>``
|
||||
This would be according to the following rules:
|
||||
|
||||
In example:
|
||||
must be used.
|
||||
|
||||
**Until 08.01.2021**, users can define which GPUs resources and *how many per node* they need with the ``--gres`` option.
|
||||
Valid ``gres`` options are: ``gpu[[:type]:count]`` where ``type=GTX1080|GTX1080Ti|RTX2080Ti`` and ``count=<number of gpus requested per node>``. In example:
|
||||
```bash
|
||||
#SBATCH --gres=gpu:GTX1080:4 # Use a node with 4 x GTX1080 GPUs
|
||||
```
|
||||
|
||||
***Important note:*** Due to a bug in the configuration, ``[:type]`` (i.e. ``GTX1080`` or ``GTX1080Ti``) is not working. Users should skip that and use only ``gpu[:count]``. This will be fixed in the upcoming downtimes as it requires a full restart of the batch system.
|
||||
**From 08.01.2021**, `--gres` is not needed anymore (but can still be used), and `--gpus` and related other options should replace it. `--gpus` works in a similar way, but without
|
||||
the need of specifying the `gpu` resource. In oher words, `--gpus` options are: ``[[:type]:count]`` where ``type=GTX1080|GTX1080Ti|RTX2080Ti`` and ``count=<number of gpus to use>``. In example:
|
||||
```bash
|
||||
#SBATCH --gpus=GTX1080:4 # Use 4 GPUs with Type=GTX1080
|
||||
```
|
||||
This setting can use in addition other settings, such like `--gpus-per-node`, in order to accomplish a similar behaviour as with `--gres`.
|
||||
|
||||
{{site.data.alerts.tip}}Always check <b>'/etc/slurm/gres.conf'</b> for checking available <i>Types</i> and for details of the NUMA node.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
## Batch script templates
|
||||
|
||||
@ -232,8 +203,8 @@ The following template should be used by any user submitting jobs to GPU nodes:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
#SBATCH --partition=gpu_<general|daily|hourly> # Specify 'general' or 'daily' or 'hourly'
|
||||
#SBATCH --gres="gpu:<type>:<number_gpus>" # You should specify at least 'gpu'
|
||||
#SBATCH --partition=<gpu|gpu-short> # Specify GPU partition
|
||||
#SBATCH --gpus="<type>:<number_gpus>" # You should specify at least 'gpu'
|
||||
#SBATCH --time=<D-HH:MM:SS> # Strongly recommended
|
||||
#SBATCH --output=<output_file> # Generate custom output file
|
||||
#SBATCH --error=<error_file # Generate custom error file
|
||||
@ -242,9 +213,12 @@ The following template should be used by any user submitting jobs to GPU nodes:
|
||||
|
||||
## Advanced options example
|
||||
##SBATCH --nodes=1 # Uncomment and specify number of nodes to use
|
||||
##SBATCH --ntasks=20 # Uncomment and specify number of nodes to use
|
||||
##SBATCH --ntasks-per-node=20 # Uncomment and specify number of tasks per node
|
||||
##SBATCH --cpus-per-task=1 # Uncomment and specify the number of cores per task
|
||||
##SBATCH --ntasks=1 # Uncomment and specify number of nodes to use
|
||||
##SBATCH --cpus-per-gpu=5 # Uncomment and specify the number of cores per task
|
||||
##SBATCH --mem-per-gpu=16000 # Uncomment and specify the number of cores per task
|
||||
##SBATCH --gpus-per-node=2 # Uncomment and specify the number of GPUs per node
|
||||
##SBATCH --gpus-per-socket=2 # Uncomment and specify the number of GPUs per socket
|
||||
##SBATCH --gpus-per-task=1 # Uncomment and specify the number of GPUs per task
|
||||
```
|
||||
|
||||
## Advanced configurations
|
||||
|
Reference in New Issue
Block a user