Files
gitea-pages/pages/merlin6/03-Slurm-General-Documentation/running-jobs.md

285 lines
16 KiB
Markdown

---
title: Running Slurm Scripts
#tags:
keywords: batch script, slurm, sbatch, srun
last_updated: 23 January 2020
summary: "This document describes how to run batch scripts in Slurm."
sidebar: merlin6_sidebar
permalink: /merlin6/running-jobs.html
---
## The rules
Before starting using the cluster, please read the following rules:
1. To ease and improve *scheduling* and *backfilling*, always try to **estimate and** to **define a proper run time** of your jobs:
* Use ``--time=<D-HH:MM:SS>`` for that.
* For very long runs, please consider using ***[Job Arrays with Checkpointing](/merlin6/running-jobs.html#array-jobs-running-very-long-tasks-with-checkpoint-files)***
2. Try to optimize your jobs for running at most within **one day**. Please, consider the following:
* Some software can simply scale up by using more nodes while drastically reducing the run time.
* Some software allow to save a specific state, and a second job can start from that state: ***[Job Arrays with Checkpointing](/merlin6/running-jobs.html#array-jobs-running-very-long-tasks-with-checkpoint-files)*** can help you with that.
* Jobs submitted to **`hourly`** get more priority than jobs submitted to **`daily`**: always use **`hourly`** for jobs shorter than 1 hour.
* Jobs submitted to **`daily`** get more priority than jobs submitted to **`general`**: always use **`daily`** for jobs shorter than 1 day.
3. Is **forbidden** to run **very short jobs** as they cause a lot of overhead but also can cause severe problems to the main scheduler.
* ***Question:*** Is my job a very short job? ***Answer:*** If it lasts in few seconds or very few minutes, yes.
* ***Question:*** How long should my job run? ***Answer:*** as the *Rule of Thumb*, from 5' would start being ok, from 15' would preferred.
* Use ***[Packed Jobs](/merlin6/running-jobs.html#packed-jobs-running-a-large-number-of-short-tasks)*** for running a large number of short tasks.
4. Do not submit hundreds of similar jobs!
* Use ***[Array Jobs](/merlin6/running-jobs.html#array-jobs-launching-a-large-number-of-related-jobs)*** for gathering jobs instead.
{{site.data.alerts.tip}}Having a good estimation of the <i>time</i> needed by your jobs, a proper way for running them, and optimizing the jobs to <i>run within one day</i> will contribute to make the system fairly and efficiently used.
{{site.data.alerts.end}}
## Basic commands for running batch scripts
* Use **``sbatch``** for submitting a batch script to Slurm.
* Use **``srun``** for running parallel tasks.
* Use **``squeue``** for checking jobs status.
* Use **``scancel``** for cancelling/deleting a job from the queue.
{{site.data.alerts.tip}}Use Linux <b>'man'</b> pages when needed (i.e. <span style="color:orange;">'man sbatch'</span>), mostly for checking the available options for the above commands.
{{site.data.alerts.end}}
## Basic settings
For a complete list of options and parameters available is recommended to use the **man pages** (i.e. ``man sbatch``, ``man srun``, ``man salloc``).
Please, notice that behaviour for some parameters might change depending on the command used when running jobs (in example, ``--exclusive`` behaviour in ``sbatch`` differs from ``srun``).
In this chapter we show the basic parameters which are usually needed in the Merlin cluster.
### Common settings
The following settings are the minimum required for running a job in the Merlin CPU and GPU nodes. Please, consider taking a look to the **man pages** (i.e. `man sbatch`, `man salloc`, `man srun`) for more information about all possible options. Also, do not hesitate to contact us on any questions.
* **Clusters:** For running jobs in the different Slurm clusters, users should to add the following option:
```bash
#SBATCH --clusters=<cluster_name> # Possible values: merlin5, merlin6, gmerlin6
```
Refer to the documentation of each cluster ([**`merlin6`**](/merlin6/slurm-configuration.html),[**`gmerlin6`**](/gmerlin6/slurm-configuration.html),[**`merlin5`**](/merlin5/slurm-configuration.html) for further information.
* **Partitions:** except when using the *default* partition for each cluster, one needs to specify the partition:
```bash
#SBATCH --partition=<partition_name> # Check each cluster documentation for possible values
```
Refer to the documentation of each cluster ([**`merlin6`**](/merlin6/slurm-configuration.html),[**`gmerlin6`**](/gmerlin6/slurm-configuration.html),[**`merlin5`**](/merlin5/slurm-configuration.html) for further information.
* **[Optional] Disabling shared nodes**: by default, nodes are not exclusive. Hence, multiple users can run in the same node. One can request exclusive node usage with the following option:
```bash
#SBATCH --exclusive # Only if you want a dedicated node
```
* **Time**: is important to define how long a job should run, according to the reality. This will help Slurm when *scheduling* and *backfilling*, and will let Slurm managing job queues in a more efficient way. This value can never exceed the `MaxTime` of the affected partition.
```bash
#SBATCH --time=<D-HH:MM:SS> # Can not exceed the partition `MaxTime`
```
Refer to the documentation of each cluster ([**`merlin6`**](/merlin6/slurm-configuration.html),[**`gmerlin6`**](/gmerlin6/slurm-configuration.html),[**`merlin5`**](/merlin5/slurm-configuration.html) for further information about partition `MaxTime` values.
* **Output and error files**: by default, Slurm script will generate standard output (``slurm-%j.out``, where `%j` is the job_id) and error (``slurm-%j.err``, where `%j` is the job_id) files in the directory from where the job was submitted. Users can change default name with the following options:
```bash
#SBATCH --output=<filename> # Can include path. Patterns accepted (i.e. %j)
#SBATCH --error=<filename> # Can include path. Patterns accepted (i.e. %j)
```
Use **man sbatch** (``man sbatch | grep -A36 '^filename pattern'``) for getting a list specification of **filename patterns**.
* **Enable/Disable Hyper-Threading**: Whether a node has or not Hyper-Threading depends on the node configuration. By default, HT nodes have HT enabled, but one should specify it from the Slurm command as follows:
```bash
#SBATCH --hint=multithread # Use extra threads with in-core multi-threading.
#SBATCH --hint=nomultithread # Don't use extra threads with in-core multi-threading.
```
Refer to the documentation of each cluster ([**`merlin6`**](/merlin6/slurm-configuration.html),[**`gmerlin6`**](/gmerlin6/slurm-configuration.html),[**`merlin5`**](/merlin5/slurm-configuration.html) for further information about node configuration and Hyper-Threading.
Consider that, sometimes, depending on your job requirements, you might need also to setup how many `--ntasks-per-core` or `--cpus-per-task` (even other options) in addition to the `--hint` command. Please, contact us in case of doubts.
{{site.data.alerts.tip}} In general, for the cluster `merlin6` <span style="color:orange;"><b>--hint=[no]multithread</b></span> is a recommended field. On the other hand, <span style="color:orange;"><b>--ntasks-per-core</b></span> is only needed when
one needs to define how a task should be handled within a core, and this setting will not be generally used on Hybrid MPI/OpenMP jobs where multiple cores are needed for single tasks.
{{site.data.alerts.end}}
## Batch script templates
### CPU-based jobs templates
The following examples apply to the **Merlin6** cluster.
#### Nomultithreaded jobs template
The following template should be used by any user submitting jobs to the Merlin6 CPU nodes:
```bash
#!/bin/bash
#SBATCH --cluster=merlin6 # Cluster name
#SBATCH --partition=general,daily,hourly # Specify one or multiple partitions
#SBATCH --time=<D-HH:MM:SS> # Strongly recommended
#SBATCH --output=<output_file> # Generate custom output file
#SBATCH --error=<error_file> # Generate custom error file
#SBATCH --hint=nomultithread # Mandatory for multithreaded jobs
##SBATCH --exclusive # Uncomment if you need exclusive node usage
##SBATCH --ntasks-per-core=1 # Only mandatory for multithreaded single tasks
## Advanced options example
##SBATCH --nodes=1 # Uncomment and specify #nodes to use
##SBATCH --ntasks=44 # Uncomment and specify #nodes to use
##SBATCH --ntasks-per-node=44 # Uncomment and specify #tasks per node
##SBATCH --cpus-per-task=44 # Uncomment and specify the number of cores per task
```
#### Multithreaded jobs template
The following template should be used by any user submitting jobs to the Merlin6 CPU nodes:
```bash
#!/bin/bash
#SBATCH --cluster=merlin6 # Cluster name
#SBATCH --partition=general,daily,hourly # Specify one or multiple partitions
#SBATCH --time=<D-HH:MM:SS> # Strongly recommended
#SBATCH --output=<output_file> # Generate custom output file
#SBATCH --error=<error_file> # Generate custom error file
#SBATCH --hint=multithread # Mandatory for multithreaded jobs
##SBATCH --exclusive # Uncomment if you need exclusive node usage
##SBATCH --ntasks-per-core=2 # Only mandatory for multithreaded single tasks
## Advanced options example
##SBATCH --nodes=1 # Uncomment and specify #nodes to use
##SBATCH --ntasks=88 # Uncomment and specify #nodes to use
##SBATCH --ntasks-per-node=88 # Uncomment and specify #tasks per node
##SBATCH --cpus-per-task=88 # Uncomment and specify the number of cores per task
```
### GPU-based jobs templates
The following template should be used by any user submitting jobs to GPU nodes:
```bash
#!/bin/bash
#SBATCH --cluster=gmerlin6 # Cluster name
#SBATCH --partition=gpu,gpu-short # Specify one or multiple partitions, or
#SBATCH --partition=gwendolen,gwendolen-long # Only for Gwendolen users
#SBATCH --gpus="<type>:<num_gpus>" # <type> is optional, <num_gpus> is mandatory
#SBATCH --time=<D-HH:MM:SS> # Strongly recommended
#SBATCH --output=<output_file> # Generate custom output file
#SBATCH --error=<error_file> # Generate custom error file
##SBATCH --exclusive # Uncomment if you need exclusive node usage
## Advanced options example
##SBATCH --nodes=1 # Uncomment and specify number of nodes to use
##SBATCH --ntasks=1 # Uncomment and specify number of nodes to use
##SBATCH --cpus-per-gpu=5 # Uncomment and specify the number of cores per task
##SBATCH --mem-per-gpu=16000 # Uncomment and specify the number of cores per task
##SBATCH --gpus-per-node=<type>:2 # Uncomment and specify the number of GPUs per node
##SBATCH --gpus-per-socket=<type>:2 # Uncomment and specify the number of GPUs per socket
##SBATCH --gpus-per-task=<type>:1 # Uncomment and specify the number of GPUs per task
```
## Advanced configurations
### Array Jobs: launching a large number of related jobs
If you need to run a large number of jobs based on the same executable with systematically varying inputs,
e.g. for a parameter sweep, you can do this most easily in form of a **simple array job**.
``` bash
#!/bin/bash
#SBATCH --job-name=test-array
#SBATCH --partition=daily
#SBATCH --ntasks=1
#SBATCH --time=08:00:00
#SBATCH --array=1-8
echo $(date) "I am job number ${SLURM_ARRAY_TASK_ID}"
srun myprogram config-file-${SLURM_ARRAY_TASK_ID}.dat
```
This will run 8 independent jobs, where each job can use the counter
variable `SLURM_ARRAY_TASK_ID` defined by Slurm inside of the job's
environment to feed the correct input arguments or configuration file
to the "myprogram" executable. Each job will receive the same set of
configurations (e.g. time limit of 8h in the example above).
The jobs are independent, but they will run in parallel (if the cluster resources allow for
it). The jobs will get JobIDs like {some-number}_0 to {some-number}_7, and they also will each
have their own output file.
**Note:**
* Do not use such jobs if you have very short tasks, since each array sub job will incur the full overhead for launching an independent Slurm job. For such cases you should used a **packed job** (see below).
* If you want to control how many of these jobs can run in parallel, you can use the `#SBATCH --array=1-100%5` syntax. The `%5` will define
that only 5 sub jobs may ever run in parallel.
You also can use an array job approach to run over all files in a directory, substituting the payload with
``` bash
FILES=(/path/to/data/*)
srun ./myprogram ${FILES[$SLURM_ARRAY_TASK_ID]}
```
Or for a trivial case you could supply the values for a parameter scan in form
of a argument list that gets fed to the program using the counter variable.
``` bash
ARGS=(0.05 0.25 0.5 1 2 5 100)
srun ./my_program.exe ${ARGS[$SLURM_ARRAY_TASK_ID]}
```
### Array jobs: running very long tasks with checkpoint files
If you need to run a job for much longer than the queues (partitions) permit, and
your executable is able to create checkpoint files, you can use this
strategy:
``` bash
#!/bin/bash
#SBATCH --job-name=test-checkpoint
#SBATCH --partition=general
#SBATCH --ntasks=1
#SBATCH --time=7-00:00:00 # each job can run for 7 days
#SBATCH --cpus-per-task=1
#SBATCH --array=1-10%1 # Run a 10-job array, one job at a time.
if test -e checkpointfile; then
# There is a checkpoint file;
myprogram --read-checkp checkpointfile
else
# There is no checkpoint file, start a new simulation.
myprogram
fi
```
The `%1` in the `#SBATCH --array=1-10%1` statement defines that only 1 subjob can ever run in parallel, so
this will result in subjob n+1 only being started when job n has finished. It will read the checkpoint file
if it is present.
### Packed jobs: running a large number of short tasks
Since the launching of a Slurm job incurs some overhead, you should not submit each short task as a separate
Slurm job. Use job packing, i.e. you run the short tasks within the loop of a single Slurm job.
You can launch the short tasks using `srun` with the `--exclusive` switch (not to be confused with the
switch of the same name used in the SBATCH commands). This switch will ensure that only a specified
number of tasks can run in parallel.
As an example, the following job submission script will ask Slurm for
44 cores (threads), then it will run the =myprog= program 1000 times with
arguments passed from 1 to 1000. But with the =-N1 -n1 -c1
--exclusive= option, it will control that at any point in time only 44
instances are effectively running, each being allocated one CPU. You
can at this point decide to allocate several CPUs or tasks by adapting
the corresponding parameters.
``` bash
#! /bin/bash
#SBATCH --job-name=test-checkpoint
#SBATCH --partition=general
#SBATCH --ntasks=1
#SBATCH --time=7-00:00:00
#SBATCH --ntasks=44 # defines the number of parallel tasks
for i in {1..1000}
do
srun -N1 -n1 -c1 --exclusive ./myprog $i &
done
wait
```
**Note:** The `&` at the end of the `srun` line is needed to not have the script waiting (blocking).
The `wait` command waits for all such background tasks to finish and returns the exit code.