19 KiB
title, keywords, last_updated, summary, sidebar, permalink
title | keywords | last_updated | summary | sidebar | permalink |
---|---|---|---|---|---|
Running Slurm Scripts | batch script, slurm, sbatch, srun | 23 January 2020 | This document describes how to run batch scripts in Slurm. | merlin6_sidebar | /merlin6/running-jobs.html |
The rules
Before starting using the cluster, please read the following rules:
- Always try to estimate and to define a proper run time of your jobs:
- Use
--time=<D-HH:MM:SS>
for that. - This will ease scheduling and backfilling.
- Slurm will schedule efficiently the queued jobs.
- For very long runs, please consider using Job Arrays with Checkpointing
- Use
- Try to optimize your jobs for running within one day. Please, consider the following:
- Some software can simply scale up by using more nodes while drastically reducing the run time.
- Some software allow to save a specific state, and a second job can start from that state.
- Job Arrays with Checkpointing can help you with that.
- Use the 'daily' partition when you ensure that you can run within one day:
- 'daily' will give you more priority than running in the 'general' queue!
- Is forbidden to run very short jobs:
- Running jobs of few seconds can cause severe problems.
- Running very short jobs causes a lot of overhead.
- Question: Is my job a very short job?
- Answer: If it lasts in few seconds or very few minutes, yes.
- Question: How long should my job run?
- Answer: as the Rule of Thumb, from 5' would start being ok, from 15' would preferred.
- Use Packed Jobs for running a large number of short tasks.
- For short runs lasting in less than 1 hour, please use the hourly partition.
- 'hourly' will give you more priority than running in the 'daily' queue!
- Do not submit hundreds of similar jobs!
- Use Array Jobs for gathering jobs instead.
{{site.data.alerts.tip}}Having a good estimation of the time needed by your jobs, a proper way for running them, and optimizing the jobs to run within one day will contribute to make the system fairly and efficiently used. {{site.data.alerts.end}}
Basic commands for running batch scripts
- Use
sbatch
for submitting a batch script to Slurm. - Use
srun
for running parallel tasks. - Use
squeue
for checking jobs status. - Use
scancel
for cancelling/deleting a job from the queue.
{{site.data.alerts.tip}}Use Linux 'man' pages when needed (i.e. 'man sbatch'), mostly for checking the available options for the above commands. {{site.data.alerts.end}}
Basic settings
For a complete list of options and parameters available is recommended to use the man pages (i.e. man sbatch
, man srun
, man salloc
).
Please, notice that behaviour for some parameters might change depending on the command used when running jobs (in example, --exclusive
behaviour in sbatch
differs from srun
).
In this chapter we show the basic parameters which are usually needed in the Merlin cluster.
Common settings
The following settings are the minimum required for running a job in the Merlin CPU and GPU nodes. Please, consider taking a look to the man pages (i.e. man sbatch
, man salloc
, man srun
) for more
information about all possible options. Also, do not hesitate to contact us on any questions.
-
Clusters: For running jobs in the Merlin6 CPU and GPU nodes, users should to add the following option:
#SBATCH --clusters=merlin6
Users with proper access, can also use the
merlin5
cluster. -
Partitions: except when using the default partition, one needs to specify the partition:
- GPU partitions:
gpu
,gpu-short
(more details Slurm GPU Partitions) - CPU partitions:
general
(default if no partition is specified),daily
andhourly
(more details: Slurm CPU Partitions)
Partition can be set as follows:
#SBATCH --partition=<partition_name> # Partition to use. 'general' is the 'default'
- GPU partitions:
-
[Optional] Disabling shared nodes: by default, nodes can share jobs from multiple users, but by ensuring that CPU/Memory/GPU resources are dedicated. One can request exclusive usage of a node (or set of nodes) with the following option:
#SBATCH --exclusive # Only if you want a dedicated node
-
Time: is important to define how long a job should run, according to the reality. This will help Slurm when scheduling and backfilling, by managing job queues in a more efficient way. This value can never exceed the
MaxTime
of the affected partition. Please review the partition information (scontrol show partition <partition_name>
or GPU Partition Configuration) forDefaultTime
andMaxTime
values.#SBATCH --time=<D-HH:MM:SS> # Time job needs to run. Can not exceed the partition `MaxTime`
-
Output and error files: by default, Slurm script will generate standard output and errors files in the directory from where you submit the batch script:
- standard output will be written into a file
slurm-$SLURM_JOB_ID.out
. - standard error will be written into a file
slurm-$SLURM_JOB_ID.err
.
If you want to the default names it can be done with the options
--output
and--error
. In example:#SBATCH --output=logs/myJob.%N.%j.out # Generate an output file per hostname and jobid #SBATCH --error=logs/myJob.%N.%j.err # Generate an errori file per hostname and jobid
Use man sbatch (
man sbatch | grep -A36 '^filename pattern'
) for getting a list specification of filename patterns. - standard output will be written into a file
-
Multithreading/No-Multithreading: Whether a node has or not multithreading depends on the node configuration. By default, HT nodes have HT enabled, but one can ensure this feature with the option
--hint
as follows:#SBATCH --hint=multithread # Use extra threads with in-core multi-threading. #SBATCH --hint=nomultithread # Don't use extra threads with in-core multi-threading.
Consider that, sometimes, depending on your job requirements, you might need also to setup how many
--ntasks-per-core
or--cpus-per-task
(even other options) in addition to the--hint
command. Please, contact us in case of doubts. {{site.data.alerts.tip}} In general, --hint=[no]multithread is a mandatory field. On the other hand, --ntasks-per-core is only needed when one needs to define how a task should be handled within a core, and this setting will not be generally used on Hybrid MPI/OpenMP jobs where multiple cores are needed for single tasks. {{site.data.alerts.end}}
GPU specific settings
The following settings are required for running on the GPU nodes:
- Slurm account: When using GPUs, users must use the
merlin-gpu
Slurm account. This is done with the--account
setting as follows:#SBATCH --account=merlin-gpu # The account 'merlin-gpu' must be used for GPUs
[Valid until 08.01.2021]
GRES: Slurm must be aware that the job will use GPUs. This is done with the--gres
setting, at least, as follows:This option is still valid as this might be needed by other resources, but for GPUs new options (i.e.#SBATCH --gres=gpu # Always set at least this option when using GPUs
--gpus
,--mem-per-gpu
) can be used, which provide more flexibility when running on GPUs. Please read below GPU advanced settings for other--gpus
options.[Valid from 08.01.2021]
GPU options (instead of GRES): Slurm must be aware that the job will use GPUs. New options are available for specifying the GPUs as a consumable resource. These are the following:--gpus=[<type>:]<number>
instead of (but also in addition with)--gres=gpu
: specifies the total number of GPUs required for the job.--gpus-per-task=[<type>:]<number>
,--gpus-per-socket=[<type>:]<number>
,--gpus-per-node=[<type>:]<number>
to specify the number of GPUs per tasks and/or socket and/or node.--gpus-per-node=[<type>:]<number>
,--gpus-per-socket
,--gpus-per-task
, to specify how many GPUs per node, socket and or tasks need to be allocated.--cpus-per-gpu
, to specify the number of CPUs to be used for each GPU.--mem-per-gpu
, to specify the amount of memory to be used for each GPU.- Other advanced options (i.e.
--gpu-bind
). Please see man pages for sbatch/srun/salloc (i.e.man sbatch
) for further information. Please read below GPU advanced settings for other--gpus
options. - Please, consider that one can specify the GPU
type
on some of the options. If one needs to specify it, then it must be specified in all options defined in the Slurm job.
GPU advanced settings
GPUs are also a shared resource. Hence, multiple users can run jobs on a single node, but only one GPU per user process must be used.
Until 08.01.2021, users can define which GPUs resources and how many per node they need with the --gres
option.
Valid gres
options are: gpu[[:type]:count]
where type=GTX1080|GTX1080Ti|RTX2080Ti
and count=<number of gpus requested per node>
. In example:
#SBATCH --gres=gpu:GTX1080:4 # Use a node with 4 x GTX1080 GPUs
From 08.01.2021, --gres
is not needed anymore (but can still be used), and --gpus
and related other options should replace it. --gpus
works in a similar way, but without
the need of specifying the gpu
resource. In oher words, --gpus
options are: [[:type]:count]
where type=GTX1080|GTX1080Ti|RTX2080Ti
(which is optional) and count=<number of gpus to use>
. In example:
#SBATCH --gpus=GTX1080:4 # Use 4 GPUs with Type=GTX1080
This setting can use in addition other settings, such like --gpus-per-node
, in order to accomplish a similar behaviour as with --gres
.
- Please, consider that one can specify the GPU
type
in some of the options. If one needs to specify it, then it must be specified in all options defined in the Slurm job.
{{site.data.alerts.tip}}Always check '/etc/slurm/gres.conf' for checking available Types and for details of the NUMA node. {{site.data.alerts.end}}
Batch script templates
CPU-based jobs templates
The following examples apply to the Merlin6 cluster.
Nomultithreaded jobs template
The following template should be used by any user submitting jobs to CPU nodes:
#!/bin/bash
#SBATCH --partition=<general|daily|hourly> # Specify 'general' or 'daily' or 'hourly'
#SBATCH --time=<D-HH:MM:SS> # Strongly recommended
#SBATCH --output=<output_file> # Generate custom output file
#SBATCH --error=<error_file> # Generate custom error file
#SBATCH --hint=nomultithread # Mandatory for non-multithreaded jobs
##SBATCH --exclusive # Uncomment if you need exclusive node usage
##SBATCH --ntasks-per-core=1 # Only mandatory for non-multithreaded single tasks
## Advanced options example
##SBATCH --nodes=1 # Uncomment and specify #nodes to use
##SBATCH --ntasks=44 # Uncomment and specify #nodes to use
##SBATCH --ntasks-per-node=44 # Uncomment and specify #tasks per node
##SBATCH --cpus-per-task=44 # Uncomment and specify the number of cores per task
Multithreaded jobs template
The following template should be used by any user submitting jobs to CPU nodes:
#!/bin/bash
#SBATCH --partition=<general|daily|hourly> # Specify 'general' or 'daily' or 'hourly'
#SBATCH --time=<D-HH:MM:SS> # Strongly recommended
#SBATCH --output=<output_file> # Generate custom output file
#SBATCH --error=<error_file> # Generate custom error file
#SBATCH --hint=multithread # Mandatory for multithreaded jobs
##SBATCH --exclusive # Uncomment if you need exclusive node usage
##SBATCH --ntasks-per-core=2 # Only mandatory for multithreaded single tasks
## Advanced options example
##SBATCH --nodes=1 # Uncomment and specify #nodes to use
##SBATCH --ntasks=88 # Uncomment and specify #nodes to use
##SBATCH --ntasks-per-node=88 # Uncomment and specify #tasks per node
##SBATCH --cpus-per-task=88 # Uncomment and specify the number of cores per task
GPU-based jobs templates
The following template should be used by any user submitting jobs to GPU nodes:
#!/bin/bash
#SBATCH --partition=<gpu|gpu-short> # Specify GPU partition
#SBATCH --gpus="<type>:<num_gpus>" # <type> is optional, <num_gpus> is mandatory
#SBATCH --time=<D-HH:MM:SS> # Strongly recommended
#SBATCH --output=<output_file> # Generate custom output file
#SBATCH --error=<error_file # Generate custom error file
#SBATCH --account=merlin-gpu # The account 'merlin-gpu' must be used
##SBATCH --exclusive # Uncomment if you need exclusive node usage
## Advanced options example
##SBATCH --nodes=1 # Uncomment and specify number of nodes to use
##SBATCH --ntasks=1 # Uncomment and specify number of nodes to use
##SBATCH --cpus-per-gpu=5 # Uncomment and specify the number of cores per task
##SBATCH --mem-per-gpu=16000 # Uncomment and specify the number of cores per task
##SBATCH --gpus-per-node=<type>:2 # Uncomment and specify the number of GPUs per node
##SBATCH --gpus-per-socket=<type>:2 # Uncomment and specify the number of GPUs per socket
##SBATCH --gpus-per-task=<type>:1 # Uncomment and specify the number of GPUs per task
Advanced configurations
Array Jobs: launching a large number of related jobs
If you need to run a large number of jobs based on the same executable with systematically varying inputs, e.g. for a parameter sweep, you can do this most easily in form of a simple array job.
#!/bin/bash
#SBATCH --job-name=test-array
#SBATCH --partition=daily
#SBATCH --ntasks=1
#SBATCH --time=08:00:00
#SBATCH --array=1-8
echo $(date) "I am job number ${SLURM_ARRAY_TASK_ID}"
srun myprogram config-file-${SLURM_ARRAY_TASK_ID}.dat
This will run 8 independent jobs, where each job can use the counter
variable SLURM_ARRAY_TASK_ID
defined by Slurm inside of the job's
environment to feed the correct input arguments or configuration file
to the "myprogram" executable. Each job will receive the same set of
configurations (e.g. time limit of 8h in the example above).
The jobs are independent, but they will run in parallel (if the cluster resources allow for it). The jobs will get JobIDs like {some-number}_0 to {some-number}_7, and they also will each have their own output file.
Note:
- Do not use such jobs if you have very short tasks, since each array sub job will incur the full overhead for launching an independent Slurm job. For such cases you should used a packed job (see below).
- If you want to control how many of these jobs can run in parallel, you can use the
#SBATCH --array=1-100%5
syntax. The%5
will define that only 5 sub jobs may ever run in parallel.
You also can use an array job approach to run over all files in a directory, substituting the payload with
FILES=(/path/to/data/*)
srun ./myprogram ${FILES[$SLURM_ARRAY_TASK_ID]}
Or for a trivial case you could supply the values for a parameter scan in form of a argument list that gets fed to the program using the counter variable.
ARGS=(0.05 0.25 0.5 1 2 5 100)
srun ./my_program.exe ${ARGS[$SLURM_ARRAY_TASK_ID]}
Array jobs: running very long tasks with checkpoint files
If you need to run a job for much longer than the queues (partitions) permit, and your executable is able to create checkpoint files, you can use this strategy:
#!/bin/bash
#SBATCH --job-name=test-checkpoint
#SBATCH --partition=general
#SBATCH --ntasks=1
#SBATCH --time=7-00:00:00 # each job can run for 7 days
#SBATCH --cpus-per-task=1
#SBATCH --array=1-10%1 # Run a 10-job array, one job at a time.
if test -e checkpointfile; then
# There is a checkpoint file;
myprogram --read-checkp checkpointfile
else
# There is no checkpoint file, start a new simulation.
myprogram
fi
The %1
in the #SBATCH --array=1-10%1
statement defines that only 1 subjob can ever run in parallel, so
this will result in subjob n+1 only being started when job n has finished. It will read the checkpoint file
if it is present.
Packed jobs: running a large number of short tasks
Since the launching of a Slurm job incurs some overhead, you should not submit each short task as a separate Slurm job. Use job packing, i.e. you run the short tasks within the loop of a single Slurm job.
You can launch the short tasks using srun
with the --exclusive
switch (not to be confused with the
switch of the same name used in the SBATCH commands). This switch will ensure that only a specified
number of tasks can run in parallel.
As an example, the following job submission script will ask Slurm for 44 cores (threads), then it will run the =myprog= program 1000 times with arguments passed from 1 to 1000. But with the =-N1 -n1 -c1 --exclusive= option, it will control that at any point in time only 44 instances are effectively running, each being allocated one CPU. You can at this point decide to allocate several CPUs or tasks by adapting the corresponding parameters.
#! /bin/bash
#SBATCH --job-name=test-checkpoint
#SBATCH --partition=general
#SBATCH --ntasks=1
#SBATCH --time=7-00:00:00
#SBATCH --ntasks=44 # defines the number of parallel tasks
for i in {1..1000}
do
srun -N1 -n1 -c1 --exclusive ./myprog $i &
done
wait
Note: The &
at the end of the srun
line is needed to not have the script waiting (blocking).
The wait
command waits for all such background tasks to finish and returns the exit code.