gitea-pages/slurm-examples.md at fcfdbf134488e835b2a6ba7941788bd3a3f04edd

2021-05-21 12:34:19 +02:00

14 KiB

Raw Blame History

title, keywords, last_updated, summary, sidebar, permalink

title	keywords	last_updated	summary	sidebar	permalink
Slurm Examples	example, template, examples, templates, running jobs, sbatch	28 June 2019	This document shows different template examples for running jobs in the Merlin cluster.	merlin6_sidebar	/merlin6/slurm-examples.html

Single core based job examples

Example 1: Hyperthreaded job

In this example we want to use hyperthreading (--ntasks-per-core=2 and --hint=multithread). In our Merlin6 configuration, the default memory per CPU (a CPU is equivalent to a core thread) is 4000MB, hence each task can use up 8000MB (2 threads x 4000MB).

#!/bin/bash
#SBATCH --partition=hourly      # Using 'hourly' will grant higher priority
#SBATCH --ntasks-per-core=2     # Request the max ntasks be invoked on each core
#SBATCH --hint=multithread      # Use extra threads with in-core multi-threading
#SBATCH --time=00:30:00         # Define max time job will run
#SBATCH --output=myscript.out   # Define your output file
#SBATCH --error=myscript.err    # Define your error file

module purge
module load $MODULE_NAME # where $MODULE_NAME is a software in PModules
srun $MYEXEC             # where $MYEXEC is a path to your binary file

Example 2: Non-hyperthreaded job

In this example we do not want hyper-threading (--ntasks-per-core=1 and --hint=nomultithread). In our Merlin6 configuration, the default memory per cpu (a CPU is equivalent to a core thread) is 4000MB. If we do not specify anything else, our single core task will use a default of 4000MB. However, one could double it with --mem-per-cpu=8000 if you require more memory (remember, the second thread will not be used so we can safely assign +4000MB to the unique active thread).

#!/bin/bash
#SBATCH --partition=hourly      # Using 'hourly' will grant higher priority
#SBATCH --ntasks-per-core=1     # Request the max ntasks be invoked on each core
#SBATCH --hint=nomultithread    # Don't use extra threads with in-core multi-threading
#SBATCH --time=00:30:00         # Define max time job will run
#SBATCH --output=myscript.out   # Define your output file
#SBATCH --error=myscript.err    # Define your error file

module purge
module load $MODULE_NAME # where $MODULE_NAME is a software in PModules
srun $MYEXEC             # where $MYEXEC is a path to your binary file

Multi core based job examples

Example 1: MPI with Hyper-Threading

In this example we run a job that will run 88 tasks. Merlin6 Apollo nodes have 44 cores each one with hyper-threading enabled. This means that we can run 2 threads per core, in total 88 threads. To accomplish that, users should specify --ntasks-per-core=2 and --hint=multithread.

Use --nodes=1 if you want to use a node exclusively (88 hyperthreaded tasks would fit in a Merlin6 node).

#!/bin/bash
#SBATCH --partition=hourly      # Using 'hourly' will grant higher priority
#SBATCH --ntasks=88             # Job will run 88 tasks
#SBATCH --ntasks-per-core=2     # Request the max ntasks be invoked on each core
#SBATCH --hint=multithread      # Use extra threads with in-core multi-threading
#SBATCH --time=00:30:00         # Define max time job will run
#SBATCH --output=myscript.out   # Define your output file
#SBATCH --error=myscript.err    # Define your error file

module purge
module load $MODULE_NAME # where $MODULE_NAME is a software in PModules
srun $MYEXEC             # where $MYEXEC is a path to your binary file

Example 2: MPI without Hyper-Threading

In this example, we want to run a job that will run 44 tasks, and due to performance reasons we want to disable hyper-threading. Merlin6 Apollo nodes have 44 cores, each one with hyper-threading enabled. For ensuring that only 1 thread will be used per task, users should specify --ntasks-per-core=1 and --hint=nomultithread. With this configuration, we tell Slurm to run only 1 tasks per core and no hyperthreading should be used. Hence, each tasks will be assigned to an independent core.

Use --nodes=1 if you want to use a node exclusively (44 non-hyperthreaded tasks would fit in a Merlin6 node).

#!/bin/bash
#SBATCH --partition=hourly      # Using 'hourly' will grant higher priority
#SBATCH --ntasks=44             # Job will run 44 tasks
#SBATCH --ntasks-per-core=1     # Request the max ntasks be invoked on each core
#SBATCH --hint=nomultithread    # Don't use extra threads with in-core multi-threading
#SBATCH --time=00:30:00         # Define max time job will run 
#SBATCH --output=myscript.out   # Define your output file
#SBATCH --error=myscript.err    # Define your output file

module purge
module load $MODULE_NAME # where $MODULE_NAME is a software in PModules
srun $MYEXEC             # where $MYEXEC is a path to your binary file

Example 3: Hyperthreaded Hybrid MPI/OpenMP job

In this example, we want to run a Hybrid Job using MPI and OpenMP using hyperthreading. In this job, we want to run 4 MPI tasks by using 8 CPUs per task. Each task in our example requires 128GB of memory. Then we specify 16000MB per CPU (8 x 16000MB = 128000MB). Notice that since hyperthreading is enabled, Slurm will use 4 cores per task (with hyperthreading 2 threads -a.k.a. Slurm CPUs- fit into a core).

#!/bin/bash -l
#SBATCH --clusters=merlin6
#SBATCH --job-name=test
#SBATCH --ntasks=4
#SBATCH --ntasks-per-socket=1
#SBATCH --mem-per-cpu=16000
#SBATCH --cpus-per-task=8
#SBATCH --partition=hourly
#SBATCH --time=01:00:00
#SBATCH --output=srun_%j.out
#SBATCH --error=srun_%j.err
#SBATCH --hint=multithread

module purge
module load $MODULE_NAME # where $MODULE_NAME is a software in PModules
srun $MYEXEC             # where $MYEXEC is a path to your binary file

{{site.data.alerts.tip}} Also, always consider that '--mem-per-cpu' x '--cpus-per-task' can never exceed the maximum amount of memory per node (352000MB). {{site.data.alerts.end}}

Example 4: Non-hyperthreaded Hybrid MPI/OpenMP job

In this example, we want to run a Hybrid Job using MPI and OpenMP without hyperthreading. In this job, we want to run 4 MPI tasks by using 8 CPUs per task. Each task in our example requires 128GB of memory. Then we specify 16000MB per CPU (8 x 16000MB = 128000MB). Notice that since hyperthreading is disabled, Slurm will use 8 cores per task (disabling hyperthreading we force the use of only 1 thread -a.k.a. 1 CPU- per core).

#!/bin/bash -l                                                                  
#SBATCH --clusters=merlin6                                                      
#SBATCH --job-name=test
#SBATCH --ntasks=4                                                              
#SBATCH --ntasks-per-socket=1                                                   
#SBATCH --mem-per-cpu=16000
#SBATCH --cpus-per-task=8                                                       
#SBATCH --partition=hourly
#SBATCH --time=01:00:00                                                      
#SBATCH --output=srun_%j.out                                                    
#SBATCH --error=srun_%j.err                                                     
#SBATCH --hint=nomultithread                                                    

module purge
module load $MODULE_NAME # where $MODULE_NAME is a software in PModules
srun $MYEXEC             # where $MYEXEC is a path to your binary file

{{site.data.alerts.tip}} Also, always consider that '--mem-per-cpu' x '--cpus-per-task' can never exceed the maximum amount of memory per node (352000MB). {{site.data.alerts.end}}

Advanced examples

If you need to run a large number of jobs based on the same executable with systematically varying inputs, e.g. for a parameter sweep, you can do this most easily in form of a simple array job.

#!/bin/bash
#SBATCH --job-name=test-array
#SBATCH --partition=daily
#SBATCH --ntasks=1
#SBATCH --time=08:00:00
#SBATCH --array=1-8

echo $(date) "I am job number ${SLURM_ARRAY_TASK_ID}"
srun $MYEXEC config-file-${SLURM_ARRAY_TASK_ID}.dat

This will run 8 independent jobs, where each job can use the counter variable SLURM_ARRAY_TASK_ID defined by Slurm inside of the job's environment to feed the correct input arguments or configuration file to the "myprogram" executable. Each job will receive the same set of configurations (e.g. time limit of 8h in the example above).

The jobs are independent, but they will run in parallel (if the cluster resources allow for it). The jobs will get JobIDs like {some-number}_0 to {some-number}_7, and they also will each have their own output file.

Note:

Do not use such jobs if you have very short tasks, since each array sub job will incur the full overhead for launching an independent Slurm job. For such cases you should used a packed job (see below).
If you want to control how many of these jobs can run in parallel, you can use the #SBATCH --array=1-100%5 syntax. The %5 will define that only 5 sub jobs may ever run in parallel.

You also can use an array job approach to run over all files in a directory, substituting the payload with

FILES=(/path/to/data/*)
srun $MYEXEC ${FILES[$SLURM_ARRAY_TASK_ID]}

Or for a trivial case you could supply the values for a parameter scan in form of a argument list that gets fed to the program using the counter variable.

ARGS=(0.05 0.25 0.5 1 2 5 100)
srun $MYEXEC ${ARGS[$SLURM_ARRAY_TASK_ID]}

Array jobs: running very long tasks with checkpoint files

If you need to run a job for much longer than the queues (partitions) permit, and your executable is able to create checkpoint files, you can use this strategy:

#!/bin/bash
#SBATCH --job-name=test-checkpoint
#SBATCH --partition=general
#SBATCH --ntasks=1
#SBATCH --time=7-00:00:00       # each job can run for 7 days
#SBATCH --cpus-per-task=1
#SBATCH --array=1-10%1   # Run a 10-job array, one job at a time.
if test -e checkpointfile; then 
     # There is a checkpoint file;
     $MYEXEC --read-checkp checkpointfile
else
     # There is no checkpoint file, start a new simulation.
     $MYEXEC
fi

The %1 in the #SBATCH --array=1-10%1 statement defines that only 1 subjob can ever run in parallel, so this will result in subjob n+1 only being started when job n has finished. It will read the checkpoint file if it is present.

Packed jobs: running a large number of short tasks

Since the launching of a Slurm job incurs some overhead, you should not submit each short task as a separate Slurm job. Use job packing, i.e. you run the short tasks within the loop of a single Slurm job.

You can launch the short tasks using srun with the --exclusive switch (not to be confused with the switch of the same name used in the SBATCH commands). This switch will ensure that only a specified number of tasks can run in parallel.

As an example, the following job submission script will ask Slurm for 44 cores (threads), then it will run the =myprog= program 1000 times with arguments passed from 1 to 1000. But with the =-N1 -n1 -c1 --exclusive= option, it will control that at any point in time only 44 instances are effectively running, each being allocated one CPU. You can at this point decide to allocate several CPUs or tasks by adapting the corresponding parameters.

#! /bin/bash
#SBATCH --job-name=test-checkpoint
#SBATCH --partition=general
#SBATCH --ntasks=1
#SBATCH --time=7-00:00:00
#SBATCH --ntasks=44    # defines the number of parallel tasks
for i in {1..1000}
do
   srun -N1 -n1 -c1 --exclusive $MYEXEC $i &
done
wait

Note: The & at the end of the srun line is needed to not have the script waiting (blocking). The wait command waits for all such background tasks to finish and returns the exit code.

Hands-On Example

Copy-paste the following example in a file called myAdvancedTest.batch):

#!/bin/bash
#SBATCH --partition=daily    # name of slurm partition to submit
#SBATCH --time=2:00:00       # limit the execution of this job to 2 hours, see sinfo for the max. allowance
#SBATCH --nodes=2            # number of nodes
#SBATCH --ntasks=44          # number of tasks
#SBATCH --ntasks-per-core=1  # Request the max ntasks be invoked on each core
#SBATCH --hint=nomultithread # Don't use extra threads with in-core multi-threading

module load gcc/9.2.0 openmpi/3.1.5-1_merlin6
module list

echo "Example no-MPI:" ; hostname        # will print one hostname per node
echo "Example MPI:"    ; srun hostname # will print one hostname per ntask

In the above example are specified the options --nodes=2 and --ntasks=44. This means that up 2 nodes are requested, and is expected to run 44 tasks. Hence, 44 cores are needed for running that job. Slurm will try to allocate a maximum of 2 nodes, both together having at least 44 cores. Since our nodes have 44 cores / each, if nodes are empty (no other users have running jobs there), job can land on a single node (it has enough cores to run 44 tasks).

If we want to ensure that job is using at least two different nodes (i.e. for boosting CPU frequency, or because the job requires more memory per core) you should specify other options.

A good example is --ntasks-per-node=22. This will equally distribute 22 tasks on 2 nodes.

#SBATCH --ntasks-per-node=22

A different example could be by specifying how much memory per core is needed. For instance --mem-per-cpu=32000 will reserve ~32000MB per core. Since we have a maximum of 352000MB per Apollo node, Slurm will be only able to allocate 11 cores (32000MB x 11cores = 352000MB) per node. It means that 4 nodes will be needed (max 11 tasks per node due to memory definition, and we need to run 44 tasks), in this case we need to change --nodes=4 (or remove --nodes). Alternatively, we can decrease --mem-per-cpu to a lower value which can allow the use of at least 44 cores per node (i.e. with 16000 should be able to use 2 nodes)

#SBATCH --mem-per-cpu=16000

Finally, in order to ensure exclusivity of the node, an option --exclusive can be used (see below). This will ensure that the requested nodes are exclusive for the job (no other users jobs will interact with this node, and only completely free nodes will be allocated).

#SBATCH --exclusive

This can be combined with the previous examples.

More advanced configurations can be defined and can be combined with the previous examples. More information about advanced options can be found in the following link: https://slurm.schedmd.com/sbatch.html (or run 'man sbatch').

If you have questions about how to properly execute your jobs, please contact us through merlin-admins@lists.psi.ch. Do not run advanced configurations unless your are sure of what you are doing.

14 KiB Raw Blame History