Files
gitea-pages/pages/merlin6/03 Job Submission/running-jobs.md

16 KiB

title, keywords, last_updated, summary, sidebar, permalink
title keywords last_updated summary sidebar permalink
Running Slurm Scripts batch script, slurm, sbatch, srun 23 January 2020 This document describes how to run batch scripts in Slurm. merlin6_sidebar /merlin6/running-jobs.html

The rules

Before starting using the cluster, please read the following rules:

  1. Always try to estimate and to define a proper run time of your jobs:
    • Use --time=<D-HH:MM:SS> for that.
    • This will ease the scheduling.
    • Slurm will schedule efficiently the queued jobs.
    • For very long runs, please consider using Job Arrays with Checkpointing
  2. Try to optimize your jobs for running within one day. Please, consider the following:
    • Some software can simply scale up by using more nodes while drastically reducing the run time.
    • Some software allow to save a specific state, and a second job can start from that state.
    • Job Arrays with Checkpointing can help you with that.
    • Use the 'daily' partition when you ensure that you can run within one day:
      • 'daily' will give you more priority than running in the 'general' queue!
  3. Is forbidden to run very short jobs:
    • Running jobs of few seconds can cause severe problems.
    • Running very short jobs causes a lot of overhead.
    • Question: Is my job a very short job?
      • Answer: If it lasts in few seconds or very few minutes, yes.
    • Question: How long should my job run?
      • Answer: as the Rule of Thumb, from 5' would start being ok, from 15' would preferred.
    • Use Packed Jobs for running a large number of short tasks.
    • For short runs lasting in less than 1 hour, please use the hourly partition.
      • 'hourly' will give you more priority than running in the 'daily' queue!
  4. Do not submit hundreds of similar jobs!

Basic commands for running batch scripts

sbatch is the command used for submitting a batch script to Slurm

  • Use srun: to run parallel tasks.
    • As an alternative, mpirun and mpiexec can be used. However, is strongly recommended to user srun instead.
  • Use squeue for checking jobs status
  • Use scancel for deleting a job from the queue.

Basic settings

For a complete list of options and parameters available is recommended to use the man pages (man sbatch, man srun, man salloc). Please, notice that behaviour for some parameters might change depending on the command (in example, --exclusive behaviour in sbatch differs from srun.

In this chapter we show the basic parameters which are usually needed in the Merlin cluster.

Clusters

  • For running jobs in the Merlin6 computing nodes, users have to add the following option:

    #SBATCH --clusters=merlin6
    
  • For running jobs in the Merlin5 computing nodes, users have to add the following options:

    #SBATCH --clusters=merlin5
    

For advanced users: If you do not care where to run the jobs (Merlin5 or Merlin6) you can skip this setting, however you must make sure that your code can run on both clusters without any problem and you have defined proper settings in your batch script.

Partitions

Merlin6 contains 4 partitions for general purpose, while Merlin5 contains 1 single CPU partition (for historical reasons):

  • Merlin6 CPU partitions are 3: general, daily and hourly.
  • Merlin6 GPU partition is 1: gpu.
  • Merlin5 CPU partition is 1: merlin

For Merlin6, if no partition is defined, general will be the default, while for Merlin5 is merlin. Partitions can be changed by defining the --partition option as follows:

#SBATCH --partition=<partition_name>  # Partition to use. 'general' is the 'default' in Merlin6.

Please check the section [Slurm Configuration#Merlin6 Slurm Partitions] for more information about Merlin6 partition setup.

Hyperthreaded vs non-Hyperthreaded jobs

Computing nodes in merlin6 have hyperthreading enabled: every core is running two threads. It means that for many cases it needs to be disabled and only those multithread-based applications will benefit from that. There are some parameters that users must apply:

  • For hyperthreaded based jobs users must specify the following options:

    #SBATCH --hint=multithread   # Mandatory for multithreaded jobs
    #SBATCH --ntasks-per-core=2  # Only needed when a task fits into a core
    
  • For non-hyperthreaded based jobs users must specify the following options:

    #SBATCH --hint=nomultithread # Mandatory for non-multithreaded jobs
    #SBATCH --ntasks-per-core=1  # Only needed when a task fits into a core
    

{{site.data.alerts.tip}} In general, --hint=[no]multithread is a mandatory field. On the other hand, --ntasks-per-core is only needed when one needs to define how a task should be handled within a core, and this setting will not be generally used on Hybrid MPI/OpenMP jobs where multiple cores are needed for a single tasks. {{site.data.alerts.end}}

Shared vs exclusive nodes

The Merlin5 and Merlin6 clusters are designed in a way that should allow running MPI/OpenMP processes as well as single core based jobs. For allowing co-existence, nodes are configured by default in a shared mode. It means, that multiple jobs from multiple users may land in the same node. This behaviour can be changed by a user if they require exclusive usage of nodes.

By default, Slurm will try to allocate jobs on nodes that are already occupied by processes not requiring exclusive usage of a node. In this way, we fill up first mixed nodes and we ensure that free full resources are available for MPI/OpenMP jobs.

Exclusivity of a node can be setup by specific the --exclusive option as follows:

#SBATCH --exclusive

Time

There are some settings that are not mandatory but would be needed or useful to specify. These are the following:

  • --time: mostly used when you need to specify longer runs in the general partition, also useful for specifying shorter times. This will affect scheduling priorities, hence is important to define it (and to define it properly).

    #SBATCH --time=<D-HH:MM:SS>   # Time job needs to run
    

Output and Errors

By default, Slurm script will generate standard output and errors files in the directory from where you submit the batch script:

  • standard output will be written into a file slurm-$SLURM_JOB_ID.out.
  • standard error will be written into a file slurm-$SLURM_JOB_ID.err.

If you want to the default names it can be done with the options --output and --error. In example:

#SBATCH --output=logs/myJob.%N.%j.out  # Generate an output file per hostname and jobid
#SBATCH --error=logs/myJob.%N.%j.err   # Generate an errori file per hostname and jobid

Use man sbatch (man sbatch | grep -A36 '^filename pattern') for getting a list specification of filename patterns.

GPU specific settings

Slurm account

When using GPUs, users must switch to the merlin-gpu Slurm account in order to be able to run on GPU-based nodes. This is done with the --account setting as follows:

#SBATCH --account=merlin-gpu     # The account 'merlin-gpu' must be used

GRES

The following options are mandatory settings that must be included in your batch scripts:

#SBATCH --gres=gpu         # Always set at least this option when using GPUs
GRES advanced settings

GPUs are also a shared resource. Hence, multiple users can run jobs on a single node, but only one GPU per user process must be used. Users can define which GPUs resources they need with the --gres option. Valid gres options are: gpu[[:type]:count] where type=GTX1080|GTX1080Ti and count=<number of gpus to use> This would be according to the following rules:

In example:

#SBATCH --gres=gpu:GTX1080:4   # Use a node with 4 x GTX1080 GPUs

Important note: Due to a bug in the configuration, [:type] (i.e. GTX1080 or GTX1080Ti) is not working. Users should skip that and use only gpu[:count]. This will be fixed in the upcoming downtimes as it requires a full restart of the batch system.

Batch script templates

CPU-based jobs templates

The following examples apply to the Merlin6 cluster.

Nomultithreaded jobs template

The following template should be used by any user submitting jobs to CPU nodes:

#!/bin/bash
#SBATCH --partition=<general|daily|hourly>  # Specify 'general' or 'daily' or 'hourly'
#SBATCH --time=<D-HH:MM:SS>                 # Strongly recommended
#SBATCH --output=<output_file>              # Generate custom output file
#SBATCH --error=<error_file>                # Generate custom error  file
#SBATCH --hint=nomultithread                # Mandatory for non-multithreaded jobs
##SBATCH --exclusive                        # Uncomment if you need exclusive node usage
##SBATCH --ntasks-per-core=1                # Only mandatory for non-multithreaded single tasks

## Advanced options example
##SBATCH --nodes=1                          # Uncomment and specify #nodes to use
##SBATCH --ntasks=44                        # Uncomment and specify #nodes to use
##SBATCH --ntasks-per-node=44               # Uncomment and specify #tasks per node
##SBATCH --cpus-per-task=44                 # Uncomment and specify the number of cores per task

Multithreaded jobs template

The following template should be used by any user submitting jobs to CPU nodes:

#!/bin/bash
#SBATCH --partition=<general|daily|hourly>  # Specify 'general' or 'daily' or 'hourly'
#SBATCH --time=<D-HH:MM:SS>                 # Strongly recommended
#SBATCH --output=<output_file>              # Generate custom output file
#SBATCH --error=<error_file>                # Generate custom error  file
#SBATCH --hint=multithread                  # Mandatory for multithreaded jobs
##SBATCH --exclusive                        # Uncomment if you need exclusive node usage
##SBATCH --ntasks-per-core=2                # Only mandatory for multithreaded single tasks

## Advanced options example
##SBATCH --nodes=1                          # Uncomment and specify #nodes to use
##SBATCH --ntasks=88                        # Uncomment and specify #nodes to use
##SBATCH --ntasks-per-node=88               # Uncomment and specify #tasks per node
##SBATCH --cpus-per-task=88                 # Uncomment and specify the number of cores per task

GPU-based jobs templates

The following template should be used by any user submitting jobs to GPU nodes:

#!/bin/bash
#SBATCH --partition=gpu_<general|daily|hourly> # Specify 'general' or 'daily' or 'hourly'
#SBATCH --gres="gpu:<type>:<number_gpus>"   # You should specify at least 'gpu'
#SBATCH --time=<D-HH:MM:SS>                 # Strongly recommended
#SBATCH --output=<output_file>              # Generate custom output file
#SBATCH --error=<error_file                 # Generate custom error  file
#SBATCH --account=merlin-gpu                # The account 'merlin-gpu' must be used
##SBATCH --exclusive                        # Uncomment if you need exclusive node usage

## Advanced options example
##SBATCH --nodes=1                          # Uncomment and specify number of nodes to use
##SBATCH --ntasks=20                        # Uncomment and specify number of nodes to use
##SBATCH --ntasks-per-node=20               # Uncomment and specify number of tasks per node
##SBATCH --cpus-per-task=1                  # Uncomment and specify the number of cores per task

Advanced configurations

If you need to run a large number of jobs based on the same executable with systematically varying inputs, e.g. for a parameter sweep, you can do this most easily in form of a simple array job.

#!/bin/bash
#SBATCH --job-name=test-array
#SBATCH --partition=daily
#SBATCH --ntasks=1
#SBATCH --time=08:00:00
#SBATCH --array=1-8

echo $(date) "I am job number ${SLURM_ARRAY_TASK_ID}"
srun myprogram config-file-${SLURM_ARRAY_TASK_ID}.dat

This will run 8 independent jobs, where each job can use the counter variable SLURM_ARRAY_TASK_ID defined by Slurm inside of the job's environment to feed the correct input arguments or configuration file to the "myprogram" executable. Each job will receive the same set of configurations (e.g. time limit of 8h in the example above).

The jobs are independent, but they will run in parallel (if the cluster resources allow for it). The jobs will get JobIDs like {some-number}_0 to {some-number}_7, and they also will each have their own output file.

Note:

  • Do not use such jobs if you have very short tasks, since each array sub job will incur the full overhead for launching an independent Slurm job. For such cases you should used a packed job (see below).
  • If you want to control how many of these jobs can run in parallel, you can use the #SBATCH --array=1-100%5 syntax. The %5 will define that only 5 sub jobs may ever run in parallel.

You also can use an array job approach to run over all files in a directory, substituting the payload with

FILES=(/path/to/data/*)
srun ./myprogram ${FILES[$SLURM_ARRAY_TASK_ID]}

Or for a trivial case you could supply the values for a parameter scan in form of a argument list that gets fed to the program using the counter variable.

ARGS=(0.05 0.25 0.5 1 2 5 100)
srun ./my_program.exe ${ARGS[$SLURM_ARRAY_TASK_ID]}

Array jobs: running very long tasks with checkpoint files

If you need to run a job for much longer than the queues (partitions) permit, and your executable is able to create checkpoint files, you can use this strategy:

#!/bin/bash
#SBATCH --job-name=test-checkpoint
#SBATCH --partition=general
#SBATCH --ntasks=1
#SBATCH --time=7-00:00:00       # each job can run for 7 days
#SBATCH --cpus-per-task=1
#SBATCH --array=1-10%1   # Run a 10-job array, one job at a time.
if test -e checkpointfile; then 
     # There is a checkpoint file;
     myprogram --read-checkp checkpointfile
else
     # There is no checkpoint file, start a new simulation.
     myprogram
fi

The %1 in the #SBATCH --array=1-10%1 statement defines that only 1 subjob can ever run in parallel, so this will result in subjob n+1 only being started when job n has finished. It will read the checkpoint file if it is present.

Packed jobs: running a large number of short tasks

Since the launching of a Slurm job incurs some overhead, you should not submit each short task as a separate Slurm job. Use job packing, i.e. you run the short tasks within the loop of a single Slurm job.

You can launch the short tasks using srun with the --exclusive switch (not to be confused with the switch of the same name used in the SBATCH commands). This switch will ensure that only a specified number of tasks can run in parallel.

As an example, the following job submission script will ask Slurm for 44 cores (threads), then it will run the =myprog= program 1000 times with arguments passed from 1 to 1000. But with the =-N1 -n1 -c1 --exclusive= option, it will control that at any point in time only 44 instances are effectively running, each being allocated one CPU. You can at this point decide to allocate several CPUs or tasks by adapting the corresponding parameters.

#! /bin/bash
#SBATCH --job-name=test-checkpoint
#SBATCH --partition=general
#SBATCH --ntasks=1
#SBATCH --time=7-00:00:00
#SBATCH --ntasks=44    # defines the number of parallel tasks
for i in {1..1000}
do
   srun -N1 -n1 -c1 --exclusive ./myprog $i &
done
wait

Note: The & at the end of the srun line is needed to not have the script waiting (blocking). The wait command waits for all such background tasks to finish and returns the exit code.