Files
gitea-pages/pages/merlin6/merlin6-slurm/running-jobs.md
2019-06-18 14:53:39 +02:00

4.7 KiB

title, last_updated, sidebar, permalink
title last_updated sidebar permalink
Running Jobs 18 June 2019 merlin6_sidebar /merlin6/running-jobs.html

Commands for running jobs

  • sbatch: to submit a batch script to Slurm
    • squeue: for checking the status of your jobs
    • scancel: for deleting a job from the queue
  • srun: to run parallel jobs in the batch system
  • salloc: to obtain a Slurm job allocation (a set of nodes), execute command(s), and then release the allocation when the command is finished.
    • salloc is equivalent to an interactive run

Shared nodes and exclusivity

The Merlin6 cluster has been designed in a way that should allow running MPI/OpenMP processes as well as single core based jobs. For allowing co-existence, nodes are configured by default in a shared mode. It means, that multiple jobs from multiple users may land in the same node. This behaviour can be changed by a user if they require exclusive usage of nodes.

By default, Slurm will try to allocate jobs on nodes that are already occupied by processes not requiring exclusive usage of a node. In this way, we fill up first mixed nodes and we ensure that free full resources are available for MPI/OpenMP jobs.

Exclusivity of a node can be setup by specific the --exclusive option as follows:

#SBATCH --exclusive

Output and Errors

By default, Slurm script will generate standard output and errors files in the directory from where you submit the batch script:

  • standard output will be written into a file slurm-$SLURM_JOB_ID.out.
  • standard error will be written into a file slurm-$SLURM_JOB_ID.err.

If you want to the default names it can be done with the options --output and --error. In example:

#SBATCH --output=logs/myJob.%N.%j.out  # Generate an output file per hostname and jobid
#SBATCH --error=logs/myJob.%N.%j.err   # Generate an errori file per hostname and jobid

Use man sbatch (man sbatch | grep -A36 '^filename pattern') for getting a list specification of filename patterns.

Partitions

Merlin6 contains 3 partitions for general purpose. These are general, daily and hourly. If no partition is defined, general will be the default. Partition can be defined with the --partition option as follows:

#SBATCH --partition=<general|daily|hourly>  # name of slurm partition to submit. 'general' is the 'default'.

Please check the section [Slurm Configuration#Merlin6 Slurm Partitions] for more information about Merlin6 partition setup.

CPU-based Jobs Settings

CPU-based jobs are available for all PSI users. Users must belong to the merlin6 Slurm Account in order to be able to run on CPU-based nodes. All users registered in Merlin6 are automatically included in the Account.

Slurm CPU Mandatory Settings

The following options are mandatory settings that must be included in your batch scripts:

#SBATCH --constraint=mc   # Always set it to 'mc' for CPU jobs.

There are some settings that are not mandatory but would be needed or useful to specify. These are the following:

  • --time: mostly used when you need to specify longer runs in the general partition, also useful for specifying shorter times. This may affect scheduling priorities.

    #SBATCH --time=<D-HH:MM:SS>   # Time job needs to run
    

GPU-based Jobs Settings

GPU-base jobs are restricted to BIO users, however access for PSI users can be requested on demand. Users must belong to the merlin6-gpu Slurm Account in order to be able to run GPU-based nodes. BIO users belonging to any BIO group are automatically registered to the merlin6-gpu account. Other users should request access to the Merlin6 administrators.

Slurm CPU Mandatory Settings

The following options are mandatory settings that must be included in your batch scripts:

#SBATCH --constraint=gpu   # Always set it to 'gpu' for GPU jobs.
#SBATCH --gres=gpu         # Always set at least this option when using GPUs

GPUs are also a shared resource. Hence, multiple users can run jobs on a single node, but only one GPU per user process must be used. Users can define which GPUs resources they need with the --gres option. This would be according to the following rules:

  • All machines except merlin-g-001 have up to 4 GPUs. merlin-g-001 has up to 2 GPUs.
  • Two different NVIDIA models profiles exist: GTX1080 and GTX1080Ti.

Valid gres options are: gpu[[:type]:count] where:

  • type: can be GTX1080 or GTX1080Ti
  • count: will be the number of GPUs to use

In example:

#SBATCH --gres=gpu:GTX1080:4   # Use 4 x GTX1080 GPUs