16
Slurm
assman_g edited this page 2024-06-12 14:40:54 +02:00

Why do we need SLURM?

SLURM (Simple Linux Utility for Resource Management) is an open-source workload manager designed for Linux clusters. It provides job scheduling and management, resource allocation, and monitoring for computational tasks across multiple nodes.

--> SLURM helps to manage the resources of compute nodes of a cluster and takes care of a "fair share" job scheduling for all cluster users

General link for documentation: https://slurm.schedmd.com/documentation.html

Basic commands/concepts

SLURM Partitions (Queues)/ sinfo

A partition, sometimes referred to as a "queue," defines a subset of compute nodes that "belong" to this partition and obey certain restrictions. To see all available partitions and the restrictions, do sinfo on the cluster.

Example on Merlin:

CLUSTER: gmerlin6
PARTITION      AVAIL  TIMELIMIT  NODES  STATE NODELIST
gwendolen         up    2:00:00      1   idle merlin-g-100
gwendolen-long    up    8:00:00      1   idle merlin-g-100
gpu*              up 7-00:00:00      6    mix merlin-g-[002,004,006,010,013-014]
gpu*              up 7-00:00:00      7  alloc merlin-g-[003,005,007,009,011-012,015]
gpu*              up 7-00:00:00      1   idle merlin-g-008
gpu-short         up    2:00:00      6    mix merlin-g-[002,004,006,010,013-014]
gpu-short         up    2:00:00      7  alloc merlin-g-[003,005,007,009,011-012,015]
gpu-short         up    2:00:00      2   idle merlin-g-[001,008]

Monitoring current(past) jobs/ squeue

To check how "busy" the the cluster currently is, do:

squeue : shows all jobs

squeue -u $USER only your jobs

squeue --cluster=gmerlin6 shows all jobs of gmerlin6

scontrol show job <job_id> --cluster=gmerlin6 shows details for this specific job on this cluster

sacct -M gmerlin6 --format="JobName,JobID%16,User%12,State%16,partition%20,time%12,ReqMem,MaxVMSize" -u $USER -S 2024-05-01 shows the submitted jobs of the past since a specific timepoint

Example: image

Find more Variables to be displayed at: https://slurm.schedmd.com/sacct.html

Submitting jobs

General submission command:

sbatch myscript.sh submit a script

Simple example script:

#!/bin/bash

#SBATCH --cluster=merlin6
#SBATCH --partition=hourly
#SBATCH --nodes=1
#SBATCH --job-name="test"

echo "hello"

sleep 1m

This script submits a job to the hourly partition on the merlin6 cluster. It prints out "hello" and sleeps for 1m (does nothing). Important: SLURM parameters can also be provided directly in the command-line. They will overwrite the SLURM params given in the script!Example:

sbatch --job-name="bla" myscript.sh will override the parameter #SBATCH --job-name="test" in the script.

Some important options:

#SBATCH --mem=<size[units]>             : memory 
#SBATCH --mem-per-cpu=<size[units]>     : memory per CPU
#SBATCH --cpus-per-task=<ncpus>         : cpus per task
#SBATCH --gpus=[<type>:]<number>        : GPUs 
#SBATCH --contraint=<Feature>           : specific type of GPU 

Example from Relion:

#SBATCH --job-name=r401profile-4gpu
#SBATCH --partition=gpu
#SBATCH --clusters=gmerlin6
#SBATCH --gpus=4
#SBATCH --nodes=1
#SBATCH --mem-per-gpu 25600
#SBATCH --ntasks=5
#SBATCH --cpus-per-task=4
#SBATCH --error=Refine3D/job038/run.err
#SBATCH --output=Refine3D/job038/run.out
#SBATCH --open-mode=append
#SBATCH --time=1-00:00:00

Cancelling jobs

scancel <jobid> --cluser=<cluster_name>

Priority/Queuing of JOBs

SLURM works with a "fair share" concept, which means that all users should get roughly the same amount of computational resources over time. To guarantee that, a submitted job will get a "priority (number)" assigned that ranks the job in the order of being actually started on the cluster. Depending on different parameters: Age, Partition , JobSize etc.

Example: Users who have consumed fewer resources recently will have their jobs prioritized higher, while those who have used more will have lower priority, promoting balanced and fair resource distribution.

Some important info/examples:

  • If you request 7 days for a job (maximum time), the JobSize parameter will become larger, meaning your jobs will have a lower priority --> try to request only the time that you will most likely need (plus a bit of backup time)
  • If you know that your job runs in <= one hour, go for the hourly / gpu-short partitions
  • Generally speaking: Try to minimize your request in terms of resources to maximize your priority!
  • If a job is queuing for an (unexpected) longer time, check the reason why the job is queuing (NODELIST(REASON)) from the squeue command. Adjust your job accordingly, if possible, or ask the admins to have a look if you are not sure about the issue.

Example: image

In this case a "wrong" time limit was requested for the partition- which can be fixed.