gitea-pages/pages/gmerlin6/slurm-configuration.md
2024-01-26 16:35:04 +01:00

16 KiB

title, keywords, last_updated, summary, sidebar, permalink
title keywords last_updated summary sidebar permalink
Slurm cluster 'gmerlin6' configuration, partitions, node definition, gmerlin6 29 January 2021 This document describes a summary of the Slurm 'configuration. merlin6_sidebar /gmerlin6/slurm-configuration.html

This documentation shows basic Slurm configuration and options needed to run jobs in the GPU cluster.

Merlin6 GPU nodes definition

The table below shows a summary of the hardware setup for the different GPU nodes

Nodes Def.#CPUs Max.#CPUs #Threads Def.Mem/CPU Max.Mem/CPU Max.Mem/Node Max.Swap GPU Type Def.#GPUs Max.#GPUs
merlin-g-[001] 1 core 8 cores 1 5120 102400 102400 10000 geforce_gtx_1080 1 2
merlin-g-[002-005] 1 core 20 cores 1 5120 102400 102400 10000 geforce_gtx_1080 1 4
merlin-g-[006-009] 1 core 20 cores 1 5120 102400 102400 10000 geforce_gtx_1080_ti 1 4
merlin-g-[010-013] 1 core 20 cores 1 5120 102400 102400 10000 geforce_rtx_2080_ti 1 4
merlin-g-014 1 core 48 cores 1 5120 360448 360448 10000 geforce_rtx_2080_ti 1 8
merlin-g-015 1 core 48 cores 1 5120 360448 360448 10000 A5000 1 8
merlin-g-100 1 core 128 cores 2 3900 998400 998400 10000 A100 1 8

{{site.data.alerts.tip}}Always check '/etc/slurm/gres.conf' and '/etc/slurm/slurm.conf' for changes in the GPU type and details of the hardware. {{site.data.alerts.end}}

Running jobs in the 'gmerlin6' cluster

In this chapter we will cover basic settings that users need to specify in order to run jobs in the GPU cluster.

Merlin6 GPU cluster

To run jobs in the gmerlin6 cluster users must specify the cluster name in Slurm:

#SBATCH --cluster=gmerlin6

Merlin6 GPU partitions

Users might need to specify the Slurm partition. If no partition is specified, it will default to gpu:

#SBATCH --partition=<partition_name>  # Possible <partition_name> values: gpu, gpu-short, gwendolen

The table below resumes shows all possible partitions available to users:

GPU Partition Default Time Max Time PriorityJobFactor* PriorityTier**
gpu 1 day 1 week 1 1
gpu-short 2 hours 2 hours 1000 500
gwendolen 30 minutes 2 hours 1000 1000
gwendolen-long*** 30 minutes 8 hours 1 1

*The PriorityJobFactor value will be added to the job priority (PARTITION column in sprio -l ). In other words, jobs sent to higher priority partitions will usually run first (however, other factors such like job age or mainly fair share might affect to that decision). For the GPU partitions, Slurm will also attempt first to allocate jobs on partitions with higher priority over partitions with lesser priority.

**Jobs submitted to a partition with a higher PriorityTier value will be dispatched before pending jobs in partition with lower PriorityTier value and, if possible, they will preempt running jobs from partitions with lower PriorityTier values.

***gwnedolen-long is a special partition which is enabled during non-working hours only. As of Nov 2023, the current policy is to disable this partition from Mon to Fri, from 1am to 5pm. However, jobs can be submitted anytime, but can only be scheduled outside this time range.

Merlin6 GPU Accounts

Users need to ensure that the public merlin account is specified. No specifying account options would default to this account. This is mostly needed by users which have multiple Slurm accounts, which may define by mistake a different account.

#SBATCH --account=merlin  # Possible values: merlin, gwendolen

Not all the accounts can be used on all partitions. This is resumed in the table below:

Slurm Account Slurm Partitions
merlin gpu,gpu-short
gwendolen gwendolen,gwendolen-long

By default, all users belong to the merlin Slurm accounts, and jobs are submitted to the gpu partition when no partition is defined.

Users only need to specify the gwendolen account when using the gwendolen or gwendolen-long partitions, otherwise specifying account is not needed (it will always default to merlin).

The 'gwendolen' account

For running jobs in the gwendolen/gwendolen-long partitions, users must specify the gwendolen account. The merlin account is not allowed to use the Gwendolen partitions.

Gwendolen is restricted to a set of users belonging to the unx-gwendolen Unix group. If you belong to a project allowed to use Gwendolen, or you are a user which would like to have access to it, please request access to the unx-gwendolen Unix group through PSI Service Now: the request will be redirected to the responsible of the project (Andreas Adelmann).

Slurm GPU specific options

Some options are available when using GPUs. These are detailed here.

Number of GPUs and type

When using the GPU cluster, users must specify the number of GPUs they need to use:

#SBATCH --gpus=[<type>:]<number>

The GPU type is optional: if left empty, it will try allocating any type of GPU. The different [<type>:] values and <number> of GPUs depends on the node. This is detailed in the below table.

Nodes GPU Type #GPUs
merlin-g-[001] geforce_gtx_1080 2
merlin-g-[002-005] geforce_gtx_1080 4
merlin-g-[006-009] geforce_gtx_1080_ti 4
merlin-g-[010-013] geforce_rtx_2080_ti 4
merlin-g-014 geforce_rtx_2080_ti 8
merlin-g-015 A5000 8
merlin-g-100 A100 8

Constraint / Features

Instead of specifying the GPU type, sometimes users would need to specify the GPU by the amount of memory available in the GPU card itself. This has been defined in Slurm with Features, which is a tag which defines the GPU memory for the different GPU cards. Users can specify which GPU memory size needs to be used with the --constraint option. In that case, notice that in many cases there is not need to specify [<type>:] in the --gpus option.

#SBATCH --contraint=<Feature>    # Possible values: gpumem_8gb, gpumem_11gb, gpumem_24gb, gpumem_40gb

The table below shows the available Features and which GPU card models and GPU nodes they belong to:

Merlin6 GPU Computing Nodes
Nodes GPU Type Feature
merlin-g-[001-005] `geforce_gtx_1080` `gpumem_8gb`
merlin-g-[006-009] `geforce_gtx_1080_ti` `gpumem_11gb`
merlin-g-[010-014] `geforce_rtx_2080_ti`
merlin-g-015 `A5000` `gpumem_24gb`
merlin-g-100 `A100` `gpumem_40gb`

Other GPU options

Alternative Slurm options for GPU based jobs are available. Please refer to the man pages for each Slurm command for further information about it (man salloc, man sbatch, man srun). Below are listed the most common settings:

#SBATCH --hint=[no]multithread
#SBATCH --ntasks=\<ntasks\>
#SBATCH --ntasks-per-gpu=\<ntasks\>
#SBATCH --mem-per-gpu=\<size[units]\>
#SBATCH --cpus-per-gpu=\<ncpus\>
#SBATCH --gpus-per-node=[\<type\>:]\<number\>
#SBATCH --gpus-per-socket=[\<type\>:]\<number\>
#SBATCH --gpus-per-task=[\<type\>:]\<number\>
#SBATCH --gpu-bind=[verbose,]\<type\>

Please, notice that when defining [<type>:] once, then all other options must use it too!

Dealing with Hyper-Threading

The gmerlin6 cluster contains the partitions gwendolen and gwendolen-long, which have a node with Hyper-Threading enabled. In that case, one should always specify whether to use Hyper-Threading or not. If not defined, Slurm will generally use it (exceptions apply). For this machine, generally HT is recommended.

#SBATCH --hint=multithread            # Use extra threads with in-core multi-threading.
#SBATCH --hint=nomultithread          # Don't use extra threads with in-core multi-threading.

User and job limits

The GPU cluster contains some basic user and job limits to ensure that a single user can not overabuse the resources and a fair usage of the cluster. The limits are described below.

Per job limits

These are limits applying to a single job. In other words, there is a maximum of resources a single job can use. Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: SlurmQoS(limits) (possible SlurmQoS values can be listed with the command sacctmgr show qos):

Partition Slurm Account Mon-Sun 0h-24h
gpu merlin gpu_week(gres/gpu=8)
gpu-short merlin gpu_week(gres/gpu=8)
gwendolen gwendolen No limits
gwendolen-long gwendolen No limits, active from 9pm to 5:30am
  • With the limits in the public gpu and gpu-short partitions, a single job using the merlin acccount (default account) can not use more than 40 CPUs, more than 8 GPUs or more than 200GB. Any job exceeding such limits will stay in the queue with the message QOSMax[Cpu|GRES|Mem]PerJob. As there are no more existing QoS during the week temporary overriding job limits (this happens for instance in the CPU daily partition), the job needs to be cancelled, and the requested resources must be adapted according to the above resource limits.

  • The gwendolen and gwendolen-long partitions are two special partitions for a NVIDIA DGX A100 machine. Only users belonging to the unx-gwendolen Unix group can run in these partitions. No limits are applied (machine resources can be completely used).

  • The gwendolen-long partition is available 24h. However,

    • from 5:30am to 9pm the partition is down (jobs can be submitted, but can not run until the partition is set to active).
    • from 9pm to 5:30am jobs are allowed to run (partition is set to active).

Per user limits for GPU partitions

These limits apply exclusively to users. In other words, there is a maximum of resources a single user can use. Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: SlurmQoS(limits) (possible SlurmQoS values can be listed with the command sacctmgr show qos):

Partition Slurm Account Mon-Sun 0h-24h
gpu merlin gpu_week(gres/gpu=16)
gpu-short merlin gpu_week(gres/gpu=16)
gwendolen gwendolen No limits
gwendolen-long gwendolen No limits, active from 9pm to 5:30am
  • With the limits in the public gpu and gpu-short partitions, a single user can not use more than 80 CPUs, more than 16 GPUs or more than 400GB. Jobs sent by any user already exceeding such limits will stay in the queue with the message QOSMax[Cpu|GRES|Mem]PerUser. In that case, job can wait in the queue until some of the running resources are freed.

  • Notice that user limits are wider than job limits. In that way, a user can run up to two 8 GPUs based jobs, or up to four 4 GPUs based jobs, etc. Please try to avoid occupying all GPUs of the same type for several hours or multiple days, otherwise it would block other users needing the same type of GPU.

Advanced Slurm configuration

Clusters at PSI use the Slurm Workload Manager as the batch system technology for managing and scheduling jobs. Slurm has been installed in a multi-clustered configuration, allowing to integrate multiple clusters in the same batch system.

For understanding the Slurm configuration setup in the cluster, sometimes may be useful to check the following files:

  • /etc/slurm/slurm.conf - can be found in the login nodes and computing nodes.
  • /etc/slurm/gres.conf - can be found in the GPU nodes, is also propgated to login nodes and computing nodes for user read access.
  • /etc/slurm/cgroup.conf - can be found in the computing nodes, is also propagated to login nodes for user read access.

The previous configuration files which can be found in the login nodes, correspond exclusively to the merlin6 cluster configuration files. Configuration files for the old merlin5 cluster or for the gmerlin6 cluster must be checked directly on any of the merlin5 or gmerlin6 computing nodes (in example, by login in to one of the nodes while a job or an active allocation is running).