gitea-pages/slurm-configuration.md at a4f4f32e22c687a3b51a84a7701c6fce5b0a8ca5

2021-05-21 12:34:19 +02:00

8.0 KiB

Raw Blame History

title, keywords, last_updated, summary, sidebar, permalink

title	keywords	last_updated	summary	sidebar	permalink
Slurm Configuration	configuration, partitions, node definition	20 May 2021	This document describes a summary of the Merlin5 Slurm configuration.	merlin6_sidebar	/merlin5/slurm-configuration.html

This documentation shows basic Slurm configuration and options needed to run jobs in the Merlin5 cluster.

The Merlin5 cluster is an old cluster with old hardware which is maintained in a best effort for increasing the CPU power of the Merlin cluster.

Merlin5 CPU nodes definition

The following table show default and maximum resources that can be used per node:

Nodes	Def.#CPUs	Max.#CPUs	#Threads	Max.Mem/Node	Max.Swap
merlin-c-[18-30]	1 core	16 cores	1	60000	10000
merlin-c-[31-32]	1 core	16 cores	1	124000	10000
merlin-c-[33-45]	1 core	16 cores	1	60000	10000
merlin-c-[46-47]	1 core	16 cores	1	124000	10000

There is one main difference between the Merlin5 and Merlin6 clusters: Merlin5 is keeping an old configuration which does not consider the memory as a consumable resource. Hence, users can oversubscribe memory. This might trigger some side-effects, but this legacy configuration has been kept to ensure that old jobs can keep running in the same way they did a few years ago. If you know that this might be a problem for you, please, always use Merlin6 instead.

Running jobs in the 'merlin5' cluster

In this chapter we will cover basic settings that users need to specify in order to run jobs in the Merlin5 CPU cluster.

Merlin5 CPU cluster

To run jobs in the merlin5 cluster users must specify the cluster name in Slurm:

#SBATCH --cluster=merlin5

Merlin5 CPU partitions

Users might need to specify the Slurm partition. If no partition is specified, it will default to merlin:

#SBATCH --partition=<partition_name> # Possible <partition_name> values: merlin, merlin-long:

The table below resumes shows all possible partitions available to users:

CPU Partition	Default Time	Max Time	Max Nodes	PriorityJobFactor*	PriorityTier**
merlin	5 days	1 week	All nodes	500	1
merlin-long	5 days	21 days	4	1	1

*****The PriorityJobFactor value will be added to the job priority (PARTITION column in sprio -l ). In other words, jobs sent to higher priority partitions will usually run first (however, other factors such like job age or mainly fair share might affect to that decision). For the GPU partitions, Slurm will also attempt first to allocate jobs on partitions with higher priority over partitions with lesser priority.

******Jobs submitted to a partition with a higher PriorityTier value will be dispatched before pending jobs in partition with lower PriorityTier value and, if possible, they will preempt running jobs from partitions with lower PriorityTier values.

The merlin-long partition is limited to 4 nodes, as it might contain jobs running for up to 21 days.

Merlin5 CPU Accounts

Users need to ensure that the public merlin account is specified. No specifying account options would default to this account. This is mostly needed by users which have multiple Slurm accounts, which may define by mistake a different account.

#SBATCH --account=merlin  # Possible values: merlin

Slurm CPU specific options

Some options are available when using CPUs. These are detailed here.

Alternative Slurm options for CPU based jobs are available. Please refer to the man pages for each Slurm command for further information about it (man salloc, man sbatch, man srun). Below are listed the most common settings:

#SBATCH --ntasks=<ntasks>
#SBATCH --ntasks-per-core=<ntasks>
#SBATCH --ntasks-per-socket=<ntasks>
#SBATCH --ntasks-per-node=<ntasks>
#SBATCH --mem=<size[units]>
#SBATCH --mem-per-cpu=<size[units]>
#SBATCH --cpus-per-task=<ncpus>
#SBATCH --cpu-bind=[{quiet,verbose},]<type>  # only for 'srun' command

Notice that in Merlin5 no hyper-threading is available (while in Merlin6 it is). Hence, in Merlin5 there is not need to specify --hint hyper-threading related options.

User and job limits

In the CPU cluster we provide some limits which basically apply to jobs and users. The idea behind this is to ensure a fair usage of the resources and to avoid overabuse of the resources from a single user or job. However, applying limits might affect the overall usage efficiency of the cluster (in example, pending jobs from a single user while having many idle nodes due to low overall activity is something that can be seen when user limits are applied). In the same way, these limits can be also used to improve the efficiency of the cluster (in example, without any job size limits, a job requesting all resources from the batch system would drain the entire cluster for fitting the job, which is undesirable).

Hence, there is a need of setting up wise limits and to ensure that there is a fair usage of the resources, by trying to optimize the overall efficiency of the cluster while allowing jobs of different nature and sizes (it is, single core based vs parallel jobs of different sizes) to run.

In the merlin5 cluster, as not many users are running on it, these limits are wider than the ones set in the merlin6 and gmerlin6 clusters.

Per job limits

These are limits which apply to a single job. In other words, there is a maximum of resources a single job can use. These limits are described in the table below, with the format SlurmQoS(limits) (SlurmQoS can be listed from the sacctmgr show qos command):

Partition	Mon-Sun 0h-24h	Other limits
merlin	merlin5(cpu=384)	None
merlin-long	merlin5(cpu=384)	Max. 4 nodes

By default, by QoS limits, a job can not use more than 384 cores (max CPU per job). However, for the merlin-long, this is even more restricted: there is an extra limit of 4 dedicated nodes for this partion. This is defined at the partition level, and will overwrite any QoS limit as long as this is more restrictive.

Per user limits for CPU partitions

No user limits apply by QoS. For the merlin partition, a single user could fill the whole batch system with jobs (however, the restriction is at the job size, as explained above). For the merlin-limit partition, the 4 node limitation still applies.

Advanced Slurm configuration

Clusters at PSI use the Slurm Workload Manager as the batch system technology for managing and scheduling jobs. Slurm has been installed in a multi-clustered configuration, allowing to integrate multiple clusters in the same batch system.

For understanding the Slurm configuration setup in the cluster, sometimes may be useful to check the following files:

/etc/slurm/slurm.conf - can be found in the login nodes and computing nodes.
/etc/slurm/gres.conf - can be found in the GPU nodes, is also propgated to login nodes and computing nodes for user read access.
/etc/slurm/cgroup.conf - can be found in the computing nodes, is also propagated to login nodes for user read access.

The previous configuration files which can be found in the login nodes, correspond exclusively to the merlin6 cluster configuration files. Configuration files for the old merlin5 cluster or for the gmerlin6 cluster must be checked directly on any of the merlin5 or gmerlin6 computing nodes (in example, by login in to one of the nodes while a job or an active allocation is running).

8.0 KiB Raw Blame History