Files
gitea-pages/pages/merlin6/03 Job Submission/slurm-configuration.md
2020-09-04 09:26:15 +02:00

10 KiB

title, keywords, last_updated, summary, sidebar, permalink
title keywords last_updated summary sidebar permalink
Slurm Configuration configuration, partitions, node definition 23 January 2020 This document describes a summary of the Merlin6 configuration. merlin6_sidebar /merlin6/slurm-configuration.html

About Merlin5 & Merlin6

The new Slurm cluster is called merlin6. However, the old Slurm merlin cluster will be kept for some time, and it has been renamed to merlin5. It will allow to keep running jobs in the old computing nodes until users have fully migrated their codes to the new cluster.

From July 2019, merlin6 becomes the default cluster and any job submitted to Slurm will be submitted to that cluster. Users can keep submitting to the old merlin5 computing nodes by using the option --cluster=merlin5.

In this documentation is only explained the usage of the merlin6 Slurm cluster.

Using Slurm 'merlin6' cluster

Basic usage for the merlin6 cluster will be detailed here. For advanced usage, please use the following document LINK TO SLURM ADVANCED CONFIG

Merlin6 Node definition

The following table show default and maximum resources that can be used per node:

Nodes Def.#CPUs Max.#CPUs #Threads Def.Mem/CPU Max.Mem/CPU Max.Mem/Node Max.Swap Def.#GPUs Max.#GPUs
merlin-c-[001-024] 1 core 44 cores 2 4000 352000 352000 10000 N/A N/A
merlin-c-[101-124] 1 core 44 cores 2 4000 352000 352000 10000 N/A N/A
merlin-c-[201-224] 1 core 44 cores 2 4000 352000 352000 10000 N/A N/A
merlin-g-[001] 1 core 8 cores 1 4000 102400 102400 10000 1 2
merlin-g-[002-009] 1 core 20 cores 1 4000 102400 102400 10000 1 4

If nothing is specified, by default each core will use up to 8GB of memory. Memory can be increased with the --mem=<mem_in_MB> and --mem-per-cpu=<mem_in_MB> options, and maximum memory allowed is Max.Mem/Node.

In Merlin6, memory is considered a Consumable Resource, as well as the CPU.

Merlin6 Slurm partitions

Partition can be specified when submitting a job with the --partition=<partitionname> option. The following partitions (also known as queues) are configured in Slurm:

Partition Default Time Max Time Max Nodes Priority PriorityJobFactor*
general 1 day 1 week 50 low 1
daily 1 day 1 day 67 medium 500
hourly 1 hour 1 hour unlimited highest 1000

*The PriorityJobFactor value will be added to the job priority (PARTITION column in sprio -l ). In other words, jobs sent to higher priority partitions will usually run first (however, other factors such like job age or mainly fair share might affect to that decision).

The general partition is the default: when nothing is specified, job will be by default assigned to that partition. General can not have more than 50 nodes running jobs. For daily this limitation is extended to 67 nodes while for hourly there are no limits.

{{site.data.alerts.tip}}Jobs which would run for less than one day should be always sent to daily, while jobs that would run for less than one hour should be sent to hourly. This would ensure that you have highest priority over jobs sent to partitions with less priority, but also because general has limited the number of nodes that can be used for that. The idea behind that, is that the cluster can not be blocked by long jobs and we can always ensure resources for shorter jobs. {{site.data.alerts.end}}

Merlin6 user and job limits

In the CPU cluster we provide some limits which basically apply to jobs and users. The idea behind this is to ensure a fair usage of the resources and to avoid overabuse of the resources from a single user or job. However, applying limits might affect the overall usage efficiency of the cluster (in example, pending jobs from a single user while having many idle nodes due to low overall activity is something that can be seen when user limits are applied). In the same way, these limits can be also used to improve the efficiency of the cluster (in example, without any job size limits, a job requesting all resources from the batch system would drain the entire cluster for fitting the job, which is undesirable).

Hence, there is a need of setting up wise limits and to ensure that there is a fair usage of the resources, by trying to optimize the overall efficiency of the cluster while allowing jobs of different nature and sizes (it is, single core based vs parallel jobs of different sizes) to run.

{{site.data.alerts.warning}}Wide limits are provided in the daily and hourly partitions, while for general those limits are more restrictive.
However, we kindly ask users to inform the Merlin administrators when there are plans to send big jobs which would require a massive draining of nodes for allocating such jobs. This would apply to jobs requiring the unlimited QoS (see below "Per job limits") {{site.data.alerts.end}}

Per job limits

These are limits which apply to a single job. In other words, there is a maximum of resources a single job can use. This is described in the table below, and limits will vary depending on the day of the week and the time (working vs non-working hours). Limits are shown in format: SlurmQoS(limits), where SlurmQoS can be seen with the command sacctmgr show qos:

Partition Mon-Fri 08h-18h Sun-Thu 18h-0h From Fri 18h to Sun 8h From Sun 8h to Mon 18h
general normal(cpu=704,mem=2750G) normal(cpu=704,mem=2750G) normal(cpu=704,mem=2750G) normal(cpu=704,mem=2750G
daily daily(cpu=704,mem=2750G) nightly(cpu=1408,mem=5500G) unlimited(cpu=2112,mem=8250G) nightly(cpu=1408,mem=5500G
hourly unlimited(cpu=2112,mem=8250G) unlimited(cpu=2112,mem=8250G) unlimited(cpu=2112,mem=8250G) unlimited(cpu=2112,mem=8250G)

By default, a job can not use more than 704 cores (max CPU per job). In the same way, memory is also proportionally limited. This is equivalent as running a job using up to 8 nodes at once. This limit applies to the general partition (fixed limit) and to the daily partition (only during working hours). Limits are softed for the daily partition during non working hours, and during the weekend limits are even wider.

For the hourly partition, despite running many parallel jobs is something not desirable (for allocating such jobs it requires massive draining of nodes), wider limits are provided. In order to avoid massive nodes drain in the cluster, for allocating huge jobs, setting per job limits is necessary. Hence, unlimited QoS mostly refers to "per user" limits more than to "per job" limits (in other words, users can run any number of hourly jobs, but the job size for such jobs is limited with wide values).

Per user limits

These limits which apply exclusively to users. In other words, there is a maximum of resource a single user can use. This is described in the table below, and limits will vary depending on the day of the week and the time (working vs non-working hours). Limits are shown in format: SlurmQoS(limits), where SlurmQoS can be seen with the command sacctmgr show qos:

Partition Mon-Fri 08h-18h Sun-Thu 18h-0h From Fri 18h to Sun 8h From Sun 8h to Mon 18h
general normal(cpu=704,mem=2750G) normal(cpu=704,mem=2750G) normal(cpu=704,mem=2750G) normal(cpu=704,mem=2750G)
daily daily(cpu=1408,mem=5500G) nightly(cpu=2112,mem=8250G) unlimited(cpu=6336,mem=24750G) nightly(cpu=2112,mem=8250G)
hourly unlimited(cpu=6336,mem=24750G) unlimited(cpu=6336,mem=24750G) unlimited(cpu=6336,mem=24750G) unlimited(cpu=6336,mem=24750G)

By default, users can not use more than 704 cores at the same time (max CPU per user). Memory is also proportionally limited in the same way. This is equivalent to 8 exclusive nodes. This limit applies to the general partition (fixed limit) and to the daily partition (only during working hours). For the hourly partition, there are no limits restriction and user limits are removed. Limits are softed for the daily partition during non working hours, and during the weekend limits are removed.

Understanding the Slurm configuration (for advanced users)

Clusters at PSI use the Slurm Workload Manager as the batch system technology for managing and scheduling jobs. Historically, Merlin4 and Merlin5 also used Slurm. In the same way, Merlin6 has been also configured with this batch system.

Slurm has been installed in a multi-clustered configuration, allowing to integrate multiple clusters in the same batch system.

For understanding the Slurm configuration setup in the cluster, sometimes may be useful to check the following files:

  • /etc/slurm/slurm.conf - can be found in the login nodes and computing nodes.
  • /etc/slurm/gres.conf - can be found in the GPU nodes, is also propgated to login nodes and computing nodes for user read access.
  • /etc/slurm/cgroup.conf - can be found in the computing nodes, is also propagated to login nodes for user read access.

The previous configuration files which can be found in the login nodes, correspond exclusively to the merlin6 cluster configuration files. Configuration files for the old merlin5 cluster must be checked directly on any of the merlin5 computing nodes: these are not propagated to the merlin6 login nodes.