Slurm configuration, changed per job and per user limits
This commit is contained in:
@ -2,7 +2,7 @@
|
||||
title: Slurm Configuration
|
||||
#tags:
|
||||
keywords: configuration, partitions, node definition
|
||||
last_updated: 23 January 2020
|
||||
last_updated: 29 January 2021
|
||||
summary: "This document describes a summary of the Merlin6 configuration."
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/slurm-configuration.html
|
||||
@ -105,7 +105,7 @@ with wide values).
|
||||
|
||||
#### Per user limits for CPU partitions
|
||||
|
||||
These limits which apply exclusively to users. In other words, there is a maximum of resource a single user can use. This is described in the table below,
|
||||
These limits which apply exclusively to users. In other words, there is a maximum of resources a single user can use. This is described in the table below,
|
||||
and limits will vary depending on the day of the week and the time (*working* vs *non-working* hours). Limits are shown in format: `SlurmQoS(limits)`,
|
||||
where `SlurmQoS` can be seen with the command `sacctmgr show qos`:
|
||||
|
||||
@ -150,13 +150,41 @@ partitions, Slurm will also attempt first to allocate jobs on partitions with hi
|
||||
|
||||
### User and job limits
|
||||
|
||||
The GPU cluster contains some basic user and job limits to ensure that a single user can not overabuse the resources and a fair usage of the cluster.
|
||||
The limits are described below.
|
||||
|
||||
#### Per job limits
|
||||
|
||||
Per job limits are the same as the per user limits (see below).
|
||||
These are limits applying to a single job. In other words, there is a maximum of resources a single job can use.
|
||||
Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below and are showed in format: `SlurmQoS(limits)`,
|
||||
(list of possible `SlurmQoS` values can be listed with the command `sacctmgr show qos`):
|
||||
|
||||
| Partition | Mon-Sun 0h-24h |
|
||||
|:-------------:| :------------------------------------: |
|
||||
| **gpu** | gpu_week(cpu=40,gres/gpu=8,mem=200G) |
|
||||
| **gpu-short** | gpu_week(cpu=40,gres/gpu=8,mem=200G) |
|
||||
|
||||
With these limits, a single job can not use more than 40 CPUs, more than 8 GPUs or more than 200GB.
|
||||
Any job exceeding such limits will stay in the queue with error **`QOSMax[Cpu|GRES|Mem]PerJob`**.
|
||||
Since there are no more QoS during the week which can increase job limits (this happens for instance in the CPU **daily** partition), the job needs to be cancelled and requested resources must be adapted according to the above resource limits.
|
||||
|
||||
#### Per user limits for CPU partitions
|
||||
|
||||
By default, a user can not use more than **two** GPU nodes in parallel. Hence, users are limited to use 8 GPUs in parallel at most (from 2 nodes).
|
||||
These limits apply exclusively to users. In other words, there is a maximum of resources a single user can use.
|
||||
Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below and are showed in format: `SlurmQoS(limits)`,
|
||||
(list of possible `SlurmQoS` values can be listed with the command `sacctmgr show qos`):
|
||||
|
||||
| Partition | Mon-Sun 0h-24h |
|
||||
|:-------------:| :---------------------------------------------------------: |
|
||||
| **gpu** | gpu_week(cpu=80,gres/gpu=16,mem=400G) |
|
||||
| **gpu-short** | gpu_week(cpu=80,gres/gpu=16,mem=400G) |
|
||||
|
||||
With these limits, a single user can not use more than 80 CPUs, more than 16 GPUs or more than 400GB.
|
||||
Jobs sent by any user already exceeding such limits will stay in the queue with error **`QOSMax[Cpu|GRES|Mem]PerUser`**. In that case, job can wait until some of the running resources by this user are freed.
|
||||
|
||||
Notice that user limits are wider than job limits. In that way, a user can run up to two 8 GPUs based jobs, or up to four 4 GPUs based jobs, etc.
|
||||
Please try to avoid occupying all GPUs of the same type for several hours or multiple days, otherwise it would block other users needing the same
|
||||
type of GPU.
|
||||
|
||||
## Understanding the Slurm configuration (for advanced users)
|
||||
|
||||
|
Reference in New Issue
Block a user