first stab at mkdocs migration
This commit is contained in:
213
docs/merlin7/03-Slurm-General-Documentation/interactive-jobs.md
Normal file
213
docs/merlin7/03-Slurm-General-Documentation/interactive-jobs.md
Normal file
@@ -0,0 +1,213 @@
|
||||
---
|
||||
title: Running Interactive Jobs
|
||||
#tags:
|
||||
keywords: interactive, X11, X, srun, salloc, job, jobs, slurm, nomachine, nx
|
||||
last_updated: 07 August 2024
|
||||
summary: "This document describes how to run interactive jobs as well as X based software."
|
||||
sidebar: merlin7_sidebar
|
||||
permalink: /merlin7/interactive-jobs.html
|
||||
---
|
||||
|
||||
### The Merlin7 'interactive' partition
|
||||
|
||||
On the **`merlin7`** cluster, it is recommended to always run interactive jobs on the **`interactive`** partition.
|
||||
This partition allows CPU oversubscription (up to four users may share the same CPU) and **has the highest scheduling priority**. Access to this partition is typically quick, making it a convenient extension of the login nodes for interactive workloads.
|
||||
|
||||
On the **`gmerlin7`** cluster, additional interactive partitions are available, but these are primarily intended for CPU-only workloads (such like compiling GPU-based software, or creating an allocation for submitting jobs to Grace-Hopper nodes).
|
||||
|
||||
{{site.data.alerts.warning}}
|
||||
Because <b>GPU resources are scarce and expensive</b>, interactive allocations on GPU nodes that use GPUs should only be submitted when strictly necessary and well justified.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
## Running interactive jobs
|
||||
|
||||
There are two different ways for running interactive jobs in Slurm. This is possible by using
|
||||
the ``salloc`` and ``srun`` commands:
|
||||
|
||||
* **``salloc``**: to obtain a Slurm job allocation (a set of nodes), execute command(s), and then release the allocation when the command is finished.
|
||||
* **``srun``**: is used for running parallel tasks.
|
||||
|
||||
### srun
|
||||
|
||||
Is run is used to run parallel jobs in the batch system. It can be used within a batch script
|
||||
(which can be run with ``sbatch``), or within a job allocation (which can be run with ``salloc``).
|
||||
Also, it can be used as a direct command (in example, from the login nodes).
|
||||
|
||||
When used inside a batch script or during a job allocation, ``srun`` is constricted to the
|
||||
amount of resources allocated by the ``sbatch``/``salloc`` commands. In ``sbatch``, usually
|
||||
these resources are defined inside the batch script with the format ``#SBATCH <option>=<value>``.
|
||||
In other words, if you define in your batch script or allocation 88 tasks (and 1 thread / core)
|
||||
and 2 nodes, ``srun`` is constricted to these amount of resources (you can use less, but never
|
||||
exceed those limits).
|
||||
|
||||
When used from the login node, usually is used to run a specific command or software in an
|
||||
interactive way. ``srun`` is a blocking process (it will block bash prompt until the ``srun``
|
||||
command finishes, unless you run it in background with ``&``). This can be very useful to run
|
||||
interactive software which pops up a Window and then submits jobs or run sub-tasks in the
|
||||
background (in example, **Relion**, **cisTEM**, etc.)
|
||||
|
||||
Refer to ``man srun`` for exploring all possible options for that command.
|
||||
|
||||
<details>
|
||||
<summary>[Show 'srun' example]: Running 'hostname' command on 3 nodes, using 2 cores (1 task/core) per node</summary>
|
||||
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
|
||||
caubet_m@login001:~> srun --clusters=merlin7 --partition=interactive --ntasks=6 --ntasks-per-node=2 --nodes=3 hostname
|
||||
cn001.merlin7.psi.ch
|
||||
cn001.merlin7.psi.ch
|
||||
cn002.merlin7.psi.ch
|
||||
cn002.merlin7.psi.ch
|
||||
cn003.merlin7.psi.ch
|
||||
cn003.merlin7.psi.ch
|
||||
</pre>
|
||||
</details>
|
||||
|
||||
### salloc
|
||||
|
||||
**``salloc``** is used to obtain a Slurm job allocation (a set of nodes). Once job is allocated,
|
||||
users are able to execute interactive command(s). Once finished (``exit`` or ``Ctrl+D``),
|
||||
the allocation is released. **``salloc``** is a blocking command, it is, command will be blocked
|
||||
until the requested resources are allocated.
|
||||
|
||||
When running **``salloc``**, once the resources are allocated, *by default* the user will get
|
||||
a ***new shell on one of the allocated resources*** (if a user has requested few nodes, it will
|
||||
prompt a new shell on the first allocated node). However, this behaviour can be changed by adding
|
||||
a shell (`$SHELL`) at the end of the `salloc` command. In example:
|
||||
|
||||
```bash
|
||||
# Typical 'salloc' call
|
||||
salloc --clusters=merlin7 --partition=interactive -N 2 -n 2
|
||||
|
||||
# Custom 'salloc' call
|
||||
# - $SHELL will open a local shell on the login node from where ``salloc`` is running
|
||||
salloc --clusters=merlin7 --partition=interactive -N 2 -n 2 $SHELL
|
||||
```
|
||||
|
||||
<details>
|
||||
<summary>[Show 'salloc' example]: Allocating 2 cores (1 task/core) in 2 nodes (1 core/node) - <i>Default</i></summary>
|
||||
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
|
||||
caubet_m@login001:~> salloc --clusters=merlin7 --partition=interactive -N 2 -n 2
|
||||
salloc: Granted job allocation 161
|
||||
salloc: Nodes cn[001-002] are ready for job
|
||||
|
||||
caubet_m@login001:~> srun hostname
|
||||
cn002.merlin7.psi.ch
|
||||
cn001.merlin7.psi.ch
|
||||
|
||||
caubet_m@login001:~> exit
|
||||
exit
|
||||
salloc: Relinquishing job allocation 161
|
||||
</pre>
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>[Show 'salloc' example]: Allocating 2 cores (1 task/core) in 2 nodes (1 core/node) - <i>$SHELL</i></summary>
|
||||
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
|
||||
caubet_m@login001:~> salloc --clusters=merlin7 --partition=interactive --ntasks=2 --nodes=2 $SHELL
|
||||
salloc: Granted job allocation 165
|
||||
salloc: Nodes cn[001-002] are ready for job
|
||||
caubet_m@login001:~> srun hostname
|
||||
cn001.merlin7.psi.ch
|
||||
cn002.merlin7.psi.ch
|
||||
caubet_m@login001:~> exit
|
||||
exit
|
||||
salloc: Relinquishing job allocation 165
|
||||
</pre>
|
||||
</details>
|
||||
|
||||
## Running interactive jobs with X11 support
|
||||
|
||||
### Requirements
|
||||
|
||||
#### Graphical access
|
||||
|
||||
[NoMachine](/merlin7/nomachine.html) is the official supported service for graphical
|
||||
access in the Merlin cluster. This service is running on the login nodes. Check the
|
||||
document [{Accessing Merlin -> NoMachine}](/merlin7/nomachine.html) for details about
|
||||
how to connect to the **NoMachine** service in the Merlin cluster.
|
||||
|
||||
For other non officially supported graphical access (X11 forwarding):
|
||||
|
||||
* For Linux clients, please follow [{How To Use Merlin -> Accessing from Linux Clients}](/merlin7/connect-from-linux.html)
|
||||
* For Windows clients, please follow [{How To Use Merlin -> Accessing from Windows Clients}](/merlin7/connect-from-windows.html)
|
||||
* For MacOS clients, please follow [{How To Use Merlin -> Accessing from MacOS Clients}](/merlin7/connect-from-macos.html)
|
||||
|
||||
### 'srun' with x11 support
|
||||
|
||||
Merlin6 and merlin7 clusters allow running any windows based applications. For that, you need to
|
||||
add the option ``--x11`` to the ``srun`` command. In example:
|
||||
|
||||
```bash
|
||||
srun --clusters=merlin7 --partition=interactive --x11 sview
|
||||
```
|
||||
|
||||
will popup a X11 based slurm view of the cluster.
|
||||
|
||||
In the same manner, you can create a bash shell with x11 support. For doing that, you need
|
||||
to add the option ``--pty`` to the ``srun --x11`` command. Once resource is allocated, from
|
||||
there you can interactively run X11 and non-X11 based commands.
|
||||
|
||||
```bash
|
||||
srun --clusters=merlin7 --partition=interactive --x11 --pty bash
|
||||
```
|
||||
|
||||
<details>
|
||||
<summary>[Show 'srun' with X11 support examples]</summary>
|
||||
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
|
||||
caubet_m@login001:~> srun --clusters=merlin7 --partition=interactive --x11 sview
|
||||
|
||||
caubet_m@login001:~>
|
||||
|
||||
caubet_m@login001:~> srun --clusters=merlin7 --partition=interactive --x11 --pty bash
|
||||
|
||||
caubet_m@cn003:~> sview
|
||||
|
||||
caubet_m@cn003:~> echo "This was an example"
|
||||
This was an example
|
||||
|
||||
caubet_m@cn003:~> exit
|
||||
exit
|
||||
</pre>
|
||||
</details>
|
||||
|
||||
### 'salloc' with x11 support
|
||||
|
||||
**Merlin6** and **merlin7** clusters allow running any windows based applications. For that, you need to
|
||||
add the option ``--x11`` to the ``salloc`` command. In example:
|
||||
|
||||
```bash
|
||||
salloc --clusters=merlin7 --partition=interactive --x11 sview
|
||||
```
|
||||
|
||||
will popup a X11 based slurm view of the cluster.
|
||||
|
||||
In the same manner, you can create a bash shell with x11 support. For doing that, you need
|
||||
to add to run just ``salloc --clusters=merlin7 --partition=interactive --x11``. Once resource is allocated, from
|
||||
there you can interactively run X11 and non-X11 based commands.
|
||||
|
||||
```bash
|
||||
salloc --clusters=merlin7 --partition=interactive --x11
|
||||
```
|
||||
|
||||
<details>
|
||||
<summary>[Show 'salloc' with X11 support examples]</summary>
|
||||
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
|
||||
caubet_m@login001:~> salloc --clusters=merlin7 --partition=interactive --x11 sview
|
||||
salloc: Granted job allocation 174
|
||||
salloc: Nodes cn001 are ready for job
|
||||
salloc: Relinquishing job allocation 174
|
||||
|
||||
caubet_m@login001:~> salloc --clusters=merlin7 --partition=interactive --x11
|
||||
salloc: Granted job allocation 175
|
||||
salloc: Nodes cn001 are ready for job
|
||||
caubet_m@cn001:~>
|
||||
|
||||
caubet_m@cn001:~> sview
|
||||
|
||||
caubet_m@cn001:~> echo "This was an example"
|
||||
This was an example
|
||||
|
||||
caubet_m@cn001:~> exit
|
||||
exit
|
||||
salloc: Relinquishing job allocation 175
|
||||
</pre>
|
||||
</details>
|
||||
@@ -0,0 +1,59 @@
|
||||
---
|
||||
title: Slurm cluster 'merlin7'
|
||||
#tags:
|
||||
keywords: configuration, partitions, node definition
|
||||
#last_updated: 24 Mai 2023
|
||||
summary: "This document describes a summary of the Merlin7 configuration."
|
||||
sidebar: merlin7_sidebar
|
||||
permalink: /merlin7/merlin7-configuration.html
|
||||
---
|
||||
|
||||
This documentation shows basic Slurm configuration and options needed to run jobs in the Merlin7 cluster.
|
||||
|
||||
## Infrastructure
|
||||
|
||||
### Hardware
|
||||
|
||||
* 2 CPU-only login nodes
|
||||
* 77 CPU-only compute nodes
|
||||
* 5 GPU A100 nodes
|
||||
* 8 GPU Grace Hopper nodes
|
||||
|
||||
The specification of the node types is:
|
||||
|
||||
| Node | #Nodes | CPU | RAM | GRES |
|
||||
| ----: | ------ | --- | --- | ---- |
|
||||
| Login Nodes | 2 | _2x_ AMD EPYC 7742 (x86_64 Rome, 64 Cores, 2.25GHz) | 512GB DDR4 3200Mhz | |
|
||||
| CPU Nodes | 77 | _2x_ AMD EPYC 7742 (x86_64 Rome, 64 Cores, 2.25GHz) | 512GB DDR4 3200Mhz | |
|
||||
| A100 GPU Nodes | 5 | _2x_ AMD EPYC 7713 (x86_64 Milan, 64 Cores, 3.2GHz) | 512GB DDR4 3200Mhz | 4 x NV_A100 (80GB) |
|
||||
| GH GPU Nodes | 3 | _2x_ NVidia Grace Neoverse-V2 (SBSA ARM 64bit, 144 Cores, 3.1GHz) | _2x_ 480GB DDR5X (CPU+GPU) | 4 x NV_GH200 (120GB) |
|
||||
|
||||
### Network
|
||||
|
||||
The Merlin7 cluster builds on top of HPE/Cray technologies, including a high-performance network fabric called Slingshot. This network fabric is able
|
||||
to provide up to 200 Gbit/s throughput between nodes. Further information on Slignshot can be found on at [HPE](https://www.hpe.com/psnow/doc/PSN1012904596HREN) and
|
||||
at <https://www.glennklockwood.com/garden/slingshot>.
|
||||
|
||||
Through software interfaces like [libFabric](https://ofiwg.github.io/libfabric/) (which available on Merlin7), application can leverage the network seamlessly.
|
||||
|
||||
### Storage
|
||||
|
||||
Unlike previous iteration of the Merlin HPC clusters, Merlin7 _does not_ have any local storage. Instead storage for the entire cluster is provided through
|
||||
a dedicated storage appliance from HPE/Cray called [ClusterStor](https://www.hpe.com/psnow/doc/PSN1012842049INEN.pdf).
|
||||
|
||||
The appliance is built of several storage servers:
|
||||
|
||||
* 2 management nodes
|
||||
* 2 MDS servers, 12 drives per server, 2.9TiB (Raid10)
|
||||
* 8 OSS-D servers, 106 drives per server, 14.5 T.B HDDs (Gridraid / Raid6)
|
||||
* 4 OSS-F servers, 12 drives per server 7TiB SSDs (Raid10)
|
||||
|
||||
With effective storage capacity of:
|
||||
|
||||
* 10 PB HDD
|
||||
* value visible on linux: HDD 9302.4 TiB
|
||||
* 162 TB SSD
|
||||
* value visible on linux: SSD 151.6 TiB
|
||||
* 23.6 TiB on Metadata
|
||||
|
||||
The storage is directly connected to the cluster (and each individual node) through the Slingshot NIC.
|
||||
@@ -0,0 +1,370 @@
|
||||
---
|
||||
title: Slurm merlin7 Configuration
|
||||
#tags:
|
||||
keywords: configuration, partitions, node definition
|
||||
#last_updated: 24 Mai 2023
|
||||
summary: "This document describes a summary of the Merlin7 Slurm CPU-based configuration."
|
||||
sidebar: merlin7_sidebar
|
||||
permalink: /merlin7/slurm-configuration.html
|
||||
---
|
||||
|
||||
This documentation shows basic Slurm configuration and options needed to run jobs in the Merlin7 cluster.
|
||||
|
||||
## Public partitions configuration summary
|
||||
|
||||
### CPU public partitions
|
||||
|
||||
| PartitionName | DefaultTime | MaxTime | Priority | Account | Per Job Limits | Per User Limits |
|
||||
| -----------------: | -----------: | ----------: | -------: | ---------------: | --------------------: | --------------------: |
|
||||
| **<u>general</u>** | 1-00:00:00 | 7-00:00:00 | Low | <u>merlin</u> | cpu=1024,mem=1920G | cpu=1024,mem=1920G |
|
||||
| **daily** | 0-01:00:00 | 1-00:00:00 | Medium | <u>merlin</u> | cpu=1024,mem=1920G | cpu=2048,mem=3840G |
|
||||
| **hourly** | 0-00:30:00 | 0-01:00:00 | High | <u>merlin</u> | cpu=2048,mem=3840G | cpu=8192,mem=15T |
|
||||
| **interactive** | 0-04:00:00 | 0-12:00:00 | Highest | <u>merlin</u> | cpu=16,mem=30G,node=1 | cpu=32,mem=60G,node=1 |
|
||||
|
||||
### GPU public partitions
|
||||
|
||||
#### A100 nodes
|
||||
|
||||
| PartitionName | DefaultTime | MaxTime | Priority | Account | Per Job Limits | Per User Limits |
|
||||
| -------------------: | -----------: | ----------: | ---------: | -------------: | -------------------------------: | -------------------------------: |
|
||||
| **a100-general** | 1-00:00:00 | 7-00:00:00 | Low | <u>merlin</u> | gres/gpu=4 | gres/gpu=8 |
|
||||
| **a100-daily** | 0-01:00:00 | 1-00:00:00 | Medium | <u>merlin</u> | gres/gpu=8 | gres/gpu=8 |
|
||||
| **a100-hourly** | 0-00:30:00 | 0-01:00:00 | High | <u>merlin</u> | gres/gpu=8 | gres/gpu=8 |
|
||||
| **a100-interactive** | 0-01:00:00 | 0-12:00:00 | Very High | <u>merlin</u> | cpu=16,gres/gpu=1,mem=60G,node=1 | cpu=16,gres/gpu=1,mem=60G,node=1 |
|
||||
|
||||
#### Grace-Hopper nodes
|
||||
|
||||
| PartitionName | DefaultTime | MaxTime | Priority | Account | Per Job Limits | Per User Limits |
|
||||
| -------------------: | -----------: | ----------: | ---------: | -------------: | -------------------------------: | -------------------------------: |
|
||||
| **gh-general** | 1-00:00:00 | 7-00:00:00 | Low | <u>merlin</u> | gres/gpu=4 | gres/gpu=8 |
|
||||
| **gh-daily** | 0-01:00:00 | 1-00:00:00 | Medium | <u>merlin</u> | gres/gpu=8 | gres/gpu=8 |
|
||||
| **gh-hourly** | 0-00:30:00 | 0-01:00:00 | High | <u>merlin</u> | gres/gpu=8 | gres/gpu=8 |
|
||||
| **gh-interactive** | 0-01:00:00 | 0-12:00:00 | Very High | <u>merlin</u> | cpu=16,gres/gpu=1,mem=46G,node=1 | cpu=16,gres/gpu=1,mem=46G,node=1 |
|
||||
|
||||
## CPU cluster: merlin7
|
||||
|
||||
**By default, jobs will be submitted to `merlin7`**, as it is the primary cluster configured on the login nodes.
|
||||
Specifying the cluster name is typically unnecessary unless you have defined environment variables that could override the default cluster name.
|
||||
However, when necessary, one can specify the cluster as follows:
|
||||
```bash
|
||||
#SBATCH --cluster=merlin7
|
||||
```
|
||||
|
||||
### CPU general configuration
|
||||
|
||||
The **Merlin7 CPU cluster** is configured with the **`CR_CORE_MEMORY`** and **`CR_ONE_TASK_PER_CORE`** options.
|
||||
* This configuration treats both cores and memory as consumable resources.
|
||||
* Since the nodes are running with **hyper-threading** enabled, each core thread is counted as a CPU
|
||||
to fulfill a job's resource requirements.
|
||||
|
||||
By default, Slurm will allocate one task per core, which means:
|
||||
* Each task will consume 2 **CPUs**, regardless of whether both threads are actively used by the job.
|
||||
|
||||
This behavior ensures consistent resource allocation but may result in underutilization of hyper-threading in some cases.
|
||||
|
||||
### CPU nodes definition
|
||||
|
||||
The table below provides an overview of the Slurm configuration for the different node types in the Merlin7 cluster.
|
||||
This information is essential for understanding how resources are allocated, enabling users to tailor their submission
|
||||
scripts accordingly.
|
||||
|
||||
| Nodes | Sockets | CoresPerSocket | Cores | ThreadsPerCore | CPUs | MaxMemPerNode | DefMemPerCPU | Features |
|
||||
| --------------------:| -------: | --------------: | -----: | --------------: | ----: | ------------: | -----------: | ------------: |
|
||||
| login[001-002] | 2 | 64 | 128 | 2 | 256 | 480G | 1920M | AMD_EPYC_7713 |
|
||||
| cn[001-077] | 2 | 64 | 128 | 2 | 256 | 480G | 1920M | AMD_EPYC_7713 |
|
||||
|
||||
Notes on memory configuration:
|
||||
* **Memory allocation options:** To request additional memory, use the following options in your submission script:
|
||||
* **`--mem=<mem_in_MB>`**: Allocates memory per node.
|
||||
* **`--mem-per-cpu=<mem_in_MB>`**: Allocates memory per CPU (equivalent to a core thread).
|
||||
|
||||
The total memory requested cannot exceed the **`MaxMemPerNode`** value.
|
||||
* **Impact of disabling Hyper-Threading:** Using the **`--hint=nomultithread`** option disables one thread per core,
|
||||
effectively halving the number of available CPUs. Consequently, memory allocation will also be halved unless explicitly
|
||||
adjusted.
|
||||
|
||||
For MPI-based jobs, where performance generally improves with single-threaded CPUs, this option is recommended.
|
||||
In such cases, you should double the **`--mem-per-cpu`** value to account for the reduced number of threads.
|
||||
|
||||
{{site.data.alerts.tip}}
|
||||
Always verify the Slurm <b>'/var/spool/slurmd/conf-cache/slurm.conf'</b> configuration file for potential changes.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
### User and job limits with QoS
|
||||
|
||||
In the `merlin7` CPU cluster, we enforce certain limits on jobs and users to ensure fair resource usage and prevent
|
||||
overuse by a single user or job. These limits aim to balance resource availability while maintaining overall cluster
|
||||
efficiency. However, applying limits can occasionally impact the cluster’s utilization. For example, user-specific
|
||||
limits may result in pending jobs even when many nodes are idle due to low activity.
|
||||
|
||||
On the other hand, these limits also enhance cluster efficiency by preventing scenarios such as a single job monopolizing
|
||||
all available resources, which could block other jobs from running. Without job size limits, for instance, a large job
|
||||
might drain the entire cluster to satisfy its resource request, a situation that is generally undesirable.
|
||||
|
||||
Thus, setting appropriate limits is essential to maintain fair resource usage while optimizing cluster efficiency. These
|
||||
limits should allow for a mix of jobs of varying sizes and types, including single-core and parallel jobs, to coexist
|
||||
effectively.
|
||||
|
||||
To implement these limits, **we utilize Quality of Service (QoS)**. Different QoS policies are defined and applied
|
||||
**to specific partitions** in line with the established resource allocation policies. The table below outlines the
|
||||
various QoS definitions applicable to the merlin7 CPU-based cluster. Here:
|
||||
* `MaxTRES` specifies resource limits per job.
|
||||
* `MaxTRESPU` specifies resource limits per user.
|
||||
|
||||
| Name | MaxTRES | MaxTRESPU | Scope |
|
||||
| -------------------: | --------------------: | --------------------: | ---------------------: |
|
||||
| **normal** | | | partition |
|
||||
| **cpu_general** | cpu=1024,mem=1920G | cpu=1024,mem=1920G | <u>user</u>, partition |
|
||||
| **cpu_daily** | cpu=1024,mem=1920G | cpu=2048,mem=3840G | partition |
|
||||
| **cpu_hourly** | cpu=2048,mem=3840G | cpu=8192,mem=15T | partition |
|
||||
| **cpu_interactive** | cpu=16,mem=30G,node=1 | cpu=32,mem=60G,node=1 | partition |
|
||||
|
||||
Where:
|
||||
* **`normal` QoS:** This QoS has no limits and is typically applied to partitions that do not require user or job
|
||||
restrictions.
|
||||
* **`cpu_general` QoS:** This is the **default QoS** for `merlin7` _users_. It limits the total resources available to each
|
||||
user. Additionally, this QoS is applied to the `general` partition, enforcing restrictions at the partition level and
|
||||
overriding user-level QoS.
|
||||
* **`cpu_daily` QoS:** Guarantees increased resources for the `daily` partition, accommodating shorter-duration jobs
|
||||
with higher resource needs.
|
||||
* **`cpu_hourly` QoS:** Offers the least constraints, allowing more resources to be used for the `hourly` partition,
|
||||
which caters to very short-duration jobs.
|
||||
* **`cpu_interactive` QoS:** Is restricted to one node and a few CPUs only, and is intended to be used when interactive
|
||||
allocations are necessary (`salloc`, `srun`).
|
||||
|
||||
For additional details, refer to the [CPU partitions](/merlin7/slurm-configuration.html#CPU-partitions) section.
|
||||
|
||||
{{site.data.alerts.tip}}
|
||||
Always verify QoS definitions for potential changes using the <b>'sacctmgr show qos format="Name%22,MaxTRESPU%35,MaxTRES%35"'</b> command.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
### CPU partitions
|
||||
|
||||
This section provides a summary of the partitions available in the `merlin7` CPU cluster.
|
||||
|
||||
Key concepts:
|
||||
* **`PriorityJobFactor`**: This value is added to a job’s priority (visible in the `PARTITION` column of the `sprio -l` command).
|
||||
Jobs submitted to partitions with higher `PriorityJobFactor` values generally run sooner. However, other factors like *job age*
|
||||
and especially *fair share* can also influence scheduling.
|
||||
* **`PriorityTier`**: Jobs submitted to partitions with higher `PriorityTier` values take precedence over pending jobs in partitions
|
||||
with lower `PriorityTier` values. Additionally, jobs from higher `PriorityTier` partitions can preempt running jobs in lower-tier
|
||||
partitions, where applicable.
|
||||
* **`QoS`**: Specifies the quality of service associated with a partition. It is used to control and restrict resource availability
|
||||
for specific partitions, ensuring that resource allocation aligns with intended usage policies. Detailed explanations of the various
|
||||
QoS settings can be found in the [User and job limits with QoS](/merlin7/slurm-configuration.html#user-and-job-limits-with-qos) section.
|
||||
|
||||
{{site.data.alerts.tip}}
|
||||
Always verify partition configurations for potential changes using the <b>'scontrol show partition'</b> command.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
#### CPU public partitions
|
||||
|
||||
| PartitionName | DefaultTime | MaxTime | TotalNodes | PriorityJobFactor | PriorityTier | QoS | AllowAccounts |
|
||||
| -----------------: | -----------: | ----------: | --------: | ----------------: | -----------: | --------------: | -------------: |
|
||||
| **<u>general</u>** | 1-00:00:00 | 7-00:00:00 | 46 | 1 | 1 | cpu_general | <u>merlin</u> |
|
||||
| **daily** | 0-01:00:00 | 1-00:00:00 | 58 | 500 | 1 | cpu_daily | <u>merlin</u> |
|
||||
| **hourly** | 0-00:30:00 | 0-01:00:00 | 77 | 1000 | 1 | cpu_hourly | <u>merlin</u> |
|
||||
| **interactive** | 0-04:00:00 | 0-12:00:00 | 58 | 1 | 2 | cpu_interactive | <u>merlin</u> |
|
||||
|
||||
All Merlin users are part of the `merlin` account, which is used as the *default account* when submitting jobs.
|
||||
Similarly, if no partition is specified, jobs are automatically submitted to the `general` partition by default.
|
||||
|
||||
{{site.data.alerts.tip}}
|
||||
For jobs running less than one day, submit them to the <b>daily</b> partition.
|
||||
For jobs running less than one hour, use the <b>hourly</b> partition.
|
||||
These partitions provide higher priority and ensure quicker scheduling compared to <b>general</b>, which has limited node availability.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
The **`hourly`** partition may include private nodes as an additional buffer. However, the current Slurm partition configuration, governed
|
||||
by **`PriorityTier`**, ensures that jobs submitted to private partitions are prioritized and processed first. As a result, access to the
|
||||
**`hourly`** partition might experience delays in such scenarios.
|
||||
|
||||
The **`interactive`** partition is designed specifically for real-time, interactive work. Here are the key characteristics:
|
||||
|
||||
* **CPU Oversubscription:** This partition allows CPU oversubscription (configured as `FORCE:4`), meaning that up to four interactive
|
||||
jobs may share the same physical CPU core. This can impact performance, but enables fast access for short-term tasks.
|
||||
* **Highest Scheduling Priority:** Jobs submitted to the interactive partition are always prioritized. They will be scheduled
|
||||
before any jobs in other partitions.
|
||||
* **Intended Use:** This partition is ideal for debugging, testing, compiling, short interactive runs, and other activities where
|
||||
immediate access is important.
|
||||
|
||||
{{site.data.alerts.warning}}
|
||||
Because of CPU sharing, the performance on the **'interactive'** partition may not be optimal for compute-intensive tasks.
|
||||
For long-running or production workloads, use a dedicated batch partition instead.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
#### CPU private partitions
|
||||
|
||||
##### CAS / ASA
|
||||
|
||||
| PartitionName | DefaultTime | MaxTime | TotalNodes | PriorityJobFactor | PriorityTier | QoS | AllowAccounts |
|
||||
| -----------------: | -----------: | ----------: | --------: | ----------------: | -----------: | ----------: | -------------: |
|
||||
| **asa** | 0-01:00:00 | 14-00:00:00 | 10 | 1 | 2 | normal | asa |
|
||||
|
||||
##### CNM / Mu3e
|
||||
|
||||
| PartitionName | DefaultTime | MaxTime | TotalNodes | PriorityJobFactor | PriorityTier | QoS | AllowAccounts |
|
||||
| -----------------: | -----------: | ----------: | --------: | ----------------: | -----------: | ----------: | -------------: |
|
||||
| **mu3e** | 1-00:00:00 | 7-00:00:00 | 4 | 1 | 2 | normal | mu3e, meg |
|
||||
|
||||
##### CNM / MeG
|
||||
|
||||
| PartitionName | DefaultTime | MaxTime | TotalNodes | PriorityJobFactor | PriorityTier | QoS | AllowAccounts |
|
||||
| -----------------: | -----------: | ----------: | --------: | ----------------: | -----------: | ----------: | -------------: |
|
||||
| **meg-short** | 0-01:00:00 | 0-01:00:00 | unlimited | 1000 | 2 | normal | meg |
|
||||
| **meg-long** | 1-00:00:00 | 5-00:00:00 | unlimited | 1 | 2 | normal | meg |
|
||||
| **meg-prod** | 1-00:00:00 | 5-00:00:00 | unlimited | 1000 | 4 | normal | meg |
|
||||
|
||||
## GPU cluster: gmerlin7
|
||||
|
||||
As mentioned in previous sections, by default, jobs will be submitted to `merlin7`, as it is the primary cluster configured on the login nodes.
|
||||
For submittng jobs to the GPU cluster, **the cluster name `gmerlin7` must be specified**, as follows:
|
||||
```bash
|
||||
#SBATCH --cluster=gmerlin7
|
||||
```
|
||||
|
||||
### GPU general configuration
|
||||
|
||||
The **Merlin7 GPU cluster** is configured with the **`CR_CORE_MEMORY`**, **`CR_ONE_TASK_PER_CORE`**, and **`ENFORCE_BINDING_GRES`** options.
|
||||
* This configuration treats both cores and memory as consumable resources.
|
||||
* Since the nodes are running with **hyper-threading** enabled, each core thread is counted as a CPU
|
||||
to fulfill a job's resource requirements.
|
||||
* Slurm will allocate the CPUs to the selected GPU.
|
||||
|
||||
By default, Slurm will allocate one task per core, which means:
|
||||
* For hyper-threaded nodes (NVIDIA A100-based nodes), each task will consume 2 **CPUs**, regardless of whether both threads are actively used by the job.
|
||||
* For the NVIDIA GraceHopper-based nodes, each task will consume 1 **CPU**.
|
||||
|
||||
This behavior ensures consistent resource allocation but may result in underutilization of hyper-threading in some cases.
|
||||
|
||||
### GPU nodes definition
|
||||
|
||||
The table below provides an overview of the Slurm configuration for the different node types in the Merlin7 cluster.
|
||||
This information is essential for understanding how resources are allocated, enabling users to tailor their submission
|
||||
scripts accordingly.
|
||||
|
||||
| Nodes | Sockets | CoresPerSocket | Cores | ThreadsPerCore | CPUs | MaxMemPerNode | DefMemPerCPU | Gres | Features |
|
||||
| --------------------:| -------: | --------------: | -----: | --------------: | ----: | ------------: | -----------: | --------------------------: | ---------------------: |
|
||||
| gpu[001-007] | 4 | 72 | 288 | 1 | 288 | 828G | 2944M | gpu:gh200:4 | AMD_EPYC_7713, NV_A100 |
|
||||
| gpu[101-105] | 1 | 64 | 64 | 2 | 128 | 480G | 3840M | gpu:nvidia_a100-sxm4-80gb:4 | GH200, NV_H100 |
|
||||
|
||||
Notes on memory configuration:
|
||||
* **Memory allocation options:** To request additional memory, use the following options in your submission script:
|
||||
* **`--mem=<mem_in_MB>`**: Allocates memory per node.
|
||||
* **`--mem-per-cpu=<mem_in_MB>`**: Allocates memory per CPU (equivalent to a core thread).
|
||||
|
||||
The total memory requested cannot exceed the **`MaxMemPerNode`** value.
|
||||
* **Impact of disabling Hyper-Threading:** Using the **`--hint=nomultithread`** option disables one thread per core,
|
||||
effectively halving the number of available CPUs. Consequently, memory allocation will also be halved unless explicitly
|
||||
adjusted.
|
||||
|
||||
For MPI-based jobs, where performance generally improves with single-threaded CPUs, this option is recommended.
|
||||
In such cases, you should double the **`--mem-per-cpu`** value to account for the reduced number of threads.
|
||||
|
||||
{{site.data.alerts.tip}}
|
||||
Always verify the Slurm <b>'/var/spool/slurmd/conf-cache/slurm.conf'</b> configuration file for potential changes.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
### User and job limits with QoS
|
||||
|
||||
In the `gmerlin7` CPU cluster, we enforce certain limits on jobs and users to ensure fair resource usage and prevent
|
||||
overuse by a single user or job. These limits aim to balance resource availability while maintaining overall cluster
|
||||
efficiency. However, applying limits can occasionally impact the cluster’s utilization. For example, user-specific
|
||||
limits may result in pending jobs even when many nodes are idle due to low activity.
|
||||
|
||||
On the other hand, these limits also enhance cluster efficiency by preventing scenarios such as a single job monopolizing
|
||||
all available resources, which could block other jobs from running. Without job size limits, for instance, a large job
|
||||
might drain the entire cluster to satisfy its resource request, a situation that is generally undesirable.
|
||||
|
||||
Thus, setting appropriate limits is essential to maintain fair resource usage while optimizing cluster efficiency. These
|
||||
limits should allow for a mix of jobs of varying sizes and types, including single-core and parallel jobs, to coexist
|
||||
effectively.
|
||||
|
||||
To implement these limits, **we utilize Quality of Service (QoS)**. Different QoS policies are defined and applied
|
||||
**to specific partitions** in line with the established resource allocation policies. The table below outlines the
|
||||
various QoS definitions applicable to the merlin7 CPU-based cluster. Here:
|
||||
* `MaxTRES` specifies resource limits per job.
|
||||
* `MaxTRESPU` specifies resource limits per user.
|
||||
|
||||
| Name | MaxTRES | MaxTRESPU | Scope |
|
||||
| -----------------------: | -------------------------------: | ------------------------------: | ---------------------: |
|
||||
| **normal** | | | partition |
|
||||
| **gpu_general** | gres/gpu=4 | gres/gpu=8 | <u>user</u>, partition |
|
||||
| **gpu_daily** | gres/gpu=8 | gres/gpu=8 | partition |
|
||||
| **gpu_hourly** | gres/gpu=8 | gres/gpu=8 | partition |
|
||||
| **gpu_gh_interactive** | cpu=16,gres/gpu=1,mem=46G,node=1 |cpu=16,gres/gpu=1,mem=46G,node=1 | partition |
|
||||
| **gpu_a100_interactive** | cpu=16,gres/gpu=1,mem=60G,node=1 |cpu=16,gres/gpu=1,mem=60G,node=1 | partition |
|
||||
|
||||
Where:
|
||||
* **`normal` QoS:** This QoS has no limits and is typically applied to partitions that do not require user or job
|
||||
restrictions.
|
||||
* **`gpu_general` QoS:** This is the **default QoS** for `gmerlin7` _users_. It limits the total resources available to each
|
||||
user. Additionally, this QoS is applied to the `[a100|gh]-general` partitions, enforcing restrictions at the partition level and
|
||||
overriding user-level QoS.
|
||||
* **`gpu_daily` QoS:** Guarantees increased resources for the `[a100|gh]-daily` partitions, accommodating shorter-duration jobs
|
||||
with higher resource needs.
|
||||
* **`gpu_hourly` QoS:** Offers the least constraints, allowing more resources to be used for the `[a100|gh]-hourly` partitions,
|
||||
which caters to very short-duration jobs.
|
||||
* **`gpu_a100_interactive` & `gpu_gh_interactive` QoS:** Guarantee interactive access to GPU nodes for software compilation and
|
||||
small testing.
|
||||
|
||||
For additional details, refer to the [GPU partitions](/merlin7/slurm-configuration.html#GPU-partitions) section.
|
||||
|
||||
{{site.data.alerts.tip}}
|
||||
Always verify QoS definitions for potential changes using the <b>'sacctmgr show qos format="Name%22,MaxTRESPU%35,MaxTRES%35"'</b> command.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
### GPU partitions
|
||||
|
||||
This section provides a summary of the partitions available in the `gmerlin7` GPU cluster.
|
||||
|
||||
Key concepts:
|
||||
* **`PriorityJobFactor`**: This value is added to a job’s priority (visible in the `PARTITION` column of the `sprio -l` command).
|
||||
Jobs submitted to partitions with higher `PriorityJobFactor` values generally run sooner. However, other factors like *job age*
|
||||
and especially *fair share* can also influence scheduling.
|
||||
* **`PriorityTier`**: Jobs submitted to partitions with higher `PriorityTier` values take precedence over pending jobs in partitions
|
||||
with lower `PriorityTier` values. Additionally, jobs from higher `PriorityTier` partitions can preempt running jobs in lower-tier
|
||||
partitions, where applicable.
|
||||
* **`QoS`**: Specifies the quality of service associated with a partition. It is used to control and restrict resource availability
|
||||
for specific partitions, ensuring that resource allocation aligns with intended usage policies. Detailed explanations of the various
|
||||
QoS settings can be found in the [User and job limits with QoS](/merlin7/slurm-configuration.html#user-and-job-limits-with-qos) section.
|
||||
|
||||
{{site.data.alerts.tip}}
|
||||
Always verify partition configurations for potential changes using the <b>'scontrol show partition'</b> command.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
#### A100-based partitions
|
||||
|
||||
| PartitionName | DefaultTime | MaxTime | TotalNodes | PriorityJobFactor | PriorityTier | QoS | AllowAccounts |
|
||||
| -------------------: | -----------: | ----------: | --------: | ----------------: | -----------: | -------------------: | -------------: |
|
||||
| **a100-general** | 1-00:00:00 | 7-00:00:00 | 3 | 1 | 1 | gpu_general | <u>merlin</u> |
|
||||
| **a100-daily** | 0-01:00:00 | 1-00:00:00 | 4 | 500 | 1 | gpu_daily | <u>merlin</u> |
|
||||
| **a100-hourly** | 0-00:30:00 | 0-01:00:00 | 5 | 1000 | 1 | gpu_hourly | <u>merlin</u> |
|
||||
| **a100-interactive** | 0-01:00:00 | 0-12:00:00 | 5 | 1 | 2 | gpu_a100_interactive | <u>merlin</u> |
|
||||
|
||||
All Merlin users are part of the `merlin` account, which is used as the *default account* when submitting jobs.
|
||||
Similarly, if no partition is specified, jobs are automatically submitted to the `general` partition by default.
|
||||
|
||||
{{site.data.alerts.tip}}
|
||||
For jobs running less than one day, submit them to the <b>a100-daily</b> partition.
|
||||
For jobs running less than one hour, use the <b>a100-hourly</b> partition.
|
||||
These partitions provide higher priority and ensure quicker scheduling compared to <b>a100-general</b>, which has limited node availability.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
#### GH-based partitions
|
||||
|
||||
| PartitionName | DefaultTime | MaxTime | TotalNodes | PriorityJobFactor | PriorityTier | QoS | AllowAccounts |
|
||||
| -------------------: | -----------: | ----------: | --------: | ----------------: | -----------: | -------------------: | -------------: |
|
||||
| **gh-general** | 1-00:00:00 | 7-00:00:00 | 5 | 1 | 1 | gpu_general | <u>merlin</u> |
|
||||
| **gh-daily** | 0-01:00:00 | 1-00:00:00 | 6 | 500 | 1 | gpu_daily | <u>merlin</u> |
|
||||
| **gh-hourly** | 0-00:30:00 | 0-01:00:00 | 7 | 1000 | 1 | gpu_hourly | <u>merlin</u> |
|
||||
| **gh-interactive** | 0-01:00:00 | 0-12:00:00 | 7 | 1 | 2 | gpu_gh_interactive | <u>merlin</u> |
|
||||
|
||||
All Merlin users are part of the `merlin` account, which is used as the *default account* when submitting jobs.
|
||||
Similarly, if no partition is specified, jobs are automatically submitted to the `general` partition by default.
|
||||
|
||||
{{site.data.alerts.tip}}
|
||||
For jobs running less than one day, submit them to the <b>gh-daily</b> partition.
|
||||
For jobs running less than one hour, use the <b>gh-hourly</b> partition.
|
||||
These partitions provide higher priority and ensure quicker scheduling compared to <b>gh-general</b>, which has limited node availability.
|
||||
{{site.data.alerts.end}}
|
||||
@@ -0,0 +1,68 @@
|
||||
---
|
||||
title: Slurm Examples
|
||||
#tags:
|
||||
keywords: slurm example, template, examples, templates, running jobs, sbatch, single core based jobs, HT, multithread, no-multithread, mpi, openmp, packed jobs, hands-on, array jobs, gpu
|
||||
last_updated: 24 Mai 2023
|
||||
summary: "This document shows different template examples for running jobs in the Merlin cluster."
|
||||
sidebar: merlin7_sidebar
|
||||
permalink: /merlin7/slurm-examples.html
|
||||
---
|
||||
|
||||
## Single core based job examples
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
#SBATCH --partition=hourly # Using 'hourly' will grant higher priority
|
||||
#SBATCH --ntasks-per-core=2 # Request the max ntasks be invoked on each core
|
||||
#SBATCH --hint=multithread # Use extra threads with in-core multi-threading
|
||||
#SBATCH --time=00:30:00 # Define max time job will run
|
||||
#SBATCH --output=myscript.out # Define your output file
|
||||
#SBATCH --error=myscript.err # Define your error file
|
||||
|
||||
module purge
|
||||
module load $MODULE_NAME # where $MODULE_NAME is a software in PModules
|
||||
srun $MYEXEC # where $MYEXEC is a path to your binary file
|
||||
```
|
||||
|
||||
## Multi-core based jobs example
|
||||
|
||||
### Pure MPI
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
#SBATCH --job-name=purempi
|
||||
#SBATCH --partition=daily # Using 'daily' will grant higher priority
|
||||
#SBATCH --time=24:00:00 # Define max time job will run
|
||||
#SBATCH --output=%x-%j.out # Define your output file
|
||||
#SBATCH --error=%x-%j.err # Define your error file
|
||||
#SBATCH --exclusive
|
||||
#SBATCH --nodes=1
|
||||
#SBATCH --ntasks=128
|
||||
#SBATCH --hint=nomultithread
|
||||
##SBATCH --cpus-per-task=1
|
||||
|
||||
module purge
|
||||
module load $MODULE_NAME # where $MODULE_NAME is a software in PModules
|
||||
srun $MYEXEC # where $MYEXEC is a path to your binary file
|
||||
```
|
||||
|
||||
### Hybrid
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
#SBATCH --job-name=hybrid
|
||||
#SBATCH --partition=daily # Using 'daily' will grant higher priority
|
||||
#SBATCH --time=24:00:00 # Define max time job will run
|
||||
#SBATCH --output=%x-%j.out # Define your output file
|
||||
#SBATCH --error=%x-%j.err # Define your error file
|
||||
#SBATCH --exclusive
|
||||
#SBATCH --nodes=1
|
||||
#SBATCH --ntasks=128
|
||||
#SBATCH --hint=multithread
|
||||
#SBATCH --cpus-per-task=2
|
||||
|
||||
module purge
|
||||
module load $MODULE_NAME # where $MODULE_NAME is a software in PModules
|
||||
srun $MYEXEC # where $MYEXEC is a path to your binary file
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user