first stab at mkdocs migration
refactor CSCS and Meg content add merlin6 quick start update merlin6 nomachine docs give the userdoc its own color scheme we use the Materials default one refactored slurm general docs merlin6 add merlin6 JB docs add software support m6 docs add all files to nav vibed changes #1 add missing pages further vibing #2 vibe #3 further fixes
This commit is contained in:
204
docs/merlin7/03-Slurm-General-Documentation/interactive-jobs.md
Normal file
204
docs/merlin7/03-Slurm-General-Documentation/interactive-jobs.md
Normal file
@@ -0,0 +1,204 @@
|
||||
# Running Interactive Jobs
|
||||
|
||||
### The Merlin7 'interactive' partition
|
||||
|
||||
On the **`merlin7`** cluster, it is recommended to always run interactive jobs on the **`interactive`** partition.
|
||||
This partition allows CPU oversubscription (up to four users may share the same CPU) and **has the highest scheduling priority**. Access to this partition is typically quick, making it a convenient extension of the login nodes for interactive workloads.
|
||||
|
||||
On the **`gmerlin7`** cluster, additional interactive partitions are available, but these are primarily intended for CPU-only workloads (such like compiling GPU-based software, or creating an allocation for submitting jobs to Grace-Hopper nodes).
|
||||
|
||||
!!! warning
|
||||
Because **GPU resources are scarce and expensive**, interactive allocations on GPU nodes that use GPUs should only be submitted when strictly necessary and well justified.
|
||||
|
||||
## Running interactive jobs
|
||||
|
||||
There are two different ways for running interactive jobs in Slurm. This is possible by using
|
||||
the ``salloc`` and ``srun`` commands:
|
||||
|
||||
* **``salloc``**: to obtain a Slurm job allocation (a set of nodes), execute command(s), and then release the allocation when the command is finished.
|
||||
* **``srun``**: is used for running parallel tasks.
|
||||
|
||||
### srun
|
||||
|
||||
Is run is used to run parallel jobs in the batch system. It can be used within a batch script
|
||||
(which can be run with ``sbatch``), or within a job allocation (which can be run with ``salloc``).
|
||||
Also, it can be used as a direct command (in example, from the login nodes).
|
||||
|
||||
When used inside a batch script or during a job allocation, ``srun`` is constricted to the
|
||||
amount of resources allocated by the ``sbatch``/``salloc`` commands. In ``sbatch``, usually
|
||||
these resources are defined inside the batch script with the format ``#SBATCH <option>=<value>``.
|
||||
In other words, if you define in your batch script or allocation 88 tasks (and 1 thread / core)
|
||||
and 2 nodes, ``srun`` is constricted to these amount of resources (you can use less, but never
|
||||
exceed those limits).
|
||||
|
||||
When used from the login node, usually is used to run a specific command or software in an
|
||||
interactive way. ``srun`` is a blocking process (it will block bash prompt until the ``srun``
|
||||
command finishes, unless you run it in background with ``&``). This can be very useful to run
|
||||
interactive software which pops up a Window and then submits jobs or run sub-tasks in the
|
||||
background (in example, **Relion**, **cisTEM**, etc.)
|
||||
|
||||
Refer to ``man srun`` for exploring all possible options for that command.
|
||||
|
||||
<details>
|
||||
<summary>[Show 'srun' example]: Running 'hostname' command on 3 nodes, using 2 cores (1 task/core) per node</summary>
|
||||
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
|
||||
caubet_m@login001:~> srun --clusters=merlin7 --partition=interactive --ntasks=6 --ntasks-per-node=2 --nodes=3 hostname
|
||||
cn001.merlin7.psi.ch
|
||||
cn001.merlin7.psi.ch
|
||||
cn002.merlin7.psi.ch
|
||||
cn002.merlin7.psi.ch
|
||||
cn003.merlin7.psi.ch
|
||||
cn003.merlin7.psi.ch
|
||||
</pre>
|
||||
</details>
|
||||
|
||||
### salloc
|
||||
|
||||
**``salloc``** is used to obtain a Slurm job allocation (a set of nodes). Once job is allocated,
|
||||
users are able to execute interactive command(s). Once finished (``exit`` or ``Ctrl+D``),
|
||||
the allocation is released. **``salloc``** is a blocking command, it is, command will be blocked
|
||||
until the requested resources are allocated.
|
||||
|
||||
When running **``salloc``**, once the resources are allocated, *by default* the user will get
|
||||
a ***new shell on one of the allocated resources*** (if a user has requested few nodes, it will
|
||||
prompt a new shell on the first allocated node). However, this behaviour can be changed by adding
|
||||
a shell (`$SHELL`) at the end of the `salloc` command. In example:
|
||||
|
||||
```bash
|
||||
# Typical 'salloc' call
|
||||
salloc --clusters=merlin7 --partition=interactive -N 2 -n 2
|
||||
|
||||
# Custom 'salloc' call
|
||||
# - $SHELL will open a local shell on the login node from where ``salloc`` is running
|
||||
salloc --clusters=merlin7 --partition=interactive -N 2 -n 2 $SHELL
|
||||
```
|
||||
|
||||
<details>
|
||||
<summary>[Show 'salloc' example]: Allocating 2 cores (1 task/core) in 2 nodes (1 core/node) - <i>Default</i></summary>
|
||||
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
|
||||
caubet_m@login001:~> salloc --clusters=merlin7 --partition=interactive -N 2 -n 2
|
||||
salloc: Granted job allocation 161
|
||||
salloc: Nodes cn[001-002] are ready for job
|
||||
|
||||
caubet_m@login001:~> srun hostname
|
||||
cn002.merlin7.psi.ch
|
||||
cn001.merlin7.psi.ch
|
||||
|
||||
caubet_m@login001:~> exit
|
||||
exit
|
||||
salloc: Relinquishing job allocation 161
|
||||
</pre>
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>[Show 'salloc' example]: Allocating 2 cores (1 task/core) in 2 nodes (1 core/node) - <i>$SHELL</i></summary>
|
||||
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
|
||||
caubet_m@login001:~> salloc --clusters=merlin7 --partition=interactive --ntasks=2 --nodes=2 $SHELL
|
||||
salloc: Granted job allocation 165
|
||||
salloc: Nodes cn[001-002] are ready for job
|
||||
caubet_m@login001:~> srun hostname
|
||||
cn001.merlin7.psi.ch
|
||||
cn002.merlin7.psi.ch
|
||||
caubet_m@login001:~> exit
|
||||
exit
|
||||
salloc: Relinquishing job allocation 165
|
||||
</pre>
|
||||
</details>
|
||||
|
||||
## Running interactive jobs with X11 support
|
||||
|
||||
### Requirements
|
||||
|
||||
#### Graphical access
|
||||
|
||||
[NoMachine](../02-How-To-Use-Merlin/nomachine.md) is the official supported service for graphical
|
||||
access in the Merlin cluster. This service is running on the login nodes. Check the
|
||||
document [{Accessing Merlin -> NoMachine}](../02-How-To-Use-Merlin/nomachine.md) for details about
|
||||
how to connect to the **NoMachine** service in the Merlin cluster.
|
||||
|
||||
For other non officially supported graphical access (X11 forwarding):
|
||||
|
||||
* For Linux clients, please follow [{How To Use Merlin -> Accessing from Linux Clients}](../02-How-To-Use-Merlin/connect-from-linux.md)
|
||||
* For Windows clients, please follow [{How To Use Merlin -> Accessing from Windows Clients}](../02-How-To-Use-Merlin/connect-from-windows.md)
|
||||
* For MacOS clients, please follow [{How To Use Merlin -> Accessing from MacOS Clients}](../02-How-To-Use-Merlin/connect-from-macos.md)
|
||||
|
||||
### 'srun' with x11 support
|
||||
|
||||
Merlin6 and merlin7 clusters allow running any windows based applications. For that, you need to
|
||||
add the option ``--x11`` to the ``srun`` command. In example:
|
||||
|
||||
```bash
|
||||
srun --clusters=merlin7 --partition=interactive --x11 sview
|
||||
```
|
||||
|
||||
will popup a X11 based slurm view of the cluster.
|
||||
|
||||
In the same manner, you can create a bash shell with x11 support. For doing that, you need
|
||||
to add the option ``--pty`` to the ``srun --x11`` command. Once resource is allocated, from
|
||||
there you can interactively run X11 and non-X11 based commands.
|
||||
|
||||
```bash
|
||||
srun --clusters=merlin7 --partition=interactive --x11 --pty bash
|
||||
```
|
||||
|
||||
<details>
|
||||
<summary>[Show 'srun' with X11 support examples]</summary>
|
||||
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
|
||||
caubet_m@login001:~> srun --clusters=merlin7 --partition=interactive --x11 sview
|
||||
|
||||
caubet_m@login001:~>
|
||||
|
||||
caubet_m@login001:~> srun --clusters=merlin7 --partition=interactive --x11 --pty bash
|
||||
|
||||
caubet_m@cn003:~> sview
|
||||
|
||||
caubet_m@cn003:~> echo "This was an example"
|
||||
This was an example
|
||||
|
||||
caubet_m@cn003:~> exit
|
||||
exit
|
||||
</pre>
|
||||
</details>
|
||||
|
||||
### 'salloc' with x11 support
|
||||
|
||||
**Merlin6** and **merlin7** clusters allow running any windows based applications. For that, you need to
|
||||
add the option ``--x11`` to the ``salloc`` command. In example:
|
||||
|
||||
```bash
|
||||
salloc --clusters=merlin7 --partition=interactive --x11 sview
|
||||
```
|
||||
|
||||
will popup a X11 based slurm view of the cluster.
|
||||
|
||||
In the same manner, you can create a bash shell with x11 support. For doing that, you need
|
||||
to add to run just ``salloc --clusters=merlin7 --partition=interactive --x11``. Once resource is allocated, from
|
||||
there you can interactively run X11 and non-X11 based commands.
|
||||
|
||||
```bash
|
||||
salloc --clusters=merlin7 --partition=interactive --x11
|
||||
```
|
||||
|
||||
<details>
|
||||
<summary>[Show 'salloc' with X11 support examples]</summary>
|
||||
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
|
||||
caubet_m@login001:~> salloc --clusters=merlin7 --partition=interactive --x11 sview
|
||||
salloc: Granted job allocation 174
|
||||
salloc: Nodes cn001 are ready for job
|
||||
salloc: Relinquishing job allocation 174
|
||||
|
||||
caubet_m@login001:~> salloc --clusters=merlin7 --partition=interactive --x11
|
||||
salloc: Granted job allocation 175
|
||||
salloc: Nodes cn001 are ready for job
|
||||
caubet_m@cn001:~>
|
||||
|
||||
caubet_m@cn001:~> sview
|
||||
|
||||
caubet_m@cn001:~> echo "This was an example"
|
||||
This was an example
|
||||
|
||||
caubet_m@cn001:~> exit
|
||||
exit
|
||||
salloc: Relinquishing job allocation 175
|
||||
</pre>
|
||||
</details>
|
||||
@@ -0,0 +1,59 @@
|
||||
---
|
||||
title: Slurm cluster 'merlin7'
|
||||
#tags:
|
||||
keywords: configuration, partitions, node definition
|
||||
#last_updated: 24 Mai 2023
|
||||
summary: "This document describes a summary of the Merlin7 configuration."
|
||||
sidebar: merlin7_sidebar
|
||||
permalink: /merlin7/merlin7-configuration.html
|
||||
---
|
||||
|
||||
This documentation shows basic Slurm configuration and options needed to run jobs in the Merlin7 cluster.
|
||||
|
||||
## Infrastructure
|
||||
|
||||
### Hardware
|
||||
|
||||
* 2 CPU-only login nodes
|
||||
* 77 CPU-only compute nodes
|
||||
* 5 GPU A100 nodes
|
||||
* 8 GPU Grace Hopper nodes
|
||||
|
||||
The specification of the node types is:
|
||||
|
||||
| Node | #Nodes | CPU | RAM | GRES |
|
||||
| ----: | ------ | --- | --- | ---- |
|
||||
| Login Nodes | 2 | _2x_ AMD EPYC 7742 (x86_64 Rome, 64 Cores, 2.25GHz) | 512GB DDR4 3200Mhz | |
|
||||
| CPU Nodes | 77 | _2x_ AMD EPYC 7742 (x86_64 Rome, 64 Cores, 2.25GHz) | 512GB DDR4 3200Mhz | |
|
||||
| A100 GPU Nodes | 5 | _2x_ AMD EPYC 7713 (x86_64 Milan, 64 Cores, 3.2GHz) | 512GB DDR4 3200Mhz | 4 x NV_A100 (80GB) |
|
||||
| GH GPU Nodes | 3 | _2x_ NVidia Grace Neoverse-V2 (SBSA ARM 64bit, 144 Cores, 3.1GHz) | _2x_ 480GB DDR5X (CPU+GPU) | 4 x NV_GH200 (120GB) |
|
||||
|
||||
### Network
|
||||
|
||||
The Merlin7 cluster builds on top of HPE/Cray technologies, including a high-performance network fabric called Slingshot. This network fabric is able
|
||||
to provide up to 200 Gbit/s throughput between nodes. Further information on Slignshot can be found on at [HPE](https://www.hpe.com/psnow/doc/PSN1012904596HREN) and
|
||||
at <https://www.glennklockwood.com/garden/slingshot>.
|
||||
|
||||
Through software interfaces like [libFabric](https://ofiwg.github.io/libfabric/) (which available on Merlin7), application can leverage the network seamlessly.
|
||||
|
||||
### Storage
|
||||
|
||||
Unlike previous iteration of the Merlin HPC clusters, Merlin7 _does not_ have any local storage. Instead storage for the entire cluster is provided through
|
||||
a dedicated storage appliance from HPE/Cray called [ClusterStor](https://www.hpe.com/psnow/doc/PSN1012842049INEN.pdf).
|
||||
|
||||
The appliance is built of several storage servers:
|
||||
|
||||
* 2 management nodes
|
||||
* 2 MDS servers, 12 drives per server, 2.9TiB (Raid10)
|
||||
* 8 OSS-D servers, 106 drives per server, 14.5 T.B HDDs (Gridraid / Raid6)
|
||||
* 4 OSS-F servers, 12 drives per server 7TiB SSDs (Raid10)
|
||||
|
||||
With effective storage capacity of:
|
||||
|
||||
* 10 PB HDD
|
||||
* value visible on linux: HDD 9302.4 TiB
|
||||
* 162 TB SSD
|
||||
* value visible on linux: SSD 151.6 TiB
|
||||
* 23.6 TiB on Metadata
|
||||
|
||||
The storage is directly connected to the cluster (and each individual node) through the Slingshot NIC.
|
||||
@@ -0,0 +1,365 @@
|
||||
---
|
||||
title: Slurm merlin7 Configuration
|
||||
#tags:
|
||||
keywords: configuration, partitions, node definition
|
||||
#last_updated: 24 Mai 2023
|
||||
summary: "This document describes a summary of the Merlin7 Slurm CPU-based configuration."
|
||||
sidebar: merlin7_sidebar
|
||||
permalink: /merlin7/slurm-configuration.html
|
||||
---
|
||||
|
||||
This documentation shows basic Slurm configuration and options needed to run jobs in the Merlin7 cluster.
|
||||
|
||||
## Public partitions configuration summary
|
||||
|
||||
### CPU public partitions
|
||||
|
||||
| PartitionName | DefaultTime | MaxTime | Priority | Account | Per Job Limits | Per User Limits |
|
||||
| -----------------: | -----------: | ----------: | -------: | ---------------: | --------------------: | --------------------: |
|
||||
| **<u>general</u>** | 1-00:00:00 | 7-00:00:00 | Low | <u>merlin</u> | cpu=1024,mem=1920G | cpu=1024,mem=1920G |
|
||||
| **daily** | 0-01:00:00 | 1-00:00:00 | Medium | <u>merlin</u> | cpu=1024,mem=1920G | cpu=2048,mem=3840G |
|
||||
| **hourly** | 0-00:30:00 | 0-01:00:00 | High | <u>merlin</u> | cpu=2048,mem=3840G | cpu=8192,mem=15T |
|
||||
| **interactive** | 0-04:00:00 | 0-12:00:00 | Highest | <u>merlin</u> | cpu=16,mem=30G,node=1 | cpu=32,mem=60G,node=1 |
|
||||
|
||||
### GPU public partitions
|
||||
|
||||
#### A100 nodes
|
||||
|
||||
| PartitionName | DefaultTime | MaxTime | Priority | Account | Per Job Limits | Per User Limits |
|
||||
| -------------------: | -----------: | ----------: | ---------: | -------------: | -------------------------------: | -------------------------------: |
|
||||
| **a100-general** | 1-00:00:00 | 7-00:00:00 | Low | <u>merlin</u> | gres/gpu=4 | gres/gpu=8 |
|
||||
| **a100-daily** | 0-01:00:00 | 1-00:00:00 | Medium | <u>merlin</u> | gres/gpu=8 | gres/gpu=8 |
|
||||
| **a100-hourly** | 0-00:30:00 | 0-01:00:00 | High | <u>merlin</u> | gres/gpu=8 | gres/gpu=8 |
|
||||
| **a100-interactive** | 0-01:00:00 | 0-12:00:00 | Very High | <u>merlin</u> | cpu=16,gres/gpu=1,mem=60G,node=1 | cpu=16,gres/gpu=1,mem=60G,node=1 |
|
||||
|
||||
#### Grace-Hopper nodes
|
||||
|
||||
| PartitionName | DefaultTime | MaxTime | Priority | Account | Per Job Limits | Per User Limits |
|
||||
| -------------------: | -----------: | ----------: | ---------: | -------------: | -------------------------------: | -------------------------------: |
|
||||
| **gh-general** | 1-00:00:00 | 7-00:00:00 | Low | <u>merlin</u> | gres/gpu=4 | gres/gpu=8 |
|
||||
| **gh-daily** | 0-01:00:00 | 1-00:00:00 | Medium | <u>merlin</u> | gres/gpu=8 | gres/gpu=8 |
|
||||
| **gh-hourly** | 0-00:30:00 | 0-01:00:00 | High | <u>merlin</u> | gres/gpu=8 | gres/gpu=8 |
|
||||
| **gh-interactive** | 0-01:00:00 | 0-12:00:00 | Very High | <u>merlin</u> | cpu=16,gres/gpu=1,mem=46G,node=1 | cpu=16,gres/gpu=1,mem=46G,node=1 |
|
||||
|
||||
## CPU cluster: merlin7
|
||||
|
||||
**By default, jobs will be submitted to `merlin7`**, as it is the primary cluster configured on the login nodes.
|
||||
Specifying the cluster name is typically unnecessary unless you have defined environment variables that could override the default cluster name.
|
||||
However, when necessary, one can specify the cluster as follows:
|
||||
```bash
|
||||
#SBATCH --cluster=merlin7
|
||||
```
|
||||
|
||||
### CPU general configuration
|
||||
|
||||
The **Merlin7 CPU cluster** is configured with the **`CR_CORE_MEMORY`** and **`CR_ONE_TASK_PER_CORE`** options.
|
||||
* This configuration treats both cores and memory as consumable resources.
|
||||
* Since the nodes are running with **hyper-threading** enabled, each core thread is counted as a CPU
|
||||
to fulfill a job's resource requirements.
|
||||
|
||||
By default, Slurm will allocate one task per core, which means:
|
||||
* Each task will consume 2 **CPUs**, regardless of whether both threads are actively used by the job.
|
||||
|
||||
This behavior ensures consistent resource allocation but may result in underutilization of hyper-threading in some cases.
|
||||
|
||||
### CPU nodes definition
|
||||
|
||||
The table below provides an overview of the Slurm configuration for the different node types in the Merlin7 cluster.
|
||||
This information is essential for understanding how resources are allocated, enabling users to tailor their submission
|
||||
scripts accordingly.
|
||||
|
||||
| Nodes | Sockets | CoresPerSocket | Cores | ThreadsPerCore | CPUs | MaxMemPerNode | DefMemPerCPU | Features |
|
||||
| --------------------:| -------: | --------------: | -----: | --------------: | ----: | ------------: | -----------: | ------------: |
|
||||
| login[001-002] | 2 | 64 | 128 | 2 | 256 | 480G | 1920M | AMD_EPYC_7713 |
|
||||
| cn[001-077] | 2 | 64 | 128 | 2 | 256 | 480G | 1920M | AMD_EPYC_7713 |
|
||||
|
||||
Notes on memory configuration:
|
||||
* **Memory allocation options:** To request additional memory, use the following options in your submission script:
|
||||
* **`--mem=<mem_in_MB>`**: Allocates memory per node.
|
||||
* **`--mem-per-cpu=<mem_in_MB>`**: Allocates memory per CPU (equivalent to a core thread).
|
||||
|
||||
The total memory requested cannot exceed the **`MaxMemPerNode`** value.
|
||||
* **Impact of disabling Hyper-Threading:** Using the **`--hint=nomultithread`** option disables one thread per core,
|
||||
effectively halving the number of available CPUs. Consequently, memory allocation will also be halved unless explicitly
|
||||
adjusted.
|
||||
|
||||
For MPI-based jobs, where performance generally improves with single-threaded CPUs, this option is recommended.
|
||||
In such cases, you should double the **`--mem-per-cpu`** value to account for the reduced number of threads.
|
||||
|
||||
!!! tip
|
||||
Always verify the Slurm `/var/spool/slurmd/conf-cache/slurm.conf` configuration file for potential changes.
|
||||
|
||||
### User and job limits with QoS
|
||||
|
||||
In the `merlin7` CPU cluster, we enforce certain limits on jobs and users to ensure fair resource usage and prevent
|
||||
overuse by a single user or job. These limits aim to balance resource availability while maintaining overall cluster
|
||||
efficiency. However, applying limits can occasionally impact the cluster’s utilization. For example, user-specific
|
||||
limits may result in pending jobs even when many nodes are idle due to low activity.
|
||||
|
||||
On the other hand, these limits also enhance cluster efficiency by preventing scenarios such as a single job monopolizing
|
||||
all available resources, which could block other jobs from running. Without job size limits, for instance, a large job
|
||||
might drain the entire cluster to satisfy its resource request, a situation that is generally undesirable.
|
||||
|
||||
Thus, setting appropriate limits is essential to maintain fair resource usage while optimizing cluster efficiency. These
|
||||
limits should allow for a mix of jobs of varying sizes and types, including single-core and parallel jobs, to coexist
|
||||
effectively.
|
||||
|
||||
To implement these limits, **we utilize Quality of Service (QoS)**. Different QoS policies are defined and applied
|
||||
**to specific partitions** in line with the established resource allocation policies. The table below outlines the
|
||||
various QoS definitions applicable to the merlin7 CPU-based cluster. Here:
|
||||
* `MaxTRES` specifies resource limits per job.
|
||||
* `MaxTRESPU` specifies resource limits per user.
|
||||
|
||||
| Name | MaxTRES | MaxTRESPU | Scope |
|
||||
| -------------------: | --------------------: | --------------------: | ---------------------: |
|
||||
| **normal** | | | partition |
|
||||
| **cpu_general** | cpu=1024,mem=1920G | cpu=1024,mem=1920G | <u>user</u>, partition |
|
||||
| **cpu_daily** | cpu=1024,mem=1920G | cpu=2048,mem=3840G | partition |
|
||||
| **cpu_hourly** | cpu=2048,mem=3840G | cpu=8192,mem=15T | partition |
|
||||
| **cpu_interactive** | cpu=16,mem=30G,node=1 | cpu=32,mem=60G,node=1 | partition |
|
||||
|
||||
Where:
|
||||
* **`normal` QoS:** This QoS has no limits and is typically applied to partitions that do not require user or job
|
||||
restrictions.
|
||||
* **`cpu_general` QoS:** This is the **default QoS** for `merlin7` _users_. It limits the total resources available to each
|
||||
user. Additionally, this QoS is applied to the `general` partition, enforcing restrictions at the partition level and
|
||||
overriding user-level QoS.
|
||||
* **`cpu_daily` QoS:** Guarantees increased resources for the `daily` partition, accommodating shorter-duration jobs
|
||||
with higher resource needs.
|
||||
* **`cpu_hourly` QoS:** Offers the least constraints, allowing more resources to be used for the `hourly` partition,
|
||||
which caters to very short-duration jobs.
|
||||
* **`cpu_interactive` QoS:** Is restricted to one node and a few CPUs only, and is intended to be used when interactive
|
||||
allocations are necessary (`salloc`, `srun`).
|
||||
|
||||
For additional details, refer to the [CPU partitions](#cpu-partitions) section.
|
||||
|
||||
!!! tip
|
||||
Always verify QoS definitions for potential changes using the `sacctmgr show qos format="Name%22,MaxTRESPU%35,MaxTRES%35"` command.
|
||||
|
||||
### CPU partitions
|
||||
|
||||
This section provides a summary of the partitions available in the `merlin7` CPU cluster.
|
||||
|
||||
Key concepts:
|
||||
* **`PriorityJobFactor`**: This value is added to a job’s priority (visible in the `PARTITION` column of the `sprio -l` command).
|
||||
Jobs submitted to partitions with higher `PriorityJobFactor` values generally run sooner. However, other factors like *job age*
|
||||
and especially *fair share* can also influence scheduling.
|
||||
* **`PriorityTier`**: Jobs submitted to partitions with higher `PriorityTier` values take precedence over pending jobs in partitions
|
||||
with lower `PriorityTier` values. Additionally, jobs from higher `PriorityTier` partitions can preempt running jobs in lower-tier
|
||||
partitions, where applicable.
|
||||
* **`QoS`**: Specifies the quality of service associated with a partition. It is used to control and restrict resource availability
|
||||
for specific partitions, ensuring that resource allocation aligns with intended usage policies. Detailed explanations of the various
|
||||
QoS settings can be found in the [User and job limits with QoS](#user-and-job-limits-with-qos) section.
|
||||
|
||||
!!! tip
|
||||
Always verify partition configurations for potential changes using the `scontrol show partition` command.
|
||||
|
||||
#### CPU public partitions
|
||||
|
||||
| PartitionName | DefaultTime | MaxTime | TotalNodes | PriorityJobFactor | PriorityTier | QoS | AllowAccounts |
|
||||
| -----------------: | -----------: | ----------: | --------: | ----------------: | -----------: | --------------: | -------------: |
|
||||
| **<u>general</u>** | 1-00:00:00 | 7-00:00:00 | 46 | 1 | 1 | cpu_general | <u>merlin</u> |
|
||||
| **daily** | 0-01:00:00 | 1-00:00:00 | 58 | 500 | 1 | cpu_daily | <u>merlin</u> |
|
||||
| **hourly** | 0-00:30:00 | 0-01:00:00 | 77 | 1000 | 1 | cpu_hourly | <u>merlin</u> |
|
||||
| **interactive** | 0-04:00:00 | 0-12:00:00 | 58 | 1 | 2 | cpu_interactive | <u>merlin</u> |
|
||||
|
||||
All Merlin users are part of the `merlin` account, which is used as the *default account* when submitting jobs.
|
||||
Similarly, if no partition is specified, jobs are automatically submitted to the `general` partition by default.
|
||||
|
||||
!!! tip
|
||||
For jobs running less than one day, submit them to the **daily** partition.
|
||||
For jobs running less than one hour, use the **hourly** partition. These
|
||||
partitions provide higher priority and ensure quicker scheduling compared
|
||||
to **general**, which has limited node availability.
|
||||
|
||||
The **`hourly`** partition may include private nodes as an additional buffer. However, the current Slurm partition configuration, governed
|
||||
by **`PriorityTier`**, ensures that jobs submitted to private partitions are prioritized and processed first. As a result, access to the
|
||||
**`hourly`** partition might experience delays in such scenarios.
|
||||
|
||||
The **`interactive`** partition is designed specifically for real-time, interactive work. Here are the key characteristics:
|
||||
|
||||
* **CPU Oversubscription:** This partition allows CPU oversubscription (configured as `FORCE:4`), meaning that up to four interactive
|
||||
jobs may share the same physical CPU core. This can impact performance, but enables fast access for short-term tasks.
|
||||
* **Highest Scheduling Priority:** Jobs submitted to the interactive partition are always prioritized. They will be scheduled
|
||||
before any jobs in other partitions.
|
||||
* **Intended Use:** This partition is ideal for debugging, testing, compiling, short interactive runs, and other activities where
|
||||
immediate access is important.
|
||||
|
||||
!!! warning
|
||||
Because of CPU sharing, the performance on the **interactive** partition
|
||||
may not be optimal for compute-intensive tasks. For long-running or
|
||||
production workloads, use a dedicated batch partition instead.
|
||||
|
||||
#### CPU private partitions
|
||||
|
||||
##### CAS / ASA
|
||||
|
||||
| PartitionName | DefaultTime | MaxTime | TotalNodes | PriorityJobFactor | PriorityTier | QoS | AllowAccounts |
|
||||
| -----------------: | -----------: | ----------: | --------: | ----------------: | -----------: | ----------: | -------------: |
|
||||
| **asa** | 0-01:00:00 | 14-00:00:00 | 10 | 1 | 2 | normal | asa |
|
||||
|
||||
##### CNM / Mu3e
|
||||
|
||||
| PartitionName | DefaultTime | MaxTime | TotalNodes | PriorityJobFactor | PriorityTier | QoS | AllowAccounts |
|
||||
| -----------------: | -----------: | ----------: | --------: | ----------------: | -----------: | ----------: | -------------: |
|
||||
| **mu3e** | 1-00:00:00 | 7-00:00:00 | 4 | 1 | 2 | normal | mu3e, meg |
|
||||
|
||||
##### CNM / MeG
|
||||
|
||||
| PartitionName | DefaultTime | MaxTime | TotalNodes | PriorityJobFactor | PriorityTier | QoS | AllowAccounts |
|
||||
| -----------------: | -----------: | ----------: | --------: | ----------------: | -----------: | ----------: | -------------: |
|
||||
| **meg-short** | 0-01:00:00 | 0-01:00:00 | unlimited | 1000 | 2 | normal | meg |
|
||||
| **meg-long** | 1-00:00:00 | 5-00:00:00 | unlimited | 1 | 2 | normal | meg |
|
||||
| **meg-prod** | 1-00:00:00 | 5-00:00:00 | unlimited | 1000 | 4 | normal | meg |
|
||||
|
||||
## GPU cluster: gmerlin7
|
||||
|
||||
As mentioned in previous sections, by default, jobs will be submitted to `merlin7`, as it is the primary cluster configured on the login nodes.
|
||||
For submittng jobs to the GPU cluster, **the cluster name `gmerlin7` must be specified**, as follows:
|
||||
```bash
|
||||
#SBATCH --cluster=gmerlin7
|
||||
```
|
||||
|
||||
### GPU general configuration
|
||||
|
||||
The **Merlin7 GPU cluster** is configured with the **`CR_CORE_MEMORY`**, **`CR_ONE_TASK_PER_CORE`**, and **`ENFORCE_BINDING_GRES`** options.
|
||||
* This configuration treats both cores and memory as consumable resources.
|
||||
* Since the nodes are running with **hyper-threading** enabled, each core thread is counted as a CPU
|
||||
to fulfill a job's resource requirements.
|
||||
* Slurm will allocate the CPUs to the selected GPU.
|
||||
|
||||
By default, Slurm will allocate one task per core, which means:
|
||||
* For hyper-threaded nodes (NVIDIA A100-based nodes), each task will consume 2 **CPUs**, regardless of whether both threads are actively used by the job.
|
||||
* For the NVIDIA GraceHopper-based nodes, each task will consume 1 **CPU**.
|
||||
|
||||
This behavior ensures consistent resource allocation but may result in underutilization of hyper-threading in some cases.
|
||||
|
||||
### GPU nodes definition
|
||||
|
||||
The table below provides an overview of the Slurm configuration for the different node types in the Merlin7 cluster.
|
||||
This information is essential for understanding how resources are allocated, enabling users to tailor their submission
|
||||
scripts accordingly.
|
||||
|
||||
| Nodes | Sockets | CoresPerSocket | Cores | ThreadsPerCore | CPUs | MaxMemPerNode | DefMemPerCPU | Gres | Features |
|
||||
| --------------------:| -------: | --------------: | -----: | --------------: | ----: | ------------: | -----------: | --------------------------: | ---------------------: |
|
||||
| gpu[001-007] | 4 | 72 | 288 | 1 | 288 | 828G | 2944M | gpu:gh200:4 | AMD_EPYC_7713, NV_A100 |
|
||||
| gpu[101-105] | 1 | 64 | 64 | 2 | 128 | 480G | 3840M | gpu:nvidia_a100-sxm4-80gb:4 | GH200, NV_H100 |
|
||||
|
||||
Notes on memory configuration:
|
||||
* **Memory allocation options:** To request additional memory, use the following options in your submission script:
|
||||
* **`--mem=<mem_in_MB>`**: Allocates memory per node.
|
||||
* **`--mem-per-cpu=<mem_in_MB>`**: Allocates memory per CPU (equivalent to a core thread).
|
||||
|
||||
The total memory requested cannot exceed the **`MaxMemPerNode`** value.
|
||||
* **Impact of disabling Hyper-Threading:** Using the **`--hint=nomultithread`** option disables one thread per core,
|
||||
effectively halving the number of available CPUs. Consequently, memory allocation will also be halved unless explicitly
|
||||
adjusted.
|
||||
|
||||
For MPI-based jobs, where performance generally improves with single-threaded CPUs, this option is recommended.
|
||||
In such cases, you should double the **`--mem-per-cpu`** value to account for the reduced number of threads.
|
||||
|
||||
!!! tip
|
||||
Always verify the Slurm `/var/spool/slurmd/conf-cache/slurm.conf` configuration file for potential changes.
|
||||
|
||||
### User and job limits with QoS
|
||||
|
||||
In the `gmerlin7` CPU cluster, we enforce certain limits on jobs and users to ensure fair resource usage and prevent
|
||||
overuse by a single user or job. These limits aim to balance resource availability while maintaining overall cluster
|
||||
efficiency. However, applying limits can occasionally impact the cluster’s utilization. For example, user-specific
|
||||
limits may result in pending jobs even when many nodes are idle due to low activity.
|
||||
|
||||
On the other hand, these limits also enhance cluster efficiency by preventing scenarios such as a single job monopolizing
|
||||
all available resources, which could block other jobs from running. Without job size limits, for instance, a large job
|
||||
might drain the entire cluster to satisfy its resource request, a situation that is generally undesirable.
|
||||
|
||||
Thus, setting appropriate limits is essential to maintain fair resource usage while optimizing cluster efficiency. These
|
||||
limits should allow for a mix of jobs of varying sizes and types, including single-core and parallel jobs, to coexist
|
||||
effectively.
|
||||
|
||||
To implement these limits, **we utilize Quality of Service (QoS)**. Different QoS policies are defined and applied
|
||||
**to specific partitions** in line with the established resource allocation policies. The table below outlines the
|
||||
various QoS definitions applicable to the merlin7 CPU-based cluster. Here:
|
||||
* `MaxTRES` specifies resource limits per job.
|
||||
* `MaxTRESPU` specifies resource limits per user.
|
||||
|
||||
| Name | MaxTRES | MaxTRESPU | Scope |
|
||||
| -----------------------: | -------------------------------: | ------------------------------: | ---------------------: |
|
||||
| **normal** | | | partition |
|
||||
| **gpu_general** | gres/gpu=4 | gres/gpu=8 | <u>user</u>, partition |
|
||||
| **gpu_daily** | gres/gpu=8 | gres/gpu=8 | partition |
|
||||
| **gpu_hourly** | gres/gpu=8 | gres/gpu=8 | partition |
|
||||
| **gpu_gh_interactive** | cpu=16,gres/gpu=1,mem=46G,node=1 |cpu=16,gres/gpu=1,mem=46G,node=1 | partition |
|
||||
| **gpu_a100_interactive** | cpu=16,gres/gpu=1,mem=60G,node=1 |cpu=16,gres/gpu=1,mem=60G,node=1 | partition |
|
||||
|
||||
Where:
|
||||
* **`normal` QoS:** This QoS has no limits and is typically applied to partitions that do not require user or job
|
||||
restrictions.
|
||||
* **`gpu_general` QoS:** This is the **default QoS** for `gmerlin7` _users_. It limits the total resources available to each
|
||||
user. Additionally, this QoS is applied to the `[a100|gh]-general` partitions, enforcing restrictions at the partition level and
|
||||
overriding user-level QoS.
|
||||
* **`gpu_daily` QoS:** Guarantees increased resources for the `[a100|gh]-daily` partitions, accommodating shorter-duration jobs
|
||||
with higher resource needs.
|
||||
* **`gpu_hourly` QoS:** Offers the least constraints, allowing more resources to be used for the `[a100|gh]-hourly` partitions,
|
||||
which caters to very short-duration jobs.
|
||||
* **`gpu_a100_interactive` & `gpu_gh_interactive` QoS:** Guarantee interactive access to GPU nodes for software compilation and
|
||||
small testing.
|
||||
|
||||
For additional details, refer to the [GPU partitions](#gpu-partitions) section.
|
||||
|
||||
!!! tip
|
||||
Always verify QoS definitions for potential changes using the `sacctmgr show qos format="Name%22,MaxTRESPU%35,MaxTRES%35"` command.
|
||||
|
||||
### GPU partitions
|
||||
|
||||
This section provides a summary of the partitions available in the `gmerlin7` GPU cluster.
|
||||
|
||||
Key concepts:
|
||||
* **`PriorityJobFactor`**: This value is added to a job’s priority (visible in the `PARTITION` column of the `sprio -l` command).
|
||||
Jobs submitted to partitions with higher `PriorityJobFactor` values generally run sooner. However, other factors like *job age*
|
||||
and especially *fair share* can also influence scheduling.
|
||||
* **`PriorityTier`**: Jobs submitted to partitions with higher `PriorityTier` values take precedence over pending jobs in partitions
|
||||
with lower `PriorityTier` values. Additionally, jobs from higher `PriorityTier` partitions can preempt running jobs in lower-tier
|
||||
partitions, where applicable.
|
||||
* **`QoS`**: Specifies the quality of service associated with a partition. It is used to control and restrict resource availability
|
||||
for specific partitions, ensuring that resource allocation aligns with intended usage policies. Detailed explanations of the various
|
||||
QoS settings can be found in the [User and job limits with QoS](#user-and-job-limits-with-qos) section.
|
||||
|
||||
!!! tip
|
||||
Always verify partition configurations for potential changes using the `scontrol show partition` command.
|
||||
|
||||
#### A100-based partitions
|
||||
|
||||
| PartitionName | DefaultTime | MaxTime | TotalNodes | PriorityJobFactor | PriorityTier | QoS | AllowAccounts |
|
||||
| -------------------: | -----------: | ----------: | --------: | ----------------: | -----------: | -------------------: | -------------: |
|
||||
| **a100-general** | 1-00:00:00 | 7-00:00:00 | 3 | 1 | 1 | gpu_general | <u>merlin</u> |
|
||||
| **a100-daily** | 0-01:00:00 | 1-00:00:00 | 4 | 500 | 1 | gpu_daily | <u>merlin</u> |
|
||||
| **a100-hourly** | 0-00:30:00 | 0-01:00:00 | 5 | 1000 | 1 | gpu_hourly | <u>merlin</u> |
|
||||
| **a100-interactive** | 0-01:00:00 | 0-12:00:00 | 5 | 1 | 2 | gpu_a100_interactive | <u>merlin</u> |
|
||||
|
||||
All Merlin users are part of the `merlin` account, which is used as the *default account* when submitting jobs.
|
||||
Similarly, if no partition is specified, jobs are automatically submitted to the `general` partition by default.
|
||||
|
||||
!!! tip
|
||||
For jobs running less than one day, submit them to the **a100-daily**
|
||||
partition. For jobs running less than one hour, use the **a100-hourly**
|
||||
partition. These partitions provide higher priority and ensure quicker
|
||||
scheduling compared to **a100-general**, which has limited node
|
||||
availability.
|
||||
|
||||
#### GH-based partitions
|
||||
|
||||
| PartitionName | DefaultTime | MaxTime | TotalNodes | PriorityJobFactor | PriorityTier | QoS | AllowAccounts |
|
||||
| -------------------: | -----------: | ----------: | --------: | ----------------: | -----------: | -------------------: | -------------: |
|
||||
| **gh-general** | 1-00:00:00 | 7-00:00:00 | 5 | 1 | 1 | gpu_general | <u>merlin</u> |
|
||||
| **gh-daily** | 0-01:00:00 | 1-00:00:00 | 6 | 500 | 1 | gpu_daily | <u>merlin</u> |
|
||||
| **gh-hourly** | 0-00:30:00 | 0-01:00:00 | 7 | 1000 | 1 | gpu_hourly | <u>merlin</u> |
|
||||
| **gh-interactive** | 0-01:00:00 | 0-12:00:00 | 7 | 1 | 2 | gpu_gh_interactive | <u>merlin</u> |
|
||||
|
||||
All Merlin users are part of the `merlin` account, which is used as the *default account* when submitting jobs.
|
||||
Similarly, if no partition is specified, jobs are automatically submitted to the `general` partition by default.
|
||||
|
||||
!!! tip
|
||||
For jobs running less than one day, submit them to the **gh-daily**
|
||||
partition. For jobs running less than one hour, use the **gh-hourly**
|
||||
partition. These partitions provide higher priority and ensure quicker
|
||||
scheduling compared to **gh-general**, which has limited node availability.
|
||||
@@ -0,0 +1,68 @@
|
||||
---
|
||||
title: Slurm Examples
|
||||
#tags:
|
||||
keywords: slurm example, template, examples, templates, running jobs, sbatch, single core based jobs, HT, multithread, no-multithread, mpi, openmp, packed jobs, hands-on, array jobs, gpu
|
||||
last_updated: 24 Mai 2023
|
||||
summary: "This document shows different template examples for running jobs in the Merlin cluster."
|
||||
sidebar: merlin7_sidebar
|
||||
permalink: /merlin7/slurm-examples.html
|
||||
---
|
||||
|
||||
## Single core based job examples
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
#SBATCH --partition=hourly # Using 'hourly' will grant higher priority
|
||||
#SBATCH --ntasks-per-core=2 # Request the max ntasks be invoked on each core
|
||||
#SBATCH --hint=multithread # Use extra threads with in-core multi-threading
|
||||
#SBATCH --time=00:30:00 # Define max time job will run
|
||||
#SBATCH --output=myscript.out # Define your output file
|
||||
#SBATCH --error=myscript.err # Define your error file
|
||||
|
||||
module purge
|
||||
module load $MODULE_NAME # where $MODULE_NAME is a software in PModules
|
||||
srun $MYEXEC # where $MYEXEC is a path to your binary file
|
||||
```
|
||||
|
||||
## Multi-core based jobs example
|
||||
|
||||
### Pure MPI
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
#SBATCH --job-name=purempi
|
||||
#SBATCH --partition=daily # Using 'daily' will grant higher priority
|
||||
#SBATCH --time=24:00:00 # Define max time job will run
|
||||
#SBATCH --output=%x-%j.out # Define your output file
|
||||
#SBATCH --error=%x-%j.err # Define your error file
|
||||
#SBATCH --exclusive
|
||||
#SBATCH --nodes=1
|
||||
#SBATCH --ntasks=128
|
||||
#SBATCH --hint=nomultithread
|
||||
##SBATCH --cpus-per-task=1
|
||||
|
||||
module purge
|
||||
module load $MODULE_NAME # where $MODULE_NAME is a software in PModules
|
||||
srun $MYEXEC # where $MYEXEC is a path to your binary file
|
||||
```
|
||||
|
||||
### Hybrid
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
#SBATCH --job-name=hybrid
|
||||
#SBATCH --partition=daily # Using 'daily' will grant higher priority
|
||||
#SBATCH --time=24:00:00 # Define max time job will run
|
||||
#SBATCH --output=%x-%j.out # Define your output file
|
||||
#SBATCH --error=%x-%j.err # Define your error file
|
||||
#SBATCH --exclusive
|
||||
#SBATCH --nodes=1
|
||||
#SBATCH --ntasks=128
|
||||
#SBATCH --hint=multithread
|
||||
#SBATCH --cpus-per-task=2
|
||||
|
||||
module purge
|
||||
module load $MODULE_NAME # where $MODULE_NAME is a software in PModules
|
||||
srun $MYEXEC # where $MYEXEC is a path to your binary file
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user