Docs v2 #8
@ -13,6 +13,6 @@ entries:
|
||||
- title: News
|
||||
url: /news.html
|
||||
output: web
|
||||
- title: Merlin 6
|
||||
- title: The Merlin Local HPC Cluster
|
||||
url: /merlin6/introduction.html
|
||||
output: web
|
||||
output: web
|
||||
|
@ -5,7 +5,7 @@ entries:
|
||||
- product: Merlin
|
||||
version: 6
|
||||
folders:
|
||||
- title: Introduction
|
||||
- title: Quick Start Guide
|
||||
# URLs for top-level folders are optional. If omitted it is a bit easier to toggle the accordion.
|
||||
#url: /merlin6/introduction.html
|
||||
folderitems:
|
||||
@ -13,24 +13,22 @@ entries:
|
||||
url: /merlin6/introduction.html
|
||||
- title: Code Of Conduct
|
||||
url: /merlin6/code-of-conduct.html
|
||||
- title: Hardware And Software Description
|
||||
url: /merlin6/hardware-and-software.html
|
||||
- title: Accessing Merlin
|
||||
folderitems:
|
||||
- title: Requesting Accounts
|
||||
url: /merlin6/request-account.html
|
||||
- title: Requesting Projects
|
||||
url: /merlin6/request-project.html
|
||||
- title: Accessing Interactive Nodes
|
||||
- title: Accessing the Interactive Nodes
|
||||
url: /merlin6/interactive.html
|
||||
- title: Accessing the Slurm Clusters
|
||||
url: /merlin6/slurm-access.html
|
||||
- title: How To Use Merlin
|
||||
folderitems:
|
||||
- title: Accessing from a Linux client
|
||||
url: /merlin6/connect-from-linux.html
|
||||
- title: Accessing from a Windows client
|
||||
url: /merlin6/connect-from-windows.html
|
||||
- title: Accessing from a MacOS client
|
||||
url: /merlin6/connect-from-macos.html
|
||||
- title: Accessing Slurm Cluster
|
||||
url: /merlin6/slurm-access.html
|
||||
- title: Merlin6 Storage
|
||||
url: /merlin6/storage.html
|
||||
- title: Transferring Data
|
||||
@ -41,22 +39,38 @@ entries:
|
||||
url: /merlin6/nomachine.html
|
||||
- title: Configuring SSH Keys
|
||||
url: /merlin6/ssh-keys.html
|
||||
- title: Job Submission
|
||||
folderitems:
|
||||
- title: Using PModules
|
||||
- title: Software repository - PModules
|
||||
url: /merlin6/using-modules.html
|
||||
- title: Slurm General Documentation
|
||||
folderitems:
|
||||
- title: Slurm Basic Commands
|
||||
url: /merlin6/slurm-basics.html
|
||||
- title: Running Batch Scripts
|
||||
- title: Running Slurm Batch Scripts
|
||||
url: /merlin6/running-jobs.html
|
||||
- title: Running Interactive Jobs
|
||||
- title: Running Slurm Interactive Jobs
|
||||
url: /merlin6/interactive-jobs.html
|
||||
- title: Slurm Examples
|
||||
- title: Slurm Batch Script Examples
|
||||
url: /merlin6/slurm-examples.html
|
||||
- title: Slurm Configuration
|
||||
url: /merlin6/slurm-configuration.html
|
||||
- title: Monitoring
|
||||
- title: Slurm Monitoring
|
||||
url: /merlin6/monitoring.html
|
||||
- title: Merlin6 CPU Slurm cluster
|
||||
folderitems:
|
||||
- title: Using Slurm - merlin6
|
||||
url: /merlin6/slurm-configuration.html
|
||||
- title: HW/SW description
|
||||
url: /merlin6/hardware-and-software.html
|
||||
- title: Merlin6 GPU Slurm cluster
|
||||
folderitems:
|
||||
- title: Using Slurm - gmerlin6
|
||||
url: /gmerlin6/slurm-configuration.html
|
||||
- title: HW/SW description
|
||||
url: /gmerlin6/hardware-and-software.html
|
||||
- title: Merlin5 CPU Slurm cluster
|
||||
folderitems:
|
||||
- title: Using Slurm - merlin5
|
||||
url: /merlin5/slurm-configuration.html
|
||||
- title: HW/SW description
|
||||
url: /merlin5/hardware-and-software.html
|
||||
- title: Jupyterhub
|
||||
folderitems:
|
||||
- title: Jupyterhub service
|
||||
@ -87,16 +101,8 @@ entries:
|
||||
url: /merlin6/ansys-mapdl.html
|
||||
- title: ParaView
|
||||
url: /merlin6/paraview.html
|
||||
- title: Announcements
|
||||
folderitems:
|
||||
- title: Downtimes
|
||||
url: /merlin6/downtimes.html
|
||||
- title: Past Downtimes
|
||||
url: /merlin6/past-downtimes.html
|
||||
- title: Support
|
||||
folderitems:
|
||||
- title: Migrating From Merlin5
|
||||
url: /merlin6/migrating.html
|
||||
- title: Known Problems
|
||||
url: /merlin6/known-problems.html
|
||||
- title: Troubleshooting
|
||||
|
@ -12,13 +12,23 @@ topnav:
|
||||
topnav_dropdowns:
|
||||
- title: Topnav dropdowns
|
||||
folders:
|
||||
- title: Merlin 6
|
||||
- title: Quick Start
|
||||
folderitems:
|
||||
- title: Introduction
|
||||
url: /merlin6/introduction.html
|
||||
- title: Contact
|
||||
url: /merlin6/contact.html
|
||||
- title: Using Merlin6
|
||||
url: /merlin6/use.html
|
||||
- title: User Guide
|
||||
url: /merlin6/user-guide.html
|
||||
- title: Requesting Accounts
|
||||
url: /merlin6/request-account.html
|
||||
- title: Requesting Projects
|
||||
url: /merlin6/request-project.html
|
||||
- title: Accessing the Interactive Nodes
|
||||
url: /merlin6/interactive.html
|
||||
- title: Accessing the Slurm Clusters
|
||||
url: /merlin6/slurm-access.html
|
||||
- title: Merlin Slurm Clusters
|
||||
folderitems:
|
||||
- title: Cluster 'merlin5'
|
||||
url: /merlin5/slurm-configuration.html
|
||||
- title: Cluster 'merlin6'
|
||||
url: /gmerlin6/slurm-configuration.html
|
||||
- title: Cluster 'gmerlin6'
|
||||
url: /gmerlin6/slurm-configuration.html
|
||||
|
@ -3,15 +3,14 @@ title: Introduction
|
||||
#tags:
|
||||
#keywords:
|
||||
last_updated: 28 June 2019
|
||||
#summary: "Merlin 6 cluster overview"
|
||||
#summary: "GPU Merlin 6 cluster overview"
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/introduction.html
|
||||
redirect_from:
|
||||
- /merlin6
|
||||
- /merlin6/index.html
|
||||
permalink: /gmerlin6/cluster-introduction.html
|
||||
---
|
||||
|
||||
## About Merlin6
|
||||
## About Merlin6 GPU cluster
|
||||
|
||||
### Introduction
|
||||
|
||||
Merlin6 is a the official PSI Local HPC cluster for development and
|
||||
mission-critical applications that has been built in 2019. It replaces
|
||||
@ -21,10 +20,16 @@ Merlin6 is designed to be extensible, so is technically possible to add
|
||||
more compute nodes and cluster storage without significant increase of
|
||||
the costs of the manpower and the operations.
|
||||
|
||||
Merlin6 is mostly based on CPU resources, but also contains a small amount
|
||||
of GPU-based resources which are mostly used by the BIO experiments.
|
||||
Merlin6 is mostly based on **CPU** resources, but also contains a small amount
|
||||
of **GPU**-based resources which are mostly used by the BIO experiments.
|
||||
|
||||
---
|
||||
### Slurm 'gmerlin6'
|
||||
|
||||
THe **GPU nodes** have a dedicated **Slurm** cluster, called **`gmerli6`**.
|
||||
|
||||
This cluster contains the same shared storage resources (`/data/user`, `/data/project`, `/shared-scracth`, `/afs`, `/psi/home`)
|
||||
which are present in the other Merlin Slurm clusters (`merlin5`,`merlin6`). The Slurm `gmerlin6` cluster is maintainted
|
||||
independently to ease access for the users and keep independent user accounting.
|
||||
|
||||
## Merlin6 Architecture
|
||||
|
123
pages/gmerlin6/hardware-and-software-description.md
Normal file
123
pages/gmerlin6/hardware-and-software-description.md
Normal file
@ -0,0 +1,123 @@
|
||||
---
|
||||
title: Hardware And Software Description
|
||||
#tags:
|
||||
#keywords:
|
||||
last_updated: 19 April 2021
|
||||
#summary: ""
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /gmerlin6/hardware-and-software.html
|
||||
---
|
||||
|
||||
## Hardware
|
||||
|
||||
### GPU Computing Nodes
|
||||
|
||||
The GPU Merlin6 cluster was initially built from recycled workstations from different groups in the BIO division.
|
||||
From then, little by little it was updated with new nodes from sporadic investments from the same division, and it was never possible a central big investment.
|
||||
Hence, due to this, the Merlin6 GPU computing cluster has a non homogeneus solution, consisting on a big variety of hardware types and components.
|
||||
|
||||
On 2018, for the common good, BIO decided to open the cluster to the Merlin users and make it widely accessible for the PSI scientists.
|
||||
|
||||
The below table summarizes the hardware setup for the Merlin6 GPU computing nodes:
|
||||
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th scope='colgroup' style="vertical-align:middle;text-align:center;" colspan="8">Merlin5 CPU Computing Nodes</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Node</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Processor</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Sockets</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Cores</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Threads</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Scratch</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Memory</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">GPU</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-001</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><a href="https://ark.intel.com/content/www/us/en/ark/products/82930/intel-core-i7-5960x-processor-extreme-edition-20m-cache-up-to-3-50-ghz.html">Intel Core i7-5960X</a></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">1</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">16</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">1.8TB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">128GB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">GTX1080</td>
|
||||
</tr>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-00[2-5]</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><a href="https://ark.intel.com/content/www/us/en/ark/products/92984/intel-xeon-processor-e5-2640-v4-25m-cache-2-40-ghz.html">Intel Xeon E5-2640</a></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">20</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">1</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">1.8TB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">128GB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">GTX1080</td>
|
||||
</tr>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-006</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><a href="https://ark.intel.com/content/www/us/en/ark/products/92984/intel-xeon-processor-e5-2640-v4-25m-cache-2-40-ghz.html">Intel Xeon E5-2640</a></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">20</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">1</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">800GB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">128GB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">GTX1080Ti</td>
|
||||
</tr>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-00[7-9]</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><a href="https://ark.intel.com/content/www/us/en/ark/products/92984/intel-xeon-processor-e5-2640-v4-25m-cache-2-40-ghz.html">Intel Xeon E5-2640</a></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">20</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">1</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">3.5TB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">128GB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">GTX1080Ti</td>
|
||||
</tr>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-01[0-3]</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><a href="https://ark.intel.com/content/www/us/en/ark/products/197098/intel-xeon-silver-4210r-processor-13-75m-cache-2-40-ghz.html">Intel Xeon Silver 4210R</a></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">20</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">1</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">1.7TB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">128GB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">RTX2080Ti</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
### Login Nodes
|
||||
|
||||
The login nodes are part of the **[Merlin6](/merlin6/introduction.html)** HPC cluster,
|
||||
and are used to compile and to submit jobs to the different ***Merlin Slurm clusters*** (`merlin5`,`merlin6`,`gmerlin6`,etc.).
|
||||
Please refer to the **[Merlin6 Hardware Documentation](/merlin6/hardware-and-software.html)** for further information.
|
||||
|
||||
### Storage
|
||||
|
||||
The storage is part of the **[Merlin6](/merlin6/introduction.html)** HPC cluster,
|
||||
and is mounted in all the ***Slurm clusters*** (`merlin5`,`merlin6`,`gmerlin6`,etc.).
|
||||
Please refer to the **[Merlin6 Hardware Documentation](/merlin6/hardware-and-software.html)** for further information.
|
||||
|
||||
### Network
|
||||
|
||||
The Merlin6 cluster connectivity is based on the [Infiniband FDR and EDR](https://en.wikipedia.org/wiki/InfiniBand) technologies.
|
||||
This allows fast access with very low latencies to the data as well as running extremely efficient MPI-based jobs.
|
||||
To check the network speed (56Gbps for **FDR**, 100Gbps for **EDR**) of the different machines, it can be checked by running on each node the following command:
|
||||
|
||||
```bash
|
||||
ibstat | grep Rate
|
||||
```
|
||||
|
||||
## Software
|
||||
|
||||
In the Merlin6 GPU computing nodes, we try to keep software stack coherency with the main cluster [Merlin6](/merlin6/index.html).
|
||||
|
||||
Due to this, the Merlin6 GPU nodes run:
|
||||
* [**RedHat Enterprise Linux 7**](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/7.9_release_notes/index)
|
||||
* [**Slurm**](https://slurm.schedmd.com/), we usually try to keep it up to date with the most recent versions.
|
||||
* [**GPFS v5**](https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/ibmspectrumscale502_welcome.html)
|
||||
* [**MLNX_OFED LTS v.5.2-2.2.0.0 or newer**](https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed) for all **ConnectX-4** or superior cards.
|
255
pages/gmerlin6/slurm-configuration.md
Normal file
255
pages/gmerlin6/slurm-configuration.md
Normal file
@ -0,0 +1,255 @@
|
||||
---
|
||||
title: Slurm cluster 'gmerlin6'
|
||||
#tags:
|
||||
keywords: configuration, partitions, node definition, gmerlin6
|
||||
last_updated: 29 January 2021
|
||||
summary: "This document describes a summary of the Slurm 'configuration."
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /gmerlin6/slurm-configuration.html
|
||||
---
|
||||
|
||||
This documentation shows basic Slurm configuration and options needed to run jobs in the GPU cluster.
|
||||
|
||||
## Merlin6 GPU nodes definition
|
||||
|
||||
The table below shows a summary of the hardware setup for the different GPU nodes
|
||||
|
||||
| Nodes | Def.#CPUs | Max.#CPUs | #Threads | Def.Mem/CPU | Max.Mem/CPU | Max.Mem/Node | Max.Swap | GPU Type | Def.#GPUs | Max.#GPUs |
|
||||
|:------------------:| ---------:| :--------:| :------: | :----------:| :----------:| :-----------:| :-------:| :--------: | :-------: | :-------: |
|
||||
| merlin-g-[001] | 1 core | 8 cores | 1 | 4000 | 102400 | 102400 | 10000 | **geforce_gtx_1080** | 1 | 2 |
|
||||
| merlin-g-[002-005] | 1 core | 20 cores | 1 | 4000 | 102400 | 102400 | 10000 | **geforce_gtx_1080** | 1 | 4 |
|
||||
| merlin-g-[006-009] | 1 core | 20 cores | 1 | 4000 | 102400 | 102400 | 10000 | **geforce_gtx_1080_ti** | 1 | 4 |
|
||||
| merlin-g-[010-013] | 1 core | 20 cores | 1 | 4000 | 102400 | 102400 | 10000 | **geforce_rtx_2080_ti** | 1 | 4 |
|
||||
| merlin-g-014 | 1 core | 48 cores | 1 | 4000 | 360448 | 360448 | 10000 | **geforce_rtx_2080_ti** | 1 | 8 |
|
||||
| merlin-g-100 | 1 core | 128 cores | 2 | 3900 | 998400 | 998400 | 10000 | **A100** | 1 | 8 |
|
||||
|
||||
{{site.data.alerts.tip}}Always check <b>'/etc/slurm/gres.conf'</b> and <b>'/etc/slurm/slurm.conf'</b> for changes in the GPU type and details of the hardware.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
## Running jobs in the 'gmerlin6' cluster
|
||||
|
||||
In this chapter we will cover basic settings that users need to specify in order to run jobs in the GPU cluster.
|
||||
|
||||
### Merlin6 GPU cluster
|
||||
|
||||
To run jobs in the **`gmerlin6`** cluster users **must** specify the cluster name in Slurm:
|
||||
|
||||
```bash
|
||||
#SBATCH --cluster=gmerlin6
|
||||
```
|
||||
|
||||
### Merlin6 GPU partitions
|
||||
|
||||
Users might need to specify the Slurm partition. If no partition is specified, it will default to **`gpu`**:
|
||||
|
||||
```bash
|
||||
#SBATCH --partition=<partition_name> # Possible <partition_name> values: gpu, gpu-short, gwendolen
|
||||
```
|
||||
|
||||
The table below resumes shows all possible partitions available to users:
|
||||
|
||||
| GPU Partition | Default Time | Max Time | PriorityJobFactor\* | PriorityTier\*\* |
|
||||
|:-----------------: | :----------: | :------: | :-----------------: | :--------------: |
|
||||
| `gpu` | 1 day | 1 week | 1 | 1 |
|
||||
| `gpu-short` | 2 hours | 2 hours | 1000 | 500 |
|
||||
| `gwendolen` | 1 hour | 12 hours | 1000 | 1000 |
|
||||
|
||||
\*The **PriorityJobFactor** value will be added to the job priority (*PARTITION* column in `sprio -l` ). In other words, jobs sent to higher priority
|
||||
partitions will usually run first (however, other factors such like **job age** or mainly **fair share** might affect to that decision). For the GPU
|
||||
partitions, Slurm will also attempt first to allocate jobs on partitions with higher priority over partitions with lesser priority.
|
||||
|
||||
\*\*Jobs submitted to a partition with a higher **PriorityTier** value will be dispatched before pending jobs in partition with lower *PriorityTier* value
|
||||
and, if possible, they will preempt running jobs from partitions with lower *PriorityTier* values.
|
||||
|
||||
### Merlin6 GPU Accounts
|
||||
|
||||
Users need to ensure that the public **`merlin`** account is specified. No specifying account options would default to this account.
|
||||
This is mostly needed by users which have multiple Slurm accounts, which may define by mistake a different account.
|
||||
|
||||
```bash
|
||||
#SBATCH --account=merlin # Possible values: merlin, gwendolen_public, gwendolen
|
||||
```
|
||||
Not all the accounts can be used on all partitions. This is resumed in the table below:
|
||||
|
||||
| Slurm Account | Slurm Partitions |
|
||||
|:-------------------: | :------------------: |
|
||||
| **`merlin`** | **`gpu`**,`gpu-short` |
|
||||
| `gwendolen_public` | `gwendolen` |
|
||||
| `gwendolen` | `gwendolen` |
|
||||
|
||||
By default, all users belong to the `merlin` and `gwendolen_public` Slurm accounts. `gwendolen` is a restricted account.
|
||||
|
||||
#### The 'gwendolen' accounts
|
||||
|
||||
For running jobs in the **`gwendolen`** partition, users must specify one of the `gwendolen_public` or `gwendolen` accounts.
|
||||
The `merlin` account is not allowed to use the `gwendolen` partition.
|
||||
|
||||
* The **`gwendolen_public`** can be used by any Merlin user, and provides restricted resource access to **`gwendolen`**.
|
||||
* The **`gwendolen`** is restricted to a set of users, and provides full access to **`gwendolen`**.
|
||||
|
||||
### Slurm GPU specific options
|
||||
|
||||
Some options are available when using GPUs. These are detailed here.
|
||||
|
||||
#### Number of GPUs and type
|
||||
|
||||
When using the GPU cluster, users **must** specify the number of GPUs they need to use:
|
||||
|
||||
```bash
|
||||
#SBATCH --gpus=[<type>:]<number>
|
||||
```
|
||||
|
||||
The GPU type is optional: if left empty, it will try allocating any type of GPU.
|
||||
The different `[<type>:]` values and `<number>` of GPUs depends on the node.
|
||||
This is detailed in the below table.
|
||||
|
||||
| Nodes | GPU Type | #GPUs |
|
||||
|:---------------------: | :-----------------------: | :---: |
|
||||
| **merlin-g-[001]** | **`geforce_gtx_1080`** | 2 |
|
||||
| **merlin-g-[002-005]** | **`geforce_gtx_1080`** | 4 |
|
||||
| **merlin-g-[006-009]** | **`geforce_gtx_1080_ti`** | 4 |
|
||||
| **merlin-g-[010-013]** | **`geforce_rtx_2080_ti`** | 4 |
|
||||
| **merlin-g-014** | **`geforce_rtx_2080_ti`** | 8 |
|
||||
| **merlin-g-100** | **`A100`** | 8 |
|
||||
|
||||
#### Constraint / Features
|
||||
|
||||
Instead of specifying the GPU **type**, sometimes users would need to **specify the GPU by the amount of memory available in the GPU** card itself.
|
||||
This has been defined in Slurm with **Features**, which is a tag which defines the GPU memory for the different GPU cards.
|
||||
Users can specify which GPU memory size needs to be used with the `--constraint` option. In that case, notice that *in many cases
|
||||
there is not need to specify `[<type>:]`* in the `--gpus` option.
|
||||
|
||||
```bash
|
||||
#SBATCH --contraint=<Feature> # Possible values: gpumem_8gb, gpumem_11gb, gpumem_40gb
|
||||
```
|
||||
|
||||
The table below shows the available **Features** and which GPU card models and GPU nodes they belong to:
|
||||
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th scope='colgroup' style="vertical-align:middle;text-align:center;" colspan="3">Merlin6 CPU Computing Nodes</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Nodes</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">GPU Type</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Feature</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-[001-005]</b></td>
|
||||
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1">`geforce_gtx_1080`</td>
|
||||
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1"><b>`gpumem_8gb`</b></td>
|
||||
</tr>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-[006-009]</b></td>
|
||||
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1">`geforce_gtx_1080_ti`</td>
|
||||
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="2"><b>`gpumem_11gb`</b></td>
|
||||
</tr>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-[010-014]</b></td>
|
||||
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1">`geforce_rtx_2080_ti`</td>
|
||||
</tr>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-100</b></td>
|
||||
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1">`A100`</td>
|
||||
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1"><b>`gpumem_40gb`</b></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
#### Other GPU options
|
||||
|
||||
Alternative Slurm options for GPU based jobs are available. Please refer to the **man** pages
|
||||
for each Slurm command for further information about it (`man salloc`, `man sbatch`, `man srun`).
|
||||
Below are listed the most common settings:
|
||||
|
||||
```bash
|
||||
#SBATCH --hint=[no]multithread
|
||||
#SBATCH --ntasks=\<ntasks\>
|
||||
#SBATCH --ntasks-per-gpu=\<ntasks\>
|
||||
#SBATCH --mem-per-gpu=\<size[units]\>
|
||||
#SBATCH --cpus-per-gpu=\<ncpus\>
|
||||
#SBATCH --gpus-per-node=[\<type\>:]\<number\>
|
||||
#SBATCH --gpus-per-socket=[\<type\>:]\<number\>
|
||||
#SBATCH --gpus-per-task=[\<type\>:]\<number\>
|
||||
#SBATCH --gpu-bind=[verbose,]\<type\>
|
||||
```
|
||||
|
||||
Please, notice that when defining `[<type>:]` once, then all other options must use it too!
|
||||
|
||||
#### Dealing with Hyper-Threading
|
||||
|
||||
The **`gmerlin6`** cluster contains the partition `gwendolen`, which has a node with Hyper-Threading enabled.
|
||||
In that case, one should always specify whether to use Hyper-Threading or not. If not defined, Slurm will
|
||||
generally use it (exceptions apply). For this machine, generally HT is recommended.
|
||||
|
||||
```bash
|
||||
#SBATCH --hint=multithread # Use extra threads with in-core multi-threading.
|
||||
#SBATCH --hint=nomultithread # Don't use extra threads with in-core multi-threading.
|
||||
```
|
||||
|
||||
## User and job limits
|
||||
|
||||
The GPU cluster contains some basic user and job limits to ensure that a single user can not overabuse the resources and a fair usage of the cluster.
|
||||
The limits are described below.
|
||||
|
||||
### Per job limits
|
||||
|
||||
These are limits applying to a single job. In other words, there is a maximum of resources a single job can use.
|
||||
Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: `SlurmQoS(limits)`
|
||||
(possible `SlurmQoS` values can be listed with the command `sacctmgr show qos`):
|
||||
|
||||
| Partition | Slurm Account | Mon-Sun 0h-24h |
|
||||
|:-------------:| :----------------: | :------------------------------------------: |
|
||||
| **gpu** | **`merlin`** | gpu_week(cpu=40,gres/gpu=8,mem=200G) |
|
||||
| **gpu-short** | **`merlin`** | gpu_week(cpu=40,gres/gpu=8,mem=200G) |
|
||||
| **gwendolen** | `gwendolen_public` | gwendolen_public(cpu=32,gres/gpu=2,mem=200G) |
|
||||
| **gwendolen** | `gwendolen` | No limits, full access granted |
|
||||
|
||||
* With the limits in the public `gpu` and `gpu-short` partitions, a single job using the `merlin` acccount
|
||||
(default account) can not use more than 40 CPUs, more than 8 GPUs or more than 200GB.
|
||||
Any job exceeding such limits will stay in the queue with the message **`QOSMax[Cpu|GRES|Mem]PerJob`**.
|
||||
As there are no more existing QoS during the week temporary overriding job limits (this happens for
|
||||
instance in the CPU **daily** partition), the job needs to be cancelled, and the requested resources
|
||||
must be adapted according to the above resource limits.
|
||||
|
||||
* The **gwendolen** partition is a special partition with a **[NVIDIA DGX A100](https://www.nvidia.com/en-us/data-center/dgx-a100/)** machine.
|
||||
Public access is possible through the `gwendolen_public` account, however is limited to 2 GPUs per job, 32 CPUs and 121875MB of memory).
|
||||
For full access, the `gwendolen` account is needed, and this is restricted to a set of users.
|
||||
|
||||
### Per user limits for GPU partitions
|
||||
|
||||
These limits apply exclusively to users. In other words, there is a maximum of resources a single user can use.
|
||||
Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: `SlurmQoS(limits)`
|
||||
(possible `SlurmQoS` values can be listed with the command `sacctmgr show qos`):
|
||||
|
||||
| Partition | Slurm Account | Mon-Sun 0h-24h |
|
||||
|:-------------:| :----------------: | :---------------------------------------------: |
|
||||
| **gpu** | **`merlin`** | gpu_week(cpu=80,gres/gpu=16,mem=400G) |
|
||||
| **gpu-short** | **`merlin`** | gpu_week(cpu=80,gres/gpu=16,mem=400G) |
|
||||
| **gwendolen** | `gwendolen_public` | gwendolen_public(cpu=64,gres/gpu=4,mem=243750M) |
|
||||
| **gwendolen** | `gwendolen` | No limits, full access granted |
|
||||
|
||||
* With the limits in the public `gpu` and `gpu-short` partitions, a single user can not use more than 80 CPUs, more than 16 GPUs or more than 400GB.
|
||||
Jobs sent by any user already exceeding such limits will stay in the queue with the message **`QOSMax[Cpu|GRES|Mem]PerUser`**.
|
||||
In that case, job can wait in the queue until some of the running resources are freed.
|
||||
|
||||
* Notice that user limits are wider than job limits. In that way, a user can run up to two 8 GPUs based jobs, or up to four 4 GPUs based jobs, etc.
|
||||
Please try to avoid occupying all GPUs of the same type for several hours or multiple days, otherwise it would block other users needing the same
|
||||
type of GPU.
|
||||
|
||||
## Advanced Slurm configuration
|
||||
|
||||
Clusters at PSI use the [Slurm Workload Manager](http://slurm.schedmd.com/) as the batch system technology for managing and scheduling jobs.
|
||||
Slurm has been installed in a **multi-clustered** configuration, allowing to integrate multiple clusters in the same batch system.
|
||||
|
||||
For understanding the Slurm configuration setup in the cluster, sometimes may be useful to check the following files:
|
||||
|
||||
* ``/etc/slurm/slurm.conf`` - can be found in the login nodes and computing nodes.
|
||||
* ``/etc/slurm/gres.conf`` - can be found in the GPU nodes, is also propgated to login nodes and computing nodes for user read access.
|
||||
* ``/etc/slurm/cgroup.conf`` - can be found in the computing nodes, is also propagated to login nodes for user read access.
|
||||
|
||||
The previous configuration files which can be found in the login nodes, correspond exclusively to the **merlin6** cluster configuration files.
|
||||
Configuration files for the old **merlin5** cluster or for the **gmerlin6** cluster must be checked directly on any of the **merlin5** or **gmerlin6** computing nodes (in example, by login in to one of the nodes while a job or an active allocation is running).
|
44
pages/merlin5/cluster-introduction.md
Normal file
44
pages/merlin5/cluster-introduction.md
Normal file
@ -0,0 +1,44 @@
|
||||
---
|
||||
title: Cluster 'merlin5'
|
||||
#tags:
|
||||
#keywords:
|
||||
last_updated: 07 April 2021
|
||||
#summary: "Merlin 5 cluster overview"
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin5/cluster-introduction.html
|
||||
---
|
||||
|
||||
## Slurm 'merlin5' cluster
|
||||
|
||||
**Merlin5** was the old official PSI Local HPC cluster for development and
|
||||
mission-critical applications which was built in 2016-2017. It was an
|
||||
extension of the Merlin4 cluster and built from existing hardware due
|
||||
to a lack of central investment on Local HPC Resources. **Merlin5** was
|
||||
then replaced by the **[Merlin6](/merlin6/index.html)** cluster in 2019,
|
||||
with an important central investment of ~1,5M CHF. **Merlin5** was mostly
|
||||
based on CPU resources, but also contained a small amount of GPU-based
|
||||
resources which were mostly used by the BIO experiments.
|
||||
|
||||
**Merlin5** has been kept as a **Local HPC [Slurm](https://slurm.schedmd.com/overview.html) cluster**,
|
||||
called **`merlin5`**. In that way, the old CPU computing nodes are still available as extra computation resources,
|
||||
and as an extension of the official production **`merlin6`** [Slurm](https://slurm.schedmd.com/overview.html) cluster.
|
||||
|
||||
The old Merlin5 _**login nodes**_, _**GPU nodes**_ and _**storage**_ were fully migrated to the **[Merlin6](/merlin6/index.html)**
|
||||
cluster, which becomes the **main Local HPC Cluster**. Hence, **[Merlin6](/merlin6/index.html)**
|
||||
contains the storage which is mounted on the different Merlin HPC [Slurm](https://slurm.schedmd.com/overview.html) Clusters (`merlin5`, `merlin6`, `gmerlin6`).
|
||||
|
||||
### Submitting jobs to 'merlin5'
|
||||
|
||||
To submit jobs to the **`merlin5`** Slurm cluster, it must be done from the **Merlin6** login nodes by using
|
||||
the option `--clusters=merlin5` on any of the Slurm commands (`sbatch`, `salloc`, `srun`, etc. commands).
|
||||
|
||||
## The Merlin Architecture
|
||||
|
||||
### Multi Non-Federated Cluster Architecture Design: The Merlin cluster
|
||||
|
||||
The following image shows the Slurm architecture design for Merlin cluster.
|
||||
It contains a multi non-federated cluster setup, with a central Slurm database
|
||||
and multiple independent clusters (`merlin5`, `merlin6`, `gmerlin6`):
|
||||
|
||||

|
||||
|
97
pages/merlin5/hardware-and-software-description.md
Normal file
97
pages/merlin5/hardware-and-software-description.md
Normal file
@ -0,0 +1,97 @@
|
||||
---
|
||||
title: Hardware And Software Description
|
||||
#tags:
|
||||
#keywords:
|
||||
last_updated: 09 April 2021
|
||||
#summary: ""
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin5/hardware-and-software.html
|
||||
---
|
||||
|
||||
## Hardware
|
||||
|
||||
### Computing Nodes
|
||||
|
||||
Merlin5 is built from recycled nodes, and hardware will be decomissioned as soon as it fails (due to expired warranty and age of the cluster).
|
||||
* Merlin5 is based on the [**HPE c7000 Enclosure**](https://h20195.www2.hpe.com/v2/getdocument.aspx?docname=c04128339) solution, with 16 x [**HPE ProLiant BL460c Gen8**](https://h20195.www2.hpe.com/v2/getdocument.aspx?docname=c04123239) nodes per chassis.
|
||||
* Connectivity is based on Infiniband **ConnectX-3 QDR-40Gbps**
|
||||
* 16 internal ports for intra chassis communication
|
||||
* 2 connected external ports for inter chassis communication and storage access.
|
||||
|
||||
The below table summarizes the hardware setup for the Merlin5 computing nodes:
|
||||
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th scope='colgroup' style="vertical-align:middle;text-align:center;" colspan="8">Merlin5 CPU Computing Nodes</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Chassis</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Node</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Processor</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Sockets</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Cores</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Threads</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Scratch</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Memory</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2"><b>#0</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-c-[18-30]</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2"><a href="https://ark.intel.com/content/www/us/en/ark/products/64595/intel-xeon-processor-e5-2670-20m-cache-2-60-ghz-8-00-gt-s-intel-qpi.html">Intel Xeon E5-2670</a></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2">2</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2">16</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2">1</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2">50GB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">64GB</td>
|
||||
</tr>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td rowspan="1"><b>merlin-c-[31,32]</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>128GB</b></td>
|
||||
</tr>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2"><b>#1</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-c-[33-45]</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2"><a href="https://ark.intel.com/content/www/us/en/ark/products/64595/intel-xeon-processor-e5-2670-20m-cache-2-60-ghz-8-00-gt-s-intel-qpi.html">Intel Xeon E5-2670</a></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2">2</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2">16</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2">1</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2">50GB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">64GB</td>
|
||||
</tr>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td rowspan="1"><b>merlin-c-[46,47]</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>128GB</b></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
### Login Nodes
|
||||
|
||||
The login nodes are part of the **[Merlin6](/merlin6/introduction.html)** HPC cluster,
|
||||
and are used to compile and to submit jobs to the different ***Merlin Slurm clusters*** (`merlin5`,`merlin6`,`gmerlin6`,etc.).
|
||||
Please refer to the **[Merlin6 Hardware Documentation](/merlin6/hardware-and-software.html)** for further information.
|
||||
|
||||
### Storage
|
||||
|
||||
The storage is part of the **[Merlin6](/merlin6/introduction.html)** HPC cluster,
|
||||
and is mounted in all the ***Slurm clusters*** (`merlin5`,`merlin6`,`gmerlin6`,etc.).
|
||||
Please refer to the **[Merlin6 Hardware Documentation](/merlin6/hardware-and-software.html)** for further information.
|
||||
|
||||
### Network
|
||||
|
||||
Merlin5 cluster connectivity is based on the [Infiniband QDR](https://en.wikipedia.org/wiki/InfiniBand) technology.
|
||||
This allows fast access with very low latencies to the data as well as running extremely efficient MPI-based jobs.
|
||||
However, this is an old version of Infiniband which requires older drivers and software can not take advantage of the latest features.
|
||||
|
||||
## Software
|
||||
|
||||
In Merlin5, we try to keep software stack coherency with the main cluster [Merlin6](/merlin6/index.html).
|
||||
|
||||
Due to this, Merlin5 runs:
|
||||
* [**RedHat Enterprise Linux 7**](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/7.9_release_notes/index)
|
||||
* [**Slurm**](https://slurm.schedmd.com/), we usually try to keep it up to date with the most recent versions.
|
||||
* [**GPFS v5**](https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/ibmspectrumscale502_welcome.html)
|
||||
* [**MLNX_OFED LTS v.4.9-2.2.4.0**](https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed), which is an old version, but required because **ConnectX-3** support has been dropped on newer OFED versions.
|
142
pages/merlin5/slurm-configuration.md
Normal file
142
pages/merlin5/slurm-configuration.md
Normal file
@ -0,0 +1,142 @@
|
||||
---
|
||||
title: Slurm Configuration
|
||||
#tags:
|
||||
keywords: configuration, partitions, node definition
|
||||
last_updated: 20 May 2021
|
||||
summary: "This document describes a summary of the Merlin5 Slurm configuration."
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin5/slurm-configuration.html
|
||||
---
|
||||
|
||||
This documentation shows basic Slurm configuration and options needed to run jobs in the Merlin5 cluster.
|
||||
|
||||
The Merlin5 cluster is an old cluster with old hardware which is maintained in a best effort for increasing the CPU power of the Merlin cluster.
|
||||
|
||||
## Merlin5 CPU nodes definition
|
||||
|
||||
The following table show default and maximum resources that can be used per node:
|
||||
|
||||
| Nodes | Def.#CPUs | Max.#CPUs | #Threads | Max.Mem/Node | Max.Swap |
|
||||
|:----------------:| ---------:| :--------:| :------: | :----------: | :-------:|
|
||||
| merlin-c-[18-30] | 1 core | 16 cores | 1 | 60000 | 10000 |
|
||||
| merlin-c-[31-32] | 1 core | 16 cores | 1 | 124000 | 10000 |
|
||||
| merlin-c-[33-45] | 1 core | 16 cores | 1 | 60000 | 10000 |
|
||||
| merlin-c-[46-47] | 1 core | 16 cores | 1 | 124000 | 10000 |
|
||||
|
||||
There is one *main difference between the Merlin5 and Merlin6 clusters*: Merlin5 is keeping an old configuration which does not
|
||||
consider the memory as a *consumable resource*. Hence, users can *oversubscribe* memory. This might trigger some side-effects, but
|
||||
this legacy configuration has been kept to ensure that old jobs can keep running in the same way they did a few years ago.
|
||||
If you know that this might be a problem for you, please, always use Merlin6 instead.
|
||||
|
||||
|
||||
## Running jobs in the 'merlin5' cluster
|
||||
|
||||
In this chapter we will cover basic settings that users need to specify in order to run jobs in the Merlin5 CPU cluster.
|
||||
|
||||
### Merlin5 CPU cluster
|
||||
|
||||
To run jobs in the **`merlin5`** cluster users **must** specify the cluster name in Slurm:
|
||||
|
||||
```bash
|
||||
#SBATCH --cluster=merlin5
|
||||
```
|
||||
|
||||
### Merlin5 CPU partitions
|
||||
|
||||
Users might need to specify the Slurm partition. If no partition is specified, it will default to **`merlin`**:
|
||||
|
||||
```bash
|
||||
#SBATCH --partition=<partition_name> # Possible <partition_name> values: merlin, merlin-long:
|
||||
```
|
||||
|
||||
The table below resumes shows all possible partitions available to users:
|
||||
|
||||
| CPU Partition | Default Time | Max Time | Max Nodes | PriorityJobFactor\* | PriorityTier\*\* |
|
||||
|:-----------------: | :----------: | :------: | :-------: | :-----------------: | :--------------: |
|
||||
| **<u>merlin</u>** | 5 days | 1 week | All nodes | 500 | 1 |
|
||||
| **merlin-long** | 5 days | 21 days | 4 | 1 | 1 |
|
||||
|
||||
**\***The **PriorityJobFactor** value will be added to the job priority (*PARTITION* column in `sprio -l` ). In other words, jobs sent to higher priority
|
||||
partitions will usually run first (however, other factors such like **job age** or mainly **fair share** might affect to that decision). For the GPU
|
||||
partitions, Slurm will also attempt first to allocate jobs on partitions with higher priority over partitions with lesser priority.
|
||||
|
||||
**\*\***Jobs submitted to a partition with a higher **PriorityTier** value will be dispatched before pending jobs in partition with lower *PriorityTier* value
|
||||
and, if possible, they will preempt running jobs from partitions with lower *PriorityTier* values.
|
||||
|
||||
The **`merlin-long`** partition **is limited to 4 nodes**, as it might contain jobs running for up to 21 days.
|
||||
|
||||
### Merlin5 CPU Accounts
|
||||
|
||||
Users need to ensure that the public **`merlin`** account is specified. No specifying account options would default to this account.
|
||||
This is mostly needed by users which have multiple Slurm accounts, which may define by mistake a different account.
|
||||
|
||||
```bash
|
||||
#SBATCH --account=merlin # Possible values: merlin
|
||||
```
|
||||
|
||||
### Slurm CPU specific options
|
||||
|
||||
Some options are available when using CPUs. These are detailed here.
|
||||
|
||||
Alternative Slurm options for CPU based jobs are available. Please refer to the **man** pages
|
||||
for each Slurm command for further information about it (`man salloc`, `man sbatch`, `man srun`).
|
||||
Below are listed the most common settings:
|
||||
|
||||
```bash
|
||||
#SBATCH --ntasks=<ntasks>
|
||||
#SBATCH --ntasks-per-core=<ntasks>
|
||||
#SBATCH --ntasks-per-socket=<ntasks>
|
||||
#SBATCH --ntasks-per-node=<ntasks>
|
||||
#SBATCH --mem=<size[units]>
|
||||
#SBATCH --mem-per-cpu=<size[units]>
|
||||
#SBATCH --cpus-per-task=<ncpus>
|
||||
#SBATCH --cpu-bind=[{quiet,verbose},]<type> # only for 'srun' command
|
||||
```
|
||||
|
||||
Notice that in **Merlin5** no hyper-threading is available (while in **Merlin6** it is).
|
||||
Hence, in **Merlin5** there is not need to specify `--hint` hyper-threading related options.
|
||||
|
||||
## User and job limits
|
||||
|
||||
In the CPU cluster we provide some limits which basically apply to jobs and users. The idea behind this is to ensure a fair usage of the resources and to
|
||||
avoid overabuse of the resources from a single user or job. However, applying limits might affect the overall usage efficiency of the cluster (in example,
|
||||
pending jobs from a single user while having many idle nodes due to low overall activity is something that can be seen when user limits are applied).
|
||||
In the same way, these limits can be also used to improve the efficiency of the cluster (in example, without any job size limits, a job requesting all
|
||||
resources from the batch system would drain the entire cluster for fitting the job, which is undesirable).
|
||||
|
||||
Hence, there is a need of setting up wise limits and to ensure that there is a fair usage of the resources, by trying to optimize the overall efficiency
|
||||
of the cluster while allowing jobs of different nature and sizes (it is, **single core** based **vs parallel jobs** of different sizes) to run.
|
||||
|
||||
In the **`merlin5`** cluster, as not many users are running on it, these limits are wider than the ones set in the **`merlin6`** and **`gmerlin6`** clusters.
|
||||
|
||||
### Per job limits
|
||||
|
||||
These are limits which apply to a single job. In other words, there is a maximum of resources a single job can use. These limits are described in the table below,
|
||||
with the format `SlurmQoS(limits)` (`SlurmQoS` can be listed from the `sacctmgr show qos` command):
|
||||
|
||||
| Partition | Mon-Sun 0h-24h | Other limits |
|
||||
|:---------------: | :--------------: | :----------: |
|
||||
| **merlin** | merlin5(cpu=384) | None |
|
||||
| **merlin-long** | merlin5(cpu=384) | Max. 4 nodes |
|
||||
|
||||
By default, by QoS limits, a job can not use more than 384 cores (max CPU per job).
|
||||
However, for the `merlin-long`, this is even more restricted: there is an extra limit of 4 dedicated nodes for this partion. This is defined
|
||||
at the partition level, and will overwrite any QoS limit as long as this is more restrictive.
|
||||
|
||||
### Per user limits for CPU partitions
|
||||
|
||||
No user limits apply by QoS. For the **`merlin`** partition, a single user could fill the whole batch system with jobs (however, the restriction is at the job size, as explained above). For the **`merlin-limit`** partition, the 4 node limitation still applies.
|
||||
|
||||
## Advanced Slurm configuration
|
||||
|
||||
Clusters at PSI use the [Slurm Workload Manager](http://slurm.schedmd.com/) as the batch system technology for managing and scheduling jobs.
|
||||
Slurm has been installed in a **multi-clustered** configuration, allowing to integrate multiple clusters in the same batch system.
|
||||
|
||||
For understanding the Slurm configuration setup in the cluster, sometimes may be useful to check the following files:
|
||||
|
||||
* ``/etc/slurm/slurm.conf`` - can be found in the login nodes and computing nodes.
|
||||
* ``/etc/slurm/gres.conf`` - can be found in the GPU nodes, is also propgated to login nodes and computing nodes for user read access.
|
||||
* ``/etc/slurm/cgroup.conf`` - can be found in the computing nodes, is also propagated to login nodes for user read access.
|
||||
|
||||
The previous configuration files which can be found in the login nodes, correspond exclusively to the **merlin6** cluster configuration files.
|
||||
Configuration files for the old **merlin5** cluster or for the **gmerlin6** cluster must be checked directly on any of the **merlin5** or **gmerlin6** computing nodes (in example, by login in to one of the nodes while a job or an active allocation is running).
|
@ -1,111 +0,0 @@
|
||||
---
|
||||
title: Hardware And Software Description
|
||||
#tags:
|
||||
#keywords:
|
||||
last_updated: 13 June 2019
|
||||
#summary: ""
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/hardware-and-software.html
|
||||
---
|
||||
|
||||
# Hardware And Software Description
|
||||
{: .no_toc }
|
||||
|
||||
## Table of contents
|
||||
{: .no_toc .text-delta }
|
||||
|
||||
1. TOC
|
||||
{:toc}
|
||||
|
||||
---
|
||||
|
||||
## Computing Nodes
|
||||
|
||||
The new Merlin6 cluster contains an homogeneous solution based on *three* HP Apollo k6000 systems. Each HP Apollo k6000 chassis contains 22 HP XL320k Gen10 blades. However,
|
||||
each chassis can contain up to 24 blades, so is possible to upgradew with up to 2 nodes per chassis.
|
||||
|
||||
Each HP XL320k Gen 10 blade can contain up to two processors of the latest Intel® Xeon® Scalable Processor family. The hardware and software configuration is the following:
|
||||
* 3 x HP Apollo k6000 chassis systems, each one:
|
||||
* 22 x [HP Apollo XL230K Gen10](https://h20195.www2.hpe.com/v2/GetDocument.aspx?docname=a00016634enw), each one:
|
||||
* 2 x *22 core* [Intel® Xeon® Gold 6152 Scalable Processor](https://ark.intel.com/products/120491/Intel-Xeon-Gold-6152-Processor-30-25M-Cache-2-10-GHz-) (2.10-3.70GHz).
|
||||
* 12 x 32 GB (384 GB in total) of DDR4 memory clocked 2666 MHz.
|
||||
* Dual Port !InfiniBand !ConnectX-5 EDR-100Gbps (low latency network); one active port per chassis.
|
||||
* 1 x 1.6TB NVMe SSD Disk
|
||||
* ~300GB reserved for the O.S.
|
||||
* ~1.2TB reserved for local fast scratch ``/scratch``.
|
||||
* Software:
|
||||
* RedHat Enterprise Linux 7.6
|
||||
* [Slurm](https://slurm.schedmd.com/) v18.08
|
||||
* [GPFS](https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/ibmspectrumscale502_welcome.html) v5.0.2
|
||||
* 1 x [HPE Apollo InfiniBand EDR 36-port Unmanaged Switch](https://h20195.www2.hpe.com/v2/getdocument.aspx?docname=a00016643enw)
|
||||
* 24 internal EDR-100Gbps ports (1 port per blade for internal low latency connectivity)
|
||||
* 12 external EDR-100Gbps ports (for external for internal low latency connectivity)
|
||||
---
|
||||
|
||||
## Login Nodes
|
||||
|
||||
### merlin-l-0[1,2]
|
||||
|
||||
Two login nodes are inherit from the previous Merlin5 cluster: ``merlin-l-01.psi.ch``, ``merlin-l-02.psi.ch``. The hardware and software configuration is the following:
|
||||
|
||||
* 2 x HP DL380 Gen9, each one:
|
||||
* 2 x *16 core* [Intel® Xeon® Processor E5-2697AV4 Family](https://ark.intel.com/products/91768/Intel-Xeon-Processor-E5-2697A-v4-40M-Cache-2-60-GHz-) (2.60-3.60GHz)
|
||||
* Hyper-Threading disabled
|
||||
* 16 x 32 GB (512 GB in total) of DDR4 memory clocked 2400 MHz.
|
||||
* Dual Port Infiniband !ConnectIB FDR-56Gbps (low latency network).
|
||||
* Software:
|
||||
* RedHat Enterprise Linux 7.6
|
||||
* [Slurm](https://slurm.schedmd.com/) v18.08
|
||||
* [GPFS](https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/ibmspectrumscale502_welcome.html) v5.0.2
|
||||
|
||||
### merlin-l-00[1,2]
|
||||
|
||||
Two new login nodes are available in the new cluster: ``merlin-l-001.psi.ch``, ``merlin-l-002.psi.ch``. The hardware and software configuration is the following:
|
||||
|
||||
* 2 x HP DL380 Gen10, each one:
|
||||
* 2 x *22 core* [Intel® Xeon® Gold 6152 Scalable Processor](https://ark.intel.com/products/120491/Intel-Xeon-Gold-6152-Processor-30-25M-Cache-2-10-GHz-) (2.10-3.70GHz).
|
||||
* Hyper-threading enabled.
|
||||
* 24 x 16GB (384 GB in total) of DDR4 memory clocked 2666 MHz.
|
||||
* Dual Port Infiniband !ConnectX-5 EDR-100Gbps (low latency network).
|
||||
* Software:
|
||||
* [NoMachine Terminal Server](https://www.nomachine.com/)
|
||||
* Currently only on: ``merlin-l-001.psi.ch``.
|
||||
* RedHat Enterprise Linux 7.6
|
||||
* [Slurm](https://slurm.schedmd.com/) v18.08
|
||||
* [GPFS](https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/ibmspectrumscale502_welcome.html) v5.0.2 (merlin-l-001) v5.0.3 (merlin-l-002)
|
||||
|
||||
---
|
||||
|
||||
## Storage
|
||||
|
||||
The storage node is based on the [Lenovo Distributed Storage Solution for IBM Spectrum Scale](https://lenovopress.com/lp0626-lenovo-distributed-storage-solution-for-ibm-spectrum-scale-x3650-m5).
|
||||
The solution is equipped with 334 x 10TB disks providing a useable capacity of 2.316 PiB (2.608PB). THe overall solution can provide a maximum read performance of 20GB/s.
|
||||
* 1 x Lenovo DSS G240, composed by:
|
||||
* 2 x ThinkSystem SR650, each one:
|
||||
* 2 x Dual Port Infiniband ConnectX-5 EDR-100Gbps (low latency network).
|
||||
* 2 x Dual Port Infiniband ConnectX-4 EDR-100Gbps (low latency network).
|
||||
* 1 x ThinkSystem RAID 930-8i 2GB Flash PCIe 12Gb Adapter
|
||||
* 1 x ThinkSystem SR630
|
||||
* 1 x Dual Port Infiniband ConnectX-5 EDR-100Gbps (low latency network).
|
||||
* 1 x Dual Port Infiniband ConnectX-4 EDR-100Gbps (low latency network).
|
||||
* 4 x Lenovo Storage D3284 High Density Expansion Enclosure, each one:
|
||||
* Holds 84 x 3.5" hot-swap drive bays in two drawers. Each drawer has three rows of drives, and each row has 14 drives.
|
||||
* Each drive bay will contain a 10TB Helium 7.2K NL-SAS HDD.
|
||||
* 2 x Mellanox SB7800 InfiniBand 1U Switch for High Availability and fast access to the storage with very low latency. Each one:
|
||||
* 36 EDR-100Gbps ports
|
||||
|
||||
---
|
||||
|
||||
## Network
|
||||
|
||||
Merlin6 cluster connectivity is based on the [Infiniband](https://en.wikipedia.org/wiki/InfiniBand) technology. This allows fast access with very low latencies to the data as well as running
|
||||
extremely efficient MPI-based jobs:
|
||||
* Connectivity amongst different computing nodes on different chassis ensures up to 1200Gbps of aggregated bandwidth.
|
||||
* Inter connectivity (communication amongst computing nodes in the same chassis) ensures up to 2400Gbps of aggregated bandwidth.
|
||||
* Communication to the storage ensures up to 800Gbps of aggregated bandwidth.
|
||||
|
||||
Merlin6 cluster currently contains 5 Infiniband Managed switches and 3 Infiniband Unmanaged switches (one per HP Apollo chassis):
|
||||
* 1 * MSX6710 (FDR) for connecting old GPU nodes, old login nodes and MeG cluster to the Merlin6 cluster (and storage). No High Availability mode possible.
|
||||
* 2 * MSB7800 (EDR) for connecting Login Nodes, Storage and other nodes in High Availability mode.
|
||||
* 3 * HP EDR Unmanaged switches, each one embedded to each HP Apollo k6000 chassis solution.
|
||||
* 2 * MSB7700 (EDR) are the top switches, interconnecting the Apollo unmanaged switches and the managed switches (MSX6710, MSB7800).
|
@ -2,31 +2,13 @@
|
||||
title: Accessing Interactive Nodes
|
||||
#tags:
|
||||
#keywords:
|
||||
last_updated: 13 June 2019
|
||||
last_updated: 20 May 2021
|
||||
#summary: ""
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/interactive.html
|
||||
---
|
||||
|
||||
|
||||
## Login nodes description
|
||||
|
||||
The Merlin6 login nodes are the official machines for accessing the recources of Merlin6.
|
||||
From these machines, users can submit jobs to the Slurm batch system as well as visualize or compile their software.
|
||||
|
||||
The Merlin6 login nodes are the following:
|
||||
|
||||
| Hostname | SSH | NoMachine | #cores | #Threads | CPU | Memory | Scratch | Scratch Mountpoint |
|
||||
| ------------------- | --- | --------- | ------ |:--------:| :-------------------- | ------ | ---------- | :------------------ |
|
||||
| merlin-l-001.psi.ch | yes | yes | 2 x 22 | 2 | Intel Xeon Gold 6152 | 384GB | 1.8TB NVMe | ``/scratch`` |
|
||||
| merlin-l-002.psi.ch | yes | yes | 2 x 22 | 2 | Intel Xeon Gold 6142 | 384GB | 1.8TB NVMe | ``/scratch`` |
|
||||
| merlin-l-01.psi.ch | yes | - | 2 x 16 | 2 | Intel Xeon E5-2697Av4 | 512GB | 100GB SAS | ``/scratch`` |
|
||||
|
||||
---
|
||||
|
||||
## Remote Access
|
||||
|
||||
### SSH Access
|
||||
## SSH Access
|
||||
|
||||
For interactive command shell access, use an SSH client. We recommend to activate SSH's X11 forwarding to allow you to use graphical
|
||||
applications (e.g. a text editor, but for more performant graphical access, refer to the sections below). X applications are supported
|
||||
@ -38,26 +20,37 @@ in the login nodes and X11 forwarding can be used for those users who have prope
|
||||
* PSI desktop configuration issues must be addressed through **[PSI Service Now](https://psi.service-now.com/psisp)** as an *Incident Request*.
|
||||
* Ticket will be redirected to the corresponding Desktop support group (Windows, Linux).
|
||||
|
||||
#### Accessing from a Linux client
|
||||
### Accessing from a Linux client
|
||||
|
||||
Refer to [{Accessing Merlin -> Accessing from Linux Clients}](/merlin6/connect-from-linux.html) for **Linux** SSH client and X11 configuration.
|
||||
Refer to [{How To Use Merlin -> Accessing from Linux Clients}](/merlin6/connect-from-linux.html) for **Linux** SSH client and X11 configuration.
|
||||
|
||||
#### Accessing from a Windows client
|
||||
### Accessing from a Windows client
|
||||
|
||||
Refer to [{Accessing Merlin -> Accessing from Windows Clients}](/merlin6/connect-from-windows.html) for **Windows** SSH client and X11 configuration.
|
||||
Refer to [{How To Use Merlin -> Accessing from Windows Clients}](/merlin6/connect-from-windows.html) for **Windows** SSH client and X11 configuration.
|
||||
|
||||
#### Accessing from a MacOS client
|
||||
### Accessing from a MacOS client
|
||||
|
||||
Refer to [{Accessing Merlin -> Accessing from MacOS Clients}](/merlin6/connect-from-macos.html) for **MacOS** SSH client and X11 configuration.
|
||||
Refer to [{How To Use Merlin -> Accessing from MacOS Clients}](/merlin6/connect-from-macos.html) for **MacOS** SSH client and X11 configuration.
|
||||
|
||||
### Graphical access using **NoMachine** client
|
||||
## NoMachine Remote Desktop Access
|
||||
|
||||
X applications are supported in the login nodes and can run efficiently through a **NoMachine** client. This is the officially supported way to run more demanding X applications on Merlin6. The client software can be downloaded from [the Nomachine Website](https://www.nomachine.com/product&p=NoMachine%20Enterprise%20Client).
|
||||
X applications are supported in the login nodes and can run efficiently through a **NoMachine** client. This is the officially supported way to run more demanding X applications on Merlin6.
|
||||
* For PSI Windows workstations, this can be installed from the Software Kiosk as 'NX Client'. If you have difficulties installing, please request support through **[PSI Service Now](https://psi.service-now.com/psisp)** as an *Incident Request*.
|
||||
* For other workstations The client software can be downloaded from the [Nomachine Website](https://www.nomachine.com/product&p=NoMachine%20Enterprise%20Client).
|
||||
|
||||
* Install the NoMachine client locally. For PSI windows machines, this can be installed from the Software Kiosk as 'NX Client'. If you have difficulties installing, please request support through **[PSI Service Now](https://psi.service-now.com/psisp)** as an *Incident Request*.
|
||||
* Configure a new connection in no machine to either `merlin-l-001.psi.ch` or `merlin-l-002.psi.ch`. The 'NX' protocol is recommended. Login nodes are available from the PSI network or through VPN.
|
||||
* You can also connect via the photo science division's `rem-acc.psi.ch` jump point. After connecting you will be presented with options to jump to the merlin login nodes. This can be accessed remotely without VPN.
|
||||
* NoMachine *client configuration* and *connectivity* for Merlin6 is fully supported by Merlin6 administrators.
|
||||
* Please contact us through the official channels on any configuration issue with NoMachine.
|
||||
### Configuring NoMachine
|
||||
|
||||
---
|
||||
Refer to [{How To Use Merlin -> Remote Desktop Access}](/merlin6/nomachine.html) for further instructions of how to configure the NoMachine client and how to access it from PSI and from outside PSI.
|
||||
|
||||
## Login nodes hardware description
|
||||
|
||||
The Merlin6 login nodes are the official machines for accessing the recources of Merlin6.
|
||||
From these machines, users can submit jobs to the Slurm batch system as well as visualize or compile their software.
|
||||
|
||||
The Merlin6 login nodes are the following:
|
||||
|
||||
| Hostname | SSH | NoMachine | #cores | #Threads | CPU | Memory | Scratch | Scratch Mountpoint |
|
||||
| ------------------- | --- | --------- | ------ |:--------:| :-------------------- | ------ | ---------- | :------------------ |
|
||||
| merlin-l-001.psi.ch | yes | yes | 2 x 22 | 2 | Intel Xeon Gold 6152 | 384GB | 1.8TB NVMe | ``/scratch`` |
|
||||
| merlin-l-002.psi.ch | yes | yes | 2 x 22 | 2 | Intel Xeon Gold 6142 | 384GB | 1.8TB NVMe | ``/scratch`` |
|
||||
| merlin-l-01.psi.ch | yes | - | 2 x 16 | 2 | Intel Xeon E5-2697Av4 | 512GB | 100GB SAS | ``/scratch`` |
|
53
pages/merlin6/01-Quick-Start-Guide/accessing-slurm.md
Normal file
53
pages/merlin6/01-Quick-Start-Guide/accessing-slurm.md
Normal file
@ -0,0 +1,53 @@
|
||||
---
|
||||
title: Accessing Slurm Cluster
|
||||
#tags:
|
||||
#keywords:
|
||||
last_updated: 13 June 2019
|
||||
#summary: ""
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/slurm-access.html
|
||||
---
|
||||
|
||||
## The Merlin Slurm clusters
|
||||
|
||||
Merlin contains a multi-cluster setup, where multiple Slurm clusters coexist under the same umbrella.
|
||||
It basically contains the following clusters:
|
||||
|
||||
* The **Merlin6 Slurm CPU cluster**, which is called [**`merlin6`**](/merlin6/slurm-access.html#merlin6-cpu-cluster-access).
|
||||
* The **Merlin6 Slurm GPU cluster**, which is called [**`gmerlin6`**](/merlin6/slurm-access.html#merlin6-gpu-cluster-access).
|
||||
* The *old Merlin5 Slurm CPU cluster*, which is called [**`merlin5`**](/merlin6/slurm-access.html#merlin5-cpu-cluster-access), still supported in a best effort basis.
|
||||
|
||||
## Accessing the Slurm clusters
|
||||
|
||||
Any job submission must be performed from a **Merlin login node**. Please refer to the [**Accessing the Interactive Nodes documentation**](/merlin6/interactive.html)
|
||||
for further information about how to access the cluster.
|
||||
|
||||
In addition, any job *must be submitted from a high performance storage area visible by the login nodes and by the computing nodes*. For this, the possible storage areas are the following:
|
||||
* `/data/user`
|
||||
* `/data/project`
|
||||
* `/shared-scratch`
|
||||
Please, avoid using `/psi/home` directories for submitting jobs.
|
||||
|
||||
### Merlin6 CPU cluster access
|
||||
|
||||
The **Merlin6 CPU cluster** (**`merlin6`**) is the default cluster configured in the login nodes. Any job submission will use by default this cluster, unless
|
||||
the option `--cluster` is specified with another of the existing clusters.
|
||||
|
||||
For further information about how to use this cluster, please visit: [**Merlin6 CPU Slurm Cluster documentation**](/merlin6/slurm-configuration.html).
|
||||
|
||||
### Merlin6 GPU cluster access
|
||||
|
||||
The **Merlin6 GPU cluster** (**`gmerlin6`**) is visible from the login nodes. However, to submit jobs to this cluster, one needs to specify the option `--cluster=gmerlin6` when submitting a job or allocation.
|
||||
|
||||
For further information about how to use this cluster, please visit: [**Merlin6 GPU Slurm Cluster documentation**](/gmerlin6/slurm-configuration.html).
|
||||
|
||||
### Merlin5 CPU cluster access
|
||||
|
||||
The **Merlin5 CPU cluster** (**`merlin5`**) is visible from the login nodes. However, to submit jobs
|
||||
to this cluster, one needs to specify the option `--cluster=merlin5` when submitting a job or allocation.
|
||||
|
||||
Using this cluster is in general not recommended, however this is still available for old users needing
|
||||
extra computational resources or longer jobs. Have in mind that this cluster is only supported in a
|
||||
**best effort basis**, and it contains very old hardware and configurations.
|
||||
|
||||
For further information about how to use this cluster, please visit the [**Merlin5 CPU Slurm Cluster documentation**](/gmerlin6/slurm-configuration.html).
|
65
pages/merlin6/01-Quick-Start-Guide/introduction.md
Normal file
65
pages/merlin6/01-Quick-Start-Guide/introduction.md
Normal file
@ -0,0 +1,65 @@
|
||||
---
|
||||
title: Introduction
|
||||
#tags:
|
||||
#keywords:
|
||||
last_updated: 28 June 2019
|
||||
#summary: "Merlin 6 cluster overview"
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/introduction.html
|
||||
redirect_from:
|
||||
- /merlin6
|
||||
- /merlin6/index.html
|
||||
---
|
||||
|
||||
## The Merlin local HPC cluster
|
||||
|
||||
Historically, the local HPC clusters at PSI were named **Merlin**. Over the years,
|
||||
multiple generations of Merlin have been deployed.
|
||||
|
||||
At present, the **Merlin local HPC cluster** contains _two_ generations of it: the old **Merlin5** cluster and the newest **Merlin6**.
|
||||
|
||||
### Merlin6
|
||||
|
||||
Merlin6 is a the official PSI Local HPC cluster for development and
|
||||
mission-critical applications that has been built in 2019. It replaces
|
||||
the Merlin5 cluster.
|
||||
|
||||
Merlin6 is designed to be extensible, so is technically possible to add
|
||||
more compute nodes and cluster storage without significant increase of
|
||||
the costs of the manpower and the operations.
|
||||
|
||||
Merlin6 contains all the main services needed for running cluster, including
|
||||
**login nodes**, **storage**, **computing nodes** and other *subservices*,
|
||||
connected to the central PSI IT infrastructure.
|
||||
|
||||
#### CPU and GPU Slurm clusters
|
||||
|
||||
The Merlin6 **computing nodes** are mostly based on **CPU** resources. However,
|
||||
it also contains a small amount of **GPU**-based resources, which are mostly used
|
||||
by the BIO Division and by Deep Leaning project.
|
||||
|
||||
These computational resources are split into **two** different **[Slurm](https://slurm.schedmd.com/overview.html)** clusters:
|
||||
* The Merlin6 CPU nodes are in a dedicated **[Slurm](https://slurm.schedmd.com/overview.html)** cluster called [**`merlin6`**](/merlin6/slurm-configuration.html).
|
||||
* This is the **default Slurm cluster** configured in the login nodes: any job submitted without the option `--cluster` will be submited to this cluster.
|
||||
* The Merlin6 GPU resources are in a dedicated **[Slurm](https://slurm.schedmd.com/overview.html)** cluster called [**`gmerlin6`**](/gmerlin6/slurm-configuration.html).
|
||||
* Users submitting to the **`gmerlin6`** GPU cluster need to specify the option ``--cluster=gmerlin6``.
|
||||
|
||||
### Merlin5
|
||||
|
||||
The old Slurm **CPU** *merlin* cluster is still active and is maintained in a best effort basis.
|
||||
|
||||
**Merlin5** only contains **computing nodes** resources in a dedicated **[Slurm](https://slurm.schedmd.com/overview.html)** cluster.
|
||||
* The Merlin5 CPU cluster is called [**merlin5**](/merlin5/slurm-configuration.html).
|
||||
|
||||
## Merlin Architecture
|
||||
|
||||
The following image shows the Slurm architecture design for the Merlin5 & Merlin6 clusters:
|
||||
|
||||

|
||||
|
||||
### Merlin6 Architecture Diagram
|
||||
|
||||
The following image shows the Merlin6 cluster architecture diagram:
|
||||
|
||||

|
||||
|
@ -1,55 +0,0 @@
|
||||
---
|
||||
title: Accessing Slurm Cluster
|
||||
#tags:
|
||||
#keywords:
|
||||
last_updated: 13 June 2019
|
||||
#summary: ""
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/slurm-access.html
|
||||
---
|
||||
|
||||
## The Merlin6 Slurm batch system
|
||||
|
||||
Clusters at PSI use the [Slurm Workload Manager](http://slurm.schedmd.com/) as the batch system technology for managing and scheduling jobs.
|
||||
Historically, *Merlin4* and *Merlin5* also used Slurm. In the same way, **Merlin6** has been also configured with this batch system.
|
||||
|
||||
Slurm has been installed in a **multi-clustered** configuration, allowing to integrate multiple clusters in the same batch system.
|
||||
* Two different Slurm clusters exist: **merlin5** and **merlin6**.
|
||||
* **merlin5** is a cluster with very old hardware (out-of-warranty).
|
||||
* **merlin5** will exist as long as hardware incidents are soft and easy to repair/fix (i.e. hard disk replacement)
|
||||
* **merlin6** is the default cluster when running Slurm commands (i.e. sinfo).
|
||||
|
||||
Please follow the section **Merlin6 Slurm** for more details about configuration and job submission.
|
||||
|
||||
### Merlin5 Access
|
||||
|
||||
Keeping the **merlin5** cluster will allow running jobs in the old computing nodes until users have fully migrated their codes to the new cluster.
|
||||
|
||||
From July 2019, **merlin6** becomes the **default cluster**. However, users can keep submitting to the old **merlin5** computing nodes by using
|
||||
the option ``--cluster=merlin5`` and using the corresponding Slurm partition with ``--partition=merlin``. In example:
|
||||
|
||||
```bash
|
||||
#SBATCH --clusters=merlin6
|
||||
```
|
||||
|
||||
Example of how to run a simple command:
|
||||
|
||||
```bash
|
||||
srun --clusters=merlin5 --partition=merlin hostname
|
||||
sbatch --clusters=merlin5 --partition=merlin myScript.batch
|
||||
```
|
||||
|
||||
### Merlin6 Access
|
||||
|
||||
In order to run jobs on the **Merlin6** cluster, you need to specify the following option in your batch scripts:
|
||||
|
||||
```bash
|
||||
#SBATCH --clusters=merlin6
|
||||
```
|
||||
|
||||
Example of how to run a simple command:
|
||||
|
||||
```bash
|
||||
srun --clusters=merlin6 hostname
|
||||
sbatch --clusters=merlin6 myScript.batch
|
||||
```
|
@ -28,7 +28,7 @@ Official X11 Forwarding support is through NoMachine. Please follow the document
|
||||
we provide a small recipe for enabling X11 Forwarding in Linux.
|
||||
|
||||
* For enabling client X11 forwarding, add the following to the start of ``~/.ssh/config``
|
||||
to implicitly add ``-Y`` to all ssh connections:
|
||||
to implicitly add ``-X`` to all ssh connections:
|
||||
|
||||
```bash
|
||||
ForwardAgent yes
|
||||
@ -38,9 +38,9 @@ to implicitly add ``-Y`` to all ssh connections:
|
||||
* Alternatively, you can add the option ``-Y`` to the ``ssh`` command. In example:
|
||||
|
||||
```bash
|
||||
ssh -Y $username@merlin-l-01.psi.ch
|
||||
ssh -Y $username@merlin-l-001.psi.ch
|
||||
ssh -Y $username@merlin-l-002.psi.ch
|
||||
ssh -X $username@merlin-l-01.psi.ch
|
||||
ssh -X $username@merlin-l-001.psi.ch
|
||||
ssh -X $username@merlin-l-002.psi.ch
|
||||
```
|
||||
|
||||
* For testing that X11 forwarding works, just run ``xclock``. A X11 based clock should
|
@ -22,13 +22,23 @@ ssh $username@merlin-l-002.psi.ch
|
||||
|
||||
## SSH with X11 Forwarding
|
||||
|
||||
Official X11 Forwarding support is through NoMachine. Please follow the document
|
||||
### Requirements
|
||||
|
||||
For running SSH with X11 Forwarding in MacOS, one needs to have a X server running in MacOS.
|
||||
The official X Server for MacOS is **[XQuartz](https://www.xquartz.org/)**. Please ensure
|
||||
you have it running before starting a SSH connection with X11 forwarding.
|
||||
|
||||
### SSH with X11 Forwarding in MacOS
|
||||
|
||||
Official X11 support is through NoMachine. Please follow the document
|
||||
[{Job Submission -> Interactive Jobs}](/merlin6/interactive-jobs.html#Requirements) and
|
||||
[{Accessing Merlin -> NoMachine}](/merlin6/nomachine.html) for more details. However,
|
||||
we provide a small recipe for enabling X11 Forwarding in MacOS.
|
||||
|
||||
* Ensure that **[XQuartz](https://www.xquartz.org/)** is installed and running in your MacOS.
|
||||
|
||||
* For enabling client X11 forwarding, add the following to the start of ``~/.ssh/config``
|
||||
to implicitly add ``-Y`` to all ssh connections:
|
||||
to implicitly add ``-X`` to all ssh connections:
|
||||
|
||||
```bash
|
||||
ForwardAgent yes
|
||||
@ -38,9 +48,9 @@ to implicitly add ``-Y`` to all ssh connections:
|
||||
* Alternatively, you can add the option ``-Y`` to the ``ssh`` command. In example:
|
||||
|
||||
```bash
|
||||
ssh -Y $username@merlin-l-01.psi.ch
|
||||
ssh -Y $username@merlin-l-001.psi.ch
|
||||
ssh -Y $username@merlin-l-002.psi.ch
|
||||
ssh -X $username@merlin-l-01.psi.ch
|
||||
ssh -X $username@merlin-l-001.psi.ch
|
||||
ssh -X $username@merlin-l-002.psi.ch
|
||||
```
|
||||
|
||||
* For testing that X11 forwarding works, just run ``xclock``. A X11 based clock should
|
152
pages/merlin6/02-How-To-Use-Merlin/using-modules.md
Normal file
152
pages/merlin6/02-How-To-Use-Merlin/using-modules.md
Normal file
@ -0,0 +1,152 @@
|
||||
---
|
||||
title: Using PModules
|
||||
#tags:
|
||||
#keywords:
|
||||
last_updated: 21 May 2021
|
||||
#summary: ""
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/using-modules.html
|
||||
---
|
||||
|
||||
## Environment Modules
|
||||
|
||||
On top of the operating system stack we provide different software using the PSI developed PModule system.
|
||||
|
||||
PModules is the official supported way and each package is deployed by a specific expert. Usually, in PModules
|
||||
software which is used by many people will be found.
|
||||
|
||||
If you miss any package/versions or a software with a specific missing feature, contact us. We will study if is feasible or not to install it.
|
||||
|
||||
## Module release stages
|
||||
|
||||
Three different **release stages** are available in Pmodules, ensuring proper software life cycling. These are the following: **`unstable`**, **`stable`** and **`deprecated`**
|
||||
|
||||
### Unstable release stage
|
||||
|
||||
The **`unstable`** release stage contains *unstable* releases of software. Software compilations here are usually under development or are not fully production ready.
|
||||
|
||||
This release stage is **not directly visible** by the end users, and needs to be explicitly invoked as follows:
|
||||
|
||||
```bash
|
||||
module use unstable
|
||||
```
|
||||
|
||||
Once software is validated and considered production ready, this is moved to the `stable` release stage.
|
||||
|
||||
### Stable release stage
|
||||
|
||||
The **`stable`** release stage contains *stable* releases of software, which have been deeply tested and are fully supported.
|
||||
|
||||
This is the ***default*** release stage, and is visible by default. Whenever possible, users are strongly advised to use packages from this release stage.
|
||||
|
||||
### Deprecated release stage
|
||||
|
||||
The **`deprecated`** release stage contains *deprecated* releases of software. Software in this release stage is usually deprecated or discontinued by their developers.
|
||||
Also, minor versions or redundant compilations are moved here as long as there is a valid copy in the *stable* repository.
|
||||
|
||||
This release stage is **not directly visible** by the users, and needs to be explicitly invoked as follows:
|
||||
|
||||
```bash
|
||||
module use deprecated
|
||||
```
|
||||
|
||||
However, software moved to this release stage can be directly loaded without the need of invoking it. This ensure proper life cycling of the software, and making it transparent for the end users.
|
||||
|
||||
## PModules commands
|
||||
|
||||
Below is listed a summary of all available commands:
|
||||
|
||||
```bash
|
||||
module use # show all available PModule Software Groups as well as Release Stages
|
||||
module avail # to see the list of available software packages provided via pmodules
|
||||
module use unstable # to get access to a set of packages not fully tested by the community
|
||||
module load <package>/<version> # to load specific software package with a specific version
|
||||
module search <string> # to search for a specific software package and its dependencies.
|
||||
module list # to list which software is loaded in your environment
|
||||
module purge # unload all loaded packages and cleanup the environment
|
||||
```
|
||||
|
||||
### module use/unuse
|
||||
|
||||
Without any parameter, `use` **lists** all available PModule **Software Groups and Release Stages**.
|
||||
|
||||
```bash
|
||||
module use
|
||||
```
|
||||
|
||||
When followed by a parameter, `use`/`unuse` invokes/uninvokes a PModule **Software Group** or **Release Stage**.
|
||||
|
||||
```bash
|
||||
module use EM # Invokes the 'EM' software group
|
||||
module unuse EM # Uninvokes the 'EM' software group
|
||||
module use unstable # Invokes the 'unstable' Release stable
|
||||
module unuse unstable # Uninvokes the 'unstable' Release stable
|
||||
```
|
||||
|
||||
### module avail
|
||||
|
||||
This option **lists** all available PModule **Software Groups and their packages**.
|
||||
|
||||
Please run `module avail --help` for further listing options.
|
||||
|
||||
### module search
|
||||
|
||||
This is used to **search** for **software packages**. By default, if no **Release Stage** or **Software Group** is specified
|
||||
in the options of the `module search` command, it will search from the already invoked *Software Groups* and *Release Stages*.
|
||||
Direct package dependencies will be also showed.
|
||||
|
||||
```bash
|
||||
(base) [caubet_m@merlin-l-001 caubet_m]$ module search openmpi/4.0.5_slurm
|
||||
|
||||
Module Release Group Requires
|
||||
---------------------------------------------------------------------------
|
||||
openmpi/4.0.5_slurm stable Compiler gcc/8.4.0
|
||||
openmpi/4.0.5_slurm stable Compiler gcc/9.2.0
|
||||
openmpi/4.0.5_slurm stable Compiler gcc/9.3.0
|
||||
openmpi/4.0.5_slurm stable Compiler intel/20.4
|
||||
|
||||
(base) [caubet_m@merlin-l-001 caubet_m]$ module load intel/20.4 openmpi/4.0.5_slurm
|
||||
```
|
||||
|
||||
Please run `module search --help` for further search options.
|
||||
|
||||
### module load/unload
|
||||
|
||||
This loads/unloads specific software packages. Packages might have direct dependencies that need to be loaded first. Other dependencies
|
||||
will be automatically loaded.
|
||||
|
||||
In the example below, the ``openmpi/4.0.5_slurm`` package will be loaded, however ``gcc/9.3.0`` must be loaded as well as this is a strict dependency. Direct dependencies must be loaded in advance. Users can load multiple packages one by one or at once. This can be useful for instance when loading a package with direct dependencies.
|
||||
|
||||
```bash
|
||||
# Single line
|
||||
module load gcc/9.3.0 openmpi/4.0.5_slurm
|
||||
|
||||
# Multiple line
|
||||
module load gcc/9.3.0
|
||||
module load openmpi/4.0.5_slurm
|
||||
```
|
||||
|
||||
#### module purge
|
||||
|
||||
This command is an alternative to `module unload`, which can be used to unload **all** loaded module files.
|
||||
|
||||
```bash
|
||||
module purge
|
||||
```
|
||||
|
||||
## When to request for new PModules packages
|
||||
|
||||
### Missing software
|
||||
|
||||
If you don't find a specific software and you know from other people interesing on it, it can be installed in PModules. Please contact us
|
||||
and we will try to help with that. Deploying new software in PModules may take few days.
|
||||
|
||||
Usually installation of new software are possible as long as few users will use it. If you are insterested in to maintain this software,
|
||||
please let us know.
|
||||
|
||||
### Missing version
|
||||
|
||||
If the existing PModules versions for a specific package do not fit to your needs, is possible to ask for a new version.
|
||||
|
||||
Usually installation of newer versions will be supported, as long as few users will use it. Installation of intermediate versions can
|
||||
be supported if this is strictly justified.
|
@ -1,341 +0,0 @@
|
||||
---
|
||||
title: Running Slurm Scripts
|
||||
#tags:
|
||||
keywords: batch script, slurm, sbatch, srun
|
||||
last_updated: 23 January 2020
|
||||
summary: "This document describes how to run batch scripts in Slurm."
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/running-jobs.html
|
||||
---
|
||||
|
||||
|
||||
## The rules
|
||||
|
||||
Before starting using the cluster, please read the following rules:
|
||||
|
||||
1. Always try to **estimate and** to **define a proper run time** of your jobs:
|
||||
* Use ``--time=<D-HH:MM:SS>`` for that.
|
||||
* This will ease *scheduling* and *backfilling*.
|
||||
* Slurm will schedule efficiently the queued jobs.
|
||||
* For very long runs, please consider using ***[Job Arrays with Checkpointing](/merlin6/running-jobs.html#array-jobs-running-very-long-tasks-with-checkpoint-files)***
|
||||
2. Try to optimize your jobs for running within **one day**. Please, consider the following:
|
||||
* Some software can simply scale up by using more nodes while drastically reducing the run time.
|
||||
* Some software allow to save a specific state, and a second job can start from that state.
|
||||
* ***[Job Arrays with Checkpointing](/merlin6/running-jobs.html#array-jobs-running-very-long-tasks-with-checkpoint-files)*** can help you with that.
|
||||
* Use the **'daily'** partition when you ensure that you can run within one day:
|
||||
* ***'daily'*** **will give you more priority than running in the** ***'general'*** **queue!**
|
||||
3. Is **forbidden** to run **very short jobs**:
|
||||
* Running jobs of few seconds can cause severe problems.
|
||||
* Running very short jobs causes a lot of overhead.
|
||||
* ***Question:*** Is my job a very short job?
|
||||
* ***Answer:*** If it lasts in few seconds or very few minutes, yes.
|
||||
* ***Question:*** How long should my job run?
|
||||
* ***Answer:*** as the *Rule of Thumb*, from 5' would start being ok, from 15' would preferred.
|
||||
* Use ***[Packed Jobs](/merlin6/running-jobs.html#packed-jobs-running-a-large-number-of-short-tasks)*** for running a large number of short tasks.
|
||||
* For short runs lasting in less than 1 hour, please use the **hourly** partition.
|
||||
* ***'hourly'*** **will give you more priority than running in the** ***'daily'*** **queue!**
|
||||
4. Do not submit hundreds of similar jobs!
|
||||
* Use ***[Array Jobs](/merlin6/running-jobs.html#array-jobs-launching-a-large-number-of-related-jobs)*** for gathering jobs instead.
|
||||
|
||||
{{site.data.alerts.tip}}Having a good estimation of the <i>time</i> needed by your jobs, a proper way for running them, and optimizing the jobs to <i>run within one day</i> will contribute to make the system fairly and efficiently used.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
## Basic commands for running batch scripts
|
||||
|
||||
* Use **``sbatch``** for submitting a batch script to Slurm.
|
||||
* Use **``srun``** for running parallel tasks.
|
||||
* Use **``squeue``** for checking jobs status.
|
||||
* Use **``scancel``** for cancelling/deleting a job from the queue.
|
||||
|
||||
{{site.data.alerts.tip}}Use Linux <b>'man'</b> pages when needed (i.e. <span style="color:orange;">'man sbatch'</span>), mostly for checking the available options for the above commands.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
## Basic settings
|
||||
|
||||
For a complete list of options and parameters available is recommended to use the **man pages** (i.e. ``man sbatch``, ``man srun``, ``man salloc``).
|
||||
Please, notice that behaviour for some parameters might change depending on the command used when running jobs (in example, ``--exclusive`` behaviour in ``sbatch`` differs from ``srun``).
|
||||
|
||||
In this chapter we show the basic parameters which are usually needed in the Merlin cluster.
|
||||
|
||||
### Common settings
|
||||
|
||||
The following settings are the minimum required for running a job in the Merlin CPU and GPU nodes. Please, consider taking a look to the **man pages** (i.e. `man sbatch`, `man salloc`, `man srun`) for more
|
||||
information about all possible options. Also, do not hesitate to contact us on any questions.
|
||||
|
||||
* **Clusters:** For running jobs in the Merlin6 CPU and GPU nodes, users should to add the following option:
|
||||
```bash
|
||||
#SBATCH --clusters=merlin6
|
||||
```
|
||||
|
||||
Users with proper access, can also use the `merlin5` cluster.
|
||||
* **Partitions:** except when using the *default* partition, one needs to specify the partition:
|
||||
* GPU partitions: ``gpu``, ``gpu-short`` (more details **[Slurm GPU Partitions](/merlin6/slurm-configuration.html#gpu-partitions)**)
|
||||
* CPU partitions: ``general`` (**default** if no partition is specified), ``daily`` and ``hourly`` (more details: **[Slurm CPU Partitions](/merlin6/slurm-configuration.html#cpu-partitions)**)
|
||||
|
||||
Partition can be set as follows:
|
||||
```bash
|
||||
#SBATCH --partition=<partition_name> # Partition to use. 'general' is the 'default'
|
||||
```
|
||||
* **[Optional] Disabling shared nodes**: by default, nodes can share jobs from multiple users, but by ensuring that CPU/Memory/GPU resources are dedicated.
|
||||
One can request exclusive usage of a node (or set of nodes) with the following option:
|
||||
```bash
|
||||
#SBATCH --exclusive # Only if you want a dedicated node
|
||||
```
|
||||
* **Time**: is important to define how long a job should run, according to the reality. This will help Slurm when *scheduling* and *backfilling*, by managing job queues in a more efficient
|
||||
way. This value can never exceed the `MaxTime` of the affected partition. Please review the partition information (`scontrol show partition <partition_name>` or [GPU Partition Configuration](/merlin6/slurm-configuration.html#gpu-partitions)) for
|
||||
`DefaultTime` and `MaxTime` values.
|
||||
```bash
|
||||
#SBATCH --time=<D-HH:MM:SS> # Time job needs to run. Can not exceed the partition `MaxTime`
|
||||
```
|
||||
* **Output and error files**: by default, Slurm script will generate standard output and errors files in the directory from where you submit the batch script:
|
||||
* standard output will be written into a file ``slurm-$SLURM_JOB_ID.out``.
|
||||
* standard error will be written into a file ``slurm-$SLURM_JOB_ID.err``.
|
||||
|
||||
If you want to the default names it can be done with the options ``--output`` and ``--error``. In example:
|
||||
```bash
|
||||
#SBATCH --output=logs/myJob.%N.%j.out # Generate an output file per hostname and jobid
|
||||
#SBATCH --error=logs/myJob.%N.%j.err # Generate an errori file per hostname and jobid
|
||||
```
|
||||
Use **man sbatch** (``man sbatch | grep -A36 '^filename pattern'``) for getting a list specification of **filename patterns**.
|
||||
|
||||
* **Multithreading/No-Multithreading:** Whether a node has or not multithreading depends on the node configuration. By default, HT nodes have HT enabled, but one can ensure this feature with the option `--hint` as follows:
|
||||
```bash
|
||||
#SBATCH --hint=multithread # Use extra threads with in-core multi-threading.
|
||||
#SBATCH --hint=nomultithread # Don't use extra threads with in-core multi-threading.
|
||||
```
|
||||
Consider that, sometimes, depending on your job requirements, you might need also to setup how many `--ntasks-per-core` or `--cpus-per-task` (even other options) in addition to the `--hint` command. Please, contact us in case of doubts.
|
||||
{{site.data.alerts.tip}} In general, <span style="color:orange;"><b>--hint=[no]multithread</b></span> is a mandatory field. On the other hand, <span style="color:orange;"><b>--ntasks-per-core</b></span> is only needed when
|
||||
one needs to define how a task should be handled within a core, and this setting will not be generally used on Hybrid MPI/OpenMP jobs where multiple cores are needed for single tasks.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
|
||||
### GPU specific settings
|
||||
|
||||
The following settings are required for running on the GPU nodes:
|
||||
|
||||
* **Slurm account**: When using GPUs, users must use the `merlin-gpu` Slurm account. This is done with the ``--account`` setting as follows:
|
||||
```bash
|
||||
#SBATCH --account=merlin-gpu # The account 'merlin-gpu' must be used for GPUs
|
||||
```
|
||||
* **`[Valid until 08.01.2021]` GRES:** Slurm must be aware that the job will use GPUs. This is done with the `--gres` setting, at least, as follows:
|
||||
```bash
|
||||
#SBATCH --gres=gpu # Always set at least this option when using GPUs
|
||||
```
|
||||
This option is still valid as this might be needed by other resources, but for GPUs new options (i.e. `--gpus`, `--mem-per-gpu`) can be used, which provide more flexibility when running on GPUs.
|
||||
* **`[Valid from 08.01.2021]` GPU options (instead of GRES):** Slurm must be aware that the job will use GPUs. New options are available for specifying
|
||||
the GPUs as a consumable resource. These are the following:
|
||||
* `--gpus=[<type>:]<number>` *instead of* (but also in addition with) `--gres=gpu`: specifies the total number of GPUs required for the job.
|
||||
* `--gpus-per-task=[<type>:]<number>`, `--gpus-per-socket=[<type>:]<number>`, `--gpus-per-node=[<type>:]<number>` to specify the number of GPUs per tasks and/or socket and/or node.
|
||||
* `--gpus-per-node=[<type>:]<number>`, `--gpus-per-socket`, `--gpus-per-task`, to specify how many GPUs per node, socket and or tasks need to be allocated.
|
||||
* `--cpus-per-gpu`, to specify the number of CPUs to be used for each GPU.
|
||||
* `--mem-per-gpu`, to specify the amount of memory to be used for each GPU.
|
||||
* Other advanced options (i.e. `--gpu-bind`). Please see **man** pages for **sbatch**/**srun**/**salloc** (i.e. *`man sbatch`*) for further information.
|
||||
Please read below **[GPU advanced settings](/merlin6/running-jobs.html#gpu-advanced-settings)** for further information.
|
||||
* Please, consider that one can specify the GPU `type` in some options. If one needs to specify it, then it must be specified in all options defined in the Slurm job.
|
||||
|
||||
#### GPU advanced settings
|
||||
|
||||
GPUs are also a shared resource. Hence, multiple users can run jobs on a single node, but only one GPU per user process
|
||||
must be used.
|
||||
|
||||
**Until 08.01.2021**, users can define which GPUs resources and *how many per node* they need with the ``--gres`` option.
|
||||
Valid ``gres`` options are: ``gpu[[:type]:count]`` where ``type=GTX1080|GTX1080Ti|RTX2080Ti`` and ``count=<number of gpus requested per node>``. In example:
|
||||
```bash
|
||||
#SBATCH --gres=gpu:GTX1080:4 # Use a node with 4 x GTX1080 GPUs
|
||||
```
|
||||
|
||||
**From 08.01.2021**, `--gres` is not needed anymore (but can still be used), and `--gpus` and related other options should replace it. `--gpus` works in a similar way, but without
|
||||
the need of specifying the `gpu` resource. In oher words, `--gpus` options are: ``[[:type]:count]`` where ``type=GTX1080|GTX1080Ti|RTX2080Ti`` (which is optional) and ``count=<number of gpus to use>``. In example:
|
||||
```bash
|
||||
#SBATCH --gpus=GTX1080:4 # Use 4 GPUs with Type=GTX1080
|
||||
```
|
||||
This setting can use in addition other settings, such like `--gpus-per-node`, in order to accomplish a similar behaviour as with `--gres`.
|
||||
* Please, consider that one can specify the GPU `type` in some of the options. If one needs to specify it, then it must be specified in all options defined in the Slurm job.
|
||||
|
||||
{{site.data.alerts.tip}}Always check <span style="color:orange;"><b>'/etc/slurm/gres.conf'</b></span> for checking available <span style="color:orange;"><i>Types</i></span> and for details of the NUMA node.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
## Batch script templates
|
||||
|
||||
### CPU-based jobs templates
|
||||
|
||||
The following examples apply to the **Merlin6** cluster.
|
||||
|
||||
#### Nomultithreaded jobs template
|
||||
|
||||
The following template should be used by any user submitting jobs to CPU nodes:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
#SBATCH --partition=<general|daily|hourly> # Specify 'general' or 'daily' or 'hourly'
|
||||
#SBATCH --time=<D-HH:MM:SS> # Strongly recommended
|
||||
#SBATCH --output=<output_file> # Generate custom output file
|
||||
#SBATCH --error=<error_file> # Generate custom error file
|
||||
#SBATCH --hint=nomultithread # Mandatory for non-multithreaded jobs
|
||||
##SBATCH --exclusive # Uncomment if you need exclusive node usage
|
||||
##SBATCH --ntasks-per-core=1 # Only mandatory for non-multithreaded single tasks
|
||||
|
||||
## Advanced options example
|
||||
##SBATCH --nodes=1 # Uncomment and specify #nodes to use
|
||||
##SBATCH --ntasks=44 # Uncomment and specify #nodes to use
|
||||
##SBATCH --ntasks-per-node=44 # Uncomment and specify #tasks per node
|
||||
##SBATCH --cpus-per-task=44 # Uncomment and specify the number of cores per task
|
||||
```
|
||||
|
||||
#### Multithreaded jobs template
|
||||
|
||||
The following template should be used by any user submitting jobs to CPU nodes:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
#SBATCH --partition=<general|daily|hourly> # Specify 'general' or 'daily' or 'hourly'
|
||||
#SBATCH --time=<D-HH:MM:SS> # Strongly recommended
|
||||
#SBATCH --output=<output_file> # Generate custom output file
|
||||
#SBATCH --error=<error_file> # Generate custom error file
|
||||
#SBATCH --hint=multithread # Mandatory for multithreaded jobs
|
||||
##SBATCH --exclusive # Uncomment if you need exclusive node usage
|
||||
##SBATCH --ntasks-per-core=2 # Only mandatory for multithreaded single tasks
|
||||
|
||||
## Advanced options example
|
||||
##SBATCH --nodes=1 # Uncomment and specify #nodes to use
|
||||
##SBATCH --ntasks=88 # Uncomment and specify #nodes to use
|
||||
##SBATCH --ntasks-per-node=88 # Uncomment and specify #tasks per node
|
||||
##SBATCH --cpus-per-task=88 # Uncomment and specify the number of cores per task
|
||||
```
|
||||
|
||||
### GPU-based jobs templates
|
||||
|
||||
The following template should be used by any user submitting jobs to GPU nodes:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
#SBATCH --partition=<gpu|gpu-short> # Specify GPU partition
|
||||
#SBATCH --gpus="<type>:<num_gpus>" # <type> is optional, <num_gpus> is mandatory
|
||||
#SBATCH --time=<D-HH:MM:SS> # Strongly recommended
|
||||
#SBATCH --output=<output_file> # Generate custom output file
|
||||
#SBATCH --error=<error_file # Generate custom error file
|
||||
#SBATCH --account=merlin-gpu # The account 'merlin-gpu' must be used
|
||||
##SBATCH --exclusive # Uncomment if you need exclusive node usage
|
||||
|
||||
## Advanced options example
|
||||
##SBATCH --nodes=1 # Uncomment and specify number of nodes to use
|
||||
##SBATCH --ntasks=1 # Uncomment and specify number of nodes to use
|
||||
##SBATCH --cpus-per-gpu=5 # Uncomment and specify the number of cores per task
|
||||
##SBATCH --mem-per-gpu=16000 # Uncomment and specify the number of cores per task
|
||||
##SBATCH --gpus-per-node=<type>:2 # Uncomment and specify the number of GPUs per node
|
||||
##SBATCH --gpus-per-socket=<type>:2 # Uncomment and specify the number of GPUs per socket
|
||||
##SBATCH --gpus-per-task=<type>:1 # Uncomment and specify the number of GPUs per task
|
||||
```
|
||||
|
||||
## Advanced configurations
|
||||
|
||||
### Array Jobs: launching a large number of related jobs
|
||||
|
||||
If you need to run a large number of jobs based on the same executable with systematically varying inputs,
|
||||
e.g. for a parameter sweep, you can do this most easily in form of a **simple array job**.
|
||||
|
||||
``` bash
|
||||
#!/bin/bash
|
||||
#SBATCH --job-name=test-array
|
||||
#SBATCH --partition=daily
|
||||
#SBATCH --ntasks=1
|
||||
#SBATCH --time=08:00:00
|
||||
#SBATCH --array=1-8
|
||||
|
||||
echo $(date) "I am job number ${SLURM_ARRAY_TASK_ID}"
|
||||
srun myprogram config-file-${SLURM_ARRAY_TASK_ID}.dat
|
||||
|
||||
```
|
||||
|
||||
This will run 8 independent jobs, where each job can use the counter
|
||||
variable `SLURM_ARRAY_TASK_ID` defined by Slurm inside of the job's
|
||||
environment to feed the correct input arguments or configuration file
|
||||
to the "myprogram" executable. Each job will receive the same set of
|
||||
configurations (e.g. time limit of 8h in the example above).
|
||||
|
||||
The jobs are independent, but they will run in parallel (if the cluster resources allow for
|
||||
it). The jobs will get JobIDs like {some-number}_0 to {some-number}_7, and they also will each
|
||||
have their own output file.
|
||||
|
||||
**Note:**
|
||||
* Do not use such jobs if you have very short tasks, since each array sub job will incur the full overhead for launching an independent Slurm job. For such cases you should used a **packed job** (see below).
|
||||
* If you want to control how many of these jobs can run in parallel, you can use the `#SBATCH --array=1-100%5` syntax. The `%5` will define
|
||||
that only 5 sub jobs may ever run in parallel.
|
||||
|
||||
You also can use an array job approach to run over all files in a directory, substituting the payload with
|
||||
|
||||
``` bash
|
||||
FILES=(/path/to/data/*)
|
||||
srun ./myprogram ${FILES[$SLURM_ARRAY_TASK_ID]}
|
||||
```
|
||||
|
||||
Or for a trivial case you could supply the values for a parameter scan in form
|
||||
of a argument list that gets fed to the program using the counter variable.
|
||||
|
||||
``` bash
|
||||
ARGS=(0.05 0.25 0.5 1 2 5 100)
|
||||
srun ./my_program.exe ${ARGS[$SLURM_ARRAY_TASK_ID]}
|
||||
```
|
||||
|
||||
### Array jobs: running very long tasks with checkpoint files
|
||||
|
||||
If you need to run a job for much longer than the queues (partitions) permit, and
|
||||
your executable is able to create checkpoint files, you can use this
|
||||
strategy:
|
||||
|
||||
``` bash
|
||||
#!/bin/bash
|
||||
#SBATCH --job-name=test-checkpoint
|
||||
#SBATCH --partition=general
|
||||
#SBATCH --ntasks=1
|
||||
#SBATCH --time=7-00:00:00 # each job can run for 7 days
|
||||
#SBATCH --cpus-per-task=1
|
||||
#SBATCH --array=1-10%1 # Run a 10-job array, one job at a time.
|
||||
if test -e checkpointfile; then
|
||||
# There is a checkpoint file;
|
||||
myprogram --read-checkp checkpointfile
|
||||
else
|
||||
# There is no checkpoint file, start a new simulation.
|
||||
myprogram
|
||||
fi
|
||||
```
|
||||
|
||||
The `%1` in the `#SBATCH --array=1-10%1` statement defines that only 1 subjob can ever run in parallel, so
|
||||
this will result in subjob n+1 only being started when job n has finished. It will read the checkpoint file
|
||||
if it is present.
|
||||
|
||||
|
||||
### Packed jobs: running a large number of short tasks
|
||||
|
||||
Since the launching of a Slurm job incurs some overhead, you should not submit each short task as a separate
|
||||
Slurm job. Use job packing, i.e. you run the short tasks within the loop of a single Slurm job.
|
||||
|
||||
You can launch the short tasks using `srun` with the `--exclusive` switch (not to be confused with the
|
||||
switch of the same name used in the SBATCH commands). This switch will ensure that only a specified
|
||||
number of tasks can run in parallel.
|
||||
|
||||
As an example, the following job submission script will ask Slurm for
|
||||
44 cores (threads), then it will run the =myprog= program 1000 times with
|
||||
arguments passed from 1 to 1000. But with the =-N1 -n1 -c1
|
||||
--exclusive= option, it will control that at any point in time only 44
|
||||
instances are effectively running, each being allocated one CPU. You
|
||||
can at this point decide to allocate several CPUs or tasks by adapting
|
||||
the corresponding parameters.
|
||||
|
||||
``` bash
|
||||
#! /bin/bash
|
||||
#SBATCH --job-name=test-checkpoint
|
||||
#SBATCH --partition=general
|
||||
#SBATCH --ntasks=1
|
||||
#SBATCH --time=7-00:00:00
|
||||
#SBATCH --ntasks=44 # defines the number of parallel tasks
|
||||
for i in {1..1000}
|
||||
do
|
||||
srun -N1 -n1 -c1 --exclusive ./myprog $i &
|
||||
done
|
||||
wait
|
||||
```
|
||||
|
||||
**Note:** The `&` at the end of the `srun` line is needed to not have the script waiting (blocking).
|
||||
The `wait` command waits for all such background tasks to finish and returns the exit code.
|
||||
|
@ -1,204 +0,0 @@
|
||||
---
|
||||
title: Slurm Configuration
|
||||
#tags:
|
||||
keywords: configuration, partitions, node definition
|
||||
last_updated: 29 January 2021
|
||||
summary: "This document describes a summary of the Merlin6 configuration."
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/slurm-configuration.html
|
||||
---
|
||||
|
||||
## About Merlin5 & Merlin6
|
||||
|
||||
The new Slurm cluster is called **merlin6**. However, the old Slurm *merlin* cluster will be kept for some time, and it has been renamed to **merlin5**.
|
||||
It will allow to keep running jobs in the old computing nodes until users have fully migrated their codes to the new cluster.
|
||||
|
||||
From July 2019, **merlin6** becomes the **default cluster** and any job submitted to Slurm will be submitted to that cluster. Users can keep submitting to
|
||||
the old *merlin5* computing nodes by using the option ``--cluster=merlin5``.
|
||||
|
||||
In this documentation is only explained the usage of the **merlin6** Slurm cluster.
|
||||
|
||||
## Merlin6 CPU
|
||||
|
||||
Basic configuration for the **merlin6 CPUs** cluster will be detailed here.
|
||||
For advanced usage, please refer to [Understanding the Slurm configuration (for advanced users)](/merlin6/slurm-configuration.html#understanding-the-slurm-configuration-for-advanced-users)
|
||||
|
||||
### CPU nodes definition
|
||||
|
||||
The following table show default and maximum resources that can be used per node:
|
||||
|
||||
| Nodes | Def.#CPUs | Max.#CPUs | #Threads | Def.Mem/CPU | Max.Mem/CPU | Max.Mem/Node | Max.Swap | Def.#GPUs | Max.#GPUs |
|
||||
|:------------------:| ---------:| :--------:| :------: | :----------:| :----------:| :-----------:| :-------:| :-------: | :-------: |
|
||||
| merlin-c-[001-024] | 1 core | 44 cores | 2 | 4000 | 352000 | 352000 | 10000 | N/A | N/A |
|
||||
| merlin-c-[101-124] | 1 core | 44 cores | 2 | 4000 | 352000 | 352000 | 10000 | N/A | N/A |
|
||||
| merlin-c-[201-224] | 1 core | 44 cores | 2 | 4000 | 352000 | 352000 | 10000 | N/A | N/A |
|
||||
|
||||
If nothing is specified, by default each core will use up to 8GB of memory. Memory can be increased with the `--mem=<mem_in_MB>` and
|
||||
`--mem-per-cpu=<mem_in_MB>` options, and maximum memory allowed is `Max.Mem/Node`.
|
||||
|
||||
In *Merlin6*, Memory is considered a Consumable Resource, as well as the CPU.
|
||||
|
||||
### CPU partitions
|
||||
|
||||
Partition can be specified when submitting a job with the ``--partition=<partitionname>`` option.
|
||||
The following *partitions* (also known as *queues*) are configured in Slurm:
|
||||
|
||||
| CPU Partition | Default Time | Max Time | Max Nodes | Priority | PriorityJobFactor\* |
|
||||
|:-----------------: | :----------: | :------: | :-------: | :------: | :-----------------: |
|
||||
| **<u>general</u>** | 1 day | 1 week | 50 | low | 1 |
|
||||
| **daily** | 1 day | 1 day | 67 | medium | 500 |
|
||||
| **hourly** | 1 hour | 1 hour | unlimited | highest | 1000 |
|
||||
|
||||
\*The **PriorityJobFactor** value will be added to the job priority (*PARTITION* column in `sprio -l` ). In other words, jobs sent to higher priority
|
||||
partitions will usually run first (however, other factors such like **job age** or mainly **fair share** might affect to that decision). For the GPU
|
||||
partitions, Slurm will also attempt first to allocate jobs on partitions with higher priority over partitions with lesser priority.
|
||||
|
||||
The **general** partition is the *default*: when nothing is specified, job will be by default assigned to that partition. General can not have more
|
||||
than 50 nodes running jobs. For **daily** this limitation is extended to 67 nodes while for **hourly** there are no limits.
|
||||
|
||||
{{site.data.alerts.tip}}Jobs which would run for less than one day should be always sent to <b>daily</b>, while jobs that would run for less
|
||||
than one hour should be sent to <b>hourly</b>. This would ensure that you have highest priority over jobs sent to partitions with less priority,
|
||||
but also because <b>general</b> has limited the number of nodes that can be used for that. The idea behind that, is that the cluster can not
|
||||
be blocked by long jobs and we can always ensure resources for shorter jobs.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
### User and job limits
|
||||
|
||||
In the CPU cluster we provide some limits which basically apply to jobs and users. The idea behind this is to ensure a fair usage of the resources and to
|
||||
avoid overabuse of the resources from a single user or job. However, applying limits might affect the overall usage efficiency of the cluster (in example,
|
||||
pending jobs from a single user while having many idle nodes due to low overall activity is something that can be seen when user limits are applied).
|
||||
In the same way, these limits can be also used to improve the efficiency of the cluster (in example, without any job size limits, a job requesting all
|
||||
resources from the batch system would drain the entire cluster for fitting the job, which is undesirable).
|
||||
|
||||
Hence, there is a need of setting up wise limits and to ensure that there is a fair usage of the resources, by trying to optimize the overall efficiency
|
||||
of the cluster while allowing jobs of different nature and sizes (it is, **single core** based **vs parallel jobs** of different sizes) to run.
|
||||
|
||||
{{site.data.alerts.warning}}Wide limits are provided in the <b>daily</b> and <b>hourly</b> partitions, while for <b>general</b> those limits are
|
||||
more restrictive.
|
||||
<br>However, we kindly ask users to inform the Merlin administrators when there are plans to send big jobs which would require a
|
||||
massive draining of nodes for allocating such jobs. This would apply to jobs requiring the <b>unlimited</b> QoS (see below <i>"Per job limits"</i>)
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
{{site.data.alerts.tip}}If you have different requirements, please let us know, we will try to accomodate or propose a solution for you.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
#### Per job limits
|
||||
|
||||
These are limits which apply to a single job. In other words, there is a maximum of resources a single job can use. This is described in the table below,
|
||||
and limits will vary depending on the day of the week and the time (*working* vs *non-working* hours). Limits are shown in format: `SlurmQoS(limits)`,
|
||||
where `SlurmQoS` can be seen with the command `sacctmgr show qos`:
|
||||
|
||||
| Partition | Mon-Fri 0h-18h | Sun-Thu 18h-0h | From Fri 18h to Mon 0h |
|
||||
|:----------: | :------------------: | :------------: | :---------------------: |
|
||||
| **general** | normal(cpu=704,mem=2750G) | normal(cpu=704,mem=2750G) | normal(cpu=704,mem=2750G) |
|
||||
| **daily** | daytime(cpu=704,mem=2750G) | nighttime(cpu=1408,mem=5500G) | unlimited(cpu=2200,mem=8593.75G) |
|
||||
| **hourly** | unlimited(cpu=2200,mem=8593.75G) | unlimited(cpu=2200,mem=8593.75G) | unlimited(cpu=2200,mem=8593.75G) |
|
||||
|
||||
By default, a job can not use more than 704 cores (max CPU per job). In the same way, memory is also proportionally limited. This is equivalent as
|
||||
running a job using up to 8 nodes at once. This limit applies to the **general** partition (fixed limit) and to the **daily** partition (only during working hours).
|
||||
Limits are softed for the **daily** partition during non working hours, and during the weekend limits are even wider.
|
||||
|
||||
For the **hourly** partition, **despite running many parallel jobs is something not desirable** (for allocating such jobs it requires massive draining of nodes),
|
||||
wider limits are provided. In order to avoid massive nodes drain in the cluster, for allocating huge jobs, setting per job limits is necessary. Hence, **unlimited** QoS
|
||||
mostly refers to "per user" limits more than to "per job" limits (in other words, users can run any number of hourly jobs, but the job size for such jobs is limited
|
||||
with wide values).
|
||||
|
||||
#### Per user limits for CPU partitions
|
||||
|
||||
These limits which apply exclusively to users. In other words, there is a maximum of resources a single user can use. This is described in the table below,
|
||||
and limits will vary depending on the day of the week and the time (*working* vs *non-working* hours). Limits are shown in format: `SlurmQoS(limits)`,
|
||||
where `SlurmQoS` can be seen with the command `sacctmgr show qos`:
|
||||
|
||||
| Partition | Mon-Fri 0h-18h | Sun-Thu 18h-0h | From Fri 18h to Mon 0h |
|
||||
|:-----------:| :----------------: | :------------: | :---------------------: |
|
||||
| **general** | normal(cpu=704,mem=2750G) | normal(cpu=704,mem=2750G) | normal(cpu=704,mem=2750G) |
|
||||
| **daily** | daytime(cpu=1408,mem=5500G) | nighttime(cpu=2112,mem=8250G) | unlimited(cpu=6336,mem=24750G) |
|
||||
| **hourly** | unlimited(cpu=6336,mem=24750G) | unlimited(cpu=6336,mem=24750G)| unlimited(cpu=6336,mem=24750G) |
|
||||
|
||||
By default, users can not use more than 704 cores at the same time (max CPU per user). Memory is also proportionally limited in the same way. This is
|
||||
equivalent to 8 exclusive nodes. This limit applies to the **general** partition (fixed limit) and to the **daily** partition (only during working hours).
|
||||
For the **hourly** partition, there are no limits restriction and user limits are removed. Limits are softed for the **daily** partition during non
|
||||
working hours, and during the weekend limits are removed.
|
||||
|
||||
## Merlin6 GPU
|
||||
|
||||
Basic configuration for the **merlin6 GPUs** will be detailed here.
|
||||
For advanced usage, please refer to [Understanding the Slurm configuration (for advanced users)](/merlin6/slurm-configuration.html#understanding-the-slurm-configuration-for-advanced-users)
|
||||
|
||||
### GPU nodes definition
|
||||
|
||||
| Nodes | Def.#CPUs | Max.#CPUs | #Threads | Def.Mem/CPU | Max.Mem/CPU | Max.Mem/Node | Max.Swap | GPU Type | Def.#GPUs | Max.#GPUs |
|
||||
|:------------------:| ---------:| :--------:| :------: | :----------:| :----------:| :-----------:| :-------:| :--------: | :-------: | :-------: |
|
||||
| merlin-g-[001] | 1 core | 8 cores | 1 | 4000 | 102400 | 102400 | 10000 | **GTX1080** | 1 | 2 |
|
||||
| merlin-g-[002-005] | 1 core | 20 cores | 1 | 4000 | 102400 | 102400 | 10000 | **GTX1080** | 1 | 4 |
|
||||
| merlin-g-[006-009] | 1 core | 20 cores | 1 | 4000 | 102400 | 102400 | 10000 | **GTX1080Ti** | 1 | 4 |
|
||||
| merlin-g-[010-013] | 1 core | 20 cores | 1 | 4000 | 102400 | 102400 | 10000 | **RTX2080Ti** | 1 | 4 |
|
||||
|
||||
{{site.data.alerts.tip}}Always check <b>'/etc/slurm/gres.conf'</b> for changes in the GPU type and details of the NUMA node.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
### GPU partitions
|
||||
|
||||
| GPU Partition | Default Time | Max Time | Max Nodes | Priority | PriorityJobFactor\* |
|
||||
|:-----------------: | :----------: | :------: | :-------: | :------: | :-----------------: |
|
||||
| **<u>gpu</u>** | 1 day | 1 week | 4 | low | 1 |
|
||||
| **gpu-short** | 2 hours | 2 hours | 4 | highest | 1000 |
|
||||
|
||||
\*The **PriorityJobFactor** value will be added to the job priority (*PARTITION* column in `sprio -l` ). In other words, jobs sent to higher priority
|
||||
partitions will usually run first (however, other factors such like **job age** or mainly **fair share** might affect to that decision). For the GPU
|
||||
partitions, Slurm will also attempt first to allocate jobs on partitions with higher priority over partitions with lesser priority.
|
||||
|
||||
### User and job limits
|
||||
|
||||
The GPU cluster contains some basic user and job limits to ensure that a single user can not overabuse the resources and a fair usage of the cluster.
|
||||
The limits are described below.
|
||||
|
||||
#### Per job limits
|
||||
|
||||
These are limits applying to a single job. In other words, there is a maximum of resources a single job can use.
|
||||
Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: `SlurmQoS(limits)`,
|
||||
(list of possible `SlurmQoS` values can be listed with the command `sacctmgr show qos`):
|
||||
|
||||
| Partition | Mon-Sun 0h-24h |
|
||||
|:-------------:| :------------------------------------: |
|
||||
| **gpu** | gpu_week(cpu=40,gres/gpu=8,mem=200G) |
|
||||
| **gpu-short** | gpu_week(cpu=40,gres/gpu=8,mem=200G) |
|
||||
|
||||
With these limits, a single job can not use more than 40 CPUs, more than 8 GPUs or more than 200GB.
|
||||
Any job exceeding such limits will stay in the queue with the message **`QOSMax[Cpu|GRES|Mem]PerJob`**.
|
||||
Since there are no more existing QoS during the week temporary overriding job limits (this happens for instance in the CPU **daily** partition), the job needs to be cancelled, and the requested resources must be adapted according to the above resource limits.
|
||||
|
||||
#### Per user limits for CPU partitions
|
||||
|
||||
These limits apply exclusively to users. In other words, there is a maximum of resources a single user can use.
|
||||
Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: `SlurmQoS(limits)`,
|
||||
(list of possible `SlurmQoS` values can be listed with the command `sacctmgr show qos`):
|
||||
|
||||
| Partition | Mon-Sun 0h-24h |
|
||||
|:-------------:| :---------------------------------------------------------: |
|
||||
| **gpu** | gpu_week(cpu=80,gres/gpu=16,mem=400G) |
|
||||
| **gpu-short** | gpu_week(cpu=80,gres/gpu=16,mem=400G) |
|
||||
|
||||
With these limits, a single user can not use more than 80 CPUs, more than 16 GPUs or more than 400GB.
|
||||
Jobs sent by any user already exceeding such limits will stay in the queue with the message **`QOSMax[Cpu|GRES|Mem]PerUser`**. In that case, job can wait in the queue until some of the running resources are freed.
|
||||
|
||||
Notice that user limits are wider than job limits. In that way, a user can run up to two 8 GPUs based jobs, or up to four 4 GPUs based jobs, etc.
|
||||
Please try to avoid occupying all GPUs of the same type for several hours or multiple days, otherwise it would block other users needing the same
|
||||
type of GPU.
|
||||
|
||||
## Understanding the Slurm configuration (for advanced users)
|
||||
|
||||
Clusters at PSI use the [Slurm Workload Manager](http://slurm.schedmd.com/) as the batch system technology for managing and scheduling jobs.
|
||||
Historically, *Merlin4* and *Merlin5* also used Slurm. In the same way, **Merlin6** has been also configured with this batch system.
|
||||
|
||||
Slurm has been installed in a **multi-clustered** configuration, allowing to integrate multiple clusters in the same batch system.
|
||||
|
||||
For understanding the Slurm configuration setup in the cluster, sometimes may be useful to check the following files:
|
||||
|
||||
* ``/etc/slurm/slurm.conf`` - can be found in the login nodes and computing nodes.
|
||||
* ``/etc/slurm/gres.conf`` - can be found in the GPU nodes, is also propgated to login nodes and computing nodes for user read access.
|
||||
* ``/etc/slurm/cgroup.conf`` - can be found in the computing nodes, is also propagated to login nodes for user read access.
|
||||
|
||||
The previous configuration files which can be found in the login nodes, correspond exclusively to the **merlin6** cluster configuration files.
|
||||
Configuration files for the old **merlin5** cluster must be checked directly on any of the **merlin5** computing nodes: these are not propagated
|
||||
to the **merlin6** login nodes.
|
@ -1,64 +0,0 @@
|
||||
---
|
||||
title: Using PModules
|
||||
#tags:
|
||||
#keywords:
|
||||
last_updated: 20 June 2019
|
||||
#summary: ""
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/using-modules.html
|
||||
---
|
||||
|
||||
## Environment Modules
|
||||
|
||||
On top of the operating system stack we provide different software using the PSI developed PModule system.
|
||||
|
||||
PModules is the official supported way and each package is deployed by a specific expert. Usually, in PModules
|
||||
software which is used by many people will be found.
|
||||
|
||||
If you miss any package/versions or a software with a specific missing feature, contact us. We will study if is feasible or not to install it.
|
||||
|
||||
### Basic commands:
|
||||
|
||||
Basic generic commands would be:
|
||||
|
||||
```bash
|
||||
module avail # to see the list of available software packages provided via pmodules
|
||||
module use unstable # to get access to a set of packages not fully tested by the community
|
||||
module load <package>/<version> # to load specific software package with a specific version
|
||||
module search <string> # to search for a specific software package and its dependencies.
|
||||
module list # to list which software is loaded in your environment
|
||||
module purge # unload all loaded packages and cleanup the environment
|
||||
```
|
||||
|
||||
Also, you can load multiple packages at once. This can be useful for instance when loading a package with its dependencies:
|
||||
|
||||
```bash
|
||||
# Single line
|
||||
module load gcc/9.2.0 openmpi/3.1.5-1_merlin6
|
||||
|
||||
# Multiple line
|
||||
module load gcc/9.2.0
|
||||
module load openmpi/3.1.5-1_merlin6
|
||||
```
|
||||
|
||||
In the example above, we load ``openmpi/3.1.5-1_merlin6`` but we also specify ``gcc/9.2.0`` which is a strict dependency. The dependency must be
|
||||
loaded in advance.
|
||||
|
||||
---
|
||||
|
||||
## When to request for new PModules packages
|
||||
|
||||
### Missing software
|
||||
|
||||
If you don't find a specific software and you know from other people interesing on it, it can be installed in PModules. Please contact us
|
||||
and we will try to help with that. Deploying new software in PModules may take few days.
|
||||
|
||||
Usually installation of new software are possible as long as few users will use it. If you are insterested in to maintain this software,
|
||||
please let us know.
|
||||
|
||||
### Missing version
|
||||
|
||||
If the existing PModules versions for a specific package do not fit to your needs, is possible to ask for a new version.
|
||||
|
||||
Usually installation of newer versions will be supported, as long as few users will use it. Installation of intermediate versions can
|
||||
be supported if this is strictly justified.
|
@ -61,10 +61,8 @@ until the requested resources are allocated.
|
||||
|
||||
When running **``salloc``**, once the resources are allocated, *by default* the user will get
|
||||
a ***new shell on one of the allocated resources*** (if a user has requested few nodes, it will
|
||||
prompt a new shell on the first allocated node). This is thanks to the default ``srun`` command
|
||||
``srun -n1 -N1 --mem-per-cpu=0 --gres=gpu:0 --pty --preserve-env --mpi=none $SHELL`` which will run
|
||||
in the background (users do not need to specify it). However, this behaviour can
|
||||
be changed by running a different command after the **``salloc``** command. In example:
|
||||
prompt a new shell on the first allocated node). However, this behaviour can be changed by adding
|
||||
a shell (`$SHELL`) at the end of the `salloc` command. In example:
|
||||
|
||||
```bash
|
||||
# Typical 'salloc' call
|
||||
@ -142,9 +140,9 @@ how to connect to the **NoMachine** service in the Merlin cluster.
|
||||
|
||||
For other non officially supported graphical access (X11 forwarding):
|
||||
|
||||
* For Linux clients, please follow [{Accessing Merlin -> Accessing from Linux Clients}](/merlin6/connect-from-linux.html)
|
||||
* For Windows clients, please follow [{Accessing Merlin -> Accessing from Windows Clients}](/merlin6/connect-from-windows.html)
|
||||
* For MacOS clients, please follow [{Accessing Merlin -> Accessing from MacOS Clients}](/merlin6/connect-from-macos.html)
|
||||
* For Linux clients, please follow [{How To Use Merlin -> Accessing from Linux Clients}](/merlin6/connect-from-linux.html)
|
||||
* For Windows clients, please follow [{How To Use Merlin -> Accessing from Windows Clients}](/merlin6/connect-from-windows.html)
|
||||
* For MacOS clients, please follow [{How To Use Merlin -> Accessing from MacOS Clients}](/merlin6/connect-from-macos.html)
|
||||
|
||||
### 'srun' with x11 support
|
||||
|
284
pages/merlin6/03-Slurm-General-Documentation/running-jobs.md
Normal file
284
pages/merlin6/03-Slurm-General-Documentation/running-jobs.md
Normal file
@ -0,0 +1,284 @@
|
||||
---
|
||||
title: Running Slurm Scripts
|
||||
#tags:
|
||||
keywords: batch script, slurm, sbatch, srun
|
||||
last_updated: 23 January 2020
|
||||
summary: "This document describes how to run batch scripts in Slurm."
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/running-jobs.html
|
||||
---
|
||||
|
||||
|
||||
## The rules
|
||||
|
||||
Before starting using the cluster, please read the following rules:
|
||||
|
||||
1. To ease and improve *scheduling* and *backfilling*, always try to **estimate and** to **define a proper run time** of your jobs:
|
||||
* Use ``--time=<D-HH:MM:SS>`` for that.
|
||||
* For very long runs, please consider using ***[Job Arrays with Checkpointing](/merlin6/running-jobs.html#array-jobs-running-very-long-tasks-with-checkpoint-files)***
|
||||
2. Try to optimize your jobs for running at most within **one day**. Please, consider the following:
|
||||
* Some software can simply scale up by using more nodes while drastically reducing the run time.
|
||||
* Some software allow to save a specific state, and a second job can start from that state: ***[Job Arrays with Checkpointing](/merlin6/running-jobs.html#array-jobs-running-very-long-tasks-with-checkpoint-files)*** can help you with that.
|
||||
* Jobs submitted to **`hourly`** get more priority than jobs submitted to **`daily`**: always use **`hourly`** for jobs shorter than 1 hour.
|
||||
* Jobs submitted to **`daily`** get more priority than jobs submitted to **`general`**: always use **`daily`** for jobs shorter than 1 day.
|
||||
3. Is **forbidden** to run **very short jobs** as they cause a lot of overhead but also can cause severe problems to the main scheduler.
|
||||
* ***Question:*** Is my job a very short job? ***Answer:*** If it lasts in few seconds or very few minutes, yes.
|
||||
* ***Question:*** How long should my job run? ***Answer:*** as the *Rule of Thumb*, from 5' would start being ok, from 15' would preferred.
|
||||
* Use ***[Packed Jobs](/merlin6/running-jobs.html#packed-jobs-running-a-large-number-of-short-tasks)*** for running a large number of short tasks.
|
||||
4. Do not submit hundreds of similar jobs!
|
||||
* Use ***[Array Jobs](/merlin6/running-jobs.html#array-jobs-launching-a-large-number-of-related-jobs)*** for gathering jobs instead.
|
||||
|
||||
{{site.data.alerts.tip}}Having a good estimation of the <i>time</i> needed by your jobs, a proper way for running them, and optimizing the jobs to <i>run within one day</i> will contribute to make the system fairly and efficiently used.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
## Basic commands for running batch scripts
|
||||
|
||||
* Use **``sbatch``** for submitting a batch script to Slurm.
|
||||
* Use **``srun``** for running parallel tasks.
|
||||
* Use **``squeue``** for checking jobs status.
|
||||
* Use **``scancel``** for cancelling/deleting a job from the queue.
|
||||
|
||||
{{site.data.alerts.tip}}Use Linux <b>'man'</b> pages when needed (i.e. <span style="color:orange;">'man sbatch'</span>), mostly for checking the available options for the above commands.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
## Basic settings
|
||||
|
||||
For a complete list of options and parameters available is recommended to use the **man pages** (i.e. ``man sbatch``, ``man srun``, ``man salloc``).
|
||||
Please, notice that behaviour for some parameters might change depending on the command used when running jobs (in example, ``--exclusive`` behaviour in ``sbatch`` differs from ``srun``).
|
||||
|
||||
In this chapter we show the basic parameters which are usually needed in the Merlin cluster.
|
||||
|
||||
### Common settings
|
||||
|
||||
The following settings are the minimum required for running a job in the Merlin CPU and GPU nodes. Please, consider taking a look to the **man pages** (i.e. `man sbatch`, `man salloc`, `man srun`) for more information about all possible options. Also, do not hesitate to contact us on any questions.
|
||||
|
||||
* **Clusters:** For running jobs in the different Slurm clusters, users should to add the following option:
|
||||
```bash
|
||||
#SBATCH --clusters=<cluster_name> # Possible values: merlin5, merlin6, gmerlin6
|
||||
```
|
||||
Refer to the documentation of each cluster ([**`merlin6`**](/merlin6/slurm-configuration.html),[**`gmerlin6`**](/gmerlin6/slurm-configuration.html),[**`merlin5`**](/merlin5/slurm-configuration.html) for further information.
|
||||
|
||||
* **Partitions:** except when using the *default* partition for each cluster, one needs to specify the partition:
|
||||
```bash
|
||||
#SBATCH --partition=<partition_name> # Check each cluster documentation for possible values
|
||||
```
|
||||
|
||||
Refer to the documentation of each cluster ([**`merlin6`**](/merlin6/slurm-configuration.html),[**`gmerlin6`**](/gmerlin6/slurm-configuration.html),[**`merlin5`**](/merlin5/slurm-configuration.html) for further information.
|
||||
|
||||
* **[Optional] Disabling shared nodes**: by default, nodes are not exclusive. Hence, multiple users can run in the same node. One can request exclusive node usage with the following option:
|
||||
```bash
|
||||
#SBATCH --exclusive # Only if you want a dedicated node
|
||||
```
|
||||
|
||||
* **Time**: is important to define how long a job should run, according to the reality. This will help Slurm when *scheduling* and *backfilling*, and will let Slurm managing job queues in a more efficient way. This value can never exceed the `MaxTime` of the affected partition.
|
||||
```bash
|
||||
#SBATCH --time=<D-HH:MM:SS> # Can not exceed the partition `MaxTime`
|
||||
```
|
||||
Refer to the documentation of each cluster ([**`merlin6`**](/merlin6/slurm-configuration.html),[**`gmerlin6`**](/gmerlin6/slurm-configuration.html),[**`merlin5`**](/merlin5/slurm-configuration.html) for further information about partition `MaxTime` values.
|
||||
|
||||
* **Output and error files**: by default, Slurm script will generate standard output (``slurm-%j.out``, where `%j` is the job_id) and error (``slurm-%j.err``, where `%j` is the job_id) files in the directory from where the job was submitted. Users can change default name with the following options:
|
||||
```bash
|
||||
#SBATCH --output=<filename> # Can include path. Patterns accepted (i.e. %j)
|
||||
#SBATCH --error=<filename> # Can include path. Patterns accepted (i.e. %j)
|
||||
```
|
||||
Use **man sbatch** (``man sbatch | grep -A36 '^filename pattern'``) for getting a list specification of **filename patterns**.
|
||||
|
||||
* **Enable/Disable Hyper-Threading**: Whether a node has or not Hyper-Threading depends on the node configuration. By default, HT nodes have HT enabled, but one should specify it from the Slurm command as follows:
|
||||
```bash
|
||||
#SBATCH --hint=multithread # Use extra threads with in-core multi-threading.
|
||||
#SBATCH --hint=nomultithread # Don't use extra threads with in-core multi-threading.
|
||||
```
|
||||
Refer to the documentation of each cluster ([**`merlin6`**](/merlin6/slurm-configuration.html),[**`gmerlin6`**](/gmerlin6/slurm-configuration.html),[**`merlin5`**](/merlin5/slurm-configuration.html) for further information about node configuration and Hyper-Threading.
|
||||
Consider that, sometimes, depending on your job requirements, you might need also to setup how many `--ntasks-per-core` or `--cpus-per-task` (even other options) in addition to the `--hint` command. Please, contact us in case of doubts.
|
||||
|
||||
{{site.data.alerts.tip}} In general, for the cluster `merlin6` <span style="color:orange;"><b>--hint=[no]multithread</b></span> is a recommended field. On the other hand, <span style="color:orange;"><b>--ntasks-per-core</b></span> is only needed when
|
||||
one needs to define how a task should be handled within a core, and this setting will not be generally used on Hybrid MPI/OpenMP jobs where multiple cores are needed for single tasks.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
## Batch script templates
|
||||
|
||||
### CPU-based jobs templates
|
||||
|
||||
The following examples apply to the **Merlin6** cluster.
|
||||
|
||||
#### Nomultithreaded jobs template
|
||||
|
||||
The following template should be used by any user submitting jobs to the Merlin6 CPU nodes:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
#SBATCH --cluster=merlin6 # Cluster name
|
||||
#SBATCH --partition=general,daily,hourly # Specify one or multiple partitions
|
||||
#SBATCH --time=<D-HH:MM:SS> # Strongly recommended
|
||||
#SBATCH --output=<output_file> # Generate custom output file
|
||||
#SBATCH --error=<error_file> # Generate custom error file
|
||||
#SBATCH --hint=nomultithread # Mandatory for multithreaded jobs
|
||||
##SBATCH --exclusive # Uncomment if you need exclusive node usage
|
||||
##SBATCH --ntasks-per-core=1 # Only mandatory for multithreaded single tasks
|
||||
|
||||
## Advanced options example
|
||||
##SBATCH --nodes=1 # Uncomment and specify #nodes to use
|
||||
##SBATCH --ntasks=44 # Uncomment and specify #nodes to use
|
||||
##SBATCH --ntasks-per-node=44 # Uncomment and specify #tasks per node
|
||||
##SBATCH --cpus-per-task=44 # Uncomment and specify the number of cores per task
|
||||
```
|
||||
|
||||
#### Multithreaded jobs template
|
||||
|
||||
The following template should be used by any user submitting jobs to the Merlin6 CPU nodes:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
#SBATCH --cluster=merlin6 # Cluster name
|
||||
#SBATCH --partition=general,daily,hourly # Specify one or multiple partitions
|
||||
#SBATCH --time=<D-HH:MM:SS> # Strongly recommended
|
||||
#SBATCH --output=<output_file> # Generate custom output file
|
||||
#SBATCH --error=<error_file> # Generate custom error file
|
||||
#SBATCH --hint=multithread # Mandatory for multithreaded jobs
|
||||
##SBATCH --exclusive # Uncomment if you need exclusive node usage
|
||||
##SBATCH --ntasks-per-core=2 # Only mandatory for multithreaded single tasks
|
||||
|
||||
## Advanced options example
|
||||
##SBATCH --nodes=1 # Uncomment and specify #nodes to use
|
||||
##SBATCH --ntasks=88 # Uncomment and specify #nodes to use
|
||||
##SBATCH --ntasks-per-node=88 # Uncomment and specify #tasks per node
|
||||
##SBATCH --cpus-per-task=88 # Uncomment and specify the number of cores per task
|
||||
```
|
||||
|
||||
### GPU-based jobs templates
|
||||
|
||||
The following template should be used by any user submitting jobs to GPU nodes:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
#SBATCH --cluster=gmerlin6 # Cluster name
|
||||
#SBATCH --partition=gpu,gpu-short,gwendolen # Specify one or multiple partitions
|
||||
#SBATCH --gpus="<type>:<num_gpus>" # <type> is optional, <num_gpus> is mandatory
|
||||
#SBATCH --time=<D-HH:MM:SS> # Strongly recommended
|
||||
#SBATCH --output=<output_file> # Generate custom output file
|
||||
#SBATCH --error=<error_file> # Generate custom error file
|
||||
##SBATCH --exclusive # Uncomment if you need exclusive node usage
|
||||
##SBATCH --account=gwendolen_public # Uncomment if you need to use gwendolen
|
||||
|
||||
## Advanced options example
|
||||
##SBATCH --nodes=1 # Uncomment and specify number of nodes to use
|
||||
##SBATCH --ntasks=1 # Uncomment and specify number of nodes to use
|
||||
##SBATCH --cpus-per-gpu=5 # Uncomment and specify the number of cores per task
|
||||
##SBATCH --mem-per-gpu=16000 # Uncomment and specify the number of cores per task
|
||||
##SBATCH --gpus-per-node=<type>:2 # Uncomment and specify the number of GPUs per node
|
||||
##SBATCH --gpus-per-socket=<type>:2 # Uncomment and specify the number of GPUs per socket
|
||||
##SBATCH --gpus-per-task=<type>:1 # Uncomment and specify the number of GPUs per task
|
||||
```
|
||||
|
||||
## Advanced configurations
|
||||
|
||||
### Array Jobs: launching a large number of related jobs
|
||||
|
||||
If you need to run a large number of jobs based on the same executable with systematically varying inputs,
|
||||
e.g. for a parameter sweep, you can do this most easily in form of a **simple array job**.
|
||||
|
||||
``` bash
|
||||
#!/bin/bash
|
||||
#SBATCH --job-name=test-array
|
||||
#SBATCH --partition=daily
|
||||
#SBATCH --ntasks=1
|
||||
#SBATCH --time=08:00:00
|
||||
#SBATCH --array=1-8
|
||||
|
||||
echo $(date) "I am job number ${SLURM_ARRAY_TASK_ID}"
|
||||
srun myprogram config-file-${SLURM_ARRAY_TASK_ID}.dat
|
||||
|
||||
```
|
||||
|
||||
This will run 8 independent jobs, where each job can use the counter
|
||||
variable `SLURM_ARRAY_TASK_ID` defined by Slurm inside of the job's
|
||||
environment to feed the correct input arguments or configuration file
|
||||
to the "myprogram" executable. Each job will receive the same set of
|
||||
configurations (e.g. time limit of 8h in the example above).
|
||||
|
||||
The jobs are independent, but they will run in parallel (if the cluster resources allow for
|
||||
it). The jobs will get JobIDs like {some-number}_0 to {some-number}_7, and they also will each
|
||||
have their own output file.
|
||||
|
||||
**Note:**
|
||||
* Do not use such jobs if you have very short tasks, since each array sub job will incur the full overhead for launching an independent Slurm job. For such cases you should used a **packed job** (see below).
|
||||
* If you want to control how many of these jobs can run in parallel, you can use the `#SBATCH --array=1-100%5` syntax. The `%5` will define
|
||||
that only 5 sub jobs may ever run in parallel.
|
||||
|
||||
You also can use an array job approach to run over all files in a directory, substituting the payload with
|
||||
|
||||
``` bash
|
||||
FILES=(/path/to/data/*)
|
||||
srun ./myprogram ${FILES[$SLURM_ARRAY_TASK_ID]}
|
||||
```
|
||||
|
||||
Or for a trivial case you could supply the values for a parameter scan in form
|
||||
of a argument list that gets fed to the program using the counter variable.
|
||||
|
||||
``` bash
|
||||
ARGS=(0.05 0.25 0.5 1 2 5 100)
|
||||
srun ./my_program.exe ${ARGS[$SLURM_ARRAY_TASK_ID]}
|
||||
```
|
||||
|
||||
### Array jobs: running very long tasks with checkpoint files
|
||||
|
||||
If you need to run a job for much longer than the queues (partitions) permit, and
|
||||
your executable is able to create checkpoint files, you can use this
|
||||
strategy:
|
||||
|
||||
``` bash
|
||||
#!/bin/bash
|
||||
#SBATCH --job-name=test-checkpoint
|
||||
#SBATCH --partition=general
|
||||
#SBATCH --ntasks=1
|
||||
#SBATCH --time=7-00:00:00 # each job can run for 7 days
|
||||
#SBATCH --cpus-per-task=1
|
||||
#SBATCH --array=1-10%1 # Run a 10-job array, one job at a time.
|
||||
if test -e checkpointfile; then
|
||||
# There is a checkpoint file;
|
||||
myprogram --read-checkp checkpointfile
|
||||
else
|
||||
# There is no checkpoint file, start a new simulation.
|
||||
myprogram
|
||||
fi
|
||||
```
|
||||
|
||||
The `%1` in the `#SBATCH --array=1-10%1` statement defines that only 1 subjob can ever run in parallel, so
|
||||
this will result in subjob n+1 only being started when job n has finished. It will read the checkpoint file
|
||||
if it is present.
|
||||
|
||||
|
||||
### Packed jobs: running a large number of short tasks
|
||||
|
||||
Since the launching of a Slurm job incurs some overhead, you should not submit each short task as a separate
|
||||
Slurm job. Use job packing, i.e. you run the short tasks within the loop of a single Slurm job.
|
||||
|
||||
You can launch the short tasks using `srun` with the `--exclusive` switch (not to be confused with the
|
||||
switch of the same name used in the SBATCH commands). This switch will ensure that only a specified
|
||||
number of tasks can run in parallel.
|
||||
|
||||
As an example, the following job submission script will ask Slurm for
|
||||
44 cores (threads), then it will run the =myprog= program 1000 times with
|
||||
arguments passed from 1 to 1000. But with the =-N1 -n1 -c1
|
||||
--exclusive= option, it will control that at any point in time only 44
|
||||
instances are effectively running, each being allocated one CPU. You
|
||||
can at this point decide to allocate several CPUs or tasks by adapting
|
||||
the corresponding parameters.
|
||||
|
||||
``` bash
|
||||
#! /bin/bash
|
||||
#SBATCH --job-name=test-checkpoint
|
||||
#SBATCH --partition=general
|
||||
#SBATCH --ntasks=1
|
||||
#SBATCH --time=7-00:00:00
|
||||
#SBATCH --ntasks=44 # defines the number of parallel tasks
|
||||
for i in {1..1000}
|
||||
do
|
||||
srun -N1 -n1 -c1 --exclusive ./myprog $i &
|
||||
done
|
||||
wait
|
||||
```
|
||||
|
||||
**Note:** The `&` at the end of the `srun` line is needed to not have the script waiting (blocking).
|
||||
The `wait` command waits for all such background tasks to finish and returns the exit code.
|
||||
|
27
pages/merlin6/cluster-introduction.md
Normal file
27
pages/merlin6/cluster-introduction.md
Normal file
@ -0,0 +1,27 @@
|
||||
---
|
||||
title: Introduction
|
||||
#tags:
|
||||
#keywords:
|
||||
last_updated: 28 June 2019
|
||||
#summary: "Merlin 6 cluster overview"
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/cluster-introduction.html
|
||||
---
|
||||
|
||||
## Slurm clusters
|
||||
|
||||
* The new Slurm CPU cluster is called [**`merlin6`**](/merlin6/cluster-introduction.html).
|
||||
* The new Slurm GPU cluster is called [**`gmerlin6`**](/gmerlin6/cluster-introduction.html)
|
||||
* The old Slurm *merlin* cluster is still active and best effort support is provided.
|
||||
The cluster, was renamed as [**merlin5**](/merlin5/cluster-introduction.html).
|
||||
|
||||
From July 2019, **`merlin6`** becomes the **default Slurm cluster** and any job submitted from the login node will be submitted to that cluster if not .
|
||||
* Users can keep submitting to the old *`merlin5`* computing nodes by using the option ``--cluster=merlin5``.
|
||||
* Users submitting to the **`gmerlin6`** GPU cluster need to specify the option ``--cluster=gmerlin6``.
|
||||
|
||||
### Slurm 'merlin6'
|
||||
|
||||
**CPU nodes** are configured in a **Slurm** cluster, called **`merlin6`**, and
|
||||
this is the _**default Slurm cluster**_. Hence, by default, if no Slurm cluster is
|
||||
specified (with the `--cluster` option), this will be the cluster to which the jobs
|
||||
will be sent.
|
166
pages/merlin6/hardware-and-software-description.md
Normal file
166
pages/merlin6/hardware-and-software-description.md
Normal file
@ -0,0 +1,166 @@
|
||||
---
|
||||
title: Hardware And Software Description
|
||||
#tags:
|
||||
#keywords:
|
||||
last_updated: 13 June 2019
|
||||
#summary: ""
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/hardware-and-software.html
|
||||
---
|
||||
|
||||
## Hardware
|
||||
|
||||
### Computing Nodes
|
||||
|
||||
The new Merlin6 cluster contains a solution based on **four** [**HPE Apollo k6000 Chassis**](https://h20195.www2.hpe.com/v2/getdocument.aspx?docname=a00016641enw)
|
||||
* *Three* of them contain 24 x [**HP Apollo XL230K Gen10**](https://h20195.www2.hpe.com/v2/GetDocument.aspx?docname=a00016634enw) blades.
|
||||
* A *fourth* chassis was purchased on 2021 with [**HP Apollo XL230K Gen10**](https://h20195.www2.hpe.com/v2/GetDocument.aspx?docname=a00016634enw) blades dedicated to few experiments. Blades have slighly different components depending on specific project requirements.
|
||||
|
||||
The connectivity for the Merlin6 cluster is based on **ConnectX-5 EDR-100Gbps**, and each chassis contains:
|
||||
* 1 x [HPE Apollo InfiniBand EDR 36-port Unmanaged Switch](https://h20195.www2.hpe.com/v2/getdocument.aspx?docname=a00016643enw)
|
||||
* 24 internal EDR-100Gbps ports (1 port per blade for internal low latency connectivity)
|
||||
* 12 external EDR-100Gbps ports (for external for internal low latency connectivity)
|
||||
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th scope='colgroup' style="vertical-align:middle;text-align:center;" colspan="8">Merlin6 CPU Computing Nodes</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Chassis</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Node</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Processor</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Sockets</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Cores</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Threads</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Scratch</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Memory</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>#0</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-c-0[01-24]</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><a href="https://ark.intel.com/content/www/us/en/ark/products/120491/intel-xeon-gold-6152-processor-30-25m-cache-2-10-ghz.html">Intel Xeon Gold 6152</a></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">44</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">1.2TB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">384GB</td>
|
||||
</tr>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>#1</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-c-1[01-24]</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><a href="https://ark.intel.com/content/www/us/en/ark/products/120491/intel-xeon-gold-6152-processor-30-25m-cache-2-10-ghz.html">Intel Xeon Gold 6152</a></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">44</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">1.2TB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">384GB</td>
|
||||
</tr>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>#2</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-c-2[01-24]</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><a href="https://ark.intel.com/content/www/us/en/ark/products/120491/intel-xeon-gold-6152-processor-30-25m-cache-2-10-ghz.html">Intel Xeon Gold 6152</a></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">44</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">1.2TB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">384GB</td>
|
||||
</tr>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2"><b>#3</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-c-3[01-06]</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2"><a href="https://ark.intel.com/content/www/us/en/ark/products/199343/intel-xeon-gold-6240r-processor-35-75m-cache-2-40-ghz.html">Intel Xeon Gold 6240R</a></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2">2</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2">48</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2">2</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2">1.2TB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">384GB</td>
|
||||
</tr>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td rowspan="1"><b>merlin-c-3[07-12]</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">768GB</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
Each blade contains a NVMe disk, where up to 300TB are dedicated to the O.S., and ~1.2TB are reserved for local `/scratch`.
|
||||
|
||||
### Login Nodes
|
||||
|
||||
*One old login node* (``merlin-l-01.psi.ch``) is inherit from the previous Merlin5 cluster. Its mainly use is for running some BIO services (`cryosparc`) and for submitting jobs.
|
||||
*Two new login nodes* (``merlin-l-001.psi.ch``,``merlin-l-002.psi.ch``) with similar configuration to the Merlin6 computing nodes are available for the users. The mainly use
|
||||
is for compiling software and submitting jobs.
|
||||
|
||||
The connectivity is based on **ConnectX-5 EDR-100Gbps** for the new login nodes, and **ConnectIB FDR-56Gbps** for the old one.
|
||||
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th scope='colgroup' style="vertical-align:middle;text-align:center;" colspan="8">Merlin6 CPU Computing Nodes</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Hardware</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Node</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Processor</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Sockets</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Cores</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Threads</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Scratch</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Memory</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>Old</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-l-01</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><a href="https://ark.intel.com/products/91768/Intel-Xeon-Processor-E5-2697A-v4-40M-Cache-2-60-GHz-">Intel Xeon E5-2697AV4</a></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">16</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">100GB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">512GB</td>
|
||||
</tr>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>New</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-l-00[1,2]</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><a href="https://ark.intel.com/content/www/us/en/ark/products/120491/intel-xeon-gold-6152-processor-30-25m-cache-2-10-ghz.html">Intel Xeon Gold 6152</a></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">44</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">1.8TB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">384GB</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
### Storage
|
||||
|
||||
The storage node is based on the [Lenovo Distributed Storage Solution for IBM Spectrum Scale](https://lenovopress.com/lp0626-lenovo-distributed-storage-solution-for-ibm-spectrum-scale-x3650-m5).
|
||||
* 2 x **Lenovo DSS G240** systems, each one composed by 2 IO Nodes **ThinkSystem SR650** mounting 4 x **Lenovo Storage D3284 High Density Expansion** enclosures.
|
||||
* Each IO node has a connectivity of 400Gbps (4 x EDR 100Gbps ports, 2 of them are **ConnectX-5** and 2 are **ConnectX-4**).
|
||||
|
||||
The storage solution is connected to the HPC clusters through 2 x **Mellanox SB7800 InfiniBand 1U Switches** for high availability and load balancing.
|
||||
|
||||
### Network
|
||||
|
||||
Merlin6 cluster connectivity is based on the [**Infiniband**](https://en.wikipedia.org/wiki/InfiniBand) technology. This allows fast access with very low latencies to the data as well as running
|
||||
extremely efficient MPI-based jobs:
|
||||
* Connectivity amongst different computing nodes on different chassis ensures up to 1200Gbps of aggregated bandwidth.
|
||||
* Inter connectivity (communication amongst computing nodes in the same chassis) ensures up to 2400Gbps of aggregated bandwidth.
|
||||
* Communication to the storage ensures up to 800Gbps of aggregated bandwidth.
|
||||
|
||||
Merlin6 cluster currently contains 5 Infiniband Managed switches and 3 Infiniband Unmanaged switches (one per HP Apollo chassis):
|
||||
* 1 x **MSX6710** (FDR) for connecting old GPU nodes, old login nodes and MeG cluster to the Merlin6 cluster (and storage). No High Availability mode possible.
|
||||
* 2 x **MSB7800** (EDR) for connecting Login Nodes, Storage and other nodes in High Availability mode.
|
||||
* 3 x **HP EDR Unmanaged** switches, each one embedded to each HP Apollo k6000 chassis solution.
|
||||
* 2 x **MSB7700** (EDR) are the top switches, interconnecting the Apollo unmanaged switches and the managed switches (MSX6710, MSB7800).
|
||||
|
||||
## Software
|
||||
|
||||
In Merlin6, we try to keep the latest software stack release to get the latest features and improvements. Due to this, **Merlin6** runs:
|
||||
* [**RedHat Enterprise Linux 7**](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/7.9_release_notes/index)
|
||||
* [**Slurm**](https://slurm.schedmd.com/), we usually try to keep it up to date with the most recent versions.
|
||||
* [**GPFS v5**](https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/ibmspectrumscale502_welcome.html)
|
||||
* [**MLNX_OFED LTS v.5.2-2.2.0.0 or newer**](https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed) for all **ConnectX-5** or superior cards.
|
||||
* [MLNX_OFED LTS v.4.9-2.2.4.0](https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed) is installed for remaining **ConnectX-3** and **ConnectIB** cards.
|
203
pages/merlin6/slurm-configuration.md
Normal file
203
pages/merlin6/slurm-configuration.md
Normal file
@ -0,0 +1,203 @@
|
||||
---
|
||||
title: Slurm Configuration
|
||||
#tags:
|
||||
keywords: configuration, partitions, node definition
|
||||
last_updated: 29 January 2021
|
||||
summary: "This document describes a summary of the Merlin6 configuration."
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/slurm-configuration.html
|
||||
---
|
||||
|
||||
This documentation shows basic Slurm configuration and options needed to run jobs in the Merlin6 CPU cluster.
|
||||
|
||||
## Merlin6 CPU nodes definition
|
||||
|
||||
The following table show default and maximum resources that can be used per node:
|
||||
|
||||
| Nodes | Def.#CPUs | Max.#CPUs | #Threads | Def.Mem/CPU | Max.Mem/CPU | Max.Mem/Node | Max.Swap | Def.#GPUs | Max.#GPUs |
|
||||
|:------------------:| ---------:| :--------:| :------: | :----------:| :----------:| :-----------:| :-------:| :-------: | :-------: |
|
||||
| merlin-c-[001-024] | 1 core | 44 cores | 2 | 4000 | 352000 | 352000 | 10000 | N/A | N/A |
|
||||
| merlin-c-[101-124] | 1 core | 44 cores | 2 | 4000 | 352000 | 352000 | 10000 | N/A | N/A |
|
||||
| merlin-c-[201-224] | 1 core | 44 cores | 2 | 4000 | 352000 | 352000 | 10000 | N/A | N/A |
|
||||
| merlin-c-[301-306] | 1 core | 44 cores | 2 | 4000 | 352000 | 352000 | 10000 | N/A | N/A |
|
||||
|
||||
If nothing is specified, by default each core will use up to 8GB of memory. Memory can be increased with the `--mem=<mem_in_MB>` and
|
||||
`--mem-per-cpu=<mem_in_MB>` options, and maximum memory allowed is `Max.Mem/Node`.
|
||||
|
||||
In **`merlin6`**, Memory is considered a Consumable Resource, as well as the CPU. Hence, both resources will account when submitting a job,
|
||||
and by default resources can not be oversubscribed. This is a main difference with the old **`merlin5`** cluster, when only CPU were accounted,
|
||||
and memory was by default oversubscribed.
|
||||
|
||||
{{site.data.alerts.tip}}Always check <b>'/etc/slurm/slurm.conf'</b> for changes in the hardware.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
## Running jobs in the 'merlin6' cluster
|
||||
|
||||
In this chapter we will cover basic settings that users need to specify in order to run jobs in the Merlin6 CPU cluster.
|
||||
|
||||
### Merlin6 CPU cluster
|
||||
|
||||
To run jobs in the **`merlin6`** cluster users **can optionally** specify the cluster name in Slurm:
|
||||
|
||||
```bash
|
||||
#SBATCH --cluster=merlin6
|
||||
```
|
||||
|
||||
If no cluster name is specified, by default any job will be submitted to this cluster (as this is the main cluster).
|
||||
Hence, this would be only necessary if one has to deal with multiple clusters or when one has defined some environmental
|
||||
variables which can modify the cluster name.
|
||||
|
||||
### Merlin6 CPU partitions
|
||||
|
||||
Users might need to specify the Slurm partition. If no partition is specified, it will default to **`general`**:
|
||||
|
||||
```bash
|
||||
#SBATCH --partition=<partition_name> # Possible <partition_name> values: general, daily, hourly
|
||||
```
|
||||
|
||||
The following *partitions* (also known as *queues*) are configured in Slurm:
|
||||
|
||||
| CPU Partition | Default Time | Max Time | Max Nodes | PriorityJobFactor\* | PriorityTier\*\* |
|
||||
|:-----------------: | :----------: | :------: | :-------: | :-----------------: | :--------------: |
|
||||
| **<u>general</u>** | 1 day | 1 week | 50 | 1 | 1 |
|
||||
| **daily** | 1 day | 1 day | 67 | 500 | 1 |
|
||||
| **hourly** | 1 hour | 1 hour | unlimited | 1000 | 1 |
|
||||
| **gfa-asa** | 1 day | 1 week | 11 | 1000 | 1000 |
|
||||
|
||||
\*The **PriorityJobFactor** value will be added to the job priority (*PARTITION* column in `sprio -l` ). In other words, jobs sent to higher priority
|
||||
partitions will usually run first (however, other factors such like **job age** or mainly **fair share** might affect to that decision). For the GPU
|
||||
partitions, Slurm will also attempt first to allocate jobs on partitions with higher priority over partitions with lesser priority.
|
||||
|
||||
**\*\***Jobs submitted to a partition with a higher **PriorityTier** value will be dispatched before pending jobs in partition with lower *PriorityTier* value
|
||||
and, if possible, they will preempt running jobs from partitions with lower *PriorityTier* values.
|
||||
|
||||
* The **`general`** partition is the **default**. It can not have more than 50 nodes running jobs.
|
||||
* For **`daily`** this limitation is extended to 67 nodes.
|
||||
* For **`hourly`** there are no limits.
|
||||
* **`gfa-asa`** is a **private hidden** partition, belonging to one experiment. **Access is restricted**. However, by agreement with the experiment,
|
||||
nodes are usually added to the **`hourly`** partition as extra resources for the public resources.
|
||||
|
||||
{{site.data.alerts.tip}}Jobs which would run for less than one day should be always sent to <b>daily</b>, while jobs that would run for less
|
||||
than one hour should be sent to <b>hourly</b>. This would ensure that you have highest priority over jobs sent to partitions with less priority,
|
||||
but also because <b>general</b> has limited the number of nodes that can be used for that. The idea behind that, is that the cluster can not
|
||||
be blocked by long jobs and we can always ensure resources for shorter jobs.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
### Merlin5 CPU Accounts
|
||||
|
||||
Users need to ensure that the public **`merlin`** account is specified. No specifying account options would default to this account.
|
||||
This is mostly needed by users which have multiple Slurm accounts, which may define by mistake a different account.
|
||||
|
||||
```bash
|
||||
#SBATCH --account=merlin # Possible values: merlin, gfa-asa
|
||||
```
|
||||
|
||||
Not all the accounts can be used on all partitions. This is resumed in the table below:
|
||||
|
||||
| Slurm Account | Slurm Partitions |
|
||||
| :------------------: | :----------------------------------: |
|
||||
| **<u>merlin</u>** | `hourly`,`daily`, `general` |
|
||||
| **gfa-asa** | `gfa-asa`,`hourly`,`daily`, `general` |
|
||||
|
||||
#### The 'gfa-asa' private account
|
||||
|
||||
For accessing the **`gfa-asa`** partition, it must be done through the **`gfa-asa`** account. This account **is restricted**
|
||||
to a group of users and is not public.
|
||||
|
||||
### Slurm CPU specific options
|
||||
|
||||
Some options are available when using CPUs. These are detailed here.
|
||||
|
||||
Alternative Slurm options for CPU based jobs are available. Please refer to the **man** pages
|
||||
for each Slurm command for further information about it (`man salloc`, `man sbatch`, `man srun`).
|
||||
Below are listed the most common settings:
|
||||
|
||||
```bash
|
||||
#SBATCH --hint=[no]multithread
|
||||
#SBATCH --ntasks=<ntasks>
|
||||
#SBATCH --ntasks-per-core=<ntasks>
|
||||
#SBATCH --ntasks-per-socket=<ntasks>
|
||||
#SBATCH --ntasks-per-node=<ntasks>
|
||||
#SBATCH --mem=<size[units]>
|
||||
#SBATCH --mem-per-cpu=<size[units]>
|
||||
#SBATCH --cpus-per-task=<ncpus>
|
||||
#SBATCH --cpu-bind=[{quiet,verbose},]<type> # only for 'srun' command
|
||||
```
|
||||
|
||||
#### Dealing with Hyper-Threading
|
||||
|
||||
The **`merlin6`** cluster contains nodes with Hyper-Threading enabled. One should always specify
|
||||
whether to use Hyper-Threading or not. If not defined, Slurm will generally use it (exceptions apply).
|
||||
|
||||
```bash
|
||||
#SBATCH --hint=multithread # Use extra threads with in-core multi-threading.
|
||||
#SBATCH --hint=nomultithread # Don't use extra threads with in-core multi-threading.
|
||||
```
|
||||
|
||||
### User and job limits
|
||||
|
||||
In the CPU cluster we provide some limits which basically apply to jobs and users. The idea behind this is to ensure a fair usage of the resources and to
|
||||
avoid overabuse of the resources from a single user or job. However, applying limits might affect the overall usage efficiency of the cluster (in example,
|
||||
pending jobs from a single user while having many idle nodes due to low overall activity is something that can be seen when user limits are applied).
|
||||
In the same way, these limits can be also used to improve the efficiency of the cluster (in example, without any job size limits, a job requesting all
|
||||
resources from the batch system would drain the entire cluster for fitting the job, which is undesirable).
|
||||
|
||||
Hence, there is a need of setting up wise limits and to ensure that there is a fair usage of the resources, by trying to optimize the overall efficiency
|
||||
of the cluster while allowing jobs of different nature and sizes (it is, **single core** based **vs parallel jobs** of different sizes) to run.
|
||||
|
||||
{{site.data.alerts.warning}}Wide limits are provided in the <b>daily</b> and <b>hourly</b> partitions, while for <b>general</b> those limits are
|
||||
more restrictive.
|
||||
<br>However, we kindly ask users to inform the Merlin administrators when there are plans to send big jobs which would require a
|
||||
massive draining of nodes for allocating such jobs. This would apply to jobs requiring the <b>unlimited</b> QoS (see below <i>"Per job limits"</i>)
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
{{site.data.alerts.tip}}If you have different requirements, please let us know, we will try to accomodate or propose a solution for you.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
#### Per job limits
|
||||
|
||||
These are limits which apply to a single job. In other words, there is a maximum of resources a single job can use. Limits are described in the table below with the format: `SlurmQoS(limits)` (possible `SlurmQoS` values can be listed with the command `sacctmgr show qos`). Some limits will vary depending on the day and time of the week.
|
||||
|
||||
| Partition | Mon-Fri 0h-18h | Sun-Thu 18h-0h | From Fri 18h to Mon 0h |
|
||||
|:----------: | :------------------------------: | :------------------------------: | :------------------------------: |
|
||||
| **general** | normal(cpu=704,mem=2750G) | normal(cpu=704,mem=2750G) | normal(cpu=704,mem=2750G) |
|
||||
| **daily** | daytime(cpu=704,mem=2750G) | nighttime(cpu=1408,mem=5500G) | unlimited(cpu=2200,mem=8593.75G) |
|
||||
| **hourly** | unlimited(cpu=2200,mem=8593.75G) | unlimited(cpu=2200,mem=8593.75G) | unlimited(cpu=2200,mem=8593.75G) |
|
||||
|
||||
By default, a job can not use more than 704 cores (max CPU per job). In the same way, memory is also proportionally limited. This is equivalent as
|
||||
running a job using up to 8 nodes at once. This limit applies to the **general** partition (fixed limit) and to the **daily** partition (only during working hours).
|
||||
Limits are softed for the **daily** partition during non working hours, and during the weekend limits are even wider.
|
||||
|
||||
For the **hourly** partition, **despite running many parallel jobs is something not desirable** (for allocating such jobs it requires massive draining of nodes),
|
||||
wider limits are provided. In order to avoid massive nodes drain in the cluster, for allocating huge jobs, setting per job limits is necessary. Hence, **unlimited** QoS
|
||||
mostly refers to "per user" limits more than to "per job" limits (in other words, users can run any number of hourly jobs, but the job size for such jobs is limited
|
||||
with wide values).
|
||||
|
||||
#### Per user limits for CPU partitions
|
||||
|
||||
These limits which apply exclusively to users. In other words, there is a maximum of resources a single user can use. Limits are described in the table below with the format: `SlurmQoS(limits)` (possible `SlurmQoS` values can be listed with the command `sacctmgr show qos`). Some limits will vary depending on the day and time of the week.
|
||||
|
||||
| Partition | Mon-Fri 0h-18h | Sun-Thu 18h-0h | From Fri 18h to Mon 0h |
|
||||
|:-----------:| :----------------------------: | :---------------------------: | :----------------------------: |
|
||||
| **general** | normal(cpu=704,mem=2750G) | normal(cpu=704,mem=2750G) | normal(cpu=704,mem=2750G) |
|
||||
| **daily** | daytime(cpu=1408,mem=5500G) | nighttime(cpu=2112,mem=8250G) | unlimited(cpu=6336,mem=24750G) |
|
||||
| **hourly** | unlimited(cpu=6336,mem=24750G) | unlimited(cpu=6336,mem=24750G)| unlimited(cpu=6336,mem=24750G) |
|
||||
|
||||
By default, users can not use more than 704 cores at the same time (max CPU per user). Memory is also proportionally limited in the same way. This is
|
||||
equivalent to 8 exclusive nodes. This limit applies to the **general** partition (fixed limit) and to the **daily** partition (only during working hours).
|
||||
For the **hourly** partition, there are no limits restriction and user limits are removed. Limits are softed for the **daily** partition during non
|
||||
working hours, and during the weekend limits are removed.
|
||||
|
||||
## Advanced Slurm configuration
|
||||
|
||||
Clusters at PSI use the [Slurm Workload Manager](http://slurm.schedmd.com/) as the batch system technology for managing and scheduling jobs.
|
||||
Slurm has been installed in a **multi-clustered** configuration, allowing to integrate multiple clusters in the same batch system.
|
||||
|
||||
For understanding the Slurm configuration setup in the cluster, sometimes may be useful to check the following files:
|
||||
|
||||
* ``/etc/slurm/slurm.conf`` - can be found in the login nodes and computing nodes.
|
||||
* ``/etc/slurm/gres.conf`` - can be found in the GPU nodes, is also propgated to login nodes and computing nodes for user read access.
|
||||
* ``/etc/slurm/cgroup.conf`` - can be found in the computing nodes, is also propagated to login nodes for user read access.
|
||||
|
||||
The previous configuration files which can be found in the login nodes, correspond exclusively to the **merlin6** cluster configuration files.
|
||||
Configuration files for the old **merlin5** cluster or for the **gmerlin6** cluster must be checked directly on any of the **merlin5** or **gmerlin6** computing nodes (in example, by login in to one of the nodes while a job or an active allocation is running).
|
12
siteinfo.md
12
siteinfo.md
@ -26,21 +26,21 @@ Link processing in Jekyll
|
||||
|
||||
Code | Result | Baseurl
|
||||
---- | ------ | -------
|
||||
`{%raw%}[Normal link to source]{%endraw%}{%raw%}(/pages/merlin6/01 introduction/introduction.md){%endraw%}` | [Normal link to source](/pages/merlin6/01 introduction/introduction.md) | ✅
|
||||
`{%raw%}[Normal link to source]{%endraw%}{%raw%}(/pages/merlin6/01-Quick-Start-Guide/introduction.md){%endraw%}` | [Normal link to source](/pages/merlin6/01-Quick-Start-Guide/introduction.md) | ✅
|
||||
`{%raw%}[Normal link to result](/merlin6/introduction.html){%endraw%}` | [Normal link to result](/merlin6/introduction.html) | ❌
|
||||
`{%raw%}[Invalid Escaped link to source]({{"/pages/merlin6/01 introduction/introduction.md"}}){%endraw%}` | [Invalid Escaped link to source]({{"/pages/merlin6/01 introduction/introduction.md"}}) | ❌❗
|
||||
`{%raw%}[Invalid Escaped link to source]({{"/pages/merlin6/01-Quick-Start-Guide/introduction.md"}}){%endraw%}` | [Invalid Escaped link to source]({{"/pages/merlin6/01-Quick-Start-Guide/introduction.md"}}) | ❌❗
|
||||
`{%raw%}[Escaped link to result]({{"/merlin6/introduction.html"}}){%endraw%}` | [Escaped link to result]({{"/merlin6/introduction.html"}}) | ❌
|
||||
`{%raw%}[Reference link to source](srcRef){%endraw%}` | [Reference link to source][srcRef] | ✅
|
||||
`{%raw%}[Reference link to result](dstRef){%endraw%}` | [Reference link to result][dstRef] | ❌
|
||||
`{%raw%}[Liquid Link]({% link pages/merlin6/01 introduction/introduction.md %}){%endraw%}` | [Liquid Link]({% link pages/merlin6/01 introduction/introduction.md %}) | ❌
|
||||
`{%raw%}[Liquid Link]({% link pages/merlin6/01-Quick-Start-Guide/introduction.md %}){%endraw%}` | [Liquid Link]({% link pages/merlin6/01-Quick-Start-Guide/introduction.md %}) | ❌
|
||||
`{%raw%}{%endraw%}` |  | ✅
|
||||
`{%raw%}{%endraw%}` |  | ❌
|
||||
`{%raw%}{% include inline_image.html file="psi-logo.png" alt="Included PSI Logo" %}{%endraw%}` | {% include inline_image.html file="psi-logo.png" alt="Included PSI Logo" -%} | | ❌
|
||||
`{%raw%}{{ "/pages/merlin6/01 introduction/introduction.md" | relative_url }}{%endraw%}` | {{ "/pages/merlin6/01 introduction/introduction.md" | relative_url }} | ✅❗
|
||||
`{%raw%}{{ "/pages/merlin6/01-Quick-Start-Guide/introduction.md" | relative_url }}{%endraw%}` | {{ "/pages/merlin6/01-Quick-Start-Guide/introduction.md" | relative_url }} | ✅❗
|
||||
`{%raw%}{{ "/merlin6/introduction.html" | relative_url }}{%endraw%}` | {{ "/merlin6/introduction.html" | relative_url }} | ✅
|
||||
`{%raw%}{% link pages/merlin6/01 introduction/introduction.md %}{%endraw%}` | {% link pages/merlin6/01 introduction/introduction.md %} | ✅
|
||||
`{%raw%}{% link pages/merlin6/01-Quick-Start-Guide/introduction.md %}{%endraw%}` | {% link pages/merlin6/01-Quick-Start-Guide/introduction.md %} | ✅
|
||||
|
||||
[srcRef]: /pages/merlin6/01 introduction/introduction.md
|
||||
[srcRef]: /pages/merlin6/01-Quick-Start-Guide/introduction.md
|
||||
[dstRef]: /merlin6/introduction.html
|
||||
|
||||
Key:
|
||||
|
Reference in New Issue
Block a user