first stab at mkdocs migration
This commit is contained in:
44
docs/merlin5/cluster-introduction.md
Normal file
44
docs/merlin5/cluster-introduction.md
Normal file
@@ -0,0 +1,44 @@
|
||||
---
|
||||
title: Cluster 'merlin5'
|
||||
#tags:
|
||||
#keywords:
|
||||
last_updated: 07 April 2021
|
||||
#summary: "Merlin 5 cluster overview"
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin5/cluster-introduction.html
|
||||
---
|
||||
|
||||
## Slurm 'merlin5' cluster
|
||||
|
||||
**Merlin5** was the old official PSI Local HPC cluster for development and
|
||||
mission-critical applications which was built in 2016-2017. It was an
|
||||
extension of the Merlin4 cluster and built from existing hardware due
|
||||
to a lack of central investment on Local HPC Resources. **Merlin5** was
|
||||
then replaced by the **[Merlin6](/merlin6/index.html)** cluster in 2019,
|
||||
with an important central investment of ~1,5M CHF. **Merlin5** was mostly
|
||||
based on CPU resources, but also contained a small amount of GPU-based
|
||||
resources which were mostly used by the BIO experiments.
|
||||
|
||||
**Merlin5** has been kept as a **Local HPC [Slurm](https://slurm.schedmd.com/overview.html) cluster**,
|
||||
called **`merlin5`**. In that way, the old CPU computing nodes are still available as extra computation resources,
|
||||
and as an extension of the official production **`merlin6`** [Slurm](https://slurm.schedmd.com/overview.html) cluster.
|
||||
|
||||
The old Merlin5 _**login nodes**_, _**GPU nodes**_ and _**storage**_ were fully migrated to the **[Merlin6](/merlin6/index.html)**
|
||||
cluster, which becomes the **main Local HPC Cluster**. Hence, **[Merlin6](/merlin6/index.html)**
|
||||
contains the storage which is mounted on the different Merlin HPC [Slurm](https://slurm.schedmd.com/overview.html) Clusters (`merlin5`, `merlin6`, `gmerlin6`).
|
||||
|
||||
### Submitting jobs to 'merlin5'
|
||||
|
||||
To submit jobs to the **`merlin5`** Slurm cluster, it must be done from the **Merlin6** login nodes by using
|
||||
the option `--clusters=merlin5` on any of the Slurm commands (`sbatch`, `salloc`, `srun`, etc. commands).
|
||||
|
||||
## The Merlin Architecture
|
||||
|
||||
### Multi Non-Federated Cluster Architecture Design: The Merlin cluster
|
||||
|
||||
The following image shows the Slurm architecture design for Merlin cluster.
|
||||
It contains a multi non-federated cluster setup, with a central Slurm database
|
||||
and multiple independent clusters (`merlin5`, `merlin6`, `gmerlin6`):
|
||||
|
||||

|
||||
|
||||
97
docs/merlin5/hardware-and-software-description.md
Normal file
97
docs/merlin5/hardware-and-software-description.md
Normal file
@@ -0,0 +1,97 @@
|
||||
---
|
||||
title: Hardware And Software Description
|
||||
#tags:
|
||||
#keywords:
|
||||
last_updated: 09 April 2021
|
||||
#summary: ""
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin5/hardware-and-software.html
|
||||
---
|
||||
|
||||
## Hardware
|
||||
|
||||
### Computing Nodes
|
||||
|
||||
Merlin5 is built from recycled nodes, and hardware will be decomissioned as soon as it fails (due to expired warranty and age of the cluster).
|
||||
* Merlin5 is based on the [**HPE c7000 Enclosure**](https://h20195.www2.hpe.com/v2/getdocument.aspx?docname=c04128339) solution, with 16 x [**HPE ProLiant BL460c Gen8**](https://h20195.www2.hpe.com/v2/getdocument.aspx?docname=c04123239) nodes per chassis.
|
||||
* Connectivity is based on Infiniband **ConnectX-3 QDR-40Gbps**
|
||||
* 16 internal ports for intra chassis communication
|
||||
* 2 connected external ports for inter chassis communication and storage access.
|
||||
|
||||
The below table summarizes the hardware setup for the Merlin5 computing nodes:
|
||||
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th scope='colgroup' style="vertical-align:middle;text-align:center;" colspan="8">Merlin5 CPU Computing Nodes</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Chassis</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Node</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Processor</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Sockets</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Cores</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Threads</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Scratch</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Memory</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2"><b>#0</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-c-[18-30]</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2"><a href="https://ark.intel.com/content/www/us/en/ark/products/64595/intel-xeon-processor-e5-2670-20m-cache-2-60-ghz-8-00-gt-s-intel-qpi.html">Intel Xeon E5-2670</a></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2">2</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2">16</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2">1</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2">50GB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">64GB</td>
|
||||
</tr>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td rowspan="1"><b>merlin-c-[31,32]</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>128GB</b></td>
|
||||
</tr>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2"><b>#1</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-c-[33-45]</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2"><a href="https://ark.intel.com/content/www/us/en/ark/products/64595/intel-xeon-processor-e5-2670-20m-cache-2-60-ghz-8-00-gt-s-intel-qpi.html">Intel Xeon E5-2670</a></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2">2</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2">16</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2">1</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2">50GB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">64GB</td>
|
||||
</tr>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td rowspan="1"><b>merlin-c-[46,47]</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>128GB</b></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
### Login Nodes
|
||||
|
||||
The login nodes are part of the **[Merlin6](/merlin6/introduction.html)** HPC cluster,
|
||||
and are used to compile and to submit jobs to the different ***Merlin Slurm clusters*** (`merlin5`,`merlin6`,`gmerlin6`,etc.).
|
||||
Please refer to the **[Merlin6 Hardware Documentation](/merlin6/hardware-and-software.html)** for further information.
|
||||
|
||||
### Storage
|
||||
|
||||
The storage is part of the **[Merlin6](/merlin6/introduction.html)** HPC cluster,
|
||||
and is mounted in all the ***Slurm clusters*** (`merlin5`,`merlin6`,`gmerlin6`,etc.).
|
||||
Please refer to the **[Merlin6 Hardware Documentation](/merlin6/hardware-and-software.html)** for further information.
|
||||
|
||||
### Network
|
||||
|
||||
Merlin5 cluster connectivity is based on the [Infiniband QDR](https://en.wikipedia.org/wiki/InfiniBand) technology.
|
||||
This allows fast access with very low latencies to the data as well as running extremely efficient MPI-based jobs.
|
||||
However, this is an old version of Infiniband which requires older drivers and software can not take advantage of the latest features.
|
||||
|
||||
## Software
|
||||
|
||||
In Merlin5, we try to keep software stack coherency with the main cluster [Merlin6](/merlin6/index.html).
|
||||
|
||||
Due to this, Merlin5 runs:
|
||||
* [**RedHat Enterprise Linux 7**](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/7.9_release_notes/index)
|
||||
* [**Slurm**](https://slurm.schedmd.com/), we usually try to keep it up to date with the most recent versions.
|
||||
* [**GPFS v5**](https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/ibmspectrumscale502_welcome.html)
|
||||
* [**MLNX_OFED LTS v.4.9-2.2.4.0**](https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed), which is an old version, but required because **ConnectX-3** support has been dropped on newer OFED versions.
|
||||
142
docs/merlin5/slurm-configuration.md
Normal file
142
docs/merlin5/slurm-configuration.md
Normal file
@@ -0,0 +1,142 @@
|
||||
---
|
||||
title: Slurm Configuration
|
||||
#tags:
|
||||
keywords: configuration, partitions, node definition
|
||||
last_updated: 20 May 2021
|
||||
summary: "This document describes a summary of the Merlin5 Slurm configuration."
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin5/slurm-configuration.html
|
||||
---
|
||||
|
||||
This documentation shows basic Slurm configuration and options needed to run jobs in the Merlin5 cluster.
|
||||
|
||||
The Merlin5 cluster is an old cluster with old hardware which is maintained in a best effort for increasing the CPU power of the Merlin cluster.
|
||||
|
||||
## Merlin5 CPU nodes definition
|
||||
|
||||
The following table show default and maximum resources that can be used per node:
|
||||
|
||||
| Nodes | Def.#CPUs | Max.#CPUs | #Threads | Max.Mem/Node | Max.Swap |
|
||||
|:----------------:| ---------:| :--------:| :------: | :----------: | :-------:|
|
||||
| merlin-c-[18-30] | 1 core | 16 cores | 1 | 60000 | 10000 |
|
||||
| merlin-c-[31-32] | 1 core | 16 cores | 1 | 124000 | 10000 |
|
||||
| merlin-c-[33-45] | 1 core | 16 cores | 1 | 60000 | 10000 |
|
||||
| merlin-c-[46-47] | 1 core | 16 cores | 1 | 124000 | 10000 |
|
||||
|
||||
There is one *main difference between the Merlin5 and Merlin6 clusters*: Merlin5 is keeping an old configuration which does not
|
||||
consider the memory as a *consumable resource*. Hence, users can *oversubscribe* memory. This might trigger some side-effects, but
|
||||
this legacy configuration has been kept to ensure that old jobs can keep running in the same way they did a few years ago.
|
||||
If you know that this might be a problem for you, please, always use Merlin6 instead.
|
||||
|
||||
|
||||
## Running jobs in the 'merlin5' cluster
|
||||
|
||||
In this chapter we will cover basic settings that users need to specify in order to run jobs in the Merlin5 CPU cluster.
|
||||
|
||||
### Merlin5 CPU cluster
|
||||
|
||||
To run jobs in the **`merlin5`** cluster users **must** specify the cluster name in Slurm:
|
||||
|
||||
```bash
|
||||
#SBATCH --cluster=merlin5
|
||||
```
|
||||
|
||||
### Merlin5 CPU partitions
|
||||
|
||||
Users might need to specify the Slurm partition. If no partition is specified, it will default to **`merlin`**:
|
||||
|
||||
```bash
|
||||
#SBATCH --partition=<partition_name> # Possible <partition_name> values: merlin, merlin-long:
|
||||
```
|
||||
|
||||
The table below resumes shows all possible partitions available to users:
|
||||
|
||||
| CPU Partition | Default Time | Max Time | Max Nodes | PriorityJobFactor\* | PriorityTier\*\* |
|
||||
|:-----------------: | :----------: | :------: | :-------: | :-----------------: | :--------------: |
|
||||
| **<u>merlin</u>** | 5 days | 1 week | All nodes | 500 | 1 |
|
||||
| **merlin-long** | 5 days | 21 days | 4 | 1 | 1 |
|
||||
|
||||
**\***The **PriorityJobFactor** value will be added to the job priority (*PARTITION* column in `sprio -l` ). In other words, jobs sent to higher priority
|
||||
partitions will usually run first (however, other factors such like **job age** or mainly **fair share** might affect to that decision). For the GPU
|
||||
partitions, Slurm will also attempt first to allocate jobs on partitions with higher priority over partitions with lesser priority.
|
||||
|
||||
**\*\***Jobs submitted to a partition with a higher **PriorityTier** value will be dispatched before pending jobs in partition with lower *PriorityTier* value
|
||||
and, if possible, they will preempt running jobs from partitions with lower *PriorityTier* values.
|
||||
|
||||
The **`merlin-long`** partition **is limited to 4 nodes**, as it might contain jobs running for up to 21 days.
|
||||
|
||||
### Merlin5 CPU Accounts
|
||||
|
||||
Users need to ensure that the public **`merlin`** account is specified. No specifying account options would default to this account.
|
||||
This is mostly needed by users which have multiple Slurm accounts, which may define by mistake a different account.
|
||||
|
||||
```bash
|
||||
#SBATCH --account=merlin # Possible values: merlin
|
||||
```
|
||||
|
||||
### Slurm CPU specific options
|
||||
|
||||
Some options are available when using CPUs. These are detailed here.
|
||||
|
||||
Alternative Slurm options for CPU based jobs are available. Please refer to the **man** pages
|
||||
for each Slurm command for further information about it (`man salloc`, `man sbatch`, `man srun`).
|
||||
Below are listed the most common settings:
|
||||
|
||||
```bash
|
||||
#SBATCH --ntasks=<ntasks>
|
||||
#SBATCH --ntasks-per-core=<ntasks>
|
||||
#SBATCH --ntasks-per-socket=<ntasks>
|
||||
#SBATCH --ntasks-per-node=<ntasks>
|
||||
#SBATCH --mem=<size[units]>
|
||||
#SBATCH --mem-per-cpu=<size[units]>
|
||||
#SBATCH --cpus-per-task=<ncpus>
|
||||
#SBATCH --cpu-bind=[{quiet,verbose},]<type> # only for 'srun' command
|
||||
```
|
||||
|
||||
Notice that in **Merlin5** no hyper-threading is available (while in **Merlin6** it is).
|
||||
Hence, in **Merlin5** there is not need to specify `--hint` hyper-threading related options.
|
||||
|
||||
## User and job limits
|
||||
|
||||
In the CPU cluster we provide some limits which basically apply to jobs and users. The idea behind this is to ensure a fair usage of the resources and to
|
||||
avoid overabuse of the resources from a single user or job. However, applying limits might affect the overall usage efficiency of the cluster (in example,
|
||||
pending jobs from a single user while having many idle nodes due to low overall activity is something that can be seen when user limits are applied).
|
||||
In the same way, these limits can be also used to improve the efficiency of the cluster (in example, without any job size limits, a job requesting all
|
||||
resources from the batch system would drain the entire cluster for fitting the job, which is undesirable).
|
||||
|
||||
Hence, there is a need of setting up wise limits and to ensure that there is a fair usage of the resources, by trying to optimize the overall efficiency
|
||||
of the cluster while allowing jobs of different nature and sizes (it is, **single core** based **vs parallel jobs** of different sizes) to run.
|
||||
|
||||
In the **`merlin5`** cluster, as not many users are running on it, these limits are wider than the ones set in the **`merlin6`** and **`gmerlin6`** clusters.
|
||||
|
||||
### Per job limits
|
||||
|
||||
These are limits which apply to a single job. In other words, there is a maximum of resources a single job can use. These limits are described in the table below,
|
||||
with the format `SlurmQoS(limits)` (`SlurmQoS` can be listed from the `sacctmgr show qos` command):
|
||||
|
||||
| Partition | Mon-Sun 0h-24h | Other limits |
|
||||
|:---------------: | :--------------: | :----------: |
|
||||
| **merlin** | merlin5(cpu=384) | None |
|
||||
| **merlin-long** | merlin5(cpu=384) | Max. 4 nodes |
|
||||
|
||||
By default, by QoS limits, a job can not use more than 384 cores (max CPU per job).
|
||||
However, for the `merlin-long`, this is even more restricted: there is an extra limit of 4 dedicated nodes for this partion. This is defined
|
||||
at the partition level, and will overwrite any QoS limit as long as this is more restrictive.
|
||||
|
||||
### Per user limits for CPU partitions
|
||||
|
||||
No user limits apply by QoS. For the **`merlin`** partition, a single user could fill the whole batch system with jobs (however, the restriction is at the job size, as explained above). For the **`merlin-limit`** partition, the 4 node limitation still applies.
|
||||
|
||||
## Advanced Slurm configuration
|
||||
|
||||
Clusters at PSI use the [Slurm Workload Manager](http://slurm.schedmd.com/) as the batch system technology for managing and scheduling jobs.
|
||||
Slurm has been installed in a **multi-clustered** configuration, allowing to integrate multiple clusters in the same batch system.
|
||||
|
||||
For understanding the Slurm configuration setup in the cluster, sometimes may be useful to check the following files:
|
||||
|
||||
* ``/etc/slurm/slurm.conf`` - can be found in the login nodes and computing nodes.
|
||||
* ``/etc/slurm/gres.conf`` - can be found in the GPU nodes, is also propgated to login nodes and computing nodes for user read access.
|
||||
* ``/etc/slurm/cgroup.conf`` - can be found in the computing nodes, is also propagated to login nodes for user read access.
|
||||
|
||||
The previous configuration files which can be found in the login nodes, correspond exclusively to the **merlin6** cluster configuration files.
|
||||
Configuration files for the old **merlin5** cluster or for the **gmerlin6** cluster must be checked directly on any of the **merlin5** or **gmerlin6** computing nodes (in example, by login in to one of the nodes while a job or an active allocation is running).
|
||||
Reference in New Issue
Block a user