This commit is contained in:
2021-04-15 17:38:45 +02:00
parent 19c7f9bb79
commit e9861ef6b5
6 changed files with 356 additions and 88 deletions

View File

@ -0,0 +1,97 @@
---
title: Hardware And Software Description
#tags:
#keywords:
last_updated: 09 April 2021
#summary: ""
sidebar: merlin6_sidebar
permalink: /merlin5/hardware-and-software.html
---
## Hardware
### Computing Nodes
Merlin5 is built from recycled nodes, and hardware will be decomissioned as soon as it fails (due to expired warranty and age of the cluster).
* Merlin5 is based on the [**HPE c7000 Enclosure**](https://h20195.www2.hpe.com/v2/getdocument.aspx?docname=c04128339) solution, with 16 x [**HPE ProLiant BL460c Gen8**](https://h20195.www2.hpe.com/v2/getdocument.aspx?docname=c04123239) nodes per chassis.
* Connectivity is based on Infiniband **ConnectX-3 QDR-40Gbps**
* 16 internal ports for intra chassis communication
* 2 connected external ports for inter chassis communication and storage access.
The below table summarizes the hardware setup for the Merlin5 computing nodes:
<table>
<thead>
<tr>
<th scope='colgroup' style="vertical-align:middle;text-align:center;" colspan="8">Merlin5 CPU Computing Nodes</th>
</tr>
<tr>
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Chassis</th>
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Node</th>
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Processor</th>
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Sockets</th>
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Cores</th>
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Threads</th>
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Scratch</th>
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Memory</th>
</tr>
</thead>
<tbody>
<tr style="vertical-align:middle;text-align:center;" ralign="center">
<td style="vertical-align:middle;text-align:center;" rowspan="2"><b>#0</b></td>
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-c-[18-30]</b></td>
<td style="vertical-align:middle;text-align:center;" rowspan="2"><a href="https://ark.intel.com/content/www/us/en/ark/products/64595/intel-xeon-processor-e5-2670-20m-cache-2-60-ghz-8-00-gt-s-intel-qpi.html">Intel Xeon E5-2670</a></td>
<td style="vertical-align:middle;text-align:center;" rowspan="2">2</td>
<td style="vertical-align:middle;text-align:center;" rowspan="2">16</td>
<td style="vertical-align:middle;text-align:center;" rowspan="2">1</td>
<td style="vertical-align:middle;text-align:center;" rowspan="2">50GB</td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">64GB</td>
</tr>
<tr style="vertical-align:middle;text-align:center;" ralign="center">
<td rowspan="1"><b>merlin-c-[31,32]</b></td>
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>128GB</b></td>
</tr>
<tr style="vertical-align:middle;text-align:center;" ralign="center">
<td style="vertical-align:middle;text-align:center;" rowspan="2"><b>#1</b></td>
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-c-[33-45]</b></td>
<td style="vertical-align:middle;text-align:center;" rowspan="2"><a href="https://ark.intel.com/content/www/us/en/ark/products/64595/intel-xeon-processor-e5-2670-20m-cache-2-60-ghz-8-00-gt-s-intel-qpi.html">Intel Xeon E5-2670</a></td>
<td style="vertical-align:middle;text-align:center;" rowspan="2">2</td>
<td style="vertical-align:middle;text-align:center;" rowspan="2">16</td>
<td style="vertical-align:middle;text-align:center;" rowspan="2">1</td>
<td style="vertical-align:middle;text-align:center;" rowspan="2">50GB</td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">64GB</td>
</tr>
<tr style="vertical-align:middle;text-align:center;" ralign="center">
<td rowspan="1"><b>merlin-c-[46,47]</b></td>
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>128GB</b></td>
</tr>
</tbody>
</table>
### Login Nodes
The login nodes are part of the **[Merlin6](/merlin6/introduction.html)** HPC cluster,
and are used to compile and to submit jobs to the different ***Merlin Slurm clusters*** (`merlin5`,`merlin6`,`gmerlin6`,etc.).
Please refer to the **[Merlin6 Hardware Documentation](/merlin6/hardware-and-software.html)** for further information.
### Storage
The storage is part of the **[Merlin6](/merlin6/introduction.html)** HPC cluster,
and is mounted in all the ***Slurm clusters*** (`merlin5`,`merlin6`,`gmerlin6`,etc.).
Please refer to the **[Merlin6 Hardware Documentation](/merlin6/hardware-and-software.html)** for further information.
### Network
Merlin5 cluster connectivity is based on the [Infiniband QDR](https://en.wikipedia.org/wiki/InfiniBand) technology.
This allows fast access with very low latencies to the data as well as running extremely efficient MPI-based jobs.
However, this is an old version of Infiniband which requires older drivers and software can not take advantage of the latest features.
## Software
In Merlin5, we try to keep software stack coherency with the main cluster [Merlin6](/merlin6/index.html).
Due to this, Merlin5 runs:
* [**RedHat Enterprise Linux 7**](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/7.9_release_notes/index)
* [**Slurm**](https://slurm.schedmd.com/), we usually try to keep it up to date with the most recent versions.
* [**GPFS v5**](https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/ibmspectrumscale502_welcome.html)
* [**MLNX_OFED LTS v.4.9-2.2.4.0**](https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed), which is an old version, but required because **ConnectX-3** support has been dropped on newer OFED versions.

View File

@ -0,0 +1,47 @@
---
title: Cluster 'merlin5'
#tags:
#keywords:
last_updated: 07 April 2021
#summary: "Merlin 5 cluster overview"
sidebar: merlin6_sidebar
permalink: /merlin5/introduction.html
redirect_from:
- /merlin5
- /merlin5/index.html
---
## Slurm 'merlin5' cluster
**Merlin5** was the old official PSI Local HPC cluster for development and
mission-critical applications which was built in 2016-2017. It was an
extension of the Merlin4 cluster and built from existing hardware due
to a lack of central investment on Local HPC Resources. **Merlin5** was
then replaced by the **[Merlin6](/merlin6/index.html)** cluster in 2019,
with an important central investment of ~1,5M CHF. **Merlin5** was mostly
based on CPU resources, but also contained a small amount of GPU-based
resources which were mostly used by the BIO experiments.
**Merlin5** has been kept as a **Local HPC [Slurm](https://slurm.schedmd.com/overview.html) cluster**,
called **`merlin5`**. In that way, the old CPU computing nodes are still available as extra computation resources,
and as an extension of the official production **`merlin6`** [Slurm](https://slurm.schedmd.com/overview.html) cluster.
The old Merlin5 _**login nodes**_, _**GPU nodes**_ and _**storage**_ were fully migrated to the **[Merlin6](/merlin6/index.html)**
cluster, which becomes the **main Local HPC Cluster**. Hence, **[Merlin6](/merlin6/index.html)**
contains the storage which is mounted on the different Merlin HPC [Slurm](https://slurm.schedmd.com/overview.html) Clusters (`merlin5`, `merlin6`, `gmerlin6`).
### Submitting jobs to 'merlin5'
To submit jobs to the **`merlin5`** Slurm cluster, it must be done from the **Merlin6** login nodes by using
the option `--clusters=merlin5` on any of the Slurm commands (`sbatch`, `salloc`, `srun`, etc. commands).
## The Merlin Architecture
### Multi Non-Federated Cluster Architecture Design: The Merlin cluster
The following image shows the Slurm architecture design for Merlin cluster.
It contains a multi non-federated cluster setup, with a central Slurm database
and multiple independent clusters (`merlin5`, `merlin6`, `gmerlin6`):
![Merlin6 Slurm Architecture Design]({{ "/images/merlin-slurm-architecture.png" }})