Updated code

This commit is contained in:
caubet_m 2021-04-20 11:44:38 +02:00
parent e9861ef6b5
commit ddfb50ee31
3 changed files with 160 additions and 30 deletions

View File

@ -0,0 +1,123 @@
---
title: Hardware And Software Description
#tags:
#keywords:
last_updated: 19 April 2021
#summary: ""
sidebar: merlin6_sidebar
permalink: /gmerlin6/hardware-and-software.html
---
## Hardware
### GPU Computing Nodes
The GPU Merlin6 cluster was initially built from recycled workstations from different groups in the BIO division.
From then, little by little it was updated with new nodes from sporadic investments from the same division, and it was never possible a central big investment.
Hence, due to this, the Merlin6 GPU computing cluster has a non homogeneus solution, consisting on a big variety of hardware types and components.
On 2018, for the common good, BIO decided to open the cluster to the Merlin users and make it widely accessible for the PSI scientists.
The below table summarizes the hardware setup for the Merlin6 GPU computing nodes:
<table>
<thead>
<tr>
<th scope='colgroup' style="vertical-align:middle;text-align:center;" colspan="8">Merlin5 CPU Computing Nodes</th>
</tr>
<tr>
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Node</th>
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Processor</th>
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Sockets</th>
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Cores</th>
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Threads</th>
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Scratch</th>
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Memory</th>
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">GPU</th>
</tr>
</thead>
<tbody>
<tr style="vertical-align:middle;text-align:center;" ralign="center">
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-001</b></td>
<td style="vertical-align:middle;text-align:center;" rowspan="1"><a href="https://ark.intel.com/content/www/us/en/ark/products/82930/intel-core-i7-5960x-processor-extreme-edition-20m-cache-up-to-3-50-ghz.html">Intel Core i7-5960X</a></td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">1</td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">16</td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">1.8TB</td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">128GB</td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">GTX1080</td>
</tr>
<tr style="vertical-align:middle;text-align:center;" ralign="center">
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-00[2-5]</b></td>
<td style="vertical-align:middle;text-align:center;" rowspan="1"><a href="https://ark.intel.com/content/www/us/en/ark/products/92984/intel-xeon-processor-e5-2640-v4-25m-cache-2-40-ghz.html">Intel Xeon E5-2640</a></td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">20</td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">1</td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">1.8TB</td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">128GB</td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">GTX1080</td>
</tr>
<tr style="vertical-align:middle;text-align:center;" ralign="center">
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-006</b></td>
<td style="vertical-align:middle;text-align:center;" rowspan="1"><a href="https://ark.intel.com/content/www/us/en/ark/products/92984/intel-xeon-processor-e5-2640-v4-25m-cache-2-40-ghz.html">Intel Xeon E5-2640</a></td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">20</td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">1</td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">800GB</td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">128GB</td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">GTX1080Ti</td>
</tr>
<tr style="vertical-align:middle;text-align:center;" ralign="center">
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-00[7-9]</b></td>
<td style="vertical-align:middle;text-align:center;" rowspan="1"><a href="https://ark.intel.com/content/www/us/en/ark/products/92984/intel-xeon-processor-e5-2640-v4-25m-cache-2-40-ghz.html">Intel Xeon E5-2640</a></td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">20</td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">1</td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">3.5TB</td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">128GB</td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">GTX1080Ti</td>
</tr>
<tr style="vertical-align:middle;text-align:center;" ralign="center">
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-01[0-3]</b></td>
<td style="vertical-align:middle;text-align:center;" rowspan="1"><a href="https://ark.intel.com/content/www/us/en/ark/products/197098/intel-xeon-silver-4210r-processor-13-75m-cache-2-40-ghz.html">Intel Xeon Silver 4210R</a></td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">20</td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">1</td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">1.7TB</td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">128GB</td>
<td style="vertical-align:middle;text-align:center;" rowspan="1">RTX2080Ti</td>
</tr>
</tbody>
</table>
### Login Nodes
The login nodes are part of the **[Merlin6](/merlin6/introduction.html)** HPC cluster,
and are used to compile and to submit jobs to the different ***Merlin Slurm clusters*** (`merlin5`,`merlin6`,`gmerlin6`,etc.).
Please refer to the **[Merlin6 Hardware Documentation](/merlin6/hardware-and-software.html)** for further information.
### Storage
The storage is part of the **[Merlin6](/merlin6/introduction.html)** HPC cluster,
and is mounted in all the ***Slurm clusters*** (`merlin5`,`merlin6`,`gmerlin6`,etc.).
Please refer to the **[Merlin6 Hardware Documentation](/merlin6/hardware-and-software.html)** for further information.
### Network
The Merlin6 cluster connectivity is based on the [Infiniband FDR and EDR](https://en.wikipedia.org/wiki/InfiniBand) technologies.
This allows fast access with very low latencies to the data as well as running extremely efficient MPI-based jobs.
To check the network speed (56Gbps for **FDR**, 100Gbps for **EDR**) of the different machines, it can be checked by running on each node the following command:
```bash
ibstat | grep Rate
```
## Software
In the Merlin6 GPU computing nodes, we try to keep software stack coherency with the main cluster [Merlin6](/merlin6/index.html).
Due to this, the Merlin6 GPU nodes run:
* [**RedHat Enterprise Linux 7**](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/7.9_release_notes/index)
* [**Slurm**](https://slurm.schedmd.com/), we usually try to keep it up to date with the most recent versions.
* [**GPFS v5**](https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/ibmspectrumscale502_welcome.html)
* [**MLNX_OFED LTS v.5.2-2.2.0.0 or newer**](https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed) for all **ConnectX-4** or superior cards.

View File

@ -1,47 +1,49 @@
--- ---
title: Cluster 'gmerlin6' title: Introduction
#tags: #tags:
#keywords: #keywords:
last_updated: 07 April 2021 last_updated: 28 June 2019
#summary: "GPU Merlin 6 cluster overview" #summary: "GPU Merlin 6 cluster overview"
sidebar: merlin6_sidebar sidebar: merlin6_sidebar
permalink: /merlin5/introduction.html permalink: /gmerlin6/introduction.html
redirect_from: redirect_from:
- /gmerlin6 - /gmerlin6
- /gmerlin6/index.html - /gmerlin6/index.html
--- ---
## Slurm 'merlin5' cluster ## About Merlin6 GPU cluster
**Merlin5** was the old official PSI Local HPC cluster for development and ### Introduction
mission-critical applications which was built in 2016-2017. It was an
extension of the Merlin4 cluster and built from existing hardware due
to a lack of central investment on Local HPC Resources. **Merlin5** was
then replaced by the **[Merlin6](/merlin6/index.html)** cluster in 2019,
with an important central investment of ~1,5M CHF. **Merlin5** was mostly
based on CPU resources, but also contained a small amount of GPU-based
resources which were mostly used by the BIO experiments.
**Merlin5** has been kept as a **Local HPC [Slurm](https://slurm.schedmd.com/overview.html) cluster**, Merlin6 is a the official PSI Local HPC cluster for development and
called **`merlin5`**. In that way, the old CPU computing nodes are still available as extra computation resources, mission-critical applications that has been built in 2019. It replaces
and as an extension of the official production **`merlin6`** [Slurm](https://slurm.schedmd.com/overview.html) cluster. the Merlin5 cluster.
The old Merlin5 _**login nodes**_, _**GPU nodes**_ and _**storage**_ were fully migrated to the **[Merlin6](/merlin6/index.html)** Merlin6 is designed to be extensible, so is technically possible to add
cluster, which becomes the **main Local HPC Cluster**. Hence, **[Merlin6](/merlin6/index.html)** more compute nodes and cluster storage without significant increase of
contains the storage which is mounted on the different Merlin HPC [Slurm](https://slurm.schedmd.com/overview.html) Clusters (`merlin5`, `merlin6`, `gmerlin6`). the costs of the manpower and the operations.
### Submitting jobs to 'merlin5' Merlin6 is mostly based on **CPU** resources, but also contains a small amount
of **GPU**-based resources which are mostly used by the BIO experiments.
To submit jobs to the **`merlin5`** Slurm cluster, it must be done from the **Merlin6** login nodes by using ### Slurm 'gmerlin6'
the option `--clusters=merlin5` on any of the Slurm commands (`sbatch`, `salloc`, `srun`, etc. commands).
## The Merlin Architecture THe **GPU nodes** have a dedicated **Slurm** cluster, called **`gmerli6`**.
### Multi Non-Federated Cluster Architecture Design: The Merlin cluster This cluster contains the same shared storage resources (`/data/user`, `/data/project`, `/shared-scracth`, `/afs`, `/psi/home`)
which are present in the other Merlin Slurm clusters (`merlin5`,`merlin6`). The Slurm `gmerlin6` cluster is maintainted
independently to ease access for the users and keep independent user accounting.
The following image shows the Slurm architecture design for Merlin cluster. ## Merlin6 Architecture
It contains a multi non-federated cluster setup, with a central Slurm database
and multiple independent clusters (`merlin5`, `merlin6`, `gmerlin6`): ### Merlin6 Cluster Architecture Diagram
The following image shows the Merlin6 cluster architecture diagram:
![Merlin6 Architecture Diagram]({{ "/images/merlinschema3.png" }})
### Merlin5 + Merlin6 Slurm Cluster Architecture Design
The following image shows the Slurm architecture design for the Merlin5 & Merlin6 clusters:
![Merlin6 Slurm Architecture Design]({{ "/images/merlin-slurm-architecture.png" }}) ![Merlin6 Slurm Architecture Design]({{ "/images/merlin-slurm-architecture.png" }})

View File

@ -21,10 +21,15 @@ Merlin6 is designed to be extensible, so is technically possible to add
more compute nodes and cluster storage without significant increase of more compute nodes and cluster storage without significant increase of
the costs of the manpower and the operations. the costs of the manpower and the operations.
Merlin6 is mostly based on CPU resources, but also contains a small amount Merlin6 is mostly based on **CPU** resources, but also contains a small amount
of GPU-based resources which are mostly used by the BIO experiments. of **GPU**-based resources which are mostly used by the BIO experiments.
--- ### Slurm 'merlin6'
**CPU nodes** are configured in a **Slurm** cluster, called **`merlin6`**, and
this is the _**default Slurm cluster**_. Hence, by default, if no Slurm cluster is
specified (with the `--cluster` option), this will be the cluster to which the jobs
will be sent.
## Merlin6 Architecture ## Merlin6 Architecture