Updated code
This commit is contained in:
parent
e9861ef6b5
commit
ddfb50ee31
123
pages/gmerlin6/hardware-and-software-description.md
Normal file
123
pages/gmerlin6/hardware-and-software-description.md
Normal file
@ -0,0 +1,123 @@
|
|||||||
|
---
|
||||||
|
title: Hardware And Software Description
|
||||||
|
#tags:
|
||||||
|
#keywords:
|
||||||
|
last_updated: 19 April 2021
|
||||||
|
#summary: ""
|
||||||
|
sidebar: merlin6_sidebar
|
||||||
|
permalink: /gmerlin6/hardware-and-software.html
|
||||||
|
---
|
||||||
|
|
||||||
|
## Hardware
|
||||||
|
|
||||||
|
### GPU Computing Nodes
|
||||||
|
|
||||||
|
The GPU Merlin6 cluster was initially built from recycled workstations from different groups in the BIO division.
|
||||||
|
From then, little by little it was updated with new nodes from sporadic investments from the same division, and it was never possible a central big investment.
|
||||||
|
Hence, due to this, the Merlin6 GPU computing cluster has a non homogeneus solution, consisting on a big variety of hardware types and components.
|
||||||
|
|
||||||
|
On 2018, for the common good, BIO decided to open the cluster to the Merlin users and make it widely accessible for the PSI scientists.
|
||||||
|
|
||||||
|
The below table summarizes the hardware setup for the Merlin6 GPU computing nodes:
|
||||||
|
|
||||||
|
<table>
|
||||||
|
<thead>
|
||||||
|
<tr>
|
||||||
|
<th scope='colgroup' style="vertical-align:middle;text-align:center;" colspan="8">Merlin5 CPU Computing Nodes</th>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Node</th>
|
||||||
|
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Processor</th>
|
||||||
|
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Sockets</th>
|
||||||
|
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Cores</th>
|
||||||
|
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Threads</th>
|
||||||
|
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Scratch</th>
|
||||||
|
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Memory</th>
|
||||||
|
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">GPU</th>
|
||||||
|
</tr>
|
||||||
|
</thead>
|
||||||
|
<tbody>
|
||||||
|
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-001</b></td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1"><a href="https://ark.intel.com/content/www/us/en/ark/products/82930/intel-core-i7-5960x-processor-extreme-edition-20m-cache-up-to-3-50-ghz.html">Intel Core i7-5960X</a></td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">1</td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">16</td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">1.8TB</td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">128GB</td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">GTX1080</td>
|
||||||
|
</tr>
|
||||||
|
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-00[2-5]</b></td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1"><a href="https://ark.intel.com/content/www/us/en/ark/products/92984/intel-xeon-processor-e5-2640-v4-25m-cache-2-40-ghz.html">Intel Xeon E5-2640</a></td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">20</td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">1</td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">1.8TB</td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">128GB</td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">GTX1080</td>
|
||||||
|
</tr>
|
||||||
|
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-006</b></td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1"><a href="https://ark.intel.com/content/www/us/en/ark/products/92984/intel-xeon-processor-e5-2640-v4-25m-cache-2-40-ghz.html">Intel Xeon E5-2640</a></td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">20</td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">1</td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">800GB</td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">128GB</td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">GTX1080Ti</td>
|
||||||
|
</tr>
|
||||||
|
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-00[7-9]</b></td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1"><a href="https://ark.intel.com/content/www/us/en/ark/products/92984/intel-xeon-processor-e5-2640-v4-25m-cache-2-40-ghz.html">Intel Xeon E5-2640</a></td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">20</td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">1</td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">3.5TB</td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">128GB</td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">GTX1080Ti</td>
|
||||||
|
</tr>
|
||||||
|
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-01[0-3]</b></td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1"><a href="https://ark.intel.com/content/www/us/en/ark/products/197098/intel-xeon-silver-4210r-processor-13-75m-cache-2-40-ghz.html">Intel Xeon Silver 4210R</a></td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">20</td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">1</td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">1.7TB</td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">128GB</td>
|
||||||
|
<td style="vertical-align:middle;text-align:center;" rowspan="1">RTX2080Ti</td>
|
||||||
|
</tr>
|
||||||
|
</tbody>
|
||||||
|
</table>
|
||||||
|
|
||||||
|
### Login Nodes
|
||||||
|
|
||||||
|
The login nodes are part of the **[Merlin6](/merlin6/introduction.html)** HPC cluster,
|
||||||
|
and are used to compile and to submit jobs to the different ***Merlin Slurm clusters*** (`merlin5`,`merlin6`,`gmerlin6`,etc.).
|
||||||
|
Please refer to the **[Merlin6 Hardware Documentation](/merlin6/hardware-and-software.html)** for further information.
|
||||||
|
|
||||||
|
### Storage
|
||||||
|
|
||||||
|
The storage is part of the **[Merlin6](/merlin6/introduction.html)** HPC cluster,
|
||||||
|
and is mounted in all the ***Slurm clusters*** (`merlin5`,`merlin6`,`gmerlin6`,etc.).
|
||||||
|
Please refer to the **[Merlin6 Hardware Documentation](/merlin6/hardware-and-software.html)** for further information.
|
||||||
|
|
||||||
|
### Network
|
||||||
|
|
||||||
|
The Merlin6 cluster connectivity is based on the [Infiniband FDR and EDR](https://en.wikipedia.org/wiki/InfiniBand) technologies.
|
||||||
|
This allows fast access with very low latencies to the data as well as running extremely efficient MPI-based jobs.
|
||||||
|
To check the network speed (56Gbps for **FDR**, 100Gbps for **EDR**) of the different machines, it can be checked by running on each node the following command:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ibstat | grep Rate
|
||||||
|
```
|
||||||
|
|
||||||
|
## Software
|
||||||
|
|
||||||
|
In the Merlin6 GPU computing nodes, we try to keep software stack coherency with the main cluster [Merlin6](/merlin6/index.html).
|
||||||
|
|
||||||
|
Due to this, the Merlin6 GPU nodes run:
|
||||||
|
* [**RedHat Enterprise Linux 7**](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/7.9_release_notes/index)
|
||||||
|
* [**Slurm**](https://slurm.schedmd.com/), we usually try to keep it up to date with the most recent versions.
|
||||||
|
* [**GPFS v5**](https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/ibmspectrumscale502_welcome.html)
|
||||||
|
* [**MLNX_OFED LTS v.5.2-2.2.0.0 or newer**](https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed) for all **ConnectX-4** or superior cards.
|
@ -1,47 +1,49 @@
|
|||||||
---
|
---
|
||||||
title: Cluster 'gmerlin6'
|
title: Introduction
|
||||||
#tags:
|
#tags:
|
||||||
#keywords:
|
#keywords:
|
||||||
last_updated: 07 April 2021
|
last_updated: 28 June 2019
|
||||||
#summary: "GPU Merlin 6 cluster overview"
|
#summary: "GPU Merlin 6 cluster overview"
|
||||||
sidebar: merlin6_sidebar
|
sidebar: merlin6_sidebar
|
||||||
permalink: /merlin5/introduction.html
|
permalink: /gmerlin6/introduction.html
|
||||||
redirect_from:
|
redirect_from:
|
||||||
- /gmerlin6
|
- /gmerlin6
|
||||||
- /gmerlin6/index.html
|
- /gmerlin6/index.html
|
||||||
---
|
---
|
||||||
|
|
||||||
## Slurm 'merlin5' cluster
|
## About Merlin6 GPU cluster
|
||||||
|
|
||||||
**Merlin5** was the old official PSI Local HPC cluster for development and
|
### Introduction
|
||||||
mission-critical applications which was built in 2016-2017. It was an
|
|
||||||
extension of the Merlin4 cluster and built from existing hardware due
|
|
||||||
to a lack of central investment on Local HPC Resources. **Merlin5** was
|
|
||||||
then replaced by the **[Merlin6](/merlin6/index.html)** cluster in 2019,
|
|
||||||
with an important central investment of ~1,5M CHF. **Merlin5** was mostly
|
|
||||||
based on CPU resources, but also contained a small amount of GPU-based
|
|
||||||
resources which were mostly used by the BIO experiments.
|
|
||||||
|
|
||||||
**Merlin5** has been kept as a **Local HPC [Slurm](https://slurm.schedmd.com/overview.html) cluster**,
|
Merlin6 is a the official PSI Local HPC cluster for development and
|
||||||
called **`merlin5`**. In that way, the old CPU computing nodes are still available as extra computation resources,
|
mission-critical applications that has been built in 2019. It replaces
|
||||||
and as an extension of the official production **`merlin6`** [Slurm](https://slurm.schedmd.com/overview.html) cluster.
|
the Merlin5 cluster.
|
||||||
|
|
||||||
The old Merlin5 _**login nodes**_, _**GPU nodes**_ and _**storage**_ were fully migrated to the **[Merlin6](/merlin6/index.html)**
|
Merlin6 is designed to be extensible, so is technically possible to add
|
||||||
cluster, which becomes the **main Local HPC Cluster**. Hence, **[Merlin6](/merlin6/index.html)**
|
more compute nodes and cluster storage without significant increase of
|
||||||
contains the storage which is mounted on the different Merlin HPC [Slurm](https://slurm.schedmd.com/overview.html) Clusters (`merlin5`, `merlin6`, `gmerlin6`).
|
the costs of the manpower and the operations.
|
||||||
|
|
||||||
### Submitting jobs to 'merlin5'
|
Merlin6 is mostly based on **CPU** resources, but also contains a small amount
|
||||||
|
of **GPU**-based resources which are mostly used by the BIO experiments.
|
||||||
|
|
||||||
To submit jobs to the **`merlin5`** Slurm cluster, it must be done from the **Merlin6** login nodes by using
|
### Slurm 'gmerlin6'
|
||||||
the option `--clusters=merlin5` on any of the Slurm commands (`sbatch`, `salloc`, `srun`, etc. commands).
|
|
||||||
|
|
||||||
## The Merlin Architecture
|
THe **GPU nodes** have a dedicated **Slurm** cluster, called **`gmerli6`**.
|
||||||
|
|
||||||
### Multi Non-Federated Cluster Architecture Design: The Merlin cluster
|
This cluster contains the same shared storage resources (`/data/user`, `/data/project`, `/shared-scracth`, `/afs`, `/psi/home`)
|
||||||
|
which are present in the other Merlin Slurm clusters (`merlin5`,`merlin6`). The Slurm `gmerlin6` cluster is maintainted
|
||||||
|
independently to ease access for the users and keep independent user accounting.
|
||||||
|
|
||||||
The following image shows the Slurm architecture design for Merlin cluster.
|
## Merlin6 Architecture
|
||||||
It contains a multi non-federated cluster setup, with a central Slurm database
|
|
||||||
and multiple independent clusters (`merlin5`, `merlin6`, `gmerlin6`):
|
### Merlin6 Cluster Architecture Diagram
|
||||||
|
|
||||||
|
The following image shows the Merlin6 cluster architecture diagram:
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
### Merlin5 + Merlin6 Slurm Cluster Architecture Design
|
||||||
|
|
||||||
|
The following image shows the Slurm architecture design for the Merlin5 & Merlin6 clusters:
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
|
@ -21,10 +21,15 @@ Merlin6 is designed to be extensible, so is technically possible to add
|
|||||||
more compute nodes and cluster storage without significant increase of
|
more compute nodes and cluster storage without significant increase of
|
||||||
the costs of the manpower and the operations.
|
the costs of the manpower and the operations.
|
||||||
|
|
||||||
Merlin6 is mostly based on CPU resources, but also contains a small amount
|
Merlin6 is mostly based on **CPU** resources, but also contains a small amount
|
||||||
of GPU-based resources which are mostly used by the BIO experiments.
|
of **GPU**-based resources which are mostly used by the BIO experiments.
|
||||||
|
|
||||||
---
|
### Slurm 'merlin6'
|
||||||
|
|
||||||
|
**CPU nodes** are configured in a **Slurm** cluster, called **`merlin6`**, and
|
||||||
|
this is the _**default Slurm cluster**_. Hence, by default, if no Slurm cluster is
|
||||||
|
specified (with the `--cluster` option), this will be the cluster to which the jobs
|
||||||
|
will be sent.
|
||||||
|
|
||||||
## Merlin6 Architecture
|
## Merlin6 Architecture
|
||||||
|
|
||||||
|
Loading…
x
Reference in New Issue
Block a user