From ddfb50ee3194d7e882d8a9747fc53c4734b9bcb5 Mon Sep 17 00:00:00 2001 From: caubet_m Date: Tue, 20 Apr 2021 11:44:38 +0200 Subject: [PATCH] Updated code --- .../hardware-and-software-description.md | 123 ++++++++++++++++++ pages/gmerlin6/introduction.md | 56 ++++---- pages/merlin6/01 introduction/introduction.md | 11 +- 3 files changed, 160 insertions(+), 30 deletions(-) create mode 100644 pages/gmerlin6/hardware-and-software-description.md diff --git a/pages/gmerlin6/hardware-and-software-description.md b/pages/gmerlin6/hardware-and-software-description.md new file mode 100644 index 0000000..d71aacb --- /dev/null +++ b/pages/gmerlin6/hardware-and-software-description.md @@ -0,0 +1,123 @@ +--- +title: Hardware And Software Description +#tags: +#keywords: +last_updated: 19 April 2021 +#summary: "" +sidebar: merlin6_sidebar +permalink: /gmerlin6/hardware-and-software.html +--- + +## Hardware + +### GPU Computing Nodes + +The GPU Merlin6 cluster was initially built from recycled workstations from different groups in the BIO division. +From then, little by little it was updated with new nodes from sporadic investments from the same division, and it was never possible a central big investment. +Hence, due to this, the Merlin6 GPU computing cluster has a non homogeneus solution, consisting on a big variety of hardware types and components. + +On 2018, for the common good, BIO decided to open the cluster to the Merlin users and make it widely accessible for the PSI scientists. + +The below table summarizes the hardware setup for the Merlin6 GPU computing nodes: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Merlin5 CPU Computing Nodes
NodeProcessorSocketsCoresThreadsScratchMemoryGPU
merlin-g-001Intel Core i7-5960X11621.8TB128GBGTX1080
merlin-g-00[2-5]Intel Xeon E5-264022011.8TB128GBGTX1080
merlin-g-006Intel Xeon E5-26402201800GB128GBGTX1080Ti
merlin-g-00[7-9]Intel Xeon E5-264022013.5TB128GBGTX1080Ti
merlin-g-01[0-3]Intel Xeon Silver 4210R22011.7TB128GBRTX2080Ti
+ +### Login Nodes + +The login nodes are part of the **[Merlin6](/merlin6/introduction.html)** HPC cluster, +and are used to compile and to submit jobs to the different ***Merlin Slurm clusters*** (`merlin5`,`merlin6`,`gmerlin6`,etc.). +Please refer to the **[Merlin6 Hardware Documentation](/merlin6/hardware-and-software.html)** for further information. + +### Storage + +The storage is part of the **[Merlin6](/merlin6/introduction.html)** HPC cluster, +and is mounted in all the ***Slurm clusters*** (`merlin5`,`merlin6`,`gmerlin6`,etc.). +Please refer to the **[Merlin6 Hardware Documentation](/merlin6/hardware-and-software.html)** for further information. + +### Network + +The Merlin6 cluster connectivity is based on the [Infiniband FDR and EDR](https://en.wikipedia.org/wiki/InfiniBand) technologies. +This allows fast access with very low latencies to the data as well as running extremely efficient MPI-based jobs. +To check the network speed (56Gbps for **FDR**, 100Gbps for **EDR**) of the different machines, it can be checked by running on each node the following command: + +```bash +ibstat | grep Rate +``` + +## Software + +In the Merlin6 GPU computing nodes, we try to keep software stack coherency with the main cluster [Merlin6](/merlin6/index.html). + +Due to this, the Merlin6 GPU nodes run: +* [**RedHat Enterprise Linux 7**](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/7.9_release_notes/index) +* [**Slurm**](https://slurm.schedmd.com/), we usually try to keep it up to date with the most recent versions. +* [**GPFS v5**](https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/ibmspectrumscale502_welcome.html) +* [**MLNX_OFED LTS v.5.2-2.2.0.0 or newer**](https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed) for all **ConnectX-4** or superior cards. diff --git a/pages/gmerlin6/introduction.md b/pages/gmerlin6/introduction.md index 6fc3dff..72383bb 100644 --- a/pages/gmerlin6/introduction.md +++ b/pages/gmerlin6/introduction.md @@ -1,47 +1,49 @@ --- -title: Cluster 'gmerlin6' +title: Introduction #tags: #keywords: -last_updated: 07 April 2021 +last_updated: 28 June 2019 #summary: "GPU Merlin 6 cluster overview" sidebar: merlin6_sidebar -permalink: /merlin5/introduction.html +permalink: /gmerlin6/introduction.html redirect_from: - /gmerlin6 - /gmerlin6/index.html --- -## Slurm 'merlin5' cluster +## About Merlin6 GPU cluster -**Merlin5** was the old official PSI Local HPC cluster for development and -mission-critical applications which was built in 2016-2017. It was an -extension of the Merlin4 cluster and built from existing hardware due -to a lack of central investment on Local HPC Resources. **Merlin5** was -then replaced by the **[Merlin6](/merlin6/index.html)** cluster in 2019, -with an important central investment of ~1,5M CHF. **Merlin5** was mostly -based on CPU resources, but also contained a small amount of GPU-based -resources which were mostly used by the BIO experiments. +### Introduction -**Merlin5** has been kept as a **Local HPC [Slurm](https://slurm.schedmd.com/overview.html) cluster**, -called **`merlin5`**. In that way, the old CPU computing nodes are still available as extra computation resources, -and as an extension of the official production **`merlin6`** [Slurm](https://slurm.schedmd.com/overview.html) cluster. +Merlin6 is a the official PSI Local HPC cluster for development and +mission-critical applications that has been built in 2019. It replaces +the Merlin5 cluster. -The old Merlin5 _**login nodes**_, _**GPU nodes**_ and _**storage**_ were fully migrated to the **[Merlin6](/merlin6/index.html)** -cluster, which becomes the **main Local HPC Cluster**. Hence, **[Merlin6](/merlin6/index.html)** -contains the storage which is mounted on the different Merlin HPC [Slurm](https://slurm.schedmd.com/overview.html) Clusters (`merlin5`, `merlin6`, `gmerlin6`). +Merlin6 is designed to be extensible, so is technically possible to add +more compute nodes and cluster storage without significant increase of +the costs of the manpower and the operations. -### Submitting jobs to 'merlin5' +Merlin6 is mostly based on **CPU** resources, but also contains a small amount +of **GPU**-based resources which are mostly used by the BIO experiments. -To submit jobs to the **`merlin5`** Slurm cluster, it must be done from the **Merlin6** login nodes by using -the option `--clusters=merlin5` on any of the Slurm commands (`sbatch`, `salloc`, `srun`, etc. commands). +### Slurm 'gmerlin6' -## The Merlin Architecture +THe **GPU nodes** have a dedicated **Slurm** cluster, called **`gmerli6`**. -### Multi Non-Federated Cluster Architecture Design: The Merlin cluster +This cluster contains the same shared storage resources (`/data/user`, `/data/project`, `/shared-scracth`, `/afs`, `/psi/home`) +which are present in the other Merlin Slurm clusters (`merlin5`,`merlin6`). The Slurm `gmerlin6` cluster is maintainted +independently to ease access for the users and keep independent user accounting. -The following image shows the Slurm architecture design for Merlin cluster. -It contains a multi non-federated cluster setup, with a central Slurm database -and multiple independent clusters (`merlin5`, `merlin6`, `gmerlin6`): +## Merlin6 Architecture + +### Merlin6 Cluster Architecture Diagram + +The following image shows the Merlin6 cluster architecture diagram: + +![Merlin6 Architecture Diagram]({{ "/images/merlinschema3.png" }}) + +### Merlin5 + Merlin6 Slurm Cluster Architecture Design + +The following image shows the Slurm architecture design for the Merlin5 & Merlin6 clusters: ![Merlin6 Slurm Architecture Design]({{ "/images/merlin-slurm-architecture.png" }}) - diff --git a/pages/merlin6/01 introduction/introduction.md b/pages/merlin6/01 introduction/introduction.md index 7266fb0..7bf8926 100644 --- a/pages/merlin6/01 introduction/introduction.md +++ b/pages/merlin6/01 introduction/introduction.md @@ -21,10 +21,15 @@ Merlin6 is designed to be extensible, so is technically possible to add more compute nodes and cluster storage without significant increase of the costs of the manpower and the operations. -Merlin6 is mostly based on CPU resources, but also contains a small amount -of GPU-based resources which are mostly used by the BIO experiments. +Merlin6 is mostly based on **CPU** resources, but also contains a small amount +of **GPU**-based resources which are mostly used by the BIO experiments. ---- +### Slurm 'merlin6' + +**CPU nodes** are configured in a **Slurm** cluster, called **`merlin6`**, and +this is the _**default Slurm cluster**_. Hence, by default, if no Slurm cluster is +specified (with the `--cluster` option), this will be the cluster to which the jobs +will be sent. ## Merlin6 Architecture