first stab at mkdocs migration
BIN
docs/cscs-userlab/downloads/CSCS/PSI_CSCSAllocations2023.xltx
Normal file
55
docs/cscs-userlab/index.md
Normal file
@@ -0,0 +1,55 @@
|
||||
# PSI HPC@CSCS
|
||||
|
||||
PSI has a long standing collaboration with CSCS for offering high end
|
||||
HPC resources to PSI projects. PSI had co-invested in CSCS' initial
|
||||
Cray XT3 supercomputer *Horizon* in 2005 and we continue to procure a share on the
|
||||
CSCS flagship systems.
|
||||
|
||||
The share is intended for projects that by their nature cannot profit
|
||||
from applying for regular [CSCS user lab allocation
|
||||
schemes.](https://www.cscs.ch/user-lab/allocation-schemes).
|
||||
|
||||
We can also help PSI groups to procure additional resources based on
|
||||
the PSI conditions - please contact us in such a case.
|
||||
|
||||
## Yearly survey for requesting a project on the PSI share
|
||||
|
||||
At the end of each year we prepare a survey process and notify all subscribed
|
||||
users of the specialized **PSI HPC@CSCS mailing list** (see below) and the
|
||||
merlin cluster lists, to enter their next year resource requests. Projects
|
||||
receive resources in the form of allocations over the four quarters of the
|
||||
following year.
|
||||
|
||||
The projects requests get reviewed and requests may get adapted to fit into the
|
||||
available capacity.
|
||||
|
||||
The survey is done through ServiceNow, please navigate to
|
||||
[Home > Service Catalog > Research Computing > Apply for computing resources at CSCS](https://psi.service-now.com/psisp?id=psi_new_sc_cat_item&sys_id=8d14bd1e4f9c7b407f7660fe0310c7e9)
|
||||
and submit the form.
|
||||
|
||||
Applications will be reviewed and the final resource allocations, in case of
|
||||
oversubscription, will be arbitrated by a panel within CSD.
|
||||
|
||||
### Instructions for filling out the 2026 survey
|
||||
|
||||
* We have a budget of 100 kCHF for 2026, which translates to 435'000 multicore node hours or 35'600 node hours on the GPU Grace Hopper nodes. The minimum allocation is 10'000 node hours for multicore projects, an average project allocation would amount to 30'000 node hours
|
||||
* You need to specify the total resource request for your project in node hours, and how you would like to split the resources over the 4 quarters. For the allocations per quarter year, please enter the number in percent (e.g. 25%, 25%, 25%, 25%). If you indicate nothing, a 25% per quarter will be assumed.
|
||||
* We currently have a total of 65 TB of storage for all projects. Additional storage
|
||||
can be obtained, but large storage assignments are not in scope for these projects.
|
||||
|
||||
## CSCS Systems reference information
|
||||
|
||||
For 2025 we can offer access to [CSCS Alps](https://www.cscs.ch/computers/alps) Eiger (CPU multicore) and Daint (GPU) systems.
|
||||
|
||||
* [CSCS User Portal](https://user.cscs.ch/)
|
||||
* Documentation
|
||||
* [CSCS Eiger CPU multicore cluster](https://docs.cscs.ch/clusters/eiger/)
|
||||
* [CSCS Daint GPU cluster](https://docs.cscs.ch/clusters/daint/)
|
||||
|
||||
## Contact information
|
||||
|
||||
* PSI Contacts:
|
||||
* Mailing list contact: <psi-hpc-at-cscs-admin@lists.psi.ch>
|
||||
* Marc Caubet Serrabou <marc.caubet@psi.ch>
|
||||
* Derek Feichtinger <derek.feichtinger@psi.ch>
|
||||
* Mailing list for receiving user notifications and survey information: psi-hpc-at-cscs@lists.psi.ch [(subscribe)](https://psilists.ethz.ch/sympa/subscribe/psi-hpc-at-cscs)
|
||||
52
docs/cscs-userlab/transfer-data.md
Normal file
@@ -0,0 +1,52 @@
|
||||
---
|
||||
title: Transferring Data betweem PSI and CSCS
|
||||
#tags:
|
||||
keywords: CSCS, data-transfer
|
||||
last_updated: 02 March 2022
|
||||
summary: "This Document shows the procedure for transferring data between CSCS and PSI"
|
||||
sidebar: CSCS_sidebar
|
||||
permalink: /CSCS/transfer-data.html
|
||||
---
|
||||
|
||||
# Transferring Data
|
||||
|
||||
This document shows how to transfer data between PSI and CSCS by using a Linux workstation.
|
||||
|
||||
## Preparing SSH configuration
|
||||
|
||||
If the directory **`.ssh`** does not exist in your home directory, create it with **`0700`** permissions:
|
||||
|
||||
```bash
|
||||
mkdir ~/.ssh
|
||||
chmod 0700 ~/.ssh
|
||||
```
|
||||
|
||||
Then, if it does not exist, create a new file **`.ssh/config`**, otherwise add the following lines
|
||||
to the already existing file, by replacing **`$cscs_accountname`** by your CSCS `username`:
|
||||
|
||||
```bash
|
||||
Host daint.cscs.ch
|
||||
Compression yes
|
||||
ProxyJump ela.cscs.ch
|
||||
Host *.cscs.ch
|
||||
User $cscs_accountname
|
||||
```
|
||||
|
||||
### Advanced SSH configuration
|
||||
|
||||
There are many different SSH settings available which would allow advanced configurations.
|
||||
Users may have some configurations already present, therefore would need to adapt it accordingly.
|
||||
|
||||
|
||||
## Transferring files
|
||||
|
||||
Once the above configuration is set, then try to rsync from Merlin to CSCS, on any direction:
|
||||
|
||||
```bash
|
||||
# CSCS -> PSI
|
||||
rsync -azv daint.cscs.ch:<source_path> <destination_path>
|
||||
|
||||
# PSI -> CSCS
|
||||
rsync -azv <source_path> daint.cscs.ch:<destination_path>
|
||||
```
|
||||
|
||||
46
docs/gmerlin6/cluster-introduction.md
Normal file
@@ -0,0 +1,46 @@
|
||||
---
|
||||
title: Introduction
|
||||
#tags:
|
||||
#keywords:
|
||||
last_updated: 28 June 2019
|
||||
#summary: "GPU Merlin 6 cluster overview"
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /gmerlin6/cluster-introduction.html
|
||||
---
|
||||
|
||||
## About Merlin6 GPU cluster
|
||||
|
||||
### Introduction
|
||||
|
||||
Merlin6 is a the official PSI Local HPC cluster for development and
|
||||
mission-critical applications that has been built in 2019. It replaces
|
||||
the Merlin5 cluster.
|
||||
|
||||
Merlin6 is designed to be extensible, so is technically possible to add
|
||||
more compute nodes and cluster storage without significant increase of
|
||||
the costs of the manpower and the operations.
|
||||
|
||||
Merlin6 is mostly based on **CPU** resources, but also contains a small amount
|
||||
of **GPU**-based resources which are mostly used by the BIO experiments.
|
||||
|
||||
### Slurm 'gmerlin6'
|
||||
|
||||
THe **GPU nodes** have a dedicated **Slurm** cluster, called **`gmerli6`**.
|
||||
|
||||
This cluster contains the same shared storage resources (`/data/user`, `/data/project`, `/shared-scracth`, `/afs`, `/psi/home`)
|
||||
which are present in the other Merlin Slurm clusters (`merlin5`,`merlin6`). The Slurm `gmerlin6` cluster is maintainted
|
||||
independently to ease access for the users and keep independent user accounting.
|
||||
|
||||
## Merlin6 Architecture
|
||||
|
||||
### Merlin6 Cluster Architecture Diagram
|
||||
|
||||
The following image shows the Merlin6 cluster architecture diagram:
|
||||
|
||||

|
||||
|
||||
### Merlin5 + Merlin6 Slurm Cluster Architecture Design
|
||||
|
||||
The following image shows the Slurm architecture design for the Merlin5 & Merlin6 clusters:
|
||||
|
||||

|
||||
151
docs/gmerlin6/hardware-and-software-description.md
Normal file
@@ -0,0 +1,151 @@
|
||||
---
|
||||
title: Hardware And Software Description
|
||||
#tags:
|
||||
#keywords:
|
||||
last_updated: 19 April 2021
|
||||
#summary: ""
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /gmerlin6/hardware-and-software.html
|
||||
---
|
||||
|
||||
## Hardware
|
||||
|
||||
### GPU Computing Nodes
|
||||
|
||||
The GPU Merlin6 cluster was initially built from recycled workstations from different groups in the BIO division.
|
||||
From then, little by little it was updated with new nodes from sporadic investments from the same division, and it was never possible a central big investment.
|
||||
Hence, due to this, the Merlin6 GPU computing cluster has a non homogeneus solution, consisting on a big variety of hardware types and components.
|
||||
|
||||
On 2018, for the common good, BIO decided to open the cluster to the Merlin users and make it widely accessible for the PSI scientists.
|
||||
|
||||
The below table summarizes the hardware setup for the Merlin6 GPU computing nodes:
|
||||
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th scope='colgroup' style="vertical-align:middle;text-align:center;" colspan="9">Merlin6 GPU Computing Nodes</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Node</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Processor</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Sockets</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Cores</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Threads</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Scratch</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Memory</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">GPUs</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">GPU Model</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-001</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><a href="https://ark.intel.com/content/www/us/en/ark/products/82930/intel-core-i7-5960x-processor-extreme-edition-20m-cache-up-to-3-50-ghz.html">Intel Core i7-5960X</a></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">1</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">16</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">1.8TB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">128GB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">GTX1080</td>
|
||||
</tr>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-00[2-5]</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><a href="https://ark.intel.com/content/www/us/en/ark/products/92984/intel-xeon-processor-e5-2640-v4-25m-cache-2-40-ghz.html">Intel Xeon E5-2640</a></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">20</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">1</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">1.8TB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">128GB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">4</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">GTX1080</td>
|
||||
</tr>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-006</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><a href="https://ark.intel.com/content/www/us/en/ark/products/92984/intel-xeon-processor-e5-2640-v4-25m-cache-2-40-ghz.html">Intel Xeon E5-2640</a></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">20</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">1</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">800GB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">128GB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">4</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">GTX1080Ti</td>
|
||||
</tr>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-00[7-9]</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><a href="https://ark.intel.com/content/www/us/en/ark/products/92984/intel-xeon-processor-e5-2640-v4-25m-cache-2-40-ghz.html">Intel Xeon E5-2640</a></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">20</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">1</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">3.5TB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">128GB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">4</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">GTX1080Ti</td>
|
||||
</tr>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-01[0-3]</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><a href="https://ark.intel.com/content/www/us/en/ark/products/197098/intel-xeon-silver-4210r-processor-13-75m-cache-2-40-ghz.html">Intel Xeon Silver 4210R</a></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">20</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">1</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">1.7TB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">128GB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">4</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">RTX2080Ti</td>
|
||||
</tr>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-014</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><a href="https://www.intel.com/content/www/us/en/products/sku/199343/intel-xeon-gold-6240r-processor-35-75m-cache-2-40-ghz/specifications.html?wapkw=Intel(R)%20Xeon(R)%20Gold%206240R%20CP">Intel Xeon Gold 6240R</a></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">48</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">1</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">2.9TB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">384GB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">8</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">RTX2080Ti</td>
|
||||
</tr>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-015</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><a href="https://www.intel.com/content/www/us/en/products/sku/215279/intel-xeon-gold-5318s-processor-36m-cache-2-10-ghz/specifications.html">Intel(R) Xeon Gold 5318S</a></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">2</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">48</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">1</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">2.9TB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">384GB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">8</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">RTX A5000</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
### Login Nodes
|
||||
|
||||
The login nodes are part of the **[Merlin6](/merlin6/introduction.html)** HPC cluster,
|
||||
and are used to compile and to submit jobs to the different ***Merlin Slurm clusters*** (`merlin5`,`merlin6`,`gmerlin6`,etc.).
|
||||
Please refer to the **[Merlin6 Hardware Documentation](/merlin6/hardware-and-software.html)** for further information.
|
||||
|
||||
### Storage
|
||||
|
||||
The storage is part of the **[Merlin6](/merlin6/introduction.html)** HPC cluster,
|
||||
and is mounted in all the ***Slurm clusters*** (`merlin5`,`merlin6`,`gmerlin6`,etc.).
|
||||
Please refer to the **[Merlin6 Hardware Documentation](/merlin6/hardware-and-software.html)** for further information.
|
||||
|
||||
### Network
|
||||
|
||||
The Merlin6 cluster connectivity is based on the [Infiniband FDR and EDR](https://en.wikipedia.org/wiki/InfiniBand) technologies.
|
||||
This allows fast access with very low latencies to the data as well as running extremely efficient MPI-based jobs.
|
||||
To check the network speed (56Gbps for **FDR**, 100Gbps for **EDR**) of the different machines, it can be checked by running on each node the following command:
|
||||
|
||||
```bash
|
||||
ibstat | grep Rate
|
||||
```
|
||||
|
||||
## Software
|
||||
|
||||
In the Merlin6 GPU computing nodes, we try to keep software stack coherency with the main cluster [Merlin6](/merlin6/index.html).
|
||||
|
||||
Due to this, the Merlin6 GPU nodes run:
|
||||
* [**RedHat Enterprise Linux 7**](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/7.9_release_notes/index)
|
||||
* [**Slurm**](https://slurm.schedmd.com/), we usually try to keep it up to date with the most recent versions.
|
||||
* [**GPFS v5**](https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/ibmspectrumscale502_welcome.html)
|
||||
* [**MLNX_OFED LTS v.5.2-2.2.0.0 or newer**](https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed) for all **ConnectX-4** or superior cards.
|
||||
268
docs/gmerlin6/slurm-configuration.md
Normal file
@@ -0,0 +1,268 @@
|
||||
---
|
||||
title: Slurm cluster 'gmerlin6'
|
||||
#tags:
|
||||
keywords: configuration, partitions, node definition, gmerlin6
|
||||
last_updated: 29 January 2021
|
||||
summary: "This document describes a summary of the Slurm 'configuration."
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /gmerlin6/slurm-configuration.html
|
||||
---
|
||||
|
||||
This documentation shows basic Slurm configuration and options needed to run jobs in the GPU cluster.
|
||||
|
||||
## Merlin6 GPU nodes definition
|
||||
|
||||
The table below shows a summary of the hardware setup for the different GPU nodes
|
||||
|
||||
| Nodes | Def.#CPUs | Max.#CPUs | #Threads | Def.Mem/CPU | Max.Mem/CPU | Max.Mem/Node | Max.Swap | GPU Type | Def.#GPUs | Max.#GPUs |
|
||||
|:------------------:| ---------:| :--------:| :------: | :----------:| :----------:| :-----------:| :-------:| :--------: | :-------: | :-------: |
|
||||
| merlin-g-[001] | 1 core | 8 cores | 1 | 5120 | 102400 | 102400 | 10000 | **geforce_gtx_1080** | 1 | 2 |
|
||||
| merlin-g-[002-005] | 1 core | 20 cores | 1 | 5120 | 102400 | 102400 | 10000 | **geforce_gtx_1080** | 1 | 4 |
|
||||
| merlin-g-[006-009] | 1 core | 20 cores | 1 | 5120 | 102400 | 102400 | 10000 | **geforce_gtx_1080_ti** | 1 | 4 |
|
||||
| merlin-g-[010-013] | 1 core | 20 cores | 1 | 5120 | 102400 | 102400 | 10000 | **geforce_rtx_2080_ti** | 1 | 4 |
|
||||
| merlin-g-014 | 1 core | 48 cores | 1 | 5120 | 360448 | 360448 | 10000 | **geforce_rtx_2080_ti** | 1 | 8 |
|
||||
| merlin-g-015 | 1 core | 48 cores | 1 | 5120 | 360448 | 360448 | 10000 | **A5000** | 1 | 8 |
|
||||
| merlin-g-100 | 1 core | 128 cores | 2 | 3900 | 998400 | 998400 | 10000 | **A100** | 1 | 8 |
|
||||
|
||||
{{site.data.alerts.tip}}Always check <b>'/etc/slurm/gres.conf'</b> and <b>'/etc/slurm/slurm.conf'</b> for changes in the GPU type and details of the hardware.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
## Running jobs in the 'gmerlin6' cluster
|
||||
|
||||
In this chapter we will cover basic settings that users need to specify in order to run jobs in the GPU cluster.
|
||||
|
||||
### Merlin6 GPU cluster
|
||||
|
||||
To run jobs in the **`gmerlin6`** cluster users **must** specify the cluster name in Slurm:
|
||||
|
||||
```bash
|
||||
#SBATCH --cluster=gmerlin6
|
||||
```
|
||||
|
||||
### Merlin6 GPU partitions
|
||||
|
||||
Users might need to specify the Slurm partition. If no partition is specified, it will default to **`gpu`**:
|
||||
|
||||
```bash
|
||||
#SBATCH --partition=<partition_name> # Possible <partition_name> values: gpu, gpu-short, gwendolen
|
||||
```
|
||||
|
||||
The table below resumes shows all possible partitions available to users:
|
||||
|
||||
| GPU Partition | Default Time | Max Time | PriorityJobFactor\* | PriorityTier\*\* |
|
||||
|:---------------------: | :----------: | :--------: | :-----------------: | :--------------: |
|
||||
| `gpu` | 1 day | 1 week | 1 | 1 |
|
||||
| `gpu-short` | 2 hours | 2 hours | 1000 | 500 |
|
||||
| `gwendolen` | 30 minutes | 2 hours | 1000 | 1000 |
|
||||
| `gwendolen-long`\*\*\* | 30 minutes | 8 hours | 1 | 1 |
|
||||
|
||||
\*The **PriorityJobFactor** value will be added to the job priority (*PARTITION* column in `sprio -l` ). In other words, jobs sent to higher priority
|
||||
partitions will usually run first (however, other factors such like **job age** or mainly **fair share** might affect to that decision). For the GPU
|
||||
partitions, Slurm will also attempt first to allocate jobs on partitions with higher priority over partitions with lesser priority.
|
||||
|
||||
\*\*Jobs submitted to a partition with a higher **PriorityTier** value will be dispatched before pending jobs in partition with lower *PriorityTier* value
|
||||
and, if possible, they will preempt running jobs from partitions with lower *PriorityTier* values.
|
||||
|
||||
\*\*\***gwnedolen-long** is a special partition which is enabled during non-working hours only. As of _Nov 2023_, the current policy is to disable this partition from Mon to Fri, from 1am to 5pm. However, jobs can be submitted anytime, but can only be scheduled outside this time range.
|
||||
|
||||
### Merlin6 GPU Accounts
|
||||
|
||||
Users need to ensure that the public **`merlin`** account is specified. No specifying account options would default to this account.
|
||||
This is mostly needed by users which have multiple Slurm accounts, which may define by mistake a different account.
|
||||
|
||||
```bash
|
||||
#SBATCH --account=merlin # Possible values: merlin, gwendolen
|
||||
```
|
||||
Not all the accounts can be used on all partitions. This is resumed in the table below:
|
||||
|
||||
| Slurm Account | Slurm Partitions |
|
||||
|:-------------------: | :------------------: |
|
||||
| **`merlin`** | **`gpu`**,`gpu-short` |
|
||||
| `gwendolen` | `gwendolen`,`gwendolen-long` |
|
||||
|
||||
By default, all users belong to the `merlin` Slurm accounts, and jobs are submitted to the `gpu` partition when no partition is defined.
|
||||
|
||||
Users only need to specify the `gwendolen` account when using the `gwendolen` or `gwendolen-long` partitions, otherwise specifying account is not needed (it will always default to `merlin`).
|
||||
|
||||
#### The 'gwendolen' account
|
||||
|
||||
For running jobs in the **`gwendolen`/`gwendolen-long`** partitions, users must specify the **`gwendolen`** account.
|
||||
The `merlin` account is not allowed to use the Gwendolen partitions.
|
||||
|
||||
Gwendolen is restricted to a set of users belonging to the **`unx-gwendolen`** Unix group. If you belong to a project allowed to use **Gwendolen**, or you are a user which would like to have access to it, please request access to the **`unx-gwendolen`** Unix group through [PSI Service Now](https://psi.service-now.com/): the request will be redirected to the responsible of the project (Andreas Adelmann).
|
||||
|
||||
### Slurm GPU specific options
|
||||
|
||||
Some options are available when using GPUs. These are detailed here.
|
||||
|
||||
#### Number of GPUs and type
|
||||
|
||||
When using the GPU cluster, users **must** specify the number of GPUs they need to use:
|
||||
|
||||
```bash
|
||||
#SBATCH --gpus=[<type>:]<number>
|
||||
```
|
||||
|
||||
The GPU type is optional: if left empty, it will try allocating any type of GPU.
|
||||
The different `[<type>:]` values and `<number>` of GPUs depends on the node.
|
||||
This is detailed in the below table.
|
||||
|
||||
| Nodes | GPU Type | #GPUs |
|
||||
|:---------------------: | :-----------------------: | :---: |
|
||||
| **merlin-g-[001]** | **`geforce_gtx_1080`** | 2 |
|
||||
| **merlin-g-[002-005]** | **`geforce_gtx_1080`** | 4 |
|
||||
| **merlin-g-[006-009]** | **`geforce_gtx_1080_ti`** | 4 |
|
||||
| **merlin-g-[010-013]** | **`geforce_rtx_2080_ti`** | 4 |
|
||||
| **merlin-g-014** | **`geforce_rtx_2080_ti`** | 8 |
|
||||
| **merlin-g-015** | **`A5000`** | 8 |
|
||||
| **merlin-g-100** | **`A100`** | 8 |
|
||||
|
||||
#### Constraint / Features
|
||||
|
||||
Instead of specifying the GPU **type**, sometimes users would need to **specify the GPU by the amount of memory available in the GPU** card itself.
|
||||
This has been defined in Slurm with **Features**, which is a tag which defines the GPU memory for the different GPU cards.
|
||||
Users can specify which GPU memory size needs to be used with the `--constraint` option. In that case, notice that *in many cases
|
||||
there is not need to specify `[<type>:]`* in the `--gpus` option.
|
||||
|
||||
```bash
|
||||
#SBATCH --contraint=<Feature> # Possible values: gpumem_8gb, gpumem_11gb, gpumem_24gb, gpumem_40gb
|
||||
```
|
||||
|
||||
The table below shows the available **Features** and which GPU card models and GPU nodes they belong to:
|
||||
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th scope='colgroup' style="vertical-align:middle;text-align:center;" colspan="3">Merlin6 GPU Computing Nodes</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Nodes</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">GPU Type</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Feature</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-[001-005]</b></td>
|
||||
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1">`geforce_gtx_1080`</td>
|
||||
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1"><b>`gpumem_8gb`</b></td>
|
||||
</tr>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-[006-009]</b></td>
|
||||
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1">`geforce_gtx_1080_ti`</td>
|
||||
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="2"><b>`gpumem_11gb`</b></td>
|
||||
</tr>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-[010-014]</b></td>
|
||||
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1">`geforce_rtx_2080_ti`</td>
|
||||
</tr>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-015</b></td>
|
||||
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1">`A5000`</td>
|
||||
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1"><b>`gpumem_24gb`</b></td>
|
||||
</tr>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-g-100</b></td>
|
||||
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1">`A100`</td>
|
||||
<td markdown="span" style="vertical-align:middle;text-align:center;" rowspan="1"><b>`gpumem_40gb`</b></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
#### Other GPU options
|
||||
|
||||
Alternative Slurm options for GPU based jobs are available. Please refer to the **man** pages
|
||||
for each Slurm command for further information about it (`man salloc`, `man sbatch`, `man srun`).
|
||||
Below are listed the most common settings:
|
||||
|
||||
```bash
|
||||
#SBATCH --hint=[no]multithread
|
||||
#SBATCH --ntasks=\<ntasks\>
|
||||
#SBATCH --ntasks-per-gpu=\<ntasks\>
|
||||
#SBATCH --mem-per-gpu=\<size[units]\>
|
||||
#SBATCH --cpus-per-gpu=\<ncpus\>
|
||||
#SBATCH --gpus-per-node=[\<type\>:]\<number\>
|
||||
#SBATCH --gpus-per-socket=[\<type\>:]\<number\>
|
||||
#SBATCH --gpus-per-task=[\<type\>:]\<number\>
|
||||
#SBATCH --gpu-bind=[verbose,]\<type\>
|
||||
```
|
||||
|
||||
Please, notice that when defining `[<type>:]` once, then all other options must use it too!
|
||||
|
||||
#### Dealing with Hyper-Threading
|
||||
|
||||
The **`gmerlin6`** cluster contains the partitions `gwendolen` and `gwendolen-long`, which have a node with Hyper-Threading enabled.
|
||||
In that case, one should always specify whether to use Hyper-Threading or not. If not defined, Slurm will
|
||||
generally use it (exceptions apply). For this machine, generally HT is recommended.
|
||||
|
||||
```bash
|
||||
#SBATCH --hint=multithread # Use extra threads with in-core multi-threading.
|
||||
#SBATCH --hint=nomultithread # Don't use extra threads with in-core multi-threading.
|
||||
```
|
||||
|
||||
## User and job limits
|
||||
|
||||
The GPU cluster contains some basic user and job limits to ensure that a single user can not overabuse the resources and a fair usage of the cluster.
|
||||
The limits are described below.
|
||||
|
||||
### Per job limits
|
||||
|
||||
These are limits applying to a single job. In other words, there is a maximum of resources a single job can use.
|
||||
Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: `SlurmQoS(limits)`
|
||||
(possible `SlurmQoS` values can be listed with the command `sacctmgr show qos`):
|
||||
|
||||
| Partition | Slurm Account | Mon-Sun 0h-24h |
|
||||
|:------------------:| :------------: | :------------------------------------------: |
|
||||
| **gpu** | **`merlin`** | gpu_week(gres/gpu=8) |
|
||||
| **gpu-short** | **`merlin`** | gpu_week(gres/gpu=8) |
|
||||
| **gwendolen** | `gwendolen` | No limits |
|
||||
| **gwendolen-long** | `gwendolen` | No limits, active from 9pm to 5:30am |
|
||||
|
||||
* With the limits in the public `gpu` and `gpu-short` partitions, a single job using the `merlin` acccount
|
||||
(default account) can not use more than 40 CPUs, more than 8 GPUs or more than 200GB.
|
||||
Any job exceeding such limits will stay in the queue with the message **`QOSMax[Cpu|GRES|Mem]PerJob`**.
|
||||
As there are no more existing QoS during the week temporary overriding job limits (this happens for
|
||||
instance in the CPU **daily** partition), the job needs to be cancelled, and the requested resources
|
||||
must be adapted according to the above resource limits.
|
||||
|
||||
* The **gwendolen** and **gwendolen-long** partitions are two special partitions for a **[NVIDIA DGX A100](https://www.nvidia.com/en-us/data-center/dgx-a100/)** machine.
|
||||
Only users belonging to the **`unx-gwendolen`** Unix group can run in these partitions. No limits are applied (machine resources can be completely used).
|
||||
|
||||
* The **`gwendolen-long`** partition is available 24h. However,
|
||||
* from 5:30am to 9pm the partition is `down` (jobs can be submitted, but can not run until the partition is set to `active`).
|
||||
* from 9pm to 5:30am jobs are allowed to run (partition is set to `active`).
|
||||
|
||||
### Per user limits for GPU partitions
|
||||
|
||||
These limits apply exclusively to users. In other words, there is a maximum of resources a single user can use.
|
||||
Limits are defined using QoS, and this is usually set at the partition level. Limits are described in the table below with the format: `SlurmQoS(limits)`
|
||||
(possible `SlurmQoS` values can be listed with the command `sacctmgr show qos`):
|
||||
|
||||
| Partition | Slurm Account | Mon-Sun 0h-24h |
|
||||
|:------------------:| :----------------: | :---------------------------------------------: |
|
||||
| **gpu** | **`merlin`** | gpu_week(gres/gpu=16) |
|
||||
| **gpu-short** | **`merlin`** | gpu_week(gres/gpu=16) |
|
||||
| **gwendolen** | `gwendolen` | No limits |
|
||||
| **gwendolen-long** | `gwendolen` | No limits, active from 9pm to 5:30am |
|
||||
|
||||
* With the limits in the public `gpu` and `gpu-short` partitions, a single user can not use more than 80 CPUs, more than 16 GPUs or more than 400GB.
|
||||
Jobs sent by any user already exceeding such limits will stay in the queue with the message **`QOSMax[Cpu|GRES|Mem]PerUser`**.
|
||||
In that case, job can wait in the queue until some of the running resources are freed.
|
||||
|
||||
* Notice that user limits are wider than job limits. In that way, a user can run up to two 8 GPUs based jobs, or up to four 4 GPUs based jobs, etc.
|
||||
Please try to avoid occupying all GPUs of the same type for several hours or multiple days, otherwise it would block other users needing the same
|
||||
type of GPU.
|
||||
|
||||
## Advanced Slurm configuration
|
||||
|
||||
Clusters at PSI use the [Slurm Workload Manager](http://slurm.schedmd.com/) as the batch system technology for managing and scheduling jobs.
|
||||
Slurm has been installed in a **multi-clustered** configuration, allowing to integrate multiple clusters in the same batch system.
|
||||
|
||||
For understanding the Slurm configuration setup in the cluster, sometimes may be useful to check the following files:
|
||||
|
||||
* ``/etc/slurm/slurm.conf`` - can be found in the login nodes and computing nodes.
|
||||
* ``/etc/slurm/gres.conf`` - can be found in the GPU nodes, is also propgated to login nodes and computing nodes for user read access.
|
||||
* ``/etc/slurm/cgroup.conf`` - can be found in the computing nodes, is also propagated to login nodes for user read access.
|
||||
|
||||
The previous configuration files which can be found in the login nodes, correspond exclusively to the **merlin6** cluster configuration files.
|
||||
Configuration files for the old **merlin5** cluster or for the **gmerlin6** cluster must be checked directly on any of the **merlin5** or **gmerlin6** computing nodes (in example, by login in to one of the nodes while a job or an active allocation is running).
|
||||
BIN
docs/images/ANSYS/HFSS/01_Select_Scheduler_Menu.png
Normal file
|
After Width: | Height: | Size: 22 KiB |
BIN
docs/images/ANSYS/HFSS/02_Select_Scheduler_RSM_Remote.png
Normal file
|
After Width: | Height: | Size: 9.6 KiB |
BIN
docs/images/ANSYS/HFSS/03_Select_Scheduler_Slurm.png
Normal file
|
After Width: | Height: | Size: 9.7 KiB |
BIN
docs/images/ANSYS/HFSS/04_Submit_Job_Menu.png
Normal file
|
After Width: | Height: | Size: 22 KiB |
BIN
docs/images/ANSYS/HFSS/05_Submit_Job_Product_Path.png
Normal file
|
After Width: | Height: | Size: 67 KiB |
BIN
docs/images/ANSYS/cfx5launcher.png
Normal file
|
After Width: | Height: | Size: 39 KiB |
BIN
docs/images/ANSYS/merlin7/HFSS/01_Select_Scheduler_Menu.png
Normal file
|
After Width: | Height: | Size: 22 KiB |
|
After Width: | Height: | Size: 9.6 KiB |
BIN
docs/images/ANSYS/merlin7/HFSS/03_Select_Scheduler_Slurm.png
Normal file
|
After Width: | Height: | Size: 9.7 KiB |
BIN
docs/images/ANSYS/merlin7/HFSS/04_Submit_Job_Menu.png
Normal file
|
After Width: | Height: | Size: 22 KiB |
BIN
docs/images/ANSYS/merlin7/HFSS/05_Submit_Job_Product_Path.png
Normal file
|
After Width: | Height: | Size: 67 KiB |
BIN
docs/images/ANSYS/merlin7/cfx5launcher.png
Normal file
|
After Width: | Height: | Size: 39 KiB |
BIN
docs/images/ANSYS/merlin7/merlin7/cfx5launcher.png
Normal file
|
After Width: | Height: | Size: 39 KiB |
BIN
docs/images/ANSYS/merlin7/rsm-1-add_hpc_resource.png
Normal file
|
After Width: | Height: | Size: 508 KiB |
BIN
docs/images/ANSYS/merlin7/rsm-2-add_cluster.png
Normal file
|
After Width: | Height: | Size: 27 KiB |
BIN
docs/images/ANSYS/merlin7/rsm-3-add_scratch_info.png
Normal file
|
After Width: | Height: | Size: 35 KiB |
BIN
docs/images/ANSYS/merlin7/rsm-4-get_slurm_queues.png
Normal file
|
After Width: | Height: | Size: 26 KiB |
BIN
docs/images/ANSYS/merlin7/rsm-5-authenticating.png
Normal file
|
After Width: | Height: | Size: 6.4 KiB |
BIN
docs/images/ANSYS/merlin7/rsm-6-selected-partitions.png
Normal file
|
After Width: | Height: | Size: 37 KiB |
BIN
docs/images/ANSYS/rsm-1-add_hpc_resource.png
Normal file
|
After Width: | Height: | Size: 508 KiB |
BIN
docs/images/ANSYS/rsm-2-add_cluster.png
Normal file
|
After Width: | Height: | Size: 23 KiB |
BIN
docs/images/ANSYS/rsm-3-add_scratch_info.png
Normal file
|
After Width: | Height: | Size: 29 KiB |
BIN
docs/images/ANSYS/rsm-4-get_slurm_queues.png
Normal file
|
After Width: | Height: | Size: 21 KiB |
BIN
docs/images/ANSYS/rsm-5-authenticating.png
Normal file
|
After Width: | Height: | Size: 6.4 KiB |
BIN
docs/images/ANSYS/rsm-6-selected-partitions.png
Normal file
|
After Width: | Height: | Size: 28 KiB |
BIN
docs/images/Access/01-request-merlin5-membership.png
Normal file
|
After Width: | Height: | Size: 68 KiB |
BIN
docs/images/Access/01-request-merlin6-membership.png
Normal file
|
After Width: | Height: | Size: 68 KiB |
BIN
docs/images/Access/01-request-merlin7-membership.png
Normal file
|
After Width: | Height: | Size: 151 KiB |
BIN
docs/images/Access/01-request-unx-group-membership.png
Normal file
|
After Width: | Height: | Size: 66 KiB |
BIN
docs/images/NoMachine/screen_nx1.png
Normal file
|
After Width: | Height: | Size: 33 KiB |
BIN
docs/images/NoMachine/screen_nx10.png
Normal file
|
After Width: | Height: | Size: 113 KiB |
BIN
docs/images/NoMachine/screen_nx2.png
Normal file
|
After Width: | Height: | Size: 44 KiB |
BIN
docs/images/NoMachine/screen_nx3.png
Normal file
|
After Width: | Height: | Size: 46 KiB |
BIN
docs/images/NoMachine/screen_nx4.png
Normal file
|
After Width: | Height: | Size: 60 KiB |
BIN
docs/images/NoMachine/screen_nx5.png
Normal file
|
After Width: | Height: | Size: 49 KiB |
BIN
docs/images/NoMachine/screen_nx6.png
Normal file
|
After Width: | Height: | Size: 35 KiB |
BIN
docs/images/NoMachine/screen_nx7.png
Normal file
|
After Width: | Height: | Size: 36 KiB |
BIN
docs/images/NoMachine/screen_nx8.png
Normal file
|
After Width: | Height: | Size: 46 KiB |
BIN
docs/images/NoMachine/screen_nx9.png
Normal file
|
After Width: | Height: | Size: 42 KiB |
BIN
docs/images/NoMachine/screen_nx_address.png
Normal file
|
After Width: | Height: | Size: 52 KiB |
BIN
docs/images/NoMachine/screen_nx_auth.png
Normal file
|
After Width: | Height: | Size: 43 KiB |
BIN
docs/images/NoMachine/screen_nx_configuration.png
Normal file
|
After Width: | Height: | Size: 85 KiB |
BIN
docs/images/NoMachine/screen_nx_single_session.png
Normal file
|
After Width: | Height: | Size: 44 KiB |
BIN
docs/images/PuTTY/Putty_Disable_Kerberos_GSSAPI.png
Normal file
|
After Width: | Height: | Size: 32 KiB |
BIN
docs/images/PuTTY/Putty_Mouse_XTerm.png
Normal file
|
After Width: | Height: | Size: 43 KiB |
BIN
docs/images/PuTTY/Putty_Session.png
Normal file
|
After Width: | Height: | Size: 57 KiB |
BIN
docs/images/PuTTY/Putty_X11_Forwarding.png
Normal file
|
After Width: | Height: | Size: 37 KiB |
BIN
docs/images/Slurm/scom.gif
Normal file
|
After Width: | Height: | Size: 1008 KiB |
BIN
docs/images/Slurm/sview.png
Normal file
|
After Width: | Height: | Size: 197 KiB |
BIN
docs/images/WIP/WIP1.jpeg
Normal file
|
After Width: | Height: | Size: 8.1 KiB |
BIN
docs/images/WIP/WIP1.webp
Normal file
|
After Width: | Height: | Size: 39 KiB |
BIN
docs/images/favicon.ico
Normal file
|
After Width: | Height: | Size: 173 KiB |
BIN
docs/images/front_page.png
Normal file
|
After Width: | Height: | Size: 3.2 MiB |
BIN
docs/images/hpce_logo.png
Normal file
|
After Width: | Height: | Size: 278 KiB |
BIN
docs/images/hpce_logo_full.png
Normal file
|
After Width: | Height: | Size: 349 KiB |
BIN
docs/images/jupyter-launch-classic.png
Normal file
|
After Width: | Height: | Size: 24 KiB |
BIN
docs/images/jupyter-nbextensions.png
Normal file
|
After Width: | Height: | Size: 46 KiB |
BIN
docs/images/jupytext_menu.png
Normal file
|
After Width: | Height: | Size: 34 KiB |
BIN
docs/images/merlin-slurm-architecture.png
Normal file
|
After Width: | Height: | Size: 157 KiB |
BIN
docs/images/merlinschema3.png
Normal file
|
After Width: | Height: | Size: 167 KiB |
BIN
docs/images/psi-logo.png
Normal file
|
After Width: | Height: | Size: 5.3 KiB |
BIN
docs/images/rmount/mount.png
Normal file
|
After Width: | Height: | Size: 52 KiB |
BIN
docs/images/rmount/select-mount.png
Normal file
|
After Width: | Height: | Size: 41 KiB |
BIN
docs/images/rmount/thunar_mount.png
Normal file
|
After Width: | Height: | Size: 127 KiB |
BIN
docs/images/scicat_token.png
Normal file
|
After Width: | Height: | Size: 70 KiB |
26
docs/index.md
Normal file
@@ -0,0 +1,26 @@
|
||||
---
|
||||
hide:
|
||||
- navigation
|
||||
- toc
|
||||
---
|
||||
|
||||
# HPCE User Documentation
|
||||
|
||||
{ width="500" }
|
||||
/// caption
|
||||
The magical trio 🪄
|
||||
///
|
||||
|
||||
The [HPCE
|
||||
group](https://www.psi.ch/en/awi/high-performance-computing-and-emerging-technologies-group)
|
||||
is part of the [PSI Center for Scientific Computing, Theory and
|
||||
Data](https://www.psi.ch/en/csd) at [Paul Scherrer
|
||||
Institute](https://www.psi.ch). It provides a range of HPC services for PSI
|
||||
scientists, such as the Merlin series of HPC clusters, and also engages in
|
||||
research activities on technologies (data analysis and machine learning
|
||||
technologies) used on these systems.
|
||||
|
||||
## Quick Links
|
||||
|
||||
- user support
|
||||
- news
|
||||
18
docs/meg/01-Quick-Start-Guide/introduction.md
Normal file
@@ -0,0 +1,18 @@
|
||||
---
|
||||
title: Introduction
|
||||
#tags:
|
||||
keywords: introduction, home, welcome, architecture, design
|
||||
last_updated: 07 September 2022
|
||||
#summary: "MeG cluster overview"
|
||||
sidebar: meg_sidebar
|
||||
permalink: /meg/introduction.html
|
||||
redirect_from:
|
||||
- /meg
|
||||
- /meg/index.html
|
||||
---
|
||||
|
||||
## The Meg local HPC cluster
|
||||
|
||||
"The MEG II collaboration includes almost 70 physicists from research institutions from five countries. Researchers and technicians from PSI have played a leading role, particularly with providing the high-quality beam, technical support in the detector integration, and in the design, construction, and operation of the detector readout electronics." [Source](https://www.psi.ch/en/cnm/news/in-search-of-new-physics-new-result-from-the-meg-ii-collaboration)
|
||||
|
||||
The MEG data analysis cluster is a cluster tightly coupled to Merlin and dedicated to the analysis of data from the MEG experiment. Operated for the Muon Physics group.
|
||||
50
docs/meg/99-support/contact.md
Normal file
@@ -0,0 +1,50 @@
|
||||
---
|
||||
title: Contact
|
||||
#tags:
|
||||
keywords: contact, support, snow, service now, mailing list, mailing, email, mail, meg-admins@lists.psi.ch, merlin users
|
||||
last_updated: 15. Jan 2025
|
||||
#summary: ""
|
||||
sidebar: meg_sidebar
|
||||
permalink: /meg/contact.html
|
||||
---
|
||||
|
||||
## Support
|
||||
|
||||
Support can be asked through:
|
||||
* [PSI Service Now](https://psi.service-now.com/psisp)
|
||||
* E-Mail: <meg-admins@lists.psi.ch>
|
||||
|
||||
Basic contact information is also displayed on every shell login to the system using the *Message of the Day* mechanism.
|
||||
|
||||
|
||||
### PSI Service Now
|
||||
|
||||
**[PSI Service Now](https://psi.service-now.com/psisp)**: is the official PSI tool for opening incident requests. However, contact via email (see below) is preferred.
|
||||
* PSI HelpDesk will redirect the incident to the corresponding department, or
|
||||
* you can always assign it directly by checking the box `I know which service is affected` and providing the service name `Local HPC Resources (e.g. MEG) [CF]` (just type in `Local` and you should get the valid completions).
|
||||
|
||||
### Contact Meg Administrators
|
||||
|
||||
**E-Mail <meg-admins@lists.psi.ch>** or **<merlin-admins@lists.psi.ch>**
|
||||
* This is the preferred way to contact MEG Administrators.
|
||||
Do not hesitate to contact us for such cases.
|
||||
|
||||
---
|
||||
|
||||
## Get updated through the Merlin User list!
|
||||
|
||||
Is strongly recommended that users subscribe to the Merlin Users mailing list: **<merlin-users@lists.psi.ch>**
|
||||
|
||||
This mailing list is the official channel used by Merlin administrators to inform users about downtimes,
|
||||
interventions or problems. Users can be subscribed in two ways:
|
||||
|
||||
* *(Preferred way)* Self-registration through **[Sympa](https://psilists.ethz.ch/sympa/info/merlin-users)**
|
||||
* If you need to subscribe many people (e.g. your whole group) by sending a request to the admin list **<merlin-admins@lists.psi.ch>**
|
||||
and providing a list of email addresses.
|
||||
|
||||
---
|
||||
|
||||
## The MEG Cluster Team
|
||||
|
||||
The PSI Merlin and MEG clusters are managed by the **[High Performance Computing and Emerging technologies Group](https://www.psi.ch/de/lsm/hpce-group)**, which
|
||||
is part of the [Science IT Infrastructure, and Services department (AWI)](https://www.psi.ch/en/awi) in PSI's [Center for Scientific Computing, Theory and Data (SCD)](https://www.psi.ch/en/csd).
|
||||
199
docs/meg/99-support/migration-to-merlin7.md
Normal file
@@ -0,0 +1,199 @@
|
||||
---
|
||||
#tags:
|
||||
keywords: meg, merlin6, merlin7, migration, fpsync, rsync
|
||||
#summary: ""
|
||||
sidebar: meg_sidebar
|
||||
last_updated: 28 May 2025
|
||||
permalink: /meg/migrating.html
|
||||
---
|
||||
|
||||
# Meg to Merlin7 Migration Guide
|
||||
|
||||
Welcome to the official documentation for migrating experiment data from **MEG** to **Merlin7**. Please follow the instructions carefully to ensure a smooth and secure transition.
|
||||
|
||||
---
|
||||
|
||||
## Directory Structure Changes
|
||||
|
||||
### Meg vs Merlin6 vs Merlin7
|
||||
|
||||
| Cluster | Home Directory | User Data Directory | Experiment data | Additional notes |
|
||||
| ------- | :----------------- | :------------------ | --------------------- | ---------------- |
|
||||
| merlin6 | /psi/home/`$USER` | /data/user/`$USER` | /data/experiments/meg | Symlink /meg |
|
||||
| meg | /meg/home/`$USER` | N/A | /meg | |
|
||||
| merlin7 | /data/user/`$USER` | /data/user/`$USER` | /data/project/meg | |
|
||||
|
||||
* The **Merlin6 home and user data directores have been merged** into the single new home directory `/data/user/$USER` on Merlin7.
|
||||
* This is the same for the home directory in the meg cluster, which has to be merged into `/data/user/$USER` on Merlin7.
|
||||
* Users are responsible for moving the data.
|
||||
* The **experiment directory has been integrated into `/data/project/meg`**.
|
||||
|
||||
### Recommended Cleanup Actions
|
||||
|
||||
* Remove unused files and datasets.
|
||||
* Archive large, inactive data sets.
|
||||
|
||||
### Mandatory Actions
|
||||
|
||||
* Stop activity on Meg and Merlin6 when performing the last rsync.
|
||||
|
||||
## Migration Instructions
|
||||
|
||||
### Preparation
|
||||
|
||||
A `experiment_migration.setup` migration script must be executed from **any MeG node** using the account that will perform the migration.
|
||||
|
||||
#### When using the local `root` account
|
||||
- The script **must be executed after every reboot** of the destination nodes.
|
||||
- **Reason:** On Merlin7, the home directory for the `root` user resides on ephemeral storage (no physical disk).
|
||||
After a reboot, this directory is cleaned, so **SSH keys need to be redeployed** before running the migration again.
|
||||
|
||||
#### When using a PSI Active Directory (AD) account
|
||||
- Applicable accounts include, for example:
|
||||
- `gac-meg2_data`
|
||||
- `gac-meg2`
|
||||
- The script only needs to be executed **once**, provided that:
|
||||
- The home directory for the AD account is located on a shared storage area.
|
||||
- This shared storage is accessible from the node executing the transfer.
|
||||
- **Reason:** On Merlin7, these accounts have their home directories on persistent shared storage, so the SSH keys remain available across reboots.
|
||||
|
||||
To run it:
|
||||
```bash
|
||||
experiment_migration.setup
|
||||
```
|
||||
|
||||
This script will:
|
||||
|
||||
* Check that you have an account on Merlin7.
|
||||
* Configure and check that your environment is ready for transferring files via Slurm job.
|
||||
|
||||
If there are issues, the script will:
|
||||
|
||||
* Print clear diagnostic output
|
||||
* Give you some hints to resolve the issue
|
||||
|
||||
If you are stuck, email: [merlin-admins@lists.psi.ch](mailto:merlin-admins@lists.psi.ch)/[meg-admins@lists.psi.ch](mailto:meg-admins@lists.psi.ch)
|
||||
|
||||
### Migration Procedure
|
||||
|
||||
1. **Run an initial sync**, ideally within a `tmux` session
|
||||
* This copies the bulk of the data from MeG to Merlin7.
|
||||
* **IMPORTANT: Do not modify the destination directories**
|
||||
* Please, before starting the transfer ensure that:
|
||||
* The source and destination directories are correct.
|
||||
* The destination directories exist.
|
||||
2. **Run additional syncs if needed**
|
||||
* Subsequent syncs can be executed to transfer changes.
|
||||
* Ensure that **only one sync for the same directory runs at a time**.
|
||||
* Multiple syncs are often required since the first one may take several hours or even days.
|
||||
3. Schedule a date for the final migration:
|
||||
* Any activity must be stopped on the source directory.
|
||||
* In the same way, no activity must be done on the destination until the migration is complete.
|
||||
4. **Perform a final sync with the `-E` option** (if it applies)
|
||||
* Use `-E` **only if you need to delete files on the destination that were removed from the source.**
|
||||
* This ensures the destination becomes an exact mirror of the source.
|
||||
* **Never use `-E` after the destination has gone into production**, as it will delete new data created there.
|
||||
5. Disable access on the source folder.
|
||||
6. Enable access on the destination folder.
|
||||
* At this point, **no new syncs have to be performed.**
|
||||
|
||||
> ⚠️ **Important Notes**
|
||||
> The `-E` option is destructive; handle with care.
|
||||
> Always verify that the destination is ready before triggering the final sync.
|
||||
> For optimal performance, use up to 12 threads with the -t option.
|
||||
|
||||
#### Running The Migration Script
|
||||
|
||||
The migration script is installed on the `meg-s-001` server at:
|
||||
`/usr/local/bin/experiment_migration.bash`
|
||||
|
||||
This script is primarily a **wrapper** around `fpsync`, providing additional logic for synchronizing MeG experiment data.
|
||||
|
||||
```bash
|
||||
[root@meg-s-001 ~]# experiment_migration.bash --help
|
||||
Usage: /usr/local/bin/experiment_migration.bash [options] -p <project_name>
|
||||
|
||||
Options:
|
||||
-t | --threads N Number of parallel threads (default: 10). Recommended 12 as max.
|
||||
-b | --experiment-src-basedir DIR Experiment base directory (default: /meg)
|
||||
-S | --space-source SPACE Source project space name (default: data1)
|
||||
-B | --experiment-dst-basedir DIR Experiment base directory (default: /data/project/meg)
|
||||
-D | --space-destination SPACE Destination project space name (default: data1)
|
||||
-p | --project-name PRJ_NAME Mantadory field. MeG project name. Examples:
|
||||
- 'online'
|
||||
- 'offline'
|
||||
- 'shared'
|
||||
-F | --force-destination-mkdir Create the destination parent directory (default: false)
|
||||
Example: mkdir -p $(dirname /data/project/meg/data1/PROJECT_NAME)
|
||||
Result: mkdir -p /data/project/meg/data1
|
||||
-s | --split N Number of files per split (default: 20000)
|
||||
-f | --filesize SIZE File size threshold (default: 100G)
|
||||
-r | --runid ID Reuse an existing runid session
|
||||
-l | --list-runids List available runid sessions and exit
|
||||
-x | --delete-runid Delete runid. Requires: -r | --runid ID
|
||||
-E | --rsync-delete-option [WARNING] Use this to delete files in the destination
|
||||
which are not present in the source any more.
|
||||
[WARNING] USE THIS OPTION CAREFULLY!
|
||||
Typically used in last rsync to have an exact
|
||||
mirror of the source directory.
|
||||
[WARNING] Some files in destination might be deleted!
|
||||
Use 'man fpsync' for more information.
|
||||
|
||||
-h | --help Show this help message
|
||||
-v | --verbose Run fpsync with -v option
|
||||
```
|
||||
|
||||
> Defaults can be updated if necessary.
|
||||
|
||||
#### Migration examples
|
||||
|
||||
##### Example: Migrating the Entire `online` Directory
|
||||
|
||||
The following example demonstrates how to migrate the **entire `online`** directory.
|
||||
|
||||
{{site.data.alerts.tip}}
|
||||
You may also choose to migrate only specific subdirectories if needed.
|
||||
However, migrating full directories is generally <b>simpler</b> and <b>less error-prone</b> compared to handling multiple subdirectory migrations.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
```bash
|
||||
[root@meg-s-001 ~]# experiment_migration.bash -S data1 -D data1 -p "online"
|
||||
🔄 Transferring project:
|
||||
From: /meg/data1/online
|
||||
To: login001.merlin7.psi.ch:/data/project/meg/data1/online
|
||||
Threads: 10 | Split: 20000 files | Max size: 100G
|
||||
RunID:
|
||||
|
||||
Please confirm to start (y/N):
|
||||
❌ Transfer cancelled by user.
|
||||
```
|
||||
|
||||
##### Example: Migrating a Specific Subdirectory
|
||||
|
||||
The following example demonstrates how to migrate **only a subdirectory**. In this case, we use the option `-F` to create the parent directory in the destination, to ensure that this exists before transferring:
|
||||
|
||||
⚠️ **Important:**
|
||||
- When migrating a subdirectory, **do not** run concurrent migrations on its parent directories.
|
||||
- For example, avoid running migrations with `-p "shared"` while simultaneously migrating `-p "shared/subprojects"`.
|
||||
|
||||
```bash
|
||||
[root@meg-s-001 ~]# experiment_migration.bash -p "shared/subprojects/meg1" -F
|
||||
🔄 Transferring project:
|
||||
From: /meg/data1/shared/subprojects/meg1
|
||||
To: login002.merlin7.psi.ch:/data/project/meg/data1/shared/subprojects/meg1
|
||||
Threads: 10 | Split: 20000 files | Max size: 100G
|
||||
RunID:
|
||||
|
||||
Please confirm to start (y/N): N
|
||||
❌ Transfer cancelled by user.
|
||||
```
|
||||
|
||||
This command initiates the migration of the directory, by creating the destination parant directory (`-F` option):
|
||||
* Creates the destination directory as follows:
|
||||
|
||||
```bash
|
||||
ssh login002.merlin.psi.ch mkdir -p /data/project/meg/data1/shared/subprojects
|
||||
```
|
||||
* Runs FPSYNC with 10 threads and N parts of max 20000 files or 100G files:
|
||||
* Source: `/meg/data1/shared/subprojects/meg1`
|
||||
* Destination: `login002.merlin7.psi.ch:/data/project/meg/data1/shared/subprojects/meg1`
|
||||
44
docs/merlin5/cluster-introduction.md
Normal file
@@ -0,0 +1,44 @@
|
||||
---
|
||||
title: Cluster 'merlin5'
|
||||
#tags:
|
||||
#keywords:
|
||||
last_updated: 07 April 2021
|
||||
#summary: "Merlin 5 cluster overview"
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin5/cluster-introduction.html
|
||||
---
|
||||
|
||||
## Slurm 'merlin5' cluster
|
||||
|
||||
**Merlin5** was the old official PSI Local HPC cluster for development and
|
||||
mission-critical applications which was built in 2016-2017. It was an
|
||||
extension of the Merlin4 cluster and built from existing hardware due
|
||||
to a lack of central investment on Local HPC Resources. **Merlin5** was
|
||||
then replaced by the **[Merlin6](/merlin6/index.html)** cluster in 2019,
|
||||
with an important central investment of ~1,5M CHF. **Merlin5** was mostly
|
||||
based on CPU resources, but also contained a small amount of GPU-based
|
||||
resources which were mostly used by the BIO experiments.
|
||||
|
||||
**Merlin5** has been kept as a **Local HPC [Slurm](https://slurm.schedmd.com/overview.html) cluster**,
|
||||
called **`merlin5`**. In that way, the old CPU computing nodes are still available as extra computation resources,
|
||||
and as an extension of the official production **`merlin6`** [Slurm](https://slurm.schedmd.com/overview.html) cluster.
|
||||
|
||||
The old Merlin5 _**login nodes**_, _**GPU nodes**_ and _**storage**_ were fully migrated to the **[Merlin6](/merlin6/index.html)**
|
||||
cluster, which becomes the **main Local HPC Cluster**. Hence, **[Merlin6](/merlin6/index.html)**
|
||||
contains the storage which is mounted on the different Merlin HPC [Slurm](https://slurm.schedmd.com/overview.html) Clusters (`merlin5`, `merlin6`, `gmerlin6`).
|
||||
|
||||
### Submitting jobs to 'merlin5'
|
||||
|
||||
To submit jobs to the **`merlin5`** Slurm cluster, it must be done from the **Merlin6** login nodes by using
|
||||
the option `--clusters=merlin5` on any of the Slurm commands (`sbatch`, `salloc`, `srun`, etc. commands).
|
||||
|
||||
## The Merlin Architecture
|
||||
|
||||
### Multi Non-Federated Cluster Architecture Design: The Merlin cluster
|
||||
|
||||
The following image shows the Slurm architecture design for Merlin cluster.
|
||||
It contains a multi non-federated cluster setup, with a central Slurm database
|
||||
and multiple independent clusters (`merlin5`, `merlin6`, `gmerlin6`):
|
||||
|
||||

|
||||
|
||||
97
docs/merlin5/hardware-and-software-description.md
Normal file
@@ -0,0 +1,97 @@
|
||||
---
|
||||
title: Hardware And Software Description
|
||||
#tags:
|
||||
#keywords:
|
||||
last_updated: 09 April 2021
|
||||
#summary: ""
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin5/hardware-and-software.html
|
||||
---
|
||||
|
||||
## Hardware
|
||||
|
||||
### Computing Nodes
|
||||
|
||||
Merlin5 is built from recycled nodes, and hardware will be decomissioned as soon as it fails (due to expired warranty and age of the cluster).
|
||||
* Merlin5 is based on the [**HPE c7000 Enclosure**](https://h20195.www2.hpe.com/v2/getdocument.aspx?docname=c04128339) solution, with 16 x [**HPE ProLiant BL460c Gen8**](https://h20195.www2.hpe.com/v2/getdocument.aspx?docname=c04123239) nodes per chassis.
|
||||
* Connectivity is based on Infiniband **ConnectX-3 QDR-40Gbps**
|
||||
* 16 internal ports for intra chassis communication
|
||||
* 2 connected external ports for inter chassis communication and storage access.
|
||||
|
||||
The below table summarizes the hardware setup for the Merlin5 computing nodes:
|
||||
|
||||
<table>
|
||||
<thead>
|
||||
<tr>
|
||||
<th scope='colgroup' style="vertical-align:middle;text-align:center;" colspan="8">Merlin5 CPU Computing Nodes</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Chassis</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Node</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Processor</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Sockets</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Cores</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Threads</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Scratch</th>
|
||||
<th scope='col' style="vertical-align:middle;text-align:center;" colspan="1">Memory</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2"><b>#0</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-c-[18-30]</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2"><a href="https://ark.intel.com/content/www/us/en/ark/products/64595/intel-xeon-processor-e5-2670-20m-cache-2-60-ghz-8-00-gt-s-intel-qpi.html">Intel Xeon E5-2670</a></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2">2</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2">16</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2">1</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2">50GB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">64GB</td>
|
||||
</tr>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td rowspan="1"><b>merlin-c-[31,32]</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>128GB</b></td>
|
||||
</tr>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2"><b>#1</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>merlin-c-[33-45]</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2"><a href="https://ark.intel.com/content/www/us/en/ark/products/64595/intel-xeon-processor-e5-2670-20m-cache-2-60-ghz-8-00-gt-s-intel-qpi.html">Intel Xeon E5-2670</a></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2">2</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2">16</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2">1</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="2">50GB</td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1">64GB</td>
|
||||
</tr>
|
||||
<tr style="vertical-align:middle;text-align:center;" ralign="center">
|
||||
<td rowspan="1"><b>merlin-c-[46,47]</b></td>
|
||||
<td style="vertical-align:middle;text-align:center;" rowspan="1"><b>128GB</b></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
### Login Nodes
|
||||
|
||||
The login nodes are part of the **[Merlin6](/merlin6/introduction.html)** HPC cluster,
|
||||
and are used to compile and to submit jobs to the different ***Merlin Slurm clusters*** (`merlin5`,`merlin6`,`gmerlin6`,etc.).
|
||||
Please refer to the **[Merlin6 Hardware Documentation](/merlin6/hardware-and-software.html)** for further information.
|
||||
|
||||
### Storage
|
||||
|
||||
The storage is part of the **[Merlin6](/merlin6/introduction.html)** HPC cluster,
|
||||
and is mounted in all the ***Slurm clusters*** (`merlin5`,`merlin6`,`gmerlin6`,etc.).
|
||||
Please refer to the **[Merlin6 Hardware Documentation](/merlin6/hardware-and-software.html)** for further information.
|
||||
|
||||
### Network
|
||||
|
||||
Merlin5 cluster connectivity is based on the [Infiniband QDR](https://en.wikipedia.org/wiki/InfiniBand) technology.
|
||||
This allows fast access with very low latencies to the data as well as running extremely efficient MPI-based jobs.
|
||||
However, this is an old version of Infiniband which requires older drivers and software can not take advantage of the latest features.
|
||||
|
||||
## Software
|
||||
|
||||
In Merlin5, we try to keep software stack coherency with the main cluster [Merlin6](/merlin6/index.html).
|
||||
|
||||
Due to this, Merlin5 runs:
|
||||
* [**RedHat Enterprise Linux 7**](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/7.9_release_notes/index)
|
||||
* [**Slurm**](https://slurm.schedmd.com/), we usually try to keep it up to date with the most recent versions.
|
||||
* [**GPFS v5**](https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/ibmspectrumscale502_welcome.html)
|
||||
* [**MLNX_OFED LTS v.4.9-2.2.4.0**](https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed), which is an old version, but required because **ConnectX-3** support has been dropped on newer OFED versions.
|
||||
142
docs/merlin5/slurm-configuration.md
Normal file
@@ -0,0 +1,142 @@
|
||||
---
|
||||
title: Slurm Configuration
|
||||
#tags:
|
||||
keywords: configuration, partitions, node definition
|
||||
last_updated: 20 May 2021
|
||||
summary: "This document describes a summary of the Merlin5 Slurm configuration."
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin5/slurm-configuration.html
|
||||
---
|
||||
|
||||
This documentation shows basic Slurm configuration and options needed to run jobs in the Merlin5 cluster.
|
||||
|
||||
The Merlin5 cluster is an old cluster with old hardware which is maintained in a best effort for increasing the CPU power of the Merlin cluster.
|
||||
|
||||
## Merlin5 CPU nodes definition
|
||||
|
||||
The following table show default and maximum resources that can be used per node:
|
||||
|
||||
| Nodes | Def.#CPUs | Max.#CPUs | #Threads | Max.Mem/Node | Max.Swap |
|
||||
|:----------------:| ---------:| :--------:| :------: | :----------: | :-------:|
|
||||
| merlin-c-[18-30] | 1 core | 16 cores | 1 | 60000 | 10000 |
|
||||
| merlin-c-[31-32] | 1 core | 16 cores | 1 | 124000 | 10000 |
|
||||
| merlin-c-[33-45] | 1 core | 16 cores | 1 | 60000 | 10000 |
|
||||
| merlin-c-[46-47] | 1 core | 16 cores | 1 | 124000 | 10000 |
|
||||
|
||||
There is one *main difference between the Merlin5 and Merlin6 clusters*: Merlin5 is keeping an old configuration which does not
|
||||
consider the memory as a *consumable resource*. Hence, users can *oversubscribe* memory. This might trigger some side-effects, but
|
||||
this legacy configuration has been kept to ensure that old jobs can keep running in the same way they did a few years ago.
|
||||
If you know that this might be a problem for you, please, always use Merlin6 instead.
|
||||
|
||||
|
||||
## Running jobs in the 'merlin5' cluster
|
||||
|
||||
In this chapter we will cover basic settings that users need to specify in order to run jobs in the Merlin5 CPU cluster.
|
||||
|
||||
### Merlin5 CPU cluster
|
||||
|
||||
To run jobs in the **`merlin5`** cluster users **must** specify the cluster name in Slurm:
|
||||
|
||||
```bash
|
||||
#SBATCH --cluster=merlin5
|
||||
```
|
||||
|
||||
### Merlin5 CPU partitions
|
||||
|
||||
Users might need to specify the Slurm partition. If no partition is specified, it will default to **`merlin`**:
|
||||
|
||||
```bash
|
||||
#SBATCH --partition=<partition_name> # Possible <partition_name> values: merlin, merlin-long:
|
||||
```
|
||||
|
||||
The table below resumes shows all possible partitions available to users:
|
||||
|
||||
| CPU Partition | Default Time | Max Time | Max Nodes | PriorityJobFactor\* | PriorityTier\*\* |
|
||||
|:-----------------: | :----------: | :------: | :-------: | :-----------------: | :--------------: |
|
||||
| **<u>merlin</u>** | 5 days | 1 week | All nodes | 500 | 1 |
|
||||
| **merlin-long** | 5 days | 21 days | 4 | 1 | 1 |
|
||||
|
||||
**\***The **PriorityJobFactor** value will be added to the job priority (*PARTITION* column in `sprio -l` ). In other words, jobs sent to higher priority
|
||||
partitions will usually run first (however, other factors such like **job age** or mainly **fair share** might affect to that decision). For the GPU
|
||||
partitions, Slurm will also attempt first to allocate jobs on partitions with higher priority over partitions with lesser priority.
|
||||
|
||||
**\*\***Jobs submitted to a partition with a higher **PriorityTier** value will be dispatched before pending jobs in partition with lower *PriorityTier* value
|
||||
and, if possible, they will preempt running jobs from partitions with lower *PriorityTier* values.
|
||||
|
||||
The **`merlin-long`** partition **is limited to 4 nodes**, as it might contain jobs running for up to 21 days.
|
||||
|
||||
### Merlin5 CPU Accounts
|
||||
|
||||
Users need to ensure that the public **`merlin`** account is specified. No specifying account options would default to this account.
|
||||
This is mostly needed by users which have multiple Slurm accounts, which may define by mistake a different account.
|
||||
|
||||
```bash
|
||||
#SBATCH --account=merlin # Possible values: merlin
|
||||
```
|
||||
|
||||
### Slurm CPU specific options
|
||||
|
||||
Some options are available when using CPUs. These are detailed here.
|
||||
|
||||
Alternative Slurm options for CPU based jobs are available. Please refer to the **man** pages
|
||||
for each Slurm command for further information about it (`man salloc`, `man sbatch`, `man srun`).
|
||||
Below are listed the most common settings:
|
||||
|
||||
```bash
|
||||
#SBATCH --ntasks=<ntasks>
|
||||
#SBATCH --ntasks-per-core=<ntasks>
|
||||
#SBATCH --ntasks-per-socket=<ntasks>
|
||||
#SBATCH --ntasks-per-node=<ntasks>
|
||||
#SBATCH --mem=<size[units]>
|
||||
#SBATCH --mem-per-cpu=<size[units]>
|
||||
#SBATCH --cpus-per-task=<ncpus>
|
||||
#SBATCH --cpu-bind=[{quiet,verbose},]<type> # only for 'srun' command
|
||||
```
|
||||
|
||||
Notice that in **Merlin5** no hyper-threading is available (while in **Merlin6** it is).
|
||||
Hence, in **Merlin5** there is not need to specify `--hint` hyper-threading related options.
|
||||
|
||||
## User and job limits
|
||||
|
||||
In the CPU cluster we provide some limits which basically apply to jobs and users. The idea behind this is to ensure a fair usage of the resources and to
|
||||
avoid overabuse of the resources from a single user or job. However, applying limits might affect the overall usage efficiency of the cluster (in example,
|
||||
pending jobs from a single user while having many idle nodes due to low overall activity is something that can be seen when user limits are applied).
|
||||
In the same way, these limits can be also used to improve the efficiency of the cluster (in example, without any job size limits, a job requesting all
|
||||
resources from the batch system would drain the entire cluster for fitting the job, which is undesirable).
|
||||
|
||||
Hence, there is a need of setting up wise limits and to ensure that there is a fair usage of the resources, by trying to optimize the overall efficiency
|
||||
of the cluster while allowing jobs of different nature and sizes (it is, **single core** based **vs parallel jobs** of different sizes) to run.
|
||||
|
||||
In the **`merlin5`** cluster, as not many users are running on it, these limits are wider than the ones set in the **`merlin6`** and **`gmerlin6`** clusters.
|
||||
|
||||
### Per job limits
|
||||
|
||||
These are limits which apply to a single job. In other words, there is a maximum of resources a single job can use. These limits are described in the table below,
|
||||
with the format `SlurmQoS(limits)` (`SlurmQoS` can be listed from the `sacctmgr show qos` command):
|
||||
|
||||
| Partition | Mon-Sun 0h-24h | Other limits |
|
||||
|:---------------: | :--------------: | :----------: |
|
||||
| **merlin** | merlin5(cpu=384) | None |
|
||||
| **merlin-long** | merlin5(cpu=384) | Max. 4 nodes |
|
||||
|
||||
By default, by QoS limits, a job can not use more than 384 cores (max CPU per job).
|
||||
However, for the `merlin-long`, this is even more restricted: there is an extra limit of 4 dedicated nodes for this partion. This is defined
|
||||
at the partition level, and will overwrite any QoS limit as long as this is more restrictive.
|
||||
|
||||
### Per user limits for CPU partitions
|
||||
|
||||
No user limits apply by QoS. For the **`merlin`** partition, a single user could fill the whole batch system with jobs (however, the restriction is at the job size, as explained above). For the **`merlin-limit`** partition, the 4 node limitation still applies.
|
||||
|
||||
## Advanced Slurm configuration
|
||||
|
||||
Clusters at PSI use the [Slurm Workload Manager](http://slurm.schedmd.com/) as the batch system technology for managing and scheduling jobs.
|
||||
Slurm has been installed in a **multi-clustered** configuration, allowing to integrate multiple clusters in the same batch system.
|
||||
|
||||
For understanding the Slurm configuration setup in the cluster, sometimes may be useful to check the following files:
|
||||
|
||||
* ``/etc/slurm/slurm.conf`` - can be found in the login nodes and computing nodes.
|
||||
* ``/etc/slurm/gres.conf`` - can be found in the GPU nodes, is also propgated to login nodes and computing nodes for user read access.
|
||||
* ``/etc/slurm/cgroup.conf`` - can be found in the computing nodes, is also propagated to login nodes for user read access.
|
||||
|
||||
The previous configuration files which can be found in the login nodes, correspond exclusively to the **merlin6** cluster configuration files.
|
||||
Configuration files for the old **merlin5** cluster or for the **gmerlin6** cluster must be checked directly on any of the **merlin5** or **gmerlin6** computing nodes (in example, by login in to one of the nodes while a job or an active allocation is running).
|
||||
@@ -0,0 +1,56 @@
|
||||
---
|
||||
title: Accessing Interactive Nodes
|
||||
#tags:
|
||||
keywords: How to, HowTo, access, accessing, nomachine, ssh
|
||||
last_updated: 07 September 2022
|
||||
#summary: ""
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/interactive.html
|
||||
---
|
||||
|
||||
## SSH Access
|
||||
|
||||
For interactive command shell access, use an SSH client. We recommend to activate SSH's X11 forwarding to allow you to use graphical
|
||||
applications (e.g. a text editor, but for more performant graphical access, refer to the sections below). X applications are supported
|
||||
in the login nodes and X11 forwarding can be used for those users who have properly configured X11 support in their desktops, however:
|
||||
|
||||
* Merlin6 administrators **do not offer support** for user desktop configuration (Windows, MacOS, Linux).
|
||||
* Hence, Merlin6 administrators **do not offer official support** for X11 client setup.
|
||||
* Nevertheless, a generic guide for X11 client setup (*Linux*, *Windows* and *MacOS*) is provided below.
|
||||
* PSI desktop configuration issues must be addressed through **[PSI Service Now](https://psi.service-now.com/psisp)** as an *Incident Request*.
|
||||
* Ticket will be redirected to the corresponding Desktop support group (Windows, Linux).
|
||||
|
||||
### Accessing from a Linux client
|
||||
|
||||
Refer to [{How To Use Merlin -> Accessing from Linux Clients}](/merlin6/connect-from-linux.html) for **Linux** SSH client and X11 configuration.
|
||||
|
||||
### Accessing from a Windows client
|
||||
|
||||
Refer to [{How To Use Merlin -> Accessing from Windows Clients}](/merlin6/connect-from-windows.html) for **Windows** SSH client and X11 configuration.
|
||||
|
||||
### Accessing from a MacOS client
|
||||
|
||||
Refer to [{How To Use Merlin -> Accessing from MacOS Clients}](/merlin6/connect-from-macos.html) for **MacOS** SSH client and X11 configuration.
|
||||
|
||||
## NoMachine Remote Desktop Access
|
||||
|
||||
X applications are supported in the login nodes and can run efficiently through a **NoMachine** client. This is the officially supported way to run more demanding X applications on Merlin6.
|
||||
* For PSI Windows workstations, this can be installed from the Software Kiosk as 'NX Client'. If you have difficulties installing, please request support through **[PSI Service Now](https://psi.service-now.com/psisp)** as an *Incident Request*.
|
||||
* For other workstations The client software can be downloaded from the [Nomachine Website](https://www.nomachine.com/product&p=NoMachine%20Enterprise%20Client).
|
||||
|
||||
### Configuring NoMachine
|
||||
|
||||
Refer to [{How To Use Merlin -> Remote Desktop Access}](/merlin6/nomachine.html) for further instructions of how to configure the NoMachine client and how to access it from PSI and from outside PSI.
|
||||
|
||||
## Login nodes hardware description
|
||||
|
||||
The Merlin6 login nodes are the official machines for accessing the recources of Merlin6.
|
||||
From these machines, users can submit jobs to the Slurm batch system as well as visualize or compile their software.
|
||||
|
||||
The Merlin6 login nodes are the following:
|
||||
|
||||
| Hostname | SSH | NoMachine | #cores | #Threads | CPU | Memory | Scratch | Scratch Mountpoint |
|
||||
| ------------------- | --- | --------- | ------ |:--------:| :-------------------- | ------ | ---------- | :------------------ |
|
||||
| merlin-l-001.psi.ch | yes | yes | 2 x 22 | 2 | Intel Xeon Gold 6152 | 384GB | 1.8TB NVMe | ``/scratch`` |
|
||||
| merlin-l-002.psi.ch | yes | yes | 2 x 22 | 2 | Intel Xeon Gold 6142 | 384GB | 1.8TB NVMe | ``/scratch`` |
|
||||
| merlin-l-01.psi.ch | yes | - | 2 x 16 | 2 | Intel Xeon E5-2697Av4 | 512GB | 100GB SAS | ``/scratch`` |
|
||||
53
docs/merlin6/01-Quick-Start-Guide/accessing-slurm.md
Normal file
@@ -0,0 +1,53 @@
|
||||
---
|
||||
title: Accessing Slurm Cluster
|
||||
#tags:
|
||||
keywords: slurm, batch system, merlin5, merlin6, gmerlin6, cpu, gpu
|
||||
last_updated: 07 September 2022
|
||||
#summary: ""
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/slurm-access.html
|
||||
---
|
||||
|
||||
## The Merlin Slurm clusters
|
||||
|
||||
Merlin contains a multi-cluster setup, where multiple Slurm clusters coexist under the same umbrella.
|
||||
It basically contains the following clusters:
|
||||
|
||||
* The **Merlin6 Slurm CPU cluster**, which is called [**`merlin6`**](/merlin6/slurm-access.html#merlin6-cpu-cluster-access).
|
||||
* The **Merlin6 Slurm GPU cluster**, which is called [**`gmerlin6`**](/merlin6/slurm-access.html#merlin6-gpu-cluster-access).
|
||||
* The *old Merlin5 Slurm CPU cluster*, which is called [**`merlin5`**](/merlin6/slurm-access.html#merlin5-cpu-cluster-access), still supported in a best effort basis.
|
||||
|
||||
## Accessing the Slurm clusters
|
||||
|
||||
Any job submission must be performed from a **Merlin login node**. Please refer to the [**Accessing the Interactive Nodes documentation**](/merlin6/interactive.html)
|
||||
for further information about how to access the cluster.
|
||||
|
||||
In addition, any job *must be submitted from a high performance storage area visible by the login nodes and by the computing nodes*. For this, the possible storage areas are the following:
|
||||
* `/data/user`
|
||||
* `/data/project`
|
||||
* `/shared-scratch`
|
||||
Please, avoid using `/psi/home` directories for submitting jobs.
|
||||
|
||||
### Merlin6 CPU cluster access
|
||||
|
||||
The **Merlin6 CPU cluster** (**`merlin6`**) is the default cluster configured in the login nodes. Any job submission will use by default this cluster, unless
|
||||
the option `--cluster` is specified with another of the existing clusters.
|
||||
|
||||
For further information about how to use this cluster, please visit: [**Merlin6 CPU Slurm Cluster documentation**](/merlin6/slurm-configuration.html).
|
||||
|
||||
### Merlin6 GPU cluster access
|
||||
|
||||
The **Merlin6 GPU cluster** (**`gmerlin6`**) is visible from the login nodes. However, to submit jobs to this cluster, one needs to specify the option `--cluster=gmerlin6` when submitting a job or allocation.
|
||||
|
||||
For further information about how to use this cluster, please visit: [**Merlin6 GPU Slurm Cluster documentation**](/gmerlin6/slurm-configuration.html).
|
||||
|
||||
### Merlin5 CPU cluster access
|
||||
|
||||
The **Merlin5 CPU cluster** (**`merlin5`**) is visible from the login nodes. However, to submit jobs
|
||||
to this cluster, one needs to specify the option `--cluster=merlin5` when submitting a job or allocation.
|
||||
|
||||
Using this cluster is in general not recommended, however this is still available for old users needing
|
||||
extra computational resources or longer jobs. Have in mind that this cluster is only supported in a
|
||||
**best effort basis**, and it contains very old hardware and configurations.
|
||||
|
||||
For further information about how to use this cluster, please visit the [**Merlin5 CPU Slurm Cluster documentation**](/gmerlin6/slurm-configuration.html).
|
||||
52
docs/merlin6/01-Quick-Start-Guide/code-of-conduct.md
Normal file
@@ -0,0 +1,52 @@
|
||||
---
|
||||
title: Code Of Conduct
|
||||
#tags:
|
||||
keywords: code of conduct, rules, principle, policy, policies, administrator, backup
|
||||
last_updated: 07 September 2022
|
||||
#summary: ""
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/code-of-conduct.html
|
||||
---
|
||||
|
||||
## The Basic principle
|
||||
|
||||
The basic principle is courtesy and consideration for other users.
|
||||
|
||||
* Merlin6 is a system shared by many users, therefore you are kindly requested to apply common courtesy in using its resources. Please follow our guidelines which aim at providing and maintaining an efficient compute environment for all our users.
|
||||
* Basic shell programming skills are an essential requirement in a Linux/UNIX HPC cluster environment; a proficiency in shell programming is greatly beneficial.
|
||||
|
||||
## Interactive nodes
|
||||
|
||||
* The interactive nodes (also known as login nodes) are for development and quick testing:
|
||||
* It is **strictly forbidden to run production jobs** on the login nodes. All production jobs must be submitted to the batch system.
|
||||
* It is **forbidden to run long processes** occupying big parts of a login node's resources.
|
||||
* According to the previous rules, **misbehaving running processes will have to be killed.**
|
||||
in order to keep the system responsive for other users.
|
||||
|
||||
## Batch system
|
||||
|
||||
* Make sure that no broken or run-away processes are left when your job is done. Keep the process space clean on all nodes.
|
||||
* During the runtime of a job, it is mandatory to use the ``/scratch`` and ``/shared-scratch`` partitions for temporary data:
|
||||
* It is **forbidden** to use the ``/data/user``, ``/data/project`` or ``/psi/home/`` for that purpose.
|
||||
* Always remove files you do not need any more (e.g. core dumps, temporary files) as early as possible. Keep the disk space clean on all nodes.
|
||||
* Prefer ``/scratch`` over ``/shared-scratch`` and use the latter only when you require the temporary files to be visible from multiple nodes.
|
||||
* Read the description in **[Merlin6 directory structure](/merlin6/storage.html#merlin6-directories)** for learning about the correct usage of each partition type.
|
||||
|
||||
## User and project data
|
||||
|
||||
* ***Users are responsible for backing up their own data***. Is recommended to backup the data on third party independent systems (i.e. LTS, Archive, AFS, SwitchDrive, Windows Shares, etc.).
|
||||
* **`/psi/home`**, as this contains a small amount of data, is the only directory where we can provide daily snapshots for one week. This can be found in the following directory **`/psi/home/.snapshot/`**
|
||||
* ***When a user leaves PSI, she or her supervisor/team are responsible to backup and move the data out from the cluster***: every few months, the storage space will be recycled for those old users who do not have an existing and valid PSI account.
|
||||
|
||||
{{site.data.alerts.warning}}When a user leaves PSI and his account has been removed, her storage space in Merlin may be recycled.
|
||||
Hence, <b>when a user leaves PSI</b>, she, her supervisor or team <b>must ensure that the data is backed up to an external storage</b>
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
## System Administrator Rights
|
||||
|
||||
* The system administrator has the right to temporarily block the access to Merlin6 for an account violating the Code of Conduct in order to maintain the efficiency and stability of the system.
|
||||
* Repetitive violations by the same user will be escalated to the user's supervisor.
|
||||
* The system administrator has the right to delete files in the **scratch** directories
|
||||
* after a job, if the job failed to clean up its files.
|
||||
* during the job in order to prevent a job from destabilizing a node or multiple nodes.
|
||||
* The system administrator has the right to kill any misbehaving running processes.
|
||||
64
docs/merlin6/01-Quick-Start-Guide/introduction.md
Normal file
@@ -0,0 +1,64 @@
|
||||
---
|
||||
title: Introduction
|
||||
#tags:
|
||||
keywords: introduction, home, welcome, architecture, design
|
||||
last_updated: 07 September 2022
|
||||
#summary: "Merlin 6 cluster overview"
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/introduction.html
|
||||
redirect_from:
|
||||
- /merlin6
|
||||
- /merlin6/index.html
|
||||
---
|
||||
|
||||
## The Merlin local HPC cluster
|
||||
|
||||
Historically, the local HPC clusters at PSI were named **Merlin**. Over the years,
|
||||
multiple generations of Merlin have been deployed.
|
||||
|
||||
At present, the **Merlin local HPC cluster** contains _two_ generations of it:
|
||||
* the old **Merlin5** cluster (`merlin5` Slurm cluster), and
|
||||
* the newest generation **Merlin6**, which is divided in two Slurm clusters:
|
||||
* `merlin6` as the Slurm CPU cluster
|
||||
* `gmerlin6` as the Slurm GPU cluster.
|
||||
|
||||
Access to the different Slurm clusters is possible from the [**Merlin login nodes**](/merlin6/interactive.html),
|
||||
which can be accessed through the [SSH protocol](/merlin6/interactive.html#ssh-access) or the [NoMachine (NX) service](/merlin6/nomachine.html).
|
||||
|
||||
The following image shows the Slurm architecture design for the Merlin5 & Merlin6 (CPU & GPU) clusters:
|
||||
|
||||

|
||||
|
||||
### Merlin6
|
||||
|
||||
Merlin6 is a the official PSI Local HPC cluster for development and
|
||||
mission-critical applications that has been built in 2019. It replaces
|
||||
the Merlin5 cluster.
|
||||
|
||||
Merlin6 is designed to be extensible, so is technically possible to add
|
||||
more compute nodes and cluster storage without significant increase of
|
||||
the costs of the manpower and the operations.
|
||||
|
||||
Merlin6 contains all the main services needed for running cluster, including
|
||||
**login nodes**, **storage**, **computing nodes** and other *subservices*,
|
||||
connected to the central PSI IT infrastructure.
|
||||
|
||||
#### CPU and GPU Slurm clusters
|
||||
|
||||
The Merlin6 **computing nodes** are mostly based on **CPU** resources. However,
|
||||
it also contains a small amount of **GPU**-based resources, which are mostly used
|
||||
by the BIO Division and by Deep Leaning project.
|
||||
|
||||
These computational resources are split into **two** different **[Slurm](https://slurm.schedmd.com/overview.html)** clusters:
|
||||
* The Merlin6 CPU nodes are in a dedicated **[Slurm](https://slurm.schedmd.com/overview.html)** cluster called [**`merlin6`**](/merlin6/slurm-configuration.html).
|
||||
* This is the **default Slurm cluster** configured in the login nodes: any job submitted without the option `--cluster` will be submited to this cluster.
|
||||
* The Merlin6 GPU resources are in a dedicated **[Slurm](https://slurm.schedmd.com/overview.html)** cluster called [**`gmerlin6`**](/gmerlin6/slurm-configuration.html).
|
||||
* Users submitting to the **`gmerlin6`** GPU cluster need to specify the option ``--cluster=gmerlin6``.
|
||||
|
||||
### Merlin5
|
||||
|
||||
The old Slurm **CPU** *merlin* cluster is still active and is maintained in a best effort basis.
|
||||
|
||||
**Merlin5** only contains **computing nodes** resources in a dedicated **[Slurm](https://slurm.schedmd.com/overview.html)** cluster.
|
||||
* The Merlin5 CPU cluster is called [**merlin5**](/merlin5/slurm-configuration.html).
|
||||
|
||||
47
docs/merlin6/01-Quick-Start-Guide/requesting-accounts.md
Normal file
@@ -0,0 +1,47 @@
|
||||
---
|
||||
title: Requesting Merlin Accounts
|
||||
#tags:
|
||||
keywords: registration, register, account, merlin5, merlin6, snow, service now
|
||||
last_updated: 07 September 2022
|
||||
#summary: ""
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/request-account.html
|
||||
---
|
||||
|
||||
## Requesting Access to Merlin6
|
||||
|
||||
Access to Merlin6 is regulated by a PSI user's account being a member of the **`svc-cluster_merlin6`** group. Access to this group will also grant access to older generations of Merlin (`merlin5`).
|
||||
|
||||
Requesting **Merlin6** access *has to be done* with the corresponding **[Request Linux Group Membership](https://psi.service-now.com/psisp?id=psi_new_sc_cat_item&sys_id=84f2c0c81b04f110679febd9bb4bcbb1)** form, available in the [PSI Service Now Service Catalog](https://psi.service-now.com/psisp).
|
||||
|
||||

|
||||
|
||||
Mandatory customizable fields are the following:
|
||||
* **`Order Access for user`**, which defaults to the logged in user. However, requesting access for another user it's also possible.
|
||||
* **`Request membership for group`**, for Merlin6 the **`svc-cluster_merlin6`** must be selected.
|
||||
* **`Justification`**, please add here a short justification why access to Merlin6 is necessary.
|
||||
|
||||
Once submitted, the Merlin responsible will approve the request as soon as possible (within the next few hours on working days). Once the request is approved, *it may take up to 30 minutes to get the account fully configured*.
|
||||
|
||||
## Requesting Access to Merlin5
|
||||
|
||||
Access to Merlin5 is regulated by a PSI user's account being a member of the **`svc-cluster_merlin5`** group. Access to this group does not grant access to newer generations of Merlin (`merlin6`, `gmerlin6`, and future ones).
|
||||
|
||||
Requesting **Merlin5** access *has to be done* with the corresponding **[Request Linux Group Membership](https://psi.service-now.com/psisp?id=psi_new_sc_cat_item&sys_id=84f2c0c81b04f110679febd9bb4bcbb1)** form, available in the [PSI Service Now Service Catalog](https://psi.service-now.com/psisp).
|
||||
|
||||

|
||||
|
||||
Mandatory customizable fields are the following:
|
||||
* **`Order Access for user`**, which defaults to the logged in user. However, requesting access for another user it's also possible.
|
||||
* **`Request membership for group`**, for Merlin5 the **`svc-cluster_merlin5`** must be selected.
|
||||
* **`Justification`**, please add here a short justification why access to Merlin5 is necessary.
|
||||
|
||||
Once submitted, the Merlin responsible will approve the request as soon as possible (within the next few hours on working days). Once the request is approved, *it may take up to 30 minutes to get the account fully configured*.
|
||||
|
||||
## Further documentation
|
||||
|
||||
Further information it's also available in the Linux Central Documentation:
|
||||
* [Unix Group / Group Management for users](https://linux.psi.ch/documentation/services/user-guide/unix_groups.html)
|
||||
* [Unix Group / Group Management for group managers](https://linux.psi.ch/documentation/services/admin-guide/unix_groups.html)
|
||||
|
||||
**Special thanks** to the **Linux Central Team** and **AIT** to make this possible.
|
||||
123
docs/merlin6/01-Quick-Start-Guide/requesting-projects.md
Normal file
@@ -0,0 +1,123 @@
|
||||
---
|
||||
title: Requesting a Merlin Project
|
||||
#tags:
|
||||
keywords: merlin project, project, snow, service now
|
||||
last_updated: 07 September 2022
|
||||
#summary: ""
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/request-project.html
|
||||
---
|
||||
|
||||
A project owns its own storage area in Merlin, which can be accessed by other group members.
|
||||
|
||||
Projects can receive a higher storage quota than user areas and should be the primary way of organizing bigger storage requirements
|
||||
in a multi-user collaboration.
|
||||
|
||||
Access to a project's directories is governed by project members belonging to a common **Unix group**. You may use an existing
|
||||
Unix group or you may have a new Unix group created especially for the project. The **project responsible** will be the owner of
|
||||
the Unix group (*this is important*)!
|
||||
|
||||
This document explains how to request new Unix group, to request membership for existing groups, and the procedure for requesting a Merlin project.
|
||||
|
||||
## About Unix groups
|
||||
|
||||
Before requesting a Merlin project, it is important to have a Unix group that can be used to grant access to it to different members
|
||||
of the project.
|
||||
|
||||
Unix groups in the PSI Active Directory (which is the PSI central database containing user and group information, and more) are defined by the `unx-` prefix, followed by a name.
|
||||
In general, PSI employees working on Linux systems (including HPC clusters, like Merlin) can request for a non-existing Unix group, and can become responsible for managing it.
|
||||
In addition, a list of administrators can be set. The administrators, together with the group manager, can approve or deny membership requests. Further information about this topic
|
||||
is covered in the [Linux Documentation - Services Admin Guides: Unix Groups / Group Management](https://linux.psi.ch/documentation/services/admin-guide/unix_groups.html), managed by the Central Linux Team.
|
||||
|
||||
To gran access to specific Merlin project directories, some users may require to be added to some specific **Unix groups**:
|
||||
* Each Merlin project (i.e. `/data/project/{bio|general}/$projectname`) or experiment (i.e. `/data/experiment/$experimentname`) directory has access restricted by ownership and group membership (with a very few exceptions allowing public access).
|
||||
* Users requiring access to a specific restricted project or experiment directory have to request membership for the corresponding Unix group owning the directory.
|
||||
|
||||
### Requesting a new Unix group
|
||||
|
||||
**If you need a new Unix group** to be created, you need to first get this group through a separate
|
||||
**[PSI Service Now ticket](https://psi.service-now.com/psisp)**. **Please use the following template.**
|
||||
You can also specify the login names of the initial group members and the **owner** of the group.
|
||||
The owner of the group is the person who will be allowed to modify the group.
|
||||
|
||||
* Please open an *Incident Request* with subject:
|
||||
```
|
||||
Subject: Request for new unix group xxxx
|
||||
```
|
||||
|
||||
* and base the text field of the request on this template
|
||||
```
|
||||
Dear HelpDesk
|
||||
|
||||
I would like to request a new unix group.
|
||||
|
||||
Unix Group Name: unx-xxxxx
|
||||
Initial Group Members: xxxxx, yyyyy, zzzzz, ...
|
||||
Group Owner: xxxxx
|
||||
Group Administrators: aaaaa, bbbbb, ccccc, ....
|
||||
|
||||
Best regards,
|
||||
```
|
||||
|
||||
### Requesting Unix group membership
|
||||
|
||||
Existing Merlin projects have already a Unix group assigned. To have access to a project, users must belong to the proper **Unix group** owning that project.
|
||||
Supervisors should inform new users which extra groups are needed for their project(s). If this information is not known, one can check the permissions for that directory. In example:
|
||||
```bash
|
||||
(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m]# ls -ltrhd /data/project/general/$projectname
|
||||
(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m]# ls -ltrhd /data/project/bio/$projectname
|
||||
```
|
||||
|
||||
Requesting membership for a specific Unix group *has to be done* with the corresponding **[Request Linux Group Membership](https://psi.service-now.com/psisp?id=psi_new_sc_cat_item&sys_id=84f2c0c81b04f110679febd9bb4bcbb1)** form, available in the [PSI Service Now Service Catalog](https://psi.service-now.com/psisp).
|
||||
|
||||

|
||||
|
||||
Once submitted, the responsible of the Unix group has to approve the request.
|
||||
|
||||
**Important note**: Requesting access to specific Unix Groups will require validation from the responsible of the Unix Group. If you ask for inclusion in many groups it may take longer, since the fulfillment of the request will depend on more people.
|
||||
|
||||
Further information can be found in the [Linux Documentation - Services User guide: Unix Groups / Group Management](https://linux.psi.ch/documentation/services/user-guide/unix_groups.html)
|
||||
|
||||
### Managing Unix Groups
|
||||
|
||||
Other administration operations on Unix Groups it's mainly covered in the [Linux Documentation - Services Admin Guides: Unix Groups / Group Management](https://linux.psi.ch/documentation/services/admin-guide/unix_groups.html), managed by the Central Linux Team.
|
||||
|
||||
## Requesting a Merlin project
|
||||
|
||||
Once a Unix group is available, a Merlin project can be requested.
|
||||
To request a project, please provide the following information in a **[PSI Service Now ticket](https://psi.service-now.com/psisp)**
|
||||
|
||||
* Please open an *Incident Request* with subject:
|
||||
```
|
||||
Subject: [Merlin6] Project Request for project name xxxxxx
|
||||
```
|
||||
|
||||
* and base the text field of the request on this template
|
||||
```
|
||||
Dear HelpDesk
|
||||
|
||||
I would like to request a new Merlin6 project.
|
||||
|
||||
Project Name: xxxxx
|
||||
UnixGroup: xxxxx # Must be an existing Unix Group
|
||||
|
||||
The project responsible is the Owner of the Unix Group.
|
||||
If you need a storage quota exceeding the defaults, please provide a description
|
||||
and motivation for the higher storage needs:
|
||||
|
||||
Storage Quota: 1TB with a maximum of 1M Files
|
||||
Reason: (None for default 1TB/1M)
|
||||
|
||||
Best regards,
|
||||
```
|
||||
|
||||
The **default storage quota** for a project is 1TB (with a maximal *Number of Files* of 1M). If you need a larger assignment, you
|
||||
need to request this and provide a description of your storage needs.
|
||||
|
||||
## Further documentation
|
||||
|
||||
Further information it's also available in the Linux Central Documentation:
|
||||
* [Unix Group / Group Management for users](https://linux.psi.ch/documentation/services/user-guide/unix_groups.html)
|
||||
* [Unix Group / Group Management for group managers](https://linux.psi.ch/documentation/services/admin-guide/unix_groups.html)
|
||||
|
||||
**Special thanks** to the **Linux Central Team** and **AIT** to make this possible.
|
||||
379
docs/merlin6/02-How-To-Use-Merlin/archive.md
Normal file
@@ -0,0 +1,379 @@
|
||||
---
|
||||
title: Archive & PSI Data Catalog
|
||||
#tags:
|
||||
keywords: linux, archive, data catalog, archiving, lts, tape, long term storage, ingestion, datacatalog
|
||||
last_updated: 31 January 2020
|
||||
summary: "This document describes how to use the PSI Data Catalog for archiving Merlin6 data."
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/archive.html
|
||||
---
|
||||
|
||||
## PSI Data Catalog as a PSI Central Service
|
||||
|
||||
PSI provides access to the ***Data Catalog*** for **long-term data storage and retrieval**. Data is
|
||||
stored on the ***PetaByte Archive*** at the **Swiss National Supercomputing Centre (CSCS)**.
|
||||
|
||||
The Data Catalog and Archive is suitable for:
|
||||
|
||||
* Raw data generated by PSI instruments
|
||||
* Derived data produced by processing some inputs
|
||||
* Data required to reproduce PSI research and publications
|
||||
|
||||
The Data Catalog is part of PSI's effort to conform to the FAIR principles for data management.
|
||||
In accordance with this policy, ***data will be publicly released under CC-BY-SA 4.0 after an
|
||||
embargo period expires.***
|
||||
|
||||
The Merlin cluster is connected to the Data Catalog. Hence, users archive data stored in the
|
||||
Merlin storage under the ``/data`` directories (currentlyi, ``/data/user`` and ``/data/project``).
|
||||
Archiving from other directories is also possible, however the process is much slower as data
|
||||
can not be directly retrieved by the PSI archive central servers (**central mode**), and needs to
|
||||
be indirectly copied to these (**decentral mode**).
|
||||
|
||||
Archiving can be done from any node accessible by the users (usually from the login nodes).
|
||||
|
||||
{{site.data.alerts.tip}} Archiving can be done in two different ways:
|
||||
<br>
|
||||
<b>'Central mode':</b> Possible for the user and project data directories, is the
|
||||
fastest way as it does not require remote copy (data is directly retreived by central AIT servers from Merlin
|
||||
through 'merlin-archive.psi.ch').
|
||||
<br>
|
||||
<br>
|
||||
<b>'Decentral mode':</b> Possible for any directory, is the slowest way of archiving as it requires
|
||||
to copy ('rsync') the data from Merlin to the central AIT servers.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
## Procedure
|
||||
|
||||
### Overview
|
||||
|
||||
Below are the main steps for using the Data Catalog.
|
||||
|
||||
* Ingest the dataset into the Data Catalog. This makes the data known to the Data Catalog system at PSI:
|
||||
* Prepare a metadata file describing the dataset
|
||||
* Run **``datasetIngestor``** script
|
||||
* If necessary, the script will copy the data to the PSI archive servers
|
||||
* Usually this is necessary when archiving from directories other than **``/data/user``** or
|
||||
**``/data/project``**. It would be also necessary when the Merlin export server (**``merlin-archive.psi.ch``**)
|
||||
is down for any reason.
|
||||
* Archive the dataset:
|
||||
* Visit [https://discovery.psi.ch](https://discovery.psi.ch)
|
||||
* Click **``Archive``** for the dataset
|
||||
* The system will now copy the data to the PetaByte Archive at CSCS
|
||||
* Retrieve data from the catalog:
|
||||
* Find the dataset on [https://discovery.psi.ch](https://discovery.psi.ch) and click **``Retrieve``**
|
||||
* Wait for the data to be copied to the PSI retrieval system
|
||||
* Run **``datasetRetriever``** script
|
||||
|
||||
Since large data sets may take a lot of time to transfer, some steps are designed to happen in the
|
||||
background. The discovery website can be used to track the progress of each step.
|
||||
|
||||
### Account Registration
|
||||
|
||||
Two types of account permit access to the Data Catalog. If your data was collected at a ***beamline***, you may
|
||||
have been assigned a **``p-group``** (e.g. ``p12345``) for the experiment. Other users are assigned **``a-group``**
|
||||
(e.g. ``a-12345``).
|
||||
|
||||
Groups are usually assigned to a PI, and then individual user accounts are added to the group. This must be done
|
||||
under user request through PSI Service Now. For existing **a-groups** and **p-groups**, you can follow the standard
|
||||
central procedures. Alternatively, if you do not know how to do that, follow the Merlin6
|
||||
**[Requesting extra Unix groups](/merlin6/request-account.html#requesting-extra-unix-groups)** procedure, or open
|
||||
a **[PSI Service Now](https://psi.service-now.com/psisp)** ticket.
|
||||
|
||||
### Documentation
|
||||
|
||||
Accessing the Data Catalog is done through the [SciCat software](https://melanie.gitpages.psi.ch/SciCatPages/).
|
||||
Documentation is here: [ingestManual](https://scicatproject.github.io/documentation/Ingestor/ingestManual.html).
|
||||
|
||||
#### Loading datacatalog tools
|
||||
|
||||
The latest datacatalog software is maintained in the PSI module system. To access it from the Merlin systems, run the following command:
|
||||
|
||||
```bash
|
||||
module load datacatalog
|
||||
```
|
||||
|
||||
It can be done from any host in the Merlin cluster accessible by users. Usually, login nodes will be the nodes used for archiving.
|
||||
|
||||
### Finding your token
|
||||
|
||||
As of 2022-04-14 a secure token is required to interact with the data catalog. This is a long random string that replaces the previous user/password authentication (allowing access for non-PSI use cases). **This string should be treated like a password and not shared.**
|
||||
|
||||
1. Go to discovery.psi.ch
|
||||
1. Click 'Sign in' in the top right corner. Click the 'Login with PSI account' and log in on the PSI login1. page.
|
||||
1. You should be redirected to your user settings and see a 'User Information' section. If not, click on1. your username in the top right and choose 'Settings' from the menu.
|
||||
1. Look for the field 'Catamel Token'. This should be a 64-character string. Click the icon to copy the1. token.
|
||||
|
||||

|
||||
|
||||
You will need to save this token for later steps. To avoid including it in all the commands, I suggest saving it to an environmental variable (Linux):
|
||||
|
||||
```
|
||||
$ SCICAT_TOKEN=RqYMZcqpqMJqluplbNYXLeSyJISLXfnkwlfBKuvTSdnlpKkU
|
||||
```
|
||||
|
||||
(Hint: prefix this line with a space to avoid saving the token to your bash history.)
|
||||
|
||||
Tokens expire after 2 weeks and will need to be fetched from the website again.
|
||||
|
||||
### Ingestion
|
||||
|
||||
The first step to ingesting your data into the catalog is to prepare a file describing what data you have. This is called
|
||||
**``metadata.json``**, and can be created with a text editor (e.g. *``vim``*). It can in principle be saved anywhere,
|
||||
but keeping it with your archived data is recommended. For more information about the format, see the 'Bio metadata'
|
||||
section below. An example follows:
|
||||
|
||||
```yaml
|
||||
{
|
||||
"principalInvestigator": "albrecht.gessler@psi.ch",
|
||||
"creationLocation": "/PSI/EMF/JEOL2200FS",
|
||||
"dataFormat": "TIFF+LZW Image Stack",
|
||||
"sourceFolder": "/gpfs/group/LBR/pXXX/myimages",
|
||||
"owner": "Wilhelm Tell",
|
||||
"ownerEmail": "wilhelm.tell@psi.ch",
|
||||
"type": "raw",
|
||||
"description": "EM micrographs of amygdalin",
|
||||
"ownerGroup": "a-12345",
|
||||
"scientificMetadata": {
|
||||
"description": "EM micrographs of amygdalin",
|
||||
"sample": {
|
||||
"name": "Amygdalin beta-glucosidase 1",
|
||||
"uniprot": "P29259",
|
||||
"species": "Apple"
|
||||
},
|
||||
"dataCollection": {
|
||||
"date": "2018-08-01"
|
||||
},
|
||||
"microscopeParameters": {
|
||||
"pixel size": {
|
||||
"v": 0.885,
|
||||
"u": "A"
|
||||
},
|
||||
"voltage": {
|
||||
"v": 200,
|
||||
"u": "kV"
|
||||
},
|
||||
"dosePerFrame": {
|
||||
"v": 1.277,
|
||||
"u": "e/A2"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
It is recommended to use the [ScicatEditor](https://bliven_s.gitpages.psi.ch/SciCatEditor/) for creating metadata files. This is a browser-based tool specifically for ingesting PSI data. Using the tool avoids syntax errors and provides templates for common data sets and options. The finished JSON file can then be downloaded to merlin or copied into a text editor.
|
||||
|
||||
Another option is to use the SciCat graphical interface from NoMachine. This provides a graphical interface for selecting data to archive. This is particularly useful for data associated with a DUO experiment and p-group. Type `SciCat`` to get started after loading the `datacatalog`` module. The GUI also replaces the the command-line ingestion described below.
|
||||
|
||||
The following steps can be run from wherever you saved your ``metadata.json``. First, perform a "dry-run" which will check the metadata for errors:
|
||||
|
||||
```bash
|
||||
datasetIngestor --token $SCICAT_TOKEN metadata.json
|
||||
```
|
||||
|
||||
It will ask for your PSI credentials and then print some info about the data to be ingested. If there are no errors, proceed to the real ingestion:
|
||||
|
||||
```bash
|
||||
datasetIngestor --token $SCICAT_TOKEN --ingest --autoarchive metadata.json
|
||||
```
|
||||
|
||||
You will be asked whether you want to copy the data to the central system:
|
||||
|
||||
* If you are on the Merlin cluster and you are archiving data from ``/data/user`` or ``/data/project``, answer 'no' since the data catalog can
|
||||
directly read the data.
|
||||
* If you are on a directory other than ``/data/user`` and ``/data/project, or you are on a desktop computer, answer 'yes'. Copying large datasets
|
||||
to the PSI archive system may take quite a while (minutes to hours).
|
||||
|
||||
If there are no errors, your data has been accepted into the data catalog! From now on, no changes should be made to the ingested data.
|
||||
This is important, since the next step is for the system to copy all the data to the CSCS Petabyte archive. Writing to tape is slow, so
|
||||
this process may take several days, and it will fail if any modifications are detected.
|
||||
|
||||
If using the ``--autoarchive`` option as suggested above, your dataset should now be in the queue. Check the data catalog:
|
||||
[https://discovery.psi.ch](https://discovery.psi.ch). Your job should have status 'WorkInProgress'. You will receive an email when the ingestion
|
||||
is complete.
|
||||
|
||||
If you didn't use ``--autoarchive``, you need to manually move the dataset into the archive queue. From **discovery.psi.ch**, navigate to the 'Archive'
|
||||
tab. You should see the newly ingested dataset. Check the dataset and click **``Archive``**. You should see the status change from **``datasetCreated``** to
|
||||
**``scheduleArchiveJob``**. This indicates that the data is in the process of being transferred to CSCS.
|
||||
|
||||
After a few days the dataset's status will change to **``datasetOnAchive``** indicating the data is stored. At this point it is safe to delete the data.
|
||||
|
||||
#### Useful commands
|
||||
|
||||
Running the datasetIngestor in dry mode (**without** ``--ingest``) finds most errors. However, it is sometimes convenient to find potential errors
|
||||
yourself with simple unix commands.
|
||||
|
||||
Find problematic filenames
|
||||
|
||||
```bash
|
||||
find . -iregex '.*/[^/]*[^a-zA-Z0-9_ ./-][^/]*'=
|
||||
```
|
||||
|
||||
Find broken links
|
||||
|
||||
```bash
|
||||
find -L . -type l
|
||||
```
|
||||
|
||||
Find outside links
|
||||
|
||||
```bash
|
||||
find . -type l -exec bash -c 'realpath --relative-base "`pwd`" "$0" 2>/dev/null |egrep "^[./]" |sed "s|^|$0 ->|" ' '{}' ';'
|
||||
```
|
||||
|
||||
Delete certain files (use with caution)
|
||||
|
||||
```bash
|
||||
# Empty directories
|
||||
find . -type d -empty -delete
|
||||
# Backup files
|
||||
find . -name '*~' -delete
|
||||
find . -name '*#autosave#' -delete
|
||||
```
|
||||
|
||||
#### Troubleshooting & Known Bugs
|
||||
|
||||
* The following message can be safely ignored:
|
||||
|
||||
```bash
|
||||
key_cert_check_authority: invalid certificate
|
||||
Certificate invalid: name is not a listed principal
|
||||
```
|
||||
It indicates that no kerberos token was provided for authentication. You can avoid the warning by first running kinit (PSI linux systems).
|
||||
|
||||
* For decentral ingestion cases, the copy step is indicated by a message ``Running [/usr/bin/rsync -e ssh -avxz ...``. It is expected that this
|
||||
step will take a long time and may appear to have hung. You can check what files have been successfully transfered using rsync:
|
||||
|
||||
```bash
|
||||
rsync --list-only user_n@pb-archive.psi.ch:archive/UID/PATH/
|
||||
```
|
||||
|
||||
where UID is the dataset ID (12345678-1234-1234-1234-123456789012) and PATH is the absolute path to your data. Note that rsync creates directories first and that the transfer order is not alphabetical in some cases, but it should be possible to see whether any data has transferred.
|
||||
|
||||
* There is currently a limit on the number of files per dataset (technically, the limit is from the total length of all file paths). It is recommended to break up datasets into 300'000 files or less.
|
||||
* If it is not possible or desirable to split data between multiple datasets, an alternate work-around is to package files into a tarball. For datasets which are already compressed, omit the -z option for a considerable speedup:
|
||||
|
||||
```
|
||||
tar -f [output].tar [srcdir]
|
||||
```
|
||||
|
||||
Uncompressed data can be compressed on the cluster using the following command:
|
||||
|
||||
```
|
||||
sbatch /data/software/Slurm/Utilities/Parallel_TarGz.batch -s [srcdir] -t [output].tar -n
|
||||
```
|
||||
|
||||
Run /data/software/Slurm/Utilities/Parallel_TarGz.batch -h for more details and options.
|
||||
|
||||
#### Sample ingestion output (datasetIngestor 1.1.11)
|
||||
<details>
|
||||
<summary>[Show Example]: Sample ingestion output (datasetIngestor 1.1.11)</summary>
|
||||
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
|
||||
/data/project/bio/myproject/archive $ datasetIngestor -copy -autoarchive -allowexistingsource -ingest metadata.json
|
||||
2019/11/06 11:04:43 Latest version: 1.1.11
|
||||
|
||||
|
||||
2019/11/06 11:04:43 Your version of this program is up-to-date
|
||||
2019/11/06 11:04:43 You are about to add a dataset to the === production === data catalog environment...
|
||||
2019/11/06 11:04:43 Your username:
|
||||
user_n
|
||||
2019/11/06 11:04:48 Your password:
|
||||
2019/11/06 11:04:52 User authenticated: XXX
|
||||
2019/11/06 11:04:52 User is member in following a or p groups: XXX
|
||||
2019/11/06 11:04:52 OwnerGroup information a-XXX verified successfully.
|
||||
2019/11/06 11:04:52 contactEmail field added: XXX
|
||||
2019/11/06 11:04:52 Scanning files in dataset /data/project/bio/myproject/archive
|
||||
2019/11/06 11:04:52 No explicit filelistingPath defined - full folder /data/project/bio/myproject/archive is used.
|
||||
2019/11/06 11:04:52 Source Folder: /data/project/bio/myproject/archive at /data/project/bio/myproject/archive
|
||||
2019/11/06 11:04:57 The dataset contains 100000 files with a total size of 50000000000 bytes.
|
||||
2019/11/06 11:04:57 creationTime field added: 2019-07-29 18:47:08 +0200 CEST
|
||||
2019/11/06 11:04:57 endTime field added: 2019-11-06 10:52:17.256033 +0100 CET
|
||||
2019/11/06 11:04:57 license field added: CC BY-SA 4.0
|
||||
2019/11/06 11:04:57 isPublished field added: false
|
||||
2019/11/06 11:04:57 classification field added: IN=medium,AV=low,CO=low
|
||||
2019/11/06 11:04:57 Updated metadata object:
|
||||
{
|
||||
"accessGroups": [
|
||||
"XXX"
|
||||
],
|
||||
"classification": "IN=medium,AV=low,CO=low",
|
||||
"contactEmail": "XXX",
|
||||
"creationLocation": "XXX",
|
||||
"creationTime": "2019-07-29T18:47:08+02:00",
|
||||
"dataFormat": "XXX",
|
||||
"description": "XXX",
|
||||
"endTime": "2019-11-06T10:52:17.256033+01:00",
|
||||
"isPublished": false,
|
||||
"license": "CC BY-SA 4.0",
|
||||
"owner": "XXX",
|
||||
"ownerEmail": "XXX",
|
||||
"ownerGroup": "a-XXX",
|
||||
"principalInvestigator": "XXX",
|
||||
"scientificMetadata": {
|
||||
...
|
||||
},
|
||||
"sourceFolder": "/data/project/bio/myproject/archive",
|
||||
"type": "raw"
|
||||
}
|
||||
2019/11/06 11:04:57 Running [/usr/bin/ssh -l user_n pb-archive.psi.ch test -d /data/project/bio/myproject/archive].
|
||||
key_cert_check_authority: invalid certificate
|
||||
Certificate invalid: name is not a listed principal
|
||||
user_n@pb-archive.psi.ch's password:
|
||||
2019/11/06 11:05:04 The source folder /data/project/bio/myproject/archive is not centrally available (decentral use case).
|
||||
The data must first be copied to a rsync cache server.
|
||||
|
||||
|
||||
2019/11/06 11:05:04 Do you want to continue (Y/n)?
|
||||
Y
|
||||
2019/11/06 11:05:09 Created dataset with id 12.345.67890/12345678-1234-1234-1234-123456789012
|
||||
2019/11/06 11:05:09 The dataset contains 108057 files.
|
||||
2019/11/06 11:05:10 Created file block 0 from file 0 to 1000 with total size of 413229990 bytes
|
||||
2019/11/06 11:05:10 Created file block 1 from file 1000 to 2000 with total size of 416024000 bytes
|
||||
2019/11/06 11:05:10 Created file block 2 from file 2000 to 3000 with total size of 416024000 bytes
|
||||
2019/11/06 11:05:10 Created file block 3 from file 3000 to 4000 with total size of 416024000 bytes
|
||||
...
|
||||
2019/11/06 11:05:26 Created file block 105 from file 105000 to 106000 with total size of 416024000 bytes
|
||||
2019/11/06 11:05:27 Created file block 106 from file 106000 to 107000 with total size of 416024000 bytes
|
||||
2019/11/06 11:05:27 Created file block 107 from file 107000 to 108000 with total size of 850195143 bytes
|
||||
2019/11/06 11:05:27 Created file block 108 from file 108000 to 108057 with total size of 151904903 bytes
|
||||
2019/11/06 11:05:27 short dataset id: 0a9fe316-c9e7-4cc5-8856-e1346dd31e31
|
||||
2019/11/06 11:05:27 Running [/usr/bin/rsync -e ssh -avxz /data/project/bio/myproject/archive/ user_n@pb-archive.psi.ch:archive
|
||||
/0a9fe316-c9e7-4cc5-8856-e1346dd31e31/data/project/bio/myproject/archive].
|
||||
key_cert_check_authority: invalid certificate
|
||||
Certificate invalid: name is not a listed principal
|
||||
user_n@pb-archive.psi.ch's password:
|
||||
Permission denied, please try again.
|
||||
user_n@pb-archive.psi.ch's password:
|
||||
/usr/libexec/test_acl.sh: line 30: /tmp/tmpacl.txt: Permission denied
|
||||
/usr/libexec/test_acl.sh: line 30: /tmp/tmpacl.txt: Permission denied
|
||||
/usr/libexec/test_acl.sh: line 30: /tmp/tmpacl.txt: Permission denied
|
||||
/usr/libexec/test_acl.sh: line 30: /tmp/tmpacl.txt: Permission denied
|
||||
/usr/libexec/test_acl.sh: line 30: /tmp/tmpacl.txt: Permission denied
|
||||
...
|
||||
2019/11/06 12:05:08 Successfully updated {"pid":"12.345.67890/12345678-1234-1234-1234-123456789012",...}
|
||||
2019/11/06 12:05:08 Submitting Archive Job for the ingested datasets.
|
||||
2019/11/06 12:05:08 Job response Status: okay
|
||||
2019/11/06 12:05:08 A confirmation email will be sent to XXX
|
||||
12.345.67890/12345678-1234-1234-1234-123456789012
|
||||
</pre>
|
||||
</details>
|
||||
|
||||
### Publishing
|
||||
|
||||
After datasets are are ingested they can be assigned a public DOI. This can be included in publications and will make the datasets on http://doi.psi.ch.
|
||||
|
||||
For instructions on this, please read the ['Publish' section in the ingest manual](https://scicatproject.github.io/documentation/Ingestor/ingestManual.html#sec-8).
|
||||
|
||||
### Retrieving data
|
||||
|
||||
Retrieving data from the archive is also initiated through the Data Catalog. Please read the ['Retrieve' section in the ingest manual](https://scicatproject.github.io/documentation/Ingestor/ingestManual.html#sec-6).
|
||||
|
||||
## Further Information
|
||||
|
||||
* [PSI Data Catalog](https://discovery.psi.ch)
|
||||
* [Full Documentation](https://scicatproject.github.io/documentation/Ingestor/ingestManual.html)
|
||||
* [Published Datasets (doi.psi.ch)](https://doi.psi.ch)
|
||||
* Data Catalog [PSI page](https://www.psi.ch/photon-science-data-services/data-catalog-and-archive)
|
||||
* Data catalog [SciCat Software](https://scicatproject.github.io/)
|
||||
* [FAIR](https://www.nature.com/articles/sdata201618) definition and [SNF Research Policy](http://www.snf.ch/en/theSNSF/research-policies/open_research_data/Pages/default.aspx#FAIR%20Data%20Principles%20for%20Research%20Data%20Management)
|
||||
* [Petabyte Archive at CSCS](https://www.cscs.ch/fileadmin/user_upload/contents_publications/annual_reports/AR2017_Online.pdf)
|
||||
50
docs/merlin6/02-How-To-Use-Merlin/connect-from-linux.md
Normal file
@@ -0,0 +1,50 @@
|
||||
---
|
||||
title: Connecting from a Linux Client
|
||||
#tags:
|
||||
keywords: linux, connecting, client, configuration, SSH, X11
|
||||
last_updated: 07 September 2022
|
||||
summary: "This document describes a recommended setup for a Linux client."
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/connect-from-linux.html
|
||||
---
|
||||
|
||||
## SSH without X11 Forwarding
|
||||
|
||||
This is the standard method. Official X11 support is provided through [NoMachine](/merlin6/nomachine.html).
|
||||
For normal SSH sessions, use your SSH client as follows:
|
||||
|
||||
```bash
|
||||
ssh $username@merlin-l-01.psi.ch
|
||||
ssh $username@merlin-l-001.psi.ch
|
||||
ssh $username@merlin-l-002.psi.ch
|
||||
```
|
||||
|
||||
## SSH with X11 Forwarding
|
||||
|
||||
Official X11 Forwarding support is through NoMachine. Please follow the document
|
||||
[{Job Submission -> Interactive Jobs}](/merlin6/interactive-jobs.html#Requirements) and
|
||||
[{Accessing Merlin -> NoMachine}](/merlin6/nomachine.html) for more details. However,
|
||||
we provide a small recipe for enabling X11 Forwarding in Linux.
|
||||
|
||||
* For enabling client X11 forwarding, add the following to the start of ``~/.ssh/config``
|
||||
to implicitly add ``-X`` to all ssh connections:
|
||||
|
||||
```bash
|
||||
ForwardAgent yes
|
||||
ForwardX11Trusted yes
|
||||
```
|
||||
|
||||
* Alternatively, you can add the option ``-Y`` to the ``ssh`` command. In example:
|
||||
|
||||
```bash
|
||||
ssh -X $username@merlin-l-01.psi.ch
|
||||
ssh -X $username@merlin-l-001.psi.ch
|
||||
ssh -X $username@merlin-l-002.psi.ch
|
||||
```
|
||||
|
||||
* For testing that X11 forwarding works, just run ``xclock``. A X11 based clock should
|
||||
popup in your client session:
|
||||
|
||||
```bash
|
||||
xclock
|
||||
```
|
||||
60
docs/merlin6/02-How-To-Use-Merlin/connect-from-macos.md
Normal file
@@ -0,0 +1,60 @@
|
||||
---
|
||||
title: Connecting from a MacOS Client
|
||||
#tags:
|
||||
keywords: MacOS, mac os, mac, connecting, client, configuration, SSH, X11
|
||||
last_updated: 07 September 2022
|
||||
summary: "This document describes a recommended setup for a MacOS client."
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/connect-from-macos.html
|
||||
---
|
||||
|
||||
## SSH without X11 Forwarding
|
||||
|
||||
This is the standard method. Official X11 support is provided through [NoMachine](/merlin6/nomachine.html).
|
||||
For normal SSH sessions, use your SSH client as follows:
|
||||
|
||||
```bash
|
||||
ssh $username@merlin-l-01.psi.ch
|
||||
ssh $username@merlin-l-001.psi.ch
|
||||
ssh $username@merlin-l-002.psi.ch
|
||||
```
|
||||
|
||||
## SSH with X11 Forwarding
|
||||
|
||||
### Requirements
|
||||
|
||||
For running SSH with X11 Forwarding in MacOS, one needs to have a X server running in MacOS.
|
||||
The official X Server for MacOS is **[XQuartz](https://www.xquartz.org/)**. Please ensure
|
||||
you have it running before starting a SSH connection with X11 forwarding.
|
||||
|
||||
### SSH with X11 Forwarding in MacOS
|
||||
|
||||
Official X11 support is through NoMachine. Please follow the document
|
||||
[{Job Submission -> Interactive Jobs}](/merlin6/interactive-jobs.html#Requirements) and
|
||||
[{Accessing Merlin -> NoMachine}](/merlin6/nomachine.html) for more details. However,
|
||||
we provide a small recipe for enabling X11 Forwarding in MacOS.
|
||||
|
||||
* Ensure that **[XQuartz](https://www.xquartz.org/)** is installed and running in your MacOS.
|
||||
|
||||
* For enabling client X11 forwarding, add the following to the start of ``~/.ssh/config``
|
||||
to implicitly add ``-X`` to all ssh connections:
|
||||
|
||||
```bash
|
||||
ForwardAgent yes
|
||||
ForwardX11Trusted yes
|
||||
```
|
||||
|
||||
* Alternatively, you can add the option ``-Y`` to the ``ssh`` command. In example:
|
||||
|
||||
```bash
|
||||
ssh -X $username@merlin-l-01.psi.ch
|
||||
ssh -X $username@merlin-l-001.psi.ch
|
||||
ssh -X $username@merlin-l-002.psi.ch
|
||||
```
|
||||
|
||||
* For testing that X11 forwarding works, just run ``xclock``. A X11 based clock should
|
||||
popup in your client session.
|
||||
|
||||
```bash
|
||||
xclock
|
||||
```
|
||||
47
docs/merlin6/02-How-To-Use-Merlin/connect-from-windows.md
Normal file
@@ -0,0 +1,47 @@
|
||||
---
|
||||
title: Connecting from a Windows Client
|
||||
keywords: microsoft, mocosoft, windows, putty, xming, connecting, client, configuration, SSH, X11
|
||||
last_updated: 07 September 2022
|
||||
summary: "This document describes a recommended setup for a Windows client."
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/connect-from-windows.html
|
||||
---
|
||||
|
||||
## SSH with PuTTY without X11 Forwarding
|
||||
|
||||
PuTTY is one of the most common tools for SSH.
|
||||
|
||||
Check, if the following software packages are installed on the Windows workstation by
|
||||
inspecting the *Start* menu (hint: use the *Search* box to save time):
|
||||
* PuTTY (should be already installed)
|
||||
* *[Optional]* Xming (needed for [SSH with X11 Forwarding](/merlin6/connect-from-windows.html#ssh-with-x11-forwarding))
|
||||
|
||||
If they are missing, you can install them using the Software Kiosk icon on the Desktop.
|
||||
|
||||
1. Start PuTTY
|
||||
|
||||
2. *[Optional]* Enable ``xterm`` to have similar mouse behavour as in Linux:
|
||||
|
||||

|
||||
|
||||
3. Create session to a Merlin login node and *Open*:
|
||||
|
||||

|
||||
|
||||
|
||||
## SSH with PuTTY with X11 Forwarding
|
||||
|
||||
Official X11 Forwarding support is through NoMachine. Please follow the document
|
||||
[{Job Submission -> Interactive Jobs}](/merlin6/interactive-jobs.html#Requirements) and
|
||||
[{Accessing Merlin -> NoMachine}](/merlin6/nomachine.html) for more details. However,
|
||||
we provide a small recipe for enabling X11 Forwarding in Windows.
|
||||
|
||||
Check, if the **Xming** is installed on the Windows workstation by inspecting the
|
||||
*Start* menu (hint: use the *Search* box to save time). If missing, you can install it by
|
||||
using the Software Kiosk icon (should be located on the Desktop).
|
||||
|
||||
1. Ensure that a X server (**Xming**) is running. Otherwise, start it.
|
||||
|
||||
2. Enable X11 Forwarding in your SSH client. In example, for Putty:
|
||||
|
||||

|
||||
192
docs/merlin6/02-How-To-Use-Merlin/kerberos.md
Normal file
@@ -0,0 +1,192 @@
|
||||
---
|
||||
title: Kerberos and AFS authentication
|
||||
#tags:
|
||||
keywords: kerberos, AFS, kinit, klist, keytab, tickets, connecting, client, configuration, slurm
|
||||
last_updated: 07 September 2022
|
||||
summary: "This document describes how to use Kerberos."
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/kerberos.html
|
||||
---
|
||||
|
||||
Projects and users have their own areas in the central PSI AFS service. In order
|
||||
to access to these areas, valid Kerberos and AFS tickets must be granted.
|
||||
|
||||
These tickets are automatically granted when accessing through SSH with
|
||||
username and password. Alternatively, one can get a granting ticket with the `kinit` (Kerberos)
|
||||
and `aklog` (AFS ticket, which needs to be run after `kinit`) commands.
|
||||
|
||||
Due to PSI security policies, the maximum lifetime of the ticket is 7 days, and the default
|
||||
time is 10 hours. It means than one needs to constantly renew (`krenew` command) the existing
|
||||
granting tickets, and their validity can not be extended longer than 7 days. At this point,
|
||||
one needs to obtain new granting tickets.
|
||||
|
||||
|
||||
## Obtaining granting tickets with username and password
|
||||
|
||||
As already described above, the most common use case is to obtain Kerberos and AFS granting tickets
|
||||
by introducing username and password:
|
||||
* When login to Merlin through SSH protocol, if this is done with username + password authentication,
|
||||
tickets for Kerberos and AFS will be automatically obtained.
|
||||
* When login to Merlin through NoMachine, no Kerberos and AFS are granted. Therefore, users need to
|
||||
run `kinit` (to obtain a granting Kerberos ticket) followed by `aklog` (to obtain a granting AFS ticket).
|
||||
See further details below.
|
||||
|
||||
To manually obtain granting tickets, one has to:
|
||||
1. To obtain a granting Kerberos ticket, one needs to run `kinit $USER` and enter the PSI password.
|
||||
```bash
|
||||
kinit $USER@D.PSI.CH
|
||||
```
|
||||
2. To obtain a granting ticket for AFS, one needs to run `aklog`. No password is necessary, but a valid
|
||||
Kerberos ticket is mandatory.
|
||||
```bash
|
||||
aklog
|
||||
```
|
||||
3. To list the status of your granted tickets, users can use the `klist` command.
|
||||
```bash
|
||||
klist
|
||||
```
|
||||
4. To extend the validity of existing granting tickets, users can use the `krenew` command.
|
||||
```bash
|
||||
krenew
|
||||
```
|
||||
* Keep in mind that the maximum lifetime for granting tickets is 7 days, therefore `krenew` can not be used beyond that limit,
|
||||
and then `kinit** should be used instead.
|
||||
|
||||
|
||||
## Obtanining granting tickets with keytab
|
||||
|
||||
Sometimes, obtaining granting tickets by using password authentication is not possible. An example are user Slurm jobs
|
||||
requiring access to private areas in AFS. For that, there's the possibility to generate a **keytab** file.
|
||||
|
||||
Be aware that the **keytab** file must be **private**, **fully protected** by correct permissions and not shared with any
|
||||
other users.
|
||||
|
||||
### Creating a keytab file
|
||||
|
||||
For generating a **keytab**, one has to:
|
||||
|
||||
1. Load a newer Kerberos ( `krb5/1.20` or higher) from Pmodules:
|
||||
```bash
|
||||
module load krb5/1.20
|
||||
```
|
||||
2. Create a private directory for storing the Kerberos **keytab** file
|
||||
```bash
|
||||
mkdir -p ~/.k5
|
||||
```
|
||||
3. Run the `ktutil` utility which comes with the loaded `krb5` Pmodule:
|
||||
```bash
|
||||
ktutil
|
||||
```
|
||||
4. In the `ktutil` console, one has to generate a **keytab** file as follows:
|
||||
```bash
|
||||
# Replace $USER by your username
|
||||
add_entry -password -k 0 -f -p $USER
|
||||
wkt /psi/home/$USER/.k5/krb5.keytab
|
||||
exit
|
||||
```
|
||||
Notice that you will need to add your password once. This step is required for generating the **keytab** file.
|
||||
5. Once back to the main shell, one has to ensure that the file contains the proper permissions:
|
||||
```bash
|
||||
chmod 0600 ~/.k5/krb5.keytab
|
||||
```
|
||||
|
||||
### Obtaining tickets by using keytab files
|
||||
|
||||
Once the keytab is created, one can obtain kerberos tickets without being prompted for a password as follows:
|
||||
|
||||
```bash
|
||||
kinit -kt ~/.k5/krb5.keytab $USER
|
||||
aklog
|
||||
```
|
||||
|
||||
## Slurm jobs accessing AFS
|
||||
|
||||
Some jobs may require to access private areas in AFS. For that, having a valid [**keytab**](/merlin6/kerberos.html#generating-granting-tickets-with-keytab) file is required.
|
||||
Then, from inside the batch script one can obtain granting tickets for Kerberos and AFS, which can be used for accessing AFS private areas.
|
||||
|
||||
The steps should be the following:
|
||||
|
||||
* Setup `KRB5CCNAME`, which can be used to specify the location of the Kerberos5 credentials (ticket) cache. In general it should point to a shared area
|
||||
(`$HOME/.k5` is a good location), and is strongly recommended to generate an independent Kerberos5 credential cache (it is, creating a new credential cache per Slurm job):
|
||||
```bash
|
||||
export KRB5CCNAME="$(mktemp "$HOME/.k5/krb5cc_XXXXXX")"
|
||||
```
|
||||
* To obtain a Kerberos5 granting ticket, run `kinit` by using your keytab:
|
||||
```bash
|
||||
kinit -kt "$HOME/.k5/krb5.keytab" $USER@D.PSI.CH
|
||||
```
|
||||
* To obtain a granting AFS ticket, run `aklog`:
|
||||
```bash
|
||||
aklog
|
||||
```
|
||||
* At the end of the job, you can remove destroy existing Kerberos tickets.
|
||||
```bash
|
||||
kdestroy
|
||||
```
|
||||
|
||||
### Slurm batch script example: obtaining KRB+AFS granting tickets
|
||||
|
||||
#### Example 1: Independent crendetial cache per Slurm job
|
||||
|
||||
This is the **recommended** way. At the end of the job, is strongly recommended to remove / destroy the existing kerberos tickets.
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
#SBATCH --partition=hourly # Specify 'general' or 'daily' or 'hourly'
|
||||
#SBATCH --time=01:00:00 # Strictly recommended when using 'general' partition.
|
||||
#SBATCH --output=run.out # Generate custom output file
|
||||
#SBATCH --error=run.err # Generate custom error file
|
||||
#SBATCH --nodes=1 # Uncomment and specify #nodes to use
|
||||
#SBATCH --ntasks=1 # Uncomment and specify #nodes to use
|
||||
#SBATCH --cpus-per-task=1
|
||||
#SBATCH --constraint=xeon-gold-6152
|
||||
#SBATCH --hint=nomultithread
|
||||
#SBATCH --job-name=krb5
|
||||
|
||||
export KRB5CCNAME="$(mktemp "$HOME/.k5/krb5cc_XXXXXX")"
|
||||
kinit -kt "$HOME/.k5/krb5.keytab" $USER@D.PSI.CH
|
||||
aklog
|
||||
klist
|
||||
|
||||
echo "Here should go my batch script code."
|
||||
|
||||
# Destroy Kerberos tickets created for this job only
|
||||
kdestroy
|
||||
klist
|
||||
```
|
||||
|
||||
#### Example 2: Shared credential cache
|
||||
|
||||
Some users may need/prefer to run with a shared cache file. For doing that, one needs to
|
||||
setup `KRB5CCNAME` from the **login node** session, before submitting the job.
|
||||
|
||||
```bash
|
||||
export KRB5CCNAME="$(mktemp "$HOME/.k5/krb5cc_XXXXXX")"
|
||||
```
|
||||
|
||||
Then, you can run one or multiple jobs scripts (or parallel job with `srun`). `KRB5CCNAME` will be propagated to the
|
||||
job script or to the parallel job, therefore a single credential cache will be shared amongst different Slurm runs.
|
||||
|
||||
```bash
|
||||
|
||||
#!/bin/bash
|
||||
#SBATCH --partition=hourly # Specify 'general' or 'daily' or 'hourly'
|
||||
#SBATCH --time=01:00:00 # Strictly recommended when using 'general' partition.
|
||||
#SBATCH --output=run.out # Generate custom output file
|
||||
#SBATCH --error=run.err # Generate custom error file
|
||||
#SBATCH --nodes=1 # Uncomment and specify #nodes to use
|
||||
#SBATCH --ntasks=1 # Uncomment and specify #nodes to use
|
||||
#SBATCH --cpus-per-task=1
|
||||
#SBATCH --constraint=xeon-gold-6152
|
||||
#SBATCH --hint=nomultithread
|
||||
#SBATCH --job-name=krb5
|
||||
|
||||
# KRB5CCNAME is inherit from the login node session
|
||||
kinit -kt "$HOME/.k5/krb5.keytab" $USER@D.PSI.CH
|
||||
aklog
|
||||
klist
|
||||
|
||||
echo "Here should go my batch script code."
|
||||
|
||||
echo "No need to run 'kdestroy', as it may have to survive for running other jobs"
|
||||
```
|
||||
109
docs/merlin6/02-How-To-Use-Merlin/merlin-rmount.md
Normal file
@@ -0,0 +1,109 @@
|
||||
---
|
||||
title: Using merlin_rmount
|
||||
#tags:
|
||||
keywords: >-
|
||||
transferring data, data transfer, rsync, dav, webdav, sftp, ftp, smb, cifs,
|
||||
copy data, copying, mount, file, folder, sharing
|
||||
last_updated: 24 August 2023
|
||||
#summary: ""
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/merlin-rmount.html
|
||||
---
|
||||
|
||||
## Background
|
||||
|
||||
Merlin provides a command for mounting remote file systems, called `merlin_rmount`. This
|
||||
provides a helpful wrapper over the Gnome storage utilities (GIO and GVFS), and provides support for a wide range of remote file formats, including
|
||||
- SMB/CIFS (Windows shared folders)
|
||||
- WebDav
|
||||
- AFP
|
||||
- FTP, SFTP
|
||||
- [complete list](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/using_the_desktop_environment_in_rhel_8/managing-storage-volumes-in-gnome_using-the-desktop-environment-in-rhel-8#gvfs-back-ends_managing-storage-volumes-in-gnome)
|
||||
|
||||
|
||||
## Usage
|
||||
|
||||
|
||||
### Start a session
|
||||
|
||||
First, start a new session. This will start a new bash shell in the current terminal where you can add further commands.
|
||||
|
||||
```
|
||||
$ merlin_rmount --init
|
||||
[INFO] Starting new D-Bus RMOUNT session
|
||||
|
||||
(RMOUNT STARTED) [bliven_s@merlin-l-002 ~]$
|
||||
```
|
||||
|
||||
Note that behind the scenes this is creating a new dbus daemon. Running multiple daemons on the same login node leads to unpredictable results, so it is best not to initialize multiple sessions in parallel.
|
||||
|
||||
### Standard Endpoints
|
||||
|
||||
Standard endpoints can be mounted using
|
||||
|
||||
```
|
||||
merlin_rmount --select-mount
|
||||
```
|
||||
|
||||
Select the desired url using the arrow keys.
|
||||
|
||||

|
||||
|
||||
From this list any of the standard supported endpoints can be mounted.
|
||||
|
||||
### Other endpoints
|
||||
|
||||
Other endpoints can be mounted using the `merlin_rmount --mount <endpoint>` command.
|
||||
|
||||

|
||||
|
||||
|
||||
### Accessing Files
|
||||
|
||||
After mounting a volume the script will print the mountpoint. It should be of the form
|
||||
|
||||
```
|
||||
/run/user/$UID/gvfs/<endpoint>
|
||||
```
|
||||
|
||||
where `$UID` gives your unix user id (a 5-digit number, also viewable with `id -u`) and
|
||||
`<endpoint>` is some string generated from the mount options.
|
||||
|
||||
For convenience, it may be useful to add a symbolic link for this gvfs directory. For instance, this would allow all volumes to be accessed in ~/mnt/:
|
||||
|
||||
```
|
||||
ln -s ~/mnt /run/user/$UID/gvfs
|
||||
```
|
||||
|
||||
Files are accessible as long as the `merlin_rmount` shell remains open.
|
||||
|
||||
|
||||
### Disconnecting
|
||||
|
||||
To disconnect, close the session with one of the following:
|
||||
|
||||
- The exit command
|
||||
- CTRL-D
|
||||
- Closing the terminal
|
||||
|
||||
Disconnecting will unmount all volumes.
|
||||
|
||||
|
||||
## Alternatives
|
||||
|
||||
### Thunar
|
||||
|
||||
Users that prefer a GUI file browser may prefer the `thunar` command, which opens the Gnome File Browser. This is also available in NoMachine sessions in the bottom bar (1). Thunar supports the same remote filesystems as `merlin_rmount`; just type the URL in the address bar (2).
|
||||
|
||||

|
||||
|
||||
When using thunar within a NoMachine session, file transfers continue after closing NoMachine (as long as the NoMachine session stays active).
|
||||
|
||||
Files can also be accessed at the command line as needed (see 'Accessing Files' above).
|
||||
|
||||
## Resources
|
||||
|
||||
- [BIO docs](https://intranet.psi.ch/en/bio/webdav-data) on using these tools for
|
||||
transfering EM data
|
||||
- [Redhad docs on GVFS](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/using_the_desktop_environment_in_rhel_8/managing-storage-volumes-in-gnome_using-the-desktop-environment-in-rhel-8)
|
||||
- [gio reference](https://developer-old.gnome.org/gio/stable/gio.html)
|
||||
122
docs/merlin6/02-How-To-Use-Merlin/nomachine.md
Normal file
@@ -0,0 +1,122 @@
|
||||
---
|
||||
title: Remote Desktop Access
|
||||
#tags:
|
||||
keywords: NX, nomachine, remote desktop access, login node, merlin-l-001, merlin-l-002, merlin-nx-01, merlin-nx-02, merlin-nx, rem-acc, vpn
|
||||
last_updated: 07 September 2022
|
||||
#summary: ""
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/nomachine.html
|
||||
---
|
||||
|
||||
Users can login in Merlin through a Linux Remote Desktop Session. NoMachine
|
||||
is a desktop virtualization tool. It is similar to VNC, Remote Desktop, etc.
|
||||
It uses the NX protocol to enable a graphical login to remote servers.
|
||||
|
||||
## Installation
|
||||
|
||||
NoMachine is available for PSI Windows computers in the Software Kiosk under the
|
||||
name **NX Client**. Please use the latest version (at least 6.0). For MacOS and
|
||||
Linux, the NoMachine client can be downloaded from https://www.nomachine.com/.
|
||||
|
||||
## Accessing Merlin6 NoMachine from PSI
|
||||
|
||||
The Merlin6 NoMachine service is hosted in the following machine:
|
||||
|
||||
* **`merlin-nx.psi.ch`**
|
||||
|
||||
This is the **front-end** (hence, *the door*) to the NoMachine **back-end nodes**,
|
||||
which contain the NoMachine desktop service. The **back-end nodes** are the following:
|
||||
|
||||
* `merlin-l-001.psi.ch`
|
||||
* `merlin-l-002.psi.ch`
|
||||
|
||||
Any access to the login node desktops must be done through **`merlin-nx.psi.ch`**
|
||||
(or from **`rem-acc.psi.ch -> merlin-nx.psi.ch`** when connecting from outside PSI).
|
||||
|
||||
The **front-end** service running on **`merlin-nx.psi.ch`** will load balance the sessions
|
||||
and login to any of the available nodes in the **back-end**.
|
||||
|
||||
**Only 1 session per back-end** is possible.
|
||||
|
||||
Below are explained all the steps necessary for configuring the access to the
|
||||
NoMachine service running on a login node.
|
||||
|
||||
### Creating a Merlin6 NoMachine connection
|
||||
|
||||
#### Adding a new connection to the front-end
|
||||
|
||||
Click the **Add** button to create a new connection to the **`merlin-nx.psi.ch` front-end**, and fill up
|
||||
the following fields:
|
||||
* **Name**: Specify a custom name for the connection. Examples: `merlin-nx`, `merlin-nx.psi.ch`, `Merlin Desktop`
|
||||
* **Host**: Specify the hostname of the **front-end** service: **`merlin-nx.psi.ch`**
|
||||
* **Protocol**: specify the protocol that will be used for the connection. *Recommended* protocol: **`NX`**
|
||||
* **Port**: Specify the listening port of the **front-end**. It must be **`4000`**.
|
||||
|
||||

|
||||
|
||||
#### Configuring NoMachine Authentication Method
|
||||
|
||||
Depending on the client version, it may ask for different authentication options.
|
||||
If it's required, choose your authentication method and **Continue** (**Password** or *Kerberos* are the recommended ones).
|
||||
|
||||
You will be requested for the crendentials (username / password). **Do not add `PSICH\`** as a prefix for the username.
|
||||
|
||||
### Opening NoMachine desktop sessions
|
||||
|
||||
By default, when connecting to the **`merlin-nx.psi.ch` front-end** it will automatically open a new
|
||||
session if none exists.
|
||||
|
||||
If there are existing sessions, instead of opening a new desktop session, users can reconnect to an
|
||||
existing one by clicking to the proper icon (see image below).
|
||||
|
||||

|
||||
|
||||
Users can also create a second desktop session by selecting the **`New Desktop`** button (*red* rectangle in the
|
||||
below image). This will create a second session on the second login node, as long as this node is up and running.
|
||||
|
||||

|
||||
|
||||
### NoMachine LightDM Session Example
|
||||
|
||||
An example of the NoMachine session, which is based on [LightDM](https://github.com/canonical/lightdm)
|
||||
X Windows:
|
||||
|
||||

|
||||
|
||||
## Accessing Merlin6 NoMachine from outside PSI
|
||||
|
||||
### No VPN access
|
||||
|
||||
Access to the Merlin6 NoMachine service is possible without VPN through **'rem-acc.psi.ch'**.
|
||||
Please follow the steps described in [PSI Remote Interactive Access](https://www.psi.ch/en/photon-science-data-services/remote-interactive-access) for
|
||||
remote access to the Merlin6 NoMachine services. Once logged in **'rem-acc.psi.ch'**, you must then login to the **`merlin-nx.psi.ch` front-end** .
|
||||
services.
|
||||
|
||||
### VPN access
|
||||
|
||||
Remote access is also possible through VPN, however, you **must not use 'rem-acc.psi.ch'**, and you have to connect directly
|
||||
to the Merlin6 NoMachine **`merlin-nx.psi.ch` front-end** as if you were inside PSI. For VPN access, you should request
|
||||
it to the IT department by opening a PSI Service Now ticket:
|
||||
[VPN Access (PSI employees)](https://psi.service-now.com/psisp?id=psi_new_sc_cat_item&sys_id=beccc01b6f44a200d02a82eeae3ee440).
|
||||
|
||||
|
||||
## Advanced Display Settings
|
||||
|
||||
**Nomachine Display Settings** can be accessed and changed either when creating a new session or by clicking the very top right corner of a running session.
|
||||
|
||||
### Prevent Rescaling
|
||||
|
||||
These settings prevent "bluriness" at the cost of some performance! (You might want to choose depending on performance)
|
||||
|
||||
* Display > Resize remote display (forces 1:1 pixel sizes)
|
||||
* Display > Change settings > Quality: Choose Medium-Best Quality
|
||||
* Display > Change settings > Modify advanced settings
|
||||
* Check: Disable network-adaptive display quality (diables lossy compression)
|
||||
* Check: Disable client side image post-processing
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
159
docs/merlin6/02-How-To-Use-Merlin/ssh-keys.md
Normal file
@@ -0,0 +1,159 @@
|
||||
---
|
||||
title: Configuring SSH Keys in Merlin
|
||||
|
||||
#tags:
|
||||
keywords: linux, connecting, client, configuration, SSH, Keys, SSH-Keys, RSA, authorization, authentication
|
||||
last_updated: 15 Jul 2020
|
||||
summary: "This document describes how to deploy SSH Keys in Merlin."
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/ssh-keys.html
|
||||
---
|
||||
|
||||
Merlin users sometimes will need to access the different Merlin services without being constantly requested by a password.
|
||||
One can achieve that with Kerberos authentication, however in some cases some software would require the setup of SSH Keys.
|
||||
One example is ANSYS Fluent, which, when used interactively, the way of communication between the GUI and the different nodes
|
||||
is through the SSH protocol, and the use of SSH Keys is enforced.
|
||||
|
||||
## Setting up SSH Keys on Merlin
|
||||
|
||||
For security reason, users **must always protect SSH Keys with a passphrase**.
|
||||
|
||||
User can check whether a SSH key already exists. These would be placed in the **~/.ssh/** directory. `RSA` encryption
|
||||
is usually the default one, and files in there would be **`id_rsa`** (private key) and **`id_rsa.pub`** (public key).
|
||||
|
||||
```bash
|
||||
ls ~/.ssh/id*
|
||||
```
|
||||
|
||||
For creating **SSH RSA Keys**, one should:
|
||||
|
||||
1. Run `ssh-keygen`, a password will be requested twice. You **must remember** this password for the future.
|
||||
* Due to security reasons, ***always try protecting it with a password***. There is only one exception, when running ANSYS software, which in general should not use password to simplify the way of running the software in Slurm.
|
||||
* This will generate a private key **id_rsa**, and a public key **id_rsa.pub** in your **~/.ssh** directory.
|
||||
2. Add your public key to the **`authorized_keys`** file, and ensure proper permissions for that file, as follows:
|
||||
```bash
|
||||
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
|
||||
chmod 0600 ~/.ssh/authorized_keys
|
||||
```
|
||||
3. Configure the SSH client in order to force the usage of the **psi.ch** domain for trusting keys:
|
||||
```bash
|
||||
echo "CanonicalizeHostname yes" >> ~/.ssh/config
|
||||
```
|
||||
4. Configure further SSH options as follows:
|
||||
```bash
|
||||
echo "AddKeysToAgent yes" >> ~/.ssh/config
|
||||
echo "ForwardAgent yes" >> ~/.ssh/config
|
||||
```
|
||||
Other options may be added.
|
||||
5. Check that your SSH config file contains at least the lines mentioned in steps 3 and 4:
|
||||
```bash
|
||||
(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m]# cat ~/.ssh/config
|
||||
CanonicalizeHostname yes
|
||||
AddKeysToAgent yes
|
||||
ForwardAgent yes
|
||||
```
|
||||
|
||||
## Using the SSH Keys
|
||||
|
||||
### Using Authentication Agent in SSH session
|
||||
|
||||
By default, when accessing the login node via SSH (with `ForwardAgent=yes`), it will automatically add your
|
||||
SSH Keys to the authentication agent. Hence, no actions should not be needed by the user. One can configure
|
||||
`ForwardAgent=yes` as follows:
|
||||
|
||||
* **(Recommended)** In your local Linux (workstation, laptop or desktop) add the following line in the
|
||||
`$HOME/.ssh/config` (or alternatively in `/etc/ssh/ssh_config`) file:
|
||||
```
|
||||
ForwardAgent yes
|
||||
```
|
||||
* Alternatively, on each SSH you can add the option `ForwardAgent=yes` in the SSH command. In example:
|
||||
```bash
|
||||
ssh -XY -o ForwardAgent=yes merlin-l-001.psi.ch
|
||||
```
|
||||
|
||||
If `ForwardAgent` is not enabled as shown above, one needs to run the authentication agent and then add your key
|
||||
to the **ssh-agent**. This must be done once per SSH session, as follows:
|
||||
|
||||
* Run `eval $(ssh-agent -s)` to run the **ssh-agent** in that SSH session
|
||||
* Check whether the authentication agent has your key already added:
|
||||
```bash
|
||||
ssh-add -l | grep "/psi/home/$(whoami)/.ssh"
|
||||
```
|
||||
* If no key is returned in the previous step, you have to add the private key identity to the authentication agent.
|
||||
You will be requested for the **passphrase** of your key, and it can be done by running:
|
||||
```bash
|
||||
ssh-add
|
||||
```
|
||||
|
||||
### Using Authentication Agent in NoMachine Session
|
||||
|
||||
By default, when using a NoMachine session, the `ssh-agent` should be automatically started. Hence, there is no need of
|
||||
starting the agent or forwarding it.
|
||||
|
||||
However, for NoMachine one always need to add the private key identity to the authentication agent. This can be done as follows:
|
||||
|
||||
1. Check whether the authentication agent has already the key added:
|
||||
```bash
|
||||
ssh-add -l | grep "/psi/home/$(whoami)/.ssh"
|
||||
```
|
||||
2. If no key is returned in the previous step, you have to add the private key identity to the authentication agent.
|
||||
You will be requested for the **passphrase** of your key, and it can be done by running:
|
||||
```bash
|
||||
ssh-add
|
||||
```
|
||||
|
||||
You just need to run it once per NoMachine session, and it would apply to all terminal windows within that NoMachine session.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Errors when running 'ssh-add'
|
||||
|
||||
If the error `Could not open a connection to your authentication agent.` appears when running `ssh-add`, it means
|
||||
that the authentication agent is not running. Please follow the previous procedures for starting it.
|
||||
|
||||
### Add/Update SSH RSA Key password
|
||||
|
||||
If an existing SSH Key does not have password, or you want to update an existing password with a new one, you can do it as follows:
|
||||
|
||||
```bash
|
||||
ssh-keygen -p -f ~/.ssh/id_rsa
|
||||
```
|
||||
|
||||
### SSH Keys deployed but not working
|
||||
|
||||
Please ensure proper permissions of the involved files, as well as any typos in the file names involved:
|
||||
|
||||
```bash
|
||||
chmod u+rwx,go-rwx,g+s ~/.ssh
|
||||
chmod u+rw-x,go-rwx ~/.ssh/authorized_keys
|
||||
chmod u+rw-x,go-rwx ~/.ssh/id_rsa
|
||||
chmod u+rw-x,go+r-wx ~/.ssh/id_rsa.pub
|
||||
```
|
||||
|
||||
### Testing SSH Keys
|
||||
|
||||
Once SSH Key is created, for testing that the SSH Key is valid, one can do the following:
|
||||
|
||||
1. Create a **new** SSH session in one of the login nodes:
|
||||
```bash
|
||||
ssh merlin-l-001
|
||||
```
|
||||
2. In the login node session, destroy any existing Kerberos ticket or active SSH Key:
|
||||
```bash
|
||||
kdestroy
|
||||
ssh-add -D
|
||||
```
|
||||
3. Add the new private key identity to the authentication agent. You will be requested by the passphrase.
|
||||
```bash
|
||||
ssh-add
|
||||
```
|
||||
4. Check that your key is active by the SSH agent:
|
||||
```bash
|
||||
ssh-add -l
|
||||
```
|
||||
4. SSH to the second login node. No password should be requested:
|
||||
```bash
|
||||
ssh -vvv merlin-l-002
|
||||
```
|
||||
|
||||
If the last step succeeds, then means that your SSH Key is properly setup.
|
||||
197
docs/merlin6/02-How-To-Use-Merlin/storage.md
Normal file
@@ -0,0 +1,197 @@
|
||||
---
|
||||
title: Merlin6 Storage
|
||||
#tags:
|
||||
keywords: storage, /data/user, /data/software, /data/project, /scratch, /shared-scratch, quota, export, user, project, scratch, data, shared-scratch, merlin_quotas
|
||||
last_updated: 07 September 2022
|
||||
#summary: ""
|
||||
sidebar: merlin6_sidebar
|
||||
redirect_from: /merlin6/data-directories.html
|
||||
permalink: /merlin6/storage.html
|
||||
---
|
||||
|
||||
## Introduction
|
||||
|
||||
This document describes the different directories of the Merlin6 cluster.
|
||||
|
||||
### User and project data
|
||||
|
||||
* ***Users are responsible for backing up their own data***. Is recommended to backup the data on third party independent systems (i.e. LTS, Archive, AFS, SwitchDrive, Windows Shares, etc.).
|
||||
* **`/psi/home`**, as this contains a small amount of data, is the only directory where we can provide daily snapshots for one week. This can be found in the following directory **`/psi/home/.snapshot/`**
|
||||
* ***When a user leaves PSI, she or her supervisor/team are responsible to backup and move the data out from the cluster***: every few months, the storage space will be recycled for those old users who do not have an existing and valid PSI account.
|
||||
|
||||
{{site.data.alerts.warning}}When a user leaves PSI and his account has been removed, her storage space in Merlin may be recycled.
|
||||
Hence, <b>when a user leaves PSI</b>, she, her supervisor or team <b>must ensure that the data is backed up to an external storage</b>
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
### Checking user quota
|
||||
|
||||
For each directory, we provide a way for checking quotas (when required). However, a single command ``merlin_quotas``
|
||||
is provided. This is useful to show with a single command all quotas for your filesystems (including AFS, which is not mentioned here).
|
||||
|
||||
To check your quotas, please run:
|
||||
|
||||
```bash
|
||||
merlin_quotas
|
||||
```
|
||||
|
||||
## Merlin6 directories
|
||||
|
||||
Merlin6 offers the following directory classes for users:
|
||||
|
||||
* ``/psi/home/<username>``: Private user **home** directory
|
||||
* ``/data/user/<username>``: Private user **data** directory
|
||||
* ``/data/project/general/<projectname>``: Shared **Project** directory
|
||||
* For BIO experiments, a dedicated ``/data/project/bio/$projectname`` exists.
|
||||
* ``/scratch``: Local *scratch* disk (only visible by the node running a job).
|
||||
* ``/shared-scratch``: Shared *scratch* disk (visible from all nodes).
|
||||
* ``/export``: Export directory for data transfer, visible from `ra-merlin-01.psi.ch`, `ra-merlin-02.psi.ch` and Merlin login nodes.
|
||||
* Refer to **[Transferring Data](/merlin6/transfer-data.html)** for more information about the export area and data transfer service.
|
||||
|
||||
{{site.data.alerts.tip}}In GPFS there is a concept called <b>GraceTime</b>. Filesystems have a block (amount of data) and file (number of files) quota.
|
||||
This quota contains a soft and hard limits. Once the soft limit is reached, users can keep writing up to their hard limit quota during the <b>grace period</b>.
|
||||
Once <b>GraceTime</b> or hard limit are reached, users will be unable to write and will need remove data below the soft limit (or ask for a quota increase
|
||||
when this is possible, see below table).
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
Properties of the directory classes:
|
||||
|
||||
| Directory | Block Quota [Soft:Hard] | Block Quota [Soft:Hard] | GraceTime | Quota Change Policy: Block | Quota Change Policy: Files | Backup | Backup Policy |
|
||||
| ---------------------------------- | ----------------------- | ----------------------- | :-------: | :--------------------------------- |:-------------------------------- | ------ | :----------------------------- |
|
||||
| /psi/home/$username | USR [10GB:11GB] | *Undef* | N/A | Up to x2 when strongly justified. | N/A | yes | Daily snapshots for 1 week |
|
||||
| /data/user/$username | USR [1TB:1.074TB] | USR [1M:1.1M] | 7d | Inmutable. Need a project. | Changeable when justified. | no | Users responsible for backup |
|
||||
| /data/project/bio/$projectname | GRP [1TB:1.074TB] | GRP [1M:1.1M] | 7d | Subject to project requirements. | Subject to project requirements. | no | Project responsible for backup |
|
||||
| /data/project/general/$projectname | GRP [1TB:1.074TB] | GRP [1M:1.1M] | 7d | Subject to project requirements. | Subject to project requirements. | no | Project responsible for backup |
|
||||
| /scratch | *Undef* | *Undef* | N/A | N/A | N/A | no | N/A |
|
||||
| /shared-scratch | USR [512GB:2TB] | USR [2M:2.5M] | 7d | Up to x2 when strongly justified. | Changeable when justified. | no | N/A |
|
||||
| /export | USR [10MB:20TB] | USR [512K:5M] | 10d | Soft can be temporary increased. | Changeable when justified. | no | N/A |
|
||||
|
||||
{{site.data.alerts.warning}}The use of <b>scratch</b> and <b>export</b> areas as an extension of the quota <i>is forbidden</i>. <b>scratch</b> and <b>export</b> areas <i>must not contain</i> final data.
|
||||
<br><b><i>Auto cleanup policies</i></b> in the <b>scratch</b> and <b>export</b> areas are applied.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
### User home directory
|
||||
|
||||
This is the default directory users will land when login in to any Merlin6 machine.
|
||||
It is intended for your scripts, documents, software development, and other files which
|
||||
you want to have backuped. Do not use it for data or HPC I/O-hungry tasks.
|
||||
|
||||
This directory is mounted in the login and computing nodes under the path:
|
||||
|
||||
```bash
|
||||
/psi/home/$username
|
||||
```
|
||||
|
||||
Home directories are part of the PSI NFS Central Home storage provided by AIT and
|
||||
are managed by the Merlin6 administrators.
|
||||
|
||||
Users can check their quota by running the following command:
|
||||
|
||||
```bash
|
||||
quota -s
|
||||
```
|
||||
|
||||
#### Home directory policy
|
||||
|
||||
* Read **[Important: Code of Conduct](## Important: Code of Conduct)** for more information about Merlin6 policies.
|
||||
* Is **forbidden** to use the home directories for IO intensive tasks
|
||||
* Use ``/scratch``, ``/shared-scratch``, ``/data/user`` or ``/data/project`` for this purpose.
|
||||
* Users can retrieve up to 1 week of their lost data thanks to the automatic **daily snapshots for 1 week**.
|
||||
Snapshots can be accessed at this path:
|
||||
|
||||
```bash
|
||||
/psi/home/.snapshop/$username
|
||||
```
|
||||
|
||||
### User data directory
|
||||
|
||||
The user data directory is intended for *fast IO access* and keeping large amounts of private data.
|
||||
This directory is mounted in the login and computing nodes under the directory
|
||||
|
||||
```bash
|
||||
/data/user/$username
|
||||
```
|
||||
|
||||
Users can check their quota by running the following command:
|
||||
|
||||
```bash
|
||||
mmlsquota -u <username> --block-size auto merlin-user
|
||||
```
|
||||
|
||||
#### User data directory policy
|
||||
|
||||
* Read **[Important: Code of Conduct](## Important: Code of Conduct)** for more information about Merlin6 policies.
|
||||
* Is **forbidden** to use the data directories as ``scratch`` area during a job runtime.
|
||||
* Use ``/scratch``, ``/shared-scratch`` for this purpose.
|
||||
* No backup policy is applied for user data directories: users are responsible for backing up their data.
|
||||
|
||||
### Project data directory
|
||||
|
||||
This storage is intended for *fast IO access* and keeping large amounts of a project's data, where the data also can be
|
||||
shared by all members of the project (the project's corresponding unix group). We recommend to keep most data in
|
||||
project related storage spaces, since it allows users to coordinate. Also, project spaces have more flexible policies
|
||||
regarding extending the available storage space.
|
||||
|
||||
Experiments can request a project space as described in **[[Accessing Merlin -> Requesting a Project]](/merlin6/request-project.html)**
|
||||
|
||||
Once created, the project data directory will be mounted in the login and computing nodes under the dirctory:
|
||||
|
||||
```bash
|
||||
/data/project/general/$projectname
|
||||
```
|
||||
|
||||
Project quotas are defined on a per *group* basis. Users can check the project quota by running the following command:
|
||||
|
||||
```bash
|
||||
mmlsquota -j $projectname --block-size auto -C merlin.psi.ch merlin-proj
|
||||
```
|
||||
|
||||
#### Project Directory policy
|
||||
|
||||
* Read **[Important: Code of Conduct](## Important: Code of Conduct)** for more information about Merlin6 policies.
|
||||
* It is **forbidden** to use the data directories as ``scratch`` area during a job's runtime, i.e. for high throughput I/O for a job's temporary files. Please Use ``/scratch``, ``/shared-scratch`` for this purpose.
|
||||
* No backups: users are responsible for managing the backups of their data directories.
|
||||
|
||||
### Scratch directories
|
||||
|
||||
There are two different types of scratch storage: **local** (``/scratch``) and **shared** (``/shared-scratch``).
|
||||
|
||||
**local** scratch should be used for all jobs that do not require the scratch files to be accessible from multiple nodes, which is trivially
|
||||
true for all jobs running on a single node.
|
||||
**shared** scratch is intended for files that need to be accessible by multiple nodes, e.g. by a MPI-job where tasks are spread out over the cluster
|
||||
and all tasks need to do I/O on the same temporary files.
|
||||
|
||||
**local** scratch in Merlin6 computing nodes provides a huge number of IOPS thanks to the NVMe technology. **Shared** scratch is implemented using a distributed parallel filesystem (GPFS) resulting in a higher latency, since it involves remote storage resources and more complex I/O coordination.
|
||||
|
||||
``/shared-scratch`` is only mounted in the *Merlin6* computing nodes (i.e. not on the login nodes), and its current size is 50TB. This can be increased in the future.
|
||||
|
||||
The properties of the available scratch storage spaces are given in the following table
|
||||
|
||||
| Cluster | Service | Scratch | Scratch Mountpoint | Shared Scratch | Shared Scratch Mountpoint | Comments |
|
||||
| ------- | -------------- | ------------ | ------------------ | -------------- | ------------------------- | -------------------------------------- |
|
||||
| merlin5 | computing node | 50GB / SAS | ``/scratch`` | ``N/A`` | ``N/A`` | ``merlin-c-[01-64]`` |
|
||||
| merlin6 | login node | 100GB / SAS | ``/scratch`` | 50TB / GPFS | ``/shared-scratch`` | ``merlin-l-0[1,2]`` |
|
||||
| merlin6 | computing node | 1.3TB / NVMe | ``/scratch`` | 50TB / GPFS | ``/shared-scratch`` | ``merlin-c-[001-024,101-124,201-224]`` |
|
||||
| merlin6 | login node | 2.0TB / NVMe | ``/scratch`` | 50TB / GPFS | ``/shared-scratch`` | ``merlin-l-00[1,2]`` |
|
||||
|
||||
#### Scratch directories policy
|
||||
|
||||
* Read **[Important: Code of Conduct](## Important: Code of Conduct)** for more information about Merlin6 policies.
|
||||
* By default, *always* use **local** first and only use **shared** if your specific use case requires it.
|
||||
* Temporary files *must be deleted at the end of the job by the user*.
|
||||
* Remaining files will be deleted by the system if detected.
|
||||
* Files not accessed within 28 days will be automatically cleaned up by the system.
|
||||
* If for some reason the scratch areas get full, admins have the rights to cleanup the oldest data.
|
||||
|
||||
### Export directory
|
||||
|
||||
Export directory is exclusively intended for transferring data from outside PSI to Merlin and viceversa. Is a temporary directoy with an auto-cleanup policy.
|
||||
Please read **[Transferring Data](/merlin6/transfer-data.html)** for more information about it.
|
||||
|
||||
#### Export directory policy
|
||||
|
||||
* Temporary files *must be deleted at the end of the job by the user*.
|
||||
* Remaining files will be deleted by the system if detected.
|
||||
* Files not accessed within 28 days will be automatically cleaned up by the system.
|
||||
* If for some reason the export area gets full, admins have the rights to cleanup the oldest data
|
||||
|
||||
---
|
||||
170
docs/merlin6/02-How-To-Use-Merlin/transfer-data.md
Normal file
@@ -0,0 +1,170 @@
|
||||
---
|
||||
title: Transferring Data
|
||||
#tags:
|
||||
keywords: transferring data, data transfer, rsync, winscp, copy data, copying, sftp, import, export, hopx, vpn
|
||||
last_updated: 24 August 2023
|
||||
#summary: ""
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/transfer-data.html
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Most methods allow data to be either transmitted or received, so it may make sense to
|
||||
initiate the transfer from either merlin or the other system, depending on the network
|
||||
visibility.
|
||||
|
||||
- Merlin login nodes are visible from the PSI network, so direct data transfer
|
||||
(rsync/WinSCP) is generally preferable. This can be initiated from either endpoint.
|
||||
- Merlin login nodes can access the internet using a limited set of protocols
|
||||
- SSH-based protocols using port 22 (rsync-over-ssh, sftp, WinSCP, etc)
|
||||
- HTTP-based protocols using ports 80 or 445 (https, WebDav, etc)
|
||||
- Protocols using other ports require admin configuration and may only work with
|
||||
specific hosts (ftp, rsync daemons, etc)
|
||||
- Systems on the internet can access the [PSI Data Transfer](https://www.psi.ch/en/photon-science-data-services/data-transfer) service
|
||||
`datatransfer.psi.ch`, using ssh-based protocols and [Globus](https://www.globus.org/)
|
||||
|
||||
|
||||
## Direct transfer via Merlin6 login nodes
|
||||
|
||||
The following methods transfer data directly via the [login
|
||||
nodes](/merlin6/interactive.html#login-nodes-hardware-description). They are suitable
|
||||
for use from within the PSI network.
|
||||
|
||||
### Rsync
|
||||
|
||||
Rsync is the preferred method to transfer data from Linux/MacOS. It allows
|
||||
transfers to be easily resumed if they get interrupted. The general syntax is:
|
||||
|
||||
```
|
||||
rsync -avAHXS <src> <dst>
|
||||
```
|
||||
|
||||
For example, to transfer files from your local computer to a merlin project
|
||||
directory:
|
||||
|
||||
```
|
||||
rsync -avAHXS ~/localdata user@merlin-l-01.psi.ch:/data/project/general/myproject/
|
||||
```
|
||||
|
||||
You can resume interrupted transfers by simply rerunning the command. Previously
|
||||
transferred files will be skipped.
|
||||
|
||||
|
||||
### WinSCP
|
||||
|
||||
The WinSCP tool can be used for remote file transfer on Windows. It is available
|
||||
from the Software Kiosk on PSI machines. Add `merlin-l-01.psi.ch` as a host and
|
||||
connect with your PSI credentials. You can then drag-and-drop files between your
|
||||
local computer and merlin.
|
||||
|
||||
### SWITCHfilesender
|
||||
|
||||
**[SWITCHfilesender](https://filesender.switch.ch/filesender2/?s=upload)** is an installation of the FileSender project (filesender.org) which is a web based application that allows authenticated users to securely and easily send arbitrarily large files to other users.
|
||||
|
||||
Authentication of users is provided through SimpleSAMLphp, supporting SAML2, LDAP and RADIUS and more. Users without an account can be sent an upload voucher by an authenticated user. FileSender is developed to the requirements of the higher education and research community.
|
||||
|
||||
The purpose of the software is to send a large file to someone, have that file available for download for a certain number of downloads and/or a certain amount of time, and after that automatically delete the file. The software is not intended as a permanent file publishing platform.
|
||||
|
||||
**[SWITCHfilesender](https://filesender.switch.ch/filesender2/?s=upload)** is fully integrated with PSI, therefore, PSI employees can log in by using their PSI account (through Authentication and Authorization Infrastructure / AAI, by selecting PSI as the institution to be used for log in).
|
||||
|
||||
## PSI Data Transfer
|
||||
|
||||
From August 2024, Merlin is connected to the **[PSI Data Transfer](https://www.psi.ch/en/photon-science-data-services/data-transfer)** service,
|
||||
`datatransfer.psi.ch`. This is a central service managed by the **[Linux team](https://linux.psi.ch/index.html)**. However, any problems or questions related to it can be directly
|
||||
[reported](/merlin6/contact.html) to the Merlin administrators, which will forward the request if necessary.
|
||||
|
||||
The PSI Data Transfer servers supports the following protocols:
|
||||
* Data Transfer - SSH (scp / rsync)
|
||||
* Data Transfer - Globus
|
||||
|
||||
Notice that `datatransfer.psi.ch` does not allow SSH login, only `rsync`, `scp` and [Globus](https://www.globus.org/) access is allowed.
|
||||
|
||||
The following filesystems are mounted:
|
||||
* `/merlin/export` which points to the `/export` directory in Merlin.
|
||||
* `/merlin/data/experiment/mu3e` which points to the `/data/experiment/mu3e` directories in Merlin.
|
||||
* Mu3e sub-directories are mounted in RW (read-write), except for `data` (read-only mounted)
|
||||
* `/merlin/data/project/general` which points to the `/data/project/general` directories in Merlin.
|
||||
* Owners of Merlin projects should request explicit access to it.
|
||||
* Currently, only `CSCS` is available for transferring files between PizDaint/Alps and Merlin
|
||||
* `/merlin/data/project/bio` which points to the `/data/project/bio` directories in Merlin.
|
||||
* `/merlin/data/user` which points to the `/data/user` directories in Merlin.
|
||||
|
||||
Access to the PSI Data Transfer uses ***Multi factor authentication*** (MFA).
|
||||
Therefore, having the Microsoft Authenticator App is required as explained [here](https://www.psi.ch/en/computing/change-to-mfa).
|
||||
|
||||
{{site.data.alerts.tip}}Please follow the
|
||||
<b><a href="https://www.psi.ch/en/photon-science-data-services/data-transfer">Official PSI Data Transfer</a></b> documentation for further instructions.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
### Directories
|
||||
|
||||
#### /merlin/data/user
|
||||
|
||||
User data directories are mounted in RW.
|
||||
|
||||
{{site.data.alerts.warning}}Please, <b>ensure proper secured permissions</b> in your '/data/user'
|
||||
directory. By default, when directory is created, the system applies the most restrictive
|
||||
permissions. However, this does not prevent users for changing permissions if they wish. At this
|
||||
point, users become responsible of those changes.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
#### /merlin/export
|
||||
|
||||
Transferring big amounts of data from outside PSI to Merlin is always possible through `/export`.
|
||||
|
||||
{{site.data.alerts.tip}}<b>The '/export' directory can be used by any Merlin user.</b>
|
||||
This is configured in Read/Write mode. If you need access, please, contact the Merlin administrators.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
{{site.data.alerts.warning}}The use <b>export</b> as an extension of the quota <i>is forbidden</i>.
|
||||
<br><b><i>Auto cleanup policies</i></b> in the <b>export</b> area apply for files older than 28 days.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
##### Exporting data from Merlin
|
||||
|
||||
For exporting data from Merlin to outside PSI by using `/export`, one has to:
|
||||
* From a Merlin login node, copy your data from any directory (i.e. `/data/project`, `/data/user`, `/scratch`) to
|
||||
`/export`. Ensure to properly secure your directories and files with proper permissions.
|
||||
* Once data is copied, from **`datatransfer.psi.ch`**, copy the data from `/merlin/export` to outside PSI
|
||||
|
||||
##### Importing data to Merlin
|
||||
|
||||
For importing data from outside PSI to Merlin by using `/export`, one has to:
|
||||
* From **`datatransfer.psi.ch`**, copy the data from outside PSI to `/merlin/export`.
|
||||
Ensure to properly secure your directories and files with proper permissions.
|
||||
* Once data is copied, from a Merlin login node, copy your data from `/export` to any directory (i.e. `/data/project`, `/data/user`, `/scratch`).
|
||||
|
||||
#### Request access to your project directory
|
||||
|
||||
Optionally, instead of using `/export`, Merlin project owners can request Read/Write or Read/Only access to their project directory.
|
||||
|
||||
{{site.data.alerts.tip}}<b>Merlin projects can request direct access.</b>
|
||||
This can be configured in Read/Write or Read/Only modes. If your project needs access, please, contact the Merlin administrators.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
## Connecting to Merlin6 from outside PSI
|
||||
|
||||
Merlin6 is fully accessible from within the PSI network. To connect from outside you can use:
|
||||
|
||||
- [VPN](https://www.psi.ch/en/computing/vpn) ([alternate instructions](https://intranet.psi.ch/BIO/ComputingVPN))
|
||||
- [SSH hopx](https://www.psi.ch/en/computing/ssh-hop)
|
||||
* Please avoid transferring big amount data through **hopx**
|
||||
- [No Machine](nomachine.md)
|
||||
* Remote Interactive Access through [**'rem-acc.psi.ch'**](https://www.psi.ch/en/photon-science-data-services/remote-interactive-access)
|
||||
* Please avoid transferring big amount of data through **NoMachine**
|
||||
|
||||
## Connecting from Merlin6 to outside file shares
|
||||
|
||||
### `merlin_rmount` command
|
||||
|
||||
Merlin provides a command for mounting remote file systems, called `merlin_rmount`. This
|
||||
provides a helpful wrapper over the Gnome storage utilities, and provides support for a wide range of remote file formats, including
|
||||
- SMB/CIFS (Windows shared folders)
|
||||
- WebDav
|
||||
- AFP
|
||||
- FTP, SFTP
|
||||
- [others](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/using_the_desktop_environment_in_rhel_8/managing-storage-volumes-in-gnome_using-the-desktop-environment-in-rhel-8#gvfs-back-ends_managing-storage-volumes-in-gnome)
|
||||
|
||||
|
||||
[More instruction on using `merlin_rmount`](/merlin6/merlin-rmount.html)
|
||||
208
docs/merlin6/02-How-To-Use-Merlin/using-modules.md
Normal file
@@ -0,0 +1,208 @@
|
||||
---
|
||||
title: Using PModules
|
||||
#tags:
|
||||
keywords: Pmodules, software, stable, unstable, deprecated, overlay, overlays, release stage, module, package, packages, library, libraries
|
||||
last_updated: 07 September 2022
|
||||
#summary: ""
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/using-modules.html
|
||||
---
|
||||
|
||||
## Environment Modules
|
||||
|
||||
On top of the operating system stack we provide different software using the PSI developed PModule system.
|
||||
|
||||
PModules is the official supported way and each package is deployed by a specific expert. Usually, in PModules
|
||||
software which is used by many people will be found.
|
||||
|
||||
If you miss any package/versions or a software with a specific missing feature, contact us. We will study if is feasible or not to install it.
|
||||
|
||||
## Module release stages
|
||||
|
||||
Three different **release stages** are available in Pmodules, ensuring proper software life cycling. These are the following: **`unstable`**, **`stable`** and **`deprecated`**
|
||||
|
||||
### Unstable release stage
|
||||
|
||||
The **`unstable`** release stage contains *unstable* releases of software. Software compilations here are usually under development or are not fully production ready.
|
||||
|
||||
This release stage is **not directly visible** by the end users, and needs to be explicitly invoked as follows:
|
||||
|
||||
```bash
|
||||
module use unstable
|
||||
```
|
||||
|
||||
Once software is validated and considered production ready, this is moved to the `stable` release stage.
|
||||
|
||||
### Stable release stage
|
||||
|
||||
The **`stable`** release stage contains *stable* releases of software, which have been deeply tested and are fully supported.
|
||||
|
||||
This is the ***default*** release stage, and is visible by default. Whenever possible, users are strongly advised to use packages from this release stage.
|
||||
|
||||
### Deprecated release stage
|
||||
|
||||
The **`deprecated`** release stage contains *deprecated* releases of software. Software in this release stage is usually deprecated or discontinued by their developers.
|
||||
Also, minor versions or redundant compilations are moved here as long as there is a valid copy in the *stable* repository.
|
||||
|
||||
This release stage is **not directly visible** by the users, and needs to be explicitly invoked as follows:
|
||||
|
||||
```bash
|
||||
module use deprecated
|
||||
```
|
||||
|
||||
However, software moved to this release stage can be directly loaded without the need of invoking it. This ensure proper life cycling of the software, and making it transparent for the end users.
|
||||
|
||||
## Module overlays
|
||||
|
||||
Recent Pmodules releases contain a feature called **Pmodules overlays**. In Merlin, overlays are used to source software from a different location.
|
||||
In that way, we can have custom private versions of software in the cluster installed on high performance storage accessed over a low latency network.
|
||||
|
||||
**Pmodules overlays** are still ***under development***, therefore consider that *some features may not work or do not work as expected*.
|
||||
|
||||
Pmodule overlays can be used from Pmodules `v1.1.5`. However, Merlin is running Pmodules `v1.0.0rc10` as the default version.
|
||||
Therefore, one needs to load first a newer version of it: this is available in the repositories and can be loaded with **`module load Pmodules/$version`** command.
|
||||
|
||||
Once running the proper Pmodules version, **overlays** are added (or invoked) with the **`module use $overlay_name`** command.
|
||||
|
||||
### overlay_merlin
|
||||
|
||||
Some Merlin software is already provided through **PModule overlays** and has been validated for using and running it in that way.
|
||||
Therefore, Melin contains an overlay called **`overlay_merlin`**. In this overlay, the software is installed in the Merlin high performance storage,
|
||||
specifically in the ``/data/software/pmodules`` directory. In general, if another copy exists in the standard repository, we strongly recommend to use
|
||||
the replica in the `overlay_merlin` overlay instead, as it provides faster access and it may also provide some customizations for the Merlin6 cluster.
|
||||
|
||||
For loading the `overlay_merlin`, please run:
|
||||
```bash
|
||||
module load Pmodules/1.1.6 # Or newer version
|
||||
module use overlay_merlin
|
||||
```
|
||||
|
||||
Then, once `overlay_merlin` is invoked, it will disable central software installations with the same version (if exist), and will be replaced
|
||||
by the local ones in Merlin. Releases from the central Pmodules repository which do not have a copy in the Merlin overlay will remain
|
||||
visible. In example, for each ANSYS release, one can identify where it is installed by searching ANSYS in PModules with the `--verbose`
|
||||
option. This will show the location of the different ANSYS releases as follows:
|
||||
* For ANSYS releases installed in the central repositories, the path starts with `/opt/psi`
|
||||
* For ANSYS releases installed in the Merlin6 repository (and/or overwritting the central ones), the path starts with `/data/software/pmodules`
|
||||
|
||||
```bash
|
||||
(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m]# module load Pmodules/1.1.6
|
||||
module load: unstable module has been loaded -- Pmodules/1.1.6
|
||||
|
||||
(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m]# module use merlin_overlay
|
||||
|
||||
(base) ❄ [caubet_m@merlin-l-001:/data/user/caubet_m]# module search ANSYS --verbose
|
||||
|
||||
Module Rel.stage Group Dependencies/Modulefile
|
||||
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
ANSYS/2019R3 stable Tools dependencies:
|
||||
modulefile: /data/software/pmodules/Tools/modulefiles/ANSYS/2019R3
|
||||
ANSYS/2020R1 stable Tools dependencies:
|
||||
modulefile: /opt/psi/Tools/modulefiles/ANSYS/2020R1
|
||||
ANSYS/2020R1-1 stable Tools dependencies:
|
||||
modulefile: /opt/psi/Tools/modulefiles/ANSYS/2020R1-1
|
||||
ANSYS/2020R2 stable Tools dependencies:
|
||||
modulefile: /data/software/pmodules/Tools/modulefiles/ANSYS/2020R2
|
||||
ANSYS/2021R1 stable Tools dependencies:
|
||||
modulefile: /data/software/pmodules/Tools/modulefiles/ANSYS/2021R1
|
||||
ANSYS/2021R2 stable Tools dependencies:
|
||||
modulefile: /data/software/pmodules/Tools/modulefiles/ANSYS/2021R2
|
||||
```
|
||||
|
||||
## PModules commands
|
||||
|
||||
Below is listed a summary of all available commands:
|
||||
|
||||
```bash
|
||||
module use # show all available PModule Software Groups as well as Release Stages
|
||||
module avail # to see the list of available software packages provided via pmodules
|
||||
module use unstable # to get access to a set of packages not fully tested by the community
|
||||
module load <package>/<version> # to load specific software package with a specific version
|
||||
module search <string> # to search for a specific software package and its dependencies.
|
||||
module list # to list which software is loaded in your environment
|
||||
module purge # unload all loaded packages and cleanup the environment
|
||||
```
|
||||
|
||||
### module use/unuse
|
||||
|
||||
Without any parameter, `use` **lists** all available PModule **Software Groups and Release Stages**.
|
||||
|
||||
```bash
|
||||
module use
|
||||
```
|
||||
|
||||
When followed by a parameter, `use`/`unuse` invokes/uninvokes a PModule **Software Group** or **Release Stage**.
|
||||
|
||||
```bash
|
||||
module use EM # Invokes the 'EM' software group
|
||||
module unuse EM # Uninvokes the 'EM' software group
|
||||
module use unstable # Invokes the 'unstable' Release stable
|
||||
module unuse unstable # Uninvokes the 'unstable' Release stable
|
||||
```
|
||||
|
||||
### module avail
|
||||
|
||||
This option **lists** all available PModule **Software Groups and their packages**.
|
||||
|
||||
Please run `module avail --help` for further listing options.
|
||||
|
||||
### module search
|
||||
|
||||
This is used to **search** for **software packages**. By default, if no **Release Stage** or **Software Group** is specified
|
||||
in the options of the `module search` command, it will search from the already invoked *Software Groups* and *Release Stages*.
|
||||
Direct package dependencies will be also showed.
|
||||
|
||||
```bash
|
||||
(base) [caubet_m@merlin-l-001 caubet_m]$ module search openmpi/4.0.5_slurm
|
||||
|
||||
Module Release Group Requires
|
||||
---------------------------------------------------------------------------
|
||||
openmpi/4.0.5_slurm stable Compiler gcc/8.4.0
|
||||
openmpi/4.0.5_slurm stable Compiler gcc/9.2.0
|
||||
openmpi/4.0.5_slurm stable Compiler gcc/9.3.0
|
||||
openmpi/4.0.5_slurm stable Compiler intel/20.4
|
||||
|
||||
(base) [caubet_m@merlin-l-001 caubet_m]$ module load intel/20.4 openmpi/4.0.5_slurm
|
||||
```
|
||||
|
||||
Please run `module search --help` for further search options.
|
||||
|
||||
### module load/unload
|
||||
|
||||
This loads/unloads specific software packages. Packages might have direct dependencies that need to be loaded first. Other dependencies
|
||||
will be automatically loaded.
|
||||
|
||||
In the example below, the ``openmpi/4.0.5_slurm`` package will be loaded, however ``gcc/9.3.0`` must be loaded as well as this is a strict dependency. Direct dependencies must be loaded in advance. Users can load multiple packages one by one or at once. This can be useful for instance when loading a package with direct dependencies.
|
||||
|
||||
```bash
|
||||
# Single line
|
||||
module load gcc/9.3.0 openmpi/4.0.5_slurm
|
||||
|
||||
# Multiple line
|
||||
module load gcc/9.3.0
|
||||
module load openmpi/4.0.5_slurm
|
||||
```
|
||||
|
||||
#### module purge
|
||||
|
||||
This command is an alternative to `module unload`, which can be used to unload **all** loaded module files.
|
||||
|
||||
```bash
|
||||
module purge
|
||||
```
|
||||
|
||||
## When to request for new PModules packages
|
||||
|
||||
### Missing software
|
||||
|
||||
If you don't find a specific software and you know from other people interesing on it, it can be installed in PModules. Please contact us
|
||||
and we will try to help with that. Deploying new software in PModules may take few days.
|
||||
|
||||
Usually installation of new software are possible as long as few users will use it. If you are insterested in to maintain this software,
|
||||
please let us know.
|
||||
|
||||
### Missing version
|
||||
|
||||
If the existing PModules versions for a specific package do not fit to your needs, is possible to ask for a new version.
|
||||
|
||||
Usually installation of newer versions will be supported, as long as few users will use it. Installation of intermediate versions can
|
||||
be supported if this is strictly justified.
|
||||
235
docs/merlin6/03-Slurm-General-Documentation/interactive-jobs.md
Normal file
@@ -0,0 +1,235 @@
|
||||
---
|
||||
title: Running Interactive Jobs
|
||||
#tags:
|
||||
keywords: interactive, X11, X, srun, salloc, job, jobs, slurm, nomachine, nx
|
||||
last_updated: 07 September 2022
|
||||
summary: "This document describes how to run interactive jobs as well as X based software."
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/interactive-jobs.html
|
||||
---
|
||||
|
||||
## Running interactive jobs
|
||||
|
||||
There are two different ways for running interactive jobs in Slurm. This is possible by using
|
||||
the ``salloc`` and ``srun`` commands:
|
||||
|
||||
* **``salloc``**: to obtain a Slurm job allocation (a set of nodes), execute command(s), and then release the allocation when the command is finished.
|
||||
* **``srun``**: is used for running parallel tasks.
|
||||
|
||||
### srun
|
||||
|
||||
Is run is used to run parallel jobs in the batch system. It can be used within a batch script
|
||||
(which can be run with ``sbatch``), or within a job allocation (which can be run with ``salloc``).
|
||||
Also, it can be used as a direct command (in example, from the login nodes).
|
||||
|
||||
When used inside a batch script or during a job allocation, ``srun`` is constricted to the
|
||||
amount of resources allocated by the ``sbatch``/``salloc`` commands. In ``sbatch``, usually
|
||||
these resources are defined inside the batch script with the format ``#SBATCH <option>=<value>``.
|
||||
In other words, if you define in your batch script or allocation 88 tasks (and 1 thread / core)
|
||||
and 2 nodes, ``srun`` is constricted to these amount of resources (you can use less, but never
|
||||
exceed those limits).
|
||||
|
||||
When used from the login node, usually is used to run a specific command or software in an
|
||||
interactive way. ``srun`` is a blocking process (it will block bash prompt until the ``srun``
|
||||
command finishes, unless you run it in background with ``&``). This can be very useful to run
|
||||
interactive software which pops up a Window and then submits jobs or run sub-tasks in the
|
||||
background (in example, **Relion**, **cisTEM**, etc.)
|
||||
|
||||
Refer to ``man srun`` for exploring all possible options for that command.
|
||||
|
||||
<details>
|
||||
<summary>[Show 'srun' example]: Running 'hostname' command on 3 nodes, using 2 cores (1 task/core) per node</summary>
|
||||
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
|
||||
(base) [caubet_m@merlin-l-001 ~]$ srun --clusters=merlin6 --ntasks=6 --ntasks-per-node=2 --nodes=3 hostname
|
||||
srun: job 135088230 queued and waiting for resources
|
||||
srun: job 135088230 has been allocated resources
|
||||
merlin-c-102.psi.ch
|
||||
merlin-c-102.psi.ch
|
||||
merlin-c-101.psi.ch
|
||||
merlin-c-101.psi.ch
|
||||
merlin-c-103.psi.ch
|
||||
merlin-c-103.psi.ch
|
||||
</pre>
|
||||
</details>
|
||||
|
||||
### salloc
|
||||
|
||||
**``salloc``** is used to obtain a Slurm job allocation (a set of nodes). Once job is allocated,
|
||||
users are able to execute interactive command(s). Once finished (``exit`` or ``Ctrl+D``),
|
||||
the allocation is released. **``salloc``** is a blocking command, it is, command will be blocked
|
||||
until the requested resources are allocated.
|
||||
|
||||
When running **``salloc``**, once the resources are allocated, *by default* the user will get
|
||||
a ***new shell on one of the allocated resources*** (if a user has requested few nodes, it will
|
||||
prompt a new shell on the first allocated node). However, this behaviour can be changed by adding
|
||||
a shell (`$SHELL`) at the end of the `salloc` command. In example:
|
||||
|
||||
```bash
|
||||
# Typical 'salloc' call
|
||||
# - Same as running:
|
||||
# 'salloc --clusters=merlin6 -N 2 -n 2 srun -n1 -N1 --mem-per-cpu=0 --gres=gpu:0 --pty --preserve-env --mpi=none $SHELL'
|
||||
salloc --clusters=merlin6 -N 2 -n 2
|
||||
|
||||
# Custom 'salloc' call
|
||||
# - $SHELL will open a local shell on the login node from where ``salloc`` is running
|
||||
salloc --clusters=merlin6 -N 2 -n 2 $SHELL
|
||||
```
|
||||
|
||||
<details>
|
||||
<summary>[Show 'salloc' example]: Allocating 2 cores (1 task/core) in 2 nodes (1 core/node) - <i>Default</i></summary>
|
||||
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
|
||||
(base) [caubet_m@merlin-l-001 ~]$ salloc --clusters=merlin6 --ntasks=2 --nodes=2
|
||||
salloc: Pending job allocation 135171306
|
||||
salloc: job 135171306 queued and waiting for resources
|
||||
salloc: job 135171306 has been allocated resources
|
||||
salloc: Granted job allocation 135171306
|
||||
|
||||
(base) [caubet_m@merlin-c-213 ~]$ srun hostname
|
||||
merlin-c-213.psi.ch
|
||||
merlin-c-214.psi.ch
|
||||
|
||||
(base) [caubet_m@merlin-c-213 ~]$ exit
|
||||
exit
|
||||
salloc: Relinquishing job allocation 135171306
|
||||
|
||||
(base) [caubet_m@merlin-l-001 ~]$ salloc --clusters=merlin6 -N 2 -n 2 srun -n1 -N1 --mem-per-cpu=0 --gres=gpu:0 --pty --preserve-env --mpi=none $SHELL
|
||||
salloc: Pending job allocation 135171342
|
||||
salloc: job 135171342 queued and waiting for resources
|
||||
salloc: job 135171342 has been allocated resources
|
||||
salloc: Granted job allocation 135171342
|
||||
|
||||
(base) [caubet_m@merlin-c-021 ~]$ srun hostname
|
||||
merlin-c-021.psi.ch
|
||||
merlin-c-022.psi.ch
|
||||
|
||||
(base) [caubet_m@merlin-c-021 ~]$ exit
|
||||
exit
|
||||
salloc: Relinquishing job allocation 135171342
|
||||
</pre>
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>[Show 'salloc' example]: Allocating 2 cores (1 task/core) in 2 nodes (1 core/node) - <i>$SHELL</i></summary>
|
||||
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
|
||||
(base) [caubet_m@merlin-export-01 ~]$ salloc --clusters=merlin6 --ntasks=2 --nodes=2 $SHELL
|
||||
salloc: Pending job allocation 135171308
|
||||
salloc: job 135171308 queued and waiting for resources
|
||||
salloc: job 135171308 has been allocated resources
|
||||
salloc: Granted job allocation 135171308
|
||||
|
||||
(base) [caubet_m@merlin-export-01 ~]$ srun hostname
|
||||
merlin-c-218.psi.ch
|
||||
merlin-c-117.psi.ch
|
||||
|
||||
(base) [caubet_m@merlin-export-01 ~]$ exit
|
||||
exit
|
||||
salloc: Relinquishing job allocation 135171308
|
||||
</pre>
|
||||
</details>
|
||||
|
||||
## Running interactive jobs with X11 support
|
||||
|
||||
### Requirements
|
||||
|
||||
#### Graphical access
|
||||
|
||||
[NoMachine](/merlin6/nomachine.html) is the official supported service for graphical
|
||||
access in the Merlin cluster. This service is running on the login nodes. Check the
|
||||
document [{Accessing Merlin -> NoMachine}](/merlin6/nomachine.html) for details about
|
||||
how to connect to the **NoMachine** service in the Merlin cluster.
|
||||
|
||||
For other non officially supported graphical access (X11 forwarding):
|
||||
|
||||
* For Linux clients, please follow [{How To Use Merlin -> Accessing from Linux Clients}](/merlin6/connect-from-linux.html)
|
||||
* For Windows clients, please follow [{How To Use Merlin -> Accessing from Windows Clients}](/merlin6/connect-from-windows.html)
|
||||
* For MacOS clients, please follow [{How To Use Merlin -> Accessing from MacOS Clients}](/merlin6/connect-from-macos.html)
|
||||
|
||||
### 'srun' with x11 support
|
||||
|
||||
Merlin5 and Merlin6 clusters allow running any windows based applications. For that, you need to
|
||||
add the option ``--x11`` to the ``srun`` command. In example:
|
||||
|
||||
```bash
|
||||
srun --clusters=merlin6 --x11 xclock
|
||||
```
|
||||
|
||||
will popup a X11 based clock.
|
||||
|
||||
In the same manner, you can create a bash shell with x11 support. For doing that, you need
|
||||
to add the option ``--pty`` to the ``srun --x11`` command. Once resource is allocated, from
|
||||
there you can interactively run X11 and non-X11 based commands.
|
||||
|
||||
```bash
|
||||
srun --clusters=merlin6 --x11 --pty bash
|
||||
```
|
||||
|
||||
<details>
|
||||
<summary>[Show 'srun' with X11 support examples]</summary>
|
||||
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
|
||||
(base) [caubet_m@merlin-l-001 ~]$ srun --clusters=merlin6 --x11 xclock
|
||||
srun: job 135095591 queued and waiting for resources
|
||||
srun: job 135095591 has been allocated resources
|
||||
|
||||
(base) [caubet_m@merlin-l-001 ~]$
|
||||
|
||||
(base) [caubet_m@merlin-l-001 ~]$ srun --clusters=merlin6 --x11 --pty bash
|
||||
srun: job 135095592 queued and waiting for resources
|
||||
srun: job 135095592 has been allocated resources
|
||||
|
||||
(base) [caubet_m@merlin-c-205 ~]$ xclock
|
||||
|
||||
(base) [caubet_m@merlin-c-205 ~]$ echo "This was an example"
|
||||
This was an example
|
||||
|
||||
(base) [caubet_m@merlin-c-205 ~]$ exit
|
||||
exit
|
||||
</pre>
|
||||
</details>
|
||||
|
||||
### 'salloc' with x11 support
|
||||
|
||||
**Merlin5** and **Merlin6** clusters allow running any windows based applications. For that, you need to
|
||||
add the option ``--x11`` to the ``salloc`` command. In example:
|
||||
|
||||
```bash
|
||||
salloc --clusters=merlin6 --x11 xclock
|
||||
```
|
||||
|
||||
will popup a X11 based clock.
|
||||
|
||||
In the same manner, you can create a bash shell with x11 support. For doing that, you need
|
||||
to add to run just ``salloc --clusters=merlin6 --x11``. Once resource is allocated, from
|
||||
there you can interactively run X11 and non-X11 based commands.
|
||||
|
||||
```bash
|
||||
salloc --clusters=merlin6 --x11
|
||||
```
|
||||
|
||||
<details>
|
||||
<summary>[Show 'salloc' with X11 support examples]</summary>
|
||||
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
|
||||
(base) [caubet_m@merlin-l-001 ~]$ salloc --clusters=merlin6 --x11 xclock
|
||||
salloc: Pending job allocation 135171355
|
||||
salloc: job 135171355 queued and waiting for resources
|
||||
salloc: job 135171355 has been allocated resources
|
||||
salloc: Granted job allocation 135171355
|
||||
salloc: Relinquishing job allocation 135171355
|
||||
|
||||
(base) [caubet_m@merlin-l-001 ~]$ salloc --clusters=merlin6 --x11
|
||||
salloc: Pending job allocation 135171349
|
||||
salloc: job 135171349 queued and waiting for resources
|
||||
salloc: job 135171349 has been allocated resources
|
||||
salloc: Granted job allocation 135171349
|
||||
salloc: Waiting for resource configuration
|
||||
salloc: Nodes merlin-c-117 are ready for job
|
||||
|
||||
(base) [caubet_m@merlin-c-117 ~]$ xclock
|
||||
|
||||
(base) [caubet_m@merlin-c-117 ~]$ echo "This was an example"
|
||||
This was an example
|
||||
|
||||
(base) [caubet_m@merlin-c-117 ~]$ exit
|
||||
exit
|
||||
salloc: Relinquishing job allocation 135171349
|
||||
</pre>
|
||||
</details>
|
||||
288
docs/merlin6/03-Slurm-General-Documentation/monitoring.md
Normal file
@@ -0,0 +1,288 @@
|
||||
---
|
||||
title: Monitoring
|
||||
#tags:
|
||||
keywords: monitoring, jobs, slurm, job status, squeue, sinfo, sacct
|
||||
last_updated: 07 September 2022
|
||||
#summary: ""
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/monitoring.html
|
||||
---
|
||||
|
||||
## Slurm Monitoring
|
||||
|
||||
### Job status
|
||||
|
||||
The status of submitted jobs can be check with the ``squeue`` command:
|
||||
|
||||
```bash
|
||||
squeue -u $username
|
||||
```
|
||||
|
||||
Common statuses:
|
||||
|
||||
* **merlin-\***: Running on the specified host
|
||||
* **(Priority)**: Waiting in the queue
|
||||
* **(Resources)**: At the head of the queue, waiting for machines to become available
|
||||
* **(AssocGrpCpuLimit), (AssocGrpNodeLimit)**: Job would exceed per-user limitations on
|
||||
the number of simultaneous CPUs/Nodes. Use `scancel` to remove the job and
|
||||
resubmit with fewer resources, or else wait for your other jobs to finish.
|
||||
* **(PartitionNodeLimit)**: Exceeds all resources available on this partition.
|
||||
Run `scancel` and resubmit to a different partition (`-p`) or with fewer
|
||||
resources.
|
||||
|
||||
Check in the **man** pages (``man squeue``) for all possible options for this command.
|
||||
|
||||
<details>
|
||||
<summary>[Show 'squeue' example]</summary>
|
||||
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
|
||||
[root@merlin-slurmctld01 ~]# squeue -u feichtinger
|
||||
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
|
||||
134332544 general spawner- feichtin R 5-06:47:45 1 merlin-c-204
|
||||
134321376 general subm-tal feichtin R 5-22:27:59 1 merlin-c-204
|
||||
</pre>
|
||||
</details>
|
||||
|
||||
### Partition status
|
||||
|
||||
The status of the nodes and partitions (a.k.a. queues) can be seen with the ``sinfo`` command:
|
||||
|
||||
```bash
|
||||
sinfo
|
||||
```
|
||||
|
||||
Check in the **man** pages (``man sinfo``) for all possible options for this command.
|
||||
|
||||
<details>
|
||||
<summary>[Show 'sinfo' example]</summary>
|
||||
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
|
||||
[root@merlin-l-001 ~]# sinfo -l
|
||||
Thu Jan 23 16:34:49 2020
|
||||
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE NODELIST
|
||||
test up 1-00:00:00 1-infinite no NO all 3 mixed merlin-c-[024,223-224]
|
||||
test up 1-00:00:00 1-infinite no NO all 2 allocated merlin-c-[123-124]
|
||||
test up 1-00:00:00 1-infinite no NO all 1 idle merlin-c-023
|
||||
general* up 7-00:00:00 1-50 no NO all 6 mixed merlin-c-[007,204,207-209,219]
|
||||
general* up 7-00:00:00 1-50 no NO all 57 allocated merlin-c-[001-005,008-020,101-122,201-203,205-206,210-218,220-222]
|
||||
general* up 7-00:00:00 1-50 no NO all 3 idle merlin-c-[006,021-022]
|
||||
daily up 1-00:00:00 1-60 no NO all 9 mixed merlin-c-[007,024,204,207-209,219,223-224]
|
||||
daily up 1-00:00:00 1-60 no NO all 59 allocated merlin-c-[001-005,008-020,101-124,201-203,205-206,210-218,220-222]
|
||||
daily up 1-00:00:00 1-60 no NO all 4 idle merlin-c-[006,021-023]
|
||||
hourly up 1:00:00 1-infinite no NO all 9 mixed merlin-c-[007,024,204,207-209,219,223-224]
|
||||
hourly up 1:00:00 1-infinite no NO all 59 allocated merlin-c-[001-005,008-020,101-124,201-203,205-206,210-218,220-222]
|
||||
hourly up 1:00:00 1-infinite no NO all 4 idle merlin-c-[006,021-023]
|
||||
gpu up 7-00:00:00 1-infinite no NO all 1 mixed merlin-g-007
|
||||
gpu up 7-00:00:00 1-infinite no NO all 8 allocated merlin-g-[001-006,008-009]
|
||||
</pre>
|
||||
</details>
|
||||
|
||||
### Slurm commander
|
||||
|
||||
The **[Slurm Commander (scom)](https://github.com/CLIP-HPC/SlurmCommander/)** is a simple but very useful open source text-based user interface for
|
||||
simple and efficient interaction with Slurm. It is developed by the **CLoud Infrastructure Project (CLIP-HPC)** and external contributions. To use it, one can
|
||||
simply run the following command:
|
||||
|
||||
```bash
|
||||
scom # merlin6 cluster
|
||||
SLURM_CLUSTERS=merlin5 scom # merlin5 cluster
|
||||
SLURM_CLUSTERS=gmerlin6 scom # gmerlin6 cluster
|
||||
scom -h # Help and extra options
|
||||
scom -d 14 # Set Job History to 14 days (instead of default 7)
|
||||
```
|
||||
With this simple interface, users can interact with their jobs, as well as getting information about past and present jobs:
|
||||
* Filtering jobs by substring is possible with the `/` key.
|
||||
* Users can perform multiple actions on their jobs (such like cancelling, holding or requeing a job), SSH to a node with an already running job,
|
||||
or getting extended details and statistics of the job itself.
|
||||
|
||||
Also, users can check the status of the cluster, to get statistics and node usage information as well as getting information about node properties.
|
||||
|
||||
The interface also provides a few job templates for different use cases (i.e. MPI, OpenMP, Hybrid, single core). Users can modify these templates,
|
||||
save it locally to the current directory, and submit the job to the cluster.
|
||||
|
||||
{{site.data.alerts.note}}Currently, <span style="color:darkblue;">scom</span> does not provide live updated information for the <span style="color:darkorange;">[Job History]</span> tab.
|
||||
To update Job History information, users have to exit the application with the <span style="color:darkorange;">q</span> key. Other tabs will be updated every 5 seconds (default).
|
||||
On the other hand, the <span style="color:darkorange;">[Job History]</span> tab contains only information for the <b>merlin6</b> CPU cluster only. Future updates will provide information
|
||||
for other clusters.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
For further information about how to use **scom**, please refer to the **[Slurm Commander Project webpage](https://github.com/CLIP-HPC/SlurmCommander/)**
|
||||
|
||||

|
||||
|
||||
### Job accounting
|
||||
|
||||
Users can check detailed information of jobs (pending, running, completed, failed, etc.) with the `sacct` command.
|
||||
This command is very flexible and can provide a lot of information. For checking all the available options, please read `man sacct`.
|
||||
Below, we summarize some examples that can be useful for the users:
|
||||
|
||||
```bash
|
||||
# Today jobs, basic summary
|
||||
sacct
|
||||
|
||||
# Today jobs, with details
|
||||
sacct --long
|
||||
|
||||
# Jobs from January 1, 2022, 12pm, with details
|
||||
sacct -S 2021-01-01T12:00:00 --long
|
||||
|
||||
# Specific job accounting
|
||||
sacct --long -j $jobid
|
||||
|
||||
# Jobs custom details, without steps (-X)
|
||||
sacct -X --format=User%20,JobID,Jobname,partition,state,time,submit,start,end,elapsed,AveRss,MaxRss,MaxRSSTask,MaxRSSNode%20,MaxVMSize,nnodes,ncpus,ntasks,reqcpus,totalcpu,reqmem,cluster,TimeLimit,TimeLimitRaw,cputime,nodelist%50,AllocTRES%80
|
||||
|
||||
# Jobs custom details, with steps
|
||||
sacct --format=User%20,JobID,Jobname,partition,state,time,submit,start,end,elapsed,AveRss,MaxRss,MaxRSSTask,MaxRSSNode%20,MaxVMSize,nnodes,ncpus,ntasks,reqcpus,totalcpu,reqmem,cluster,TimeLimit,TimeLimitRaw,cputime,nodelist%50,AllocTRES%80
|
||||
```
|
||||
|
||||
### Job efficiency
|
||||
|
||||
Users can check how efficient are their jobs. For that, the ``seff`` command is available.
|
||||
|
||||
```bash
|
||||
seff $jobid
|
||||
```
|
||||
|
||||
<details>
|
||||
<summary>[Show 'seff' example]</summary>
|
||||
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
|
||||
[root@merlin-slurmctld01 ~]# seff 134333893
|
||||
Job ID: 134333893
|
||||
Cluster: merlin6
|
||||
User/Group: albajacas_a/unx-sls
|
||||
State: COMPLETED (exit code 0)
|
||||
Nodes: 1
|
||||
Cores per node: 8
|
||||
CPU Utilized: 00:26:15
|
||||
CPU Efficiency: 49.47% of 00:53:04 core-walltime
|
||||
Job Wall-clock time: 00:06:38
|
||||
Memory Utilized: 60.73 MB
|
||||
Memory Efficiency: 0.19% of 31.25 GB
|
||||
</pre>
|
||||
</details>
|
||||
|
||||
### List job attributes
|
||||
|
||||
The ``sjstat`` command is used to display statistics of jobs under control of SLURM. To use it
|
||||
|
||||
```bash
|
||||
sjstat
|
||||
```
|
||||
|
||||
<details>
|
||||
<summary>[Show 'sjstat' example]</summary>
|
||||
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
|
||||
[root@merlin-l-001 ~]# sjstat -v
|
||||
|
||||
Scheduling pool data:
|
||||
----------------------------------------------------------------------------------
|
||||
Total Usable Free Node Time Other
|
||||
Pool Memory Cpus Nodes Nodes Nodes Limit Limit traits
|
||||
----------------------------------------------------------------------------------
|
||||
test 373502Mb 88 6 6 1 UNLIM 1-00:00:00
|
||||
general* 373502Mb 88 66 66 8 50 7-00:00:00
|
||||
daily 373502Mb 88 72 72 9 60 1-00:00:00
|
||||
hourly 373502Mb 88 72 72 9 UNLIM 01:00:00
|
||||
gpu 128000Mb 8 1 1 0 UNLIM 7-00:00:00
|
||||
gpu 128000Mb 20 8 8 0 UNLIM 7-00:00:00
|
||||
|
||||
Running job data:
|
||||
---------------------------------------------------------------------------------------------------
|
||||
Time Time Time
|
||||
JobID User Procs Pool Status Used Limit Started Master/Other
|
||||
---------------------------------------------------------------------------------------------------
|
||||
13433377 collu_g 1 gpu PD 0:00 24:00:00 N/A (Resources)
|
||||
13433389 collu_g 20 gpu PD 0:00 24:00:00 N/A (Resources)
|
||||
13433382 jaervine 4 gpu PD 0:00 24:00:00 N/A (Priority)
|
||||
13433386 barret_d 20 gpu PD 0:00 24:00:00 N/A (Priority)
|
||||
13433382 pamula_f 20 gpu PD 0:00 168:00:00 N/A (Priority)
|
||||
13433387 pamula_f 4 gpu PD 0:00 24:00:00 N/A (Priority)
|
||||
13433365 andreani 132 daily PD 0:00 24:00:00 N/A (Dependency)
|
||||
13433388 marino_j 6 gpu R 1:43:12 168:00:00 01-23T14:54:57 merlin-g-007
|
||||
13433377 choi_s 40 gpu R 2:09:55 48:00:00 01-23T14:28:14 merlin-g-006
|
||||
13433373 qi_c 20 gpu R 7:00:04 24:00:00 01-23T09:38:05 merlin-g-004
|
||||
13433390 jaervine 2 gpu R 5:18 24:00:00 01-23T16:32:51 merlin-g-007
|
||||
13433390 jaervine 2 gpu R 15:18 24:00:00 01-23T16:22:51 merlin-g-007
|
||||
13433375 bellotti 4 gpu R 7:35:44 9:00:00 01-23T09:02:25 merlin-g-001
|
||||
13433358 bellotti 1 gpu R 1-05:52:19 144:00:00 01-22T10:45:50 merlin-g-007
|
||||
13433377 lavriha_ 20 gpu R 5:13:24 24:00:00 01-23T11:24:45 merlin-g-008
|
||||
13433370 lavriha_ 40 gpu R 22:43:09 24:00:00 01-22T17:55:00 merlin-g-003
|
||||
13433373 qi_c 20 gpu R 15:03:15 24:00:00 01-23T01:34:54 merlin-g-002
|
||||
13433371 qi_c 4 gpu R 22:14:14 168:00:00 01-22T18:23:55 merlin-g-001
|
||||
13433254 feichtin 2 general R 5-07:26:11 156:00:00 01-18T09:11:58 merlin-c-204
|
||||
13432137 feichtin 2 general R 5-23:06:25 160:00:00 01-17T17:31:44 merlin-c-204
|
||||
13433389 albajaca 32 hourly R 41:19 1:00:00 01-23T15:56:50 merlin-c-219
|
||||
13433387 riemann_ 2 general R 1:51:47 4:00:00 01-23T14:46:22 merlin-c-204
|
||||
13433370 jimenez_ 2 general R 23:20:45 168:00:00 01-22T17:17:24 merlin-c-106
|
||||
13433381 jimenez_ 2 general R 4:55:33 168:00:00 01-23T11:42:36 merlin-c-219
|
||||
13433390 sayed_m 128 daily R 21:49 10:00:00 01-23T16:16:20 merlin-c-223
|
||||
13433359 adelmann 2 general R 1-05:00:09 48:00:00 01-22T11:38:00 merlin-c-204
|
||||
13433377 zimmerma 2 daily R 6:13:38 24:00:00 01-23T10:24:31 merlin-c-007
|
||||
13433375 zohdirad 24 daily R 7:33:16 10:00:00 01-23T09:04:53 merlin-c-218
|
||||
13433363 zimmerma 6 general R 1-02:54:20 47:50:00 01-22T13:43:49 merlin-c-106
|
||||
13433376 zimmerma 6 general R 7:25:42 23:50:00 01-23T09:12:27 merlin-c-007
|
||||
13433371 vazquez_ 16 daily R 21:46:31 23:59:00 01-22T18:51:38 merlin-c-106
|
||||
13433382 vazquez_ 16 daily R 4:09:23 23:59:00 01-23T12:28:46 merlin-c-024
|
||||
13433376 jiang_j1 440 daily R 7:11:14 10:00:00 01-23T09:26:55 merlin-c-123
|
||||
13433376 jiang_j1 24 daily R 7:08:19 10:00:00 01-23T09:29:50 merlin-c-220
|
||||
13433384 kranjcev 440 daily R 2:48:19 24:00:00 01-23T13:49:50 merlin-c-108
|
||||
13433371 vazquez_ 16 general R 20:15:15 120:00:00 01-22T20:22:54 merlin-c-210
|
||||
13433371 vazquez_ 16 general R 21:15:51 120:00:00 01-22T19:22:18 merlin-c-210
|
||||
13433374 colonna_ 176 daily R 8:23:18 24:00:00 01-23T08:14:51 merlin-c-211
|
||||
13433374 bures_l 88 daily R 10:45:06 24:00:00 01-23T05:53:03 merlin-c-001
|
||||
13433375 derlet 88 daily R 7:32:05 24:00:00 01-23T09:06:04 merlin-c-107
|
||||
13433373 derlet 88 daily R 17:21:57 24:00:00 01-22T23:16:12 merlin-c-002
|
||||
13433373 derlet 88 daily R 18:13:05 24:00:00 01-22T22:25:04 merlin-c-112
|
||||
13433365 andreani 264 daily R 4:10:08 24:00:00 01-23T12:28:01 merlin-c-003
|
||||
13431187 mahrous_ 88 general R 6-15:59:16 168:00:00 01-17T00:38:53 merlin-c-111
|
||||
13433387 kranjcev 2 general R 1:48:47 4:00:00 01-23T14:49:22 merlin-c-204
|
||||
13433368 karalis_ 352 general R 1-00:05:22 96:00:00 01-22T16:32:47 merlin-c-013
|
||||
13433367 karalis_ 352 general R 1-00:06:44 96:00:00 01-22T16:31:25 merlin-c-118
|
||||
13433385 karalis_ 352 general R 1:37:24 96:00:00 01-23T15:00:45 merlin-c-213
|
||||
13433374 sato 256 general R 14:55:55 24:00:00 01-23T01:42:14 merlin-c-204
|
||||
13433374 sato 64 general R 10:43:35 24:00:00 01-23T05:54:34 merlin-c-106
|
||||
67723568 sato 32 general R 10:40:07 24:00:00 01-23T05:58:02 merlin-c-007
|
||||
13433265 khanppna 440 general R 3-18:20:58 168:00:00 01-19T22:17:11 merlin-c-008
|
||||
13433375 khanppna 704 general R 7:31:24 24:00:00 01-23T09:06:45 merlin-c-101
|
||||
13433371 khanppna 616 general R 21:40:33 24:00:00 01-22T18:57:36 merlin-c-208
|
||||
</pre>
|
||||
</details>
|
||||
|
||||
### Graphical user interface
|
||||
|
||||
When using **ssh** with X11 forwarding (``ssh -XY``), or when using NoMachine, users can use ``sview``.
|
||||
**SView** is a graphical user interface to view and modify Slurm states. To run **sview**:
|
||||
|
||||
```bash
|
||||
ssh -XY $username@merlin-l-001.psi.ch # Not necessary when using NoMachine
|
||||
sview
|
||||
```
|
||||
|
||||

|
||||
|
||||
|
||||
## General Monitoring
|
||||
|
||||
The following pages contain basic monitoring for Slurm and computing nodes.
|
||||
Currently, monitoring is based on Grafana + InfluxDB. In the future it will
|
||||
be moved to a different service based on ElasticSearch + LogStash + Kibana.
|
||||
|
||||
In the meantime, the following monitoring pages are available in a best effort
|
||||
support:
|
||||
|
||||
### Merlin6 Monitoring Pages
|
||||
|
||||
* Slurm monitoring:
|
||||
* ***[Merlin6 Slurm Statistics - XDMOD](https://merlin-slurmmon01.psi.ch/)***
|
||||
* [Merlin6 Slurm Live Status](https://hpc-monitor02.psi.ch/d/QNcbW1AZk/merlin6-slurm-live-status?orgId=1&refresh=10s)
|
||||
* [Merlin6 Slurm Overview](https://hpc-monitor02.psi.ch/d/94UxWJ0Zz/merlin6-slurm-overview?orgId=1&refresh=10s)
|
||||
* Nodes monitoring:
|
||||
* [Merlin6 CPU Nodes Overview](https://hpc-monitor02.psi.ch/d/JmvLR8gZz/merlin6-computing-cpu-nodes?orgId=1&refresh=10s)
|
||||
* [Merlin6 GPU Nodes Overview](https://hpc-monitor02.psi.ch/d/gOo1Z10Wk/merlin6-computing-gpu-nodes?orgId=1&refresh=10s)
|
||||
|
||||
### Merlin5 Monitoring Pages
|
||||
|
||||
* Slurm monitoring:
|
||||
* [Merlin5 Slurm Live Status](https://hpc-monitor02.psi.ch/d/o8msZJ0Zz/merlin5-slurm-live-status?orgId=1&refresh=10s)
|
||||
* [Merlin5 Slurm Overview](https://hpc-monitor02.psi.ch/d/eWLEW1AWz/merlin5-slurm-overview?orgId=1&refresh=10s)
|
||||
* Nodes monitoring:
|
||||
* [Merlin5 CPU Nodes Overview](https://hpc-monitor02.psi.ch/d/ejTyWJAWk/merlin5-computing-cpu-nodes?orgId=1&refresh=10s)
|
||||
284
docs/merlin6/03-Slurm-General-Documentation/running-jobs.md
Normal file
@@ -0,0 +1,284 @@
|
||||
---
|
||||
title: Running Slurm Scripts
|
||||
#tags:
|
||||
keywords: batch script, slurm, sbatch, srun, jobs, job, submit, submission, array jobs, array, squeue, sinfo, scancel, packed jobs, short jobs, very short jobs, multithread, rules, no-multithread, HT
|
||||
last_updated: 07 September 2022
|
||||
summary: "This document describes how to run batch scripts in Slurm."
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/running-jobs.html
|
||||
---
|
||||
|
||||
|
||||
## The rules
|
||||
|
||||
Before starting using the cluster, please read the following rules:
|
||||
|
||||
1. To ease and improve *scheduling* and *backfilling*, always try to **estimate and** to **define a proper run time** of your jobs:
|
||||
* Use ``--time=<D-HH:MM:SS>`` for that.
|
||||
* For very long runs, please consider using ***[Job Arrays with Checkpointing](/merlin6/running-jobs.html#array-jobs-running-very-long-tasks-with-checkpoint-files)***
|
||||
2. Try to optimize your jobs for running at most within **one day**. Please, consider the following:
|
||||
* Some software can simply scale up by using more nodes while drastically reducing the run time.
|
||||
* Some software allow to save a specific state, and a second job can start from that state: ***[Job Arrays with Checkpointing](/merlin6/running-jobs.html#array-jobs-running-very-long-tasks-with-checkpoint-files)*** can help you with that.
|
||||
* Jobs submitted to **`hourly`** get more priority than jobs submitted to **`daily`**: always use **`hourly`** for jobs shorter than 1 hour.
|
||||
* Jobs submitted to **`daily`** get more priority than jobs submitted to **`general`**: always use **`daily`** for jobs shorter than 1 day.
|
||||
3. Is **forbidden** to run **very short jobs** as they cause a lot of overhead but also can cause severe problems to the main scheduler.
|
||||
* ***Question:*** Is my job a very short job? ***Answer:*** If it lasts in few seconds or very few minutes, yes.
|
||||
* ***Question:*** How long should my job run? ***Answer:*** as the *Rule of Thumb*, from 5' would start being ok, from 15' would preferred.
|
||||
* Use ***[Packed Jobs](/merlin6/running-jobs.html#packed-jobs-running-a-large-number-of-short-tasks)*** for running a large number of short tasks.
|
||||
4. Do not submit hundreds of similar jobs!
|
||||
* Use ***[Array Jobs](/merlin6/running-jobs.html#array-jobs-launching-a-large-number-of-related-jobs)*** for gathering jobs instead.
|
||||
|
||||
{{site.data.alerts.tip}}Having a good estimation of the <i>time</i> needed by your jobs, a proper way for running them, and optimizing the jobs to <i>run within one day</i> will contribute to make the system fairly and efficiently used.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
## Basic commands for running batch scripts
|
||||
|
||||
* Use **``sbatch``** for submitting a batch script to Slurm.
|
||||
* Use **``srun``** for running parallel tasks.
|
||||
* Use **``squeue``** for checking jobs status.
|
||||
* Use **``scancel``** for cancelling/deleting a job from the queue.
|
||||
|
||||
{{site.data.alerts.tip}}Use Linux <b>'man'</b> pages when needed (i.e. <span style="color:orange;">'man sbatch'</span>), mostly for checking the available options for the above commands.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
## Basic settings
|
||||
|
||||
For a complete list of options and parameters available is recommended to use the **man pages** (i.e. ``man sbatch``, ``man srun``, ``man salloc``).
|
||||
Please, notice that behaviour for some parameters might change depending on the command used when running jobs (in example, ``--exclusive`` behaviour in ``sbatch`` differs from ``srun``).
|
||||
|
||||
In this chapter we show the basic parameters which are usually needed in the Merlin cluster.
|
||||
|
||||
### Common settings
|
||||
|
||||
The following settings are the minimum required for running a job in the Merlin CPU and GPU nodes. Please, consider taking a look to the **man pages** (i.e. `man sbatch`, `man salloc`, `man srun`) for more information about all possible options. Also, do not hesitate to contact us on any questions.
|
||||
|
||||
* **Clusters:** For running jobs in the different Slurm clusters, users should to add the following option:
|
||||
```bash
|
||||
#SBATCH --clusters=<cluster_name> # Possible values: merlin5, merlin6, gmerlin6
|
||||
```
|
||||
Refer to the documentation of each cluster ([**`merlin6`**](/merlin6/slurm-configuration.html),[**`gmerlin6`**](/gmerlin6/slurm-configuration.html),[**`merlin5`**](/merlin5/slurm-configuration.html) for further information.
|
||||
|
||||
* **Partitions:** except when using the *default* partition for each cluster, one needs to specify the partition:
|
||||
```bash
|
||||
#SBATCH --partition=<partition_name> # Check each cluster documentation for possible values
|
||||
```
|
||||
|
||||
Refer to the documentation of each cluster ([**`merlin6`**](/merlin6/slurm-configuration.html),[**`gmerlin6`**](/gmerlin6/slurm-configuration.html),[**`merlin5`**](/merlin5/slurm-configuration.html) for further information.
|
||||
|
||||
* **[Optional] Disabling shared nodes**: by default, nodes are not exclusive. Hence, multiple users can run in the same node. One can request exclusive node usage with the following option:
|
||||
```bash
|
||||
#SBATCH --exclusive # Only if you want a dedicated node
|
||||
```
|
||||
|
||||
* **Time**: is important to define how long a job should run, according to the reality. This will help Slurm when *scheduling* and *backfilling*, and will let Slurm managing job queues in a more efficient way. This value can never exceed the `MaxTime` of the affected partition.
|
||||
```bash
|
||||
#SBATCH --time=<D-HH:MM:SS> # Can not exceed the partition `MaxTime`
|
||||
```
|
||||
Refer to the documentation of each cluster ([**`merlin6`**](/merlin6/slurm-configuration.html),[**`gmerlin6`**](/gmerlin6/slurm-configuration.html),[**`merlin5`**](/merlin5/slurm-configuration.html) for further information about partition `MaxTime` values.
|
||||
|
||||
* **Output and error files**: by default, Slurm script will generate standard output (``slurm-%j.out``, where `%j` is the job_id) and error (``slurm-%j.err``, where `%j` is the job_id) files in the directory from where the job was submitted. Users can change default name with the following options:
|
||||
```bash
|
||||
#SBATCH --output=<filename> # Can include path. Patterns accepted (i.e. %j)
|
||||
#SBATCH --error=<filename> # Can include path. Patterns accepted (i.e. %j)
|
||||
```
|
||||
Use **man sbatch** (``man sbatch | grep -A36 '^filename pattern'``) for getting a list specification of **filename patterns**.
|
||||
|
||||
* **Enable/Disable Hyper-Threading**: Whether a node has or not Hyper-Threading depends on the node configuration. By default, HT nodes have HT enabled, but one should specify it from the Slurm command as follows:
|
||||
```bash
|
||||
#SBATCH --hint=multithread # Use extra threads with in-core multi-threading.
|
||||
#SBATCH --hint=nomultithread # Don't use extra threads with in-core multi-threading.
|
||||
```
|
||||
Refer to the documentation of each cluster ([**`merlin6`**](/merlin6/slurm-configuration.html),[**`gmerlin6`**](/gmerlin6/slurm-configuration.html),[**`merlin5`**](/merlin5/slurm-configuration.html) for further information about node configuration and Hyper-Threading.
|
||||
Consider that, sometimes, depending on your job requirements, you might need also to setup how many `--ntasks-per-core` or `--cpus-per-task` (even other options) in addition to the `--hint` command. Please, contact us in case of doubts.
|
||||
|
||||
{{site.data.alerts.tip}} In general, for the cluster `merlin6` <span style="color:orange;"><b>--hint=[no]multithread</b></span> is a recommended field. On the other hand, <span style="color:orange;"><b>--ntasks-per-core</b></span> is only needed when
|
||||
one needs to define how a task should be handled within a core, and this setting will not be generally used on Hybrid MPI/OpenMP jobs where multiple cores are needed for single tasks.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
## Batch script templates
|
||||
|
||||
### CPU-based jobs templates
|
||||
|
||||
The following examples apply to the **Merlin6** cluster.
|
||||
|
||||
#### Nomultithreaded jobs template
|
||||
|
||||
The following template should be used by any user submitting jobs to the Merlin6 CPU nodes:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
#SBATCH --cluster=merlin6 # Cluster name
|
||||
#SBATCH --partition=general,daily,hourly # Specify one or multiple partitions
|
||||
#SBATCH --time=<D-HH:MM:SS> # Strongly recommended
|
||||
#SBATCH --output=<output_file> # Generate custom output file
|
||||
#SBATCH --error=<error_file> # Generate custom error file
|
||||
#SBATCH --hint=nomultithread # Mandatory for multithreaded jobs
|
||||
##SBATCH --exclusive # Uncomment if you need exclusive node usage
|
||||
##SBATCH --ntasks-per-core=1 # Only mandatory for multithreaded single tasks
|
||||
|
||||
## Advanced options example
|
||||
##SBATCH --nodes=1 # Uncomment and specify #nodes to use
|
||||
##SBATCH --ntasks=44 # Uncomment and specify #nodes to use
|
||||
##SBATCH --ntasks-per-node=44 # Uncomment and specify #tasks per node
|
||||
##SBATCH --cpus-per-task=44 # Uncomment and specify the number of cores per task
|
||||
```
|
||||
|
||||
#### Multithreaded jobs template
|
||||
|
||||
The following template should be used by any user submitting jobs to the Merlin6 CPU nodes:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
#SBATCH --cluster=merlin6 # Cluster name
|
||||
#SBATCH --partition=general,daily,hourly # Specify one or multiple partitions
|
||||
#SBATCH --time=<D-HH:MM:SS> # Strongly recommended
|
||||
#SBATCH --output=<output_file> # Generate custom output file
|
||||
#SBATCH --error=<error_file> # Generate custom error file
|
||||
#SBATCH --hint=multithread # Mandatory for multithreaded jobs
|
||||
##SBATCH --exclusive # Uncomment if you need exclusive node usage
|
||||
##SBATCH --ntasks-per-core=2 # Only mandatory for multithreaded single tasks
|
||||
|
||||
## Advanced options example
|
||||
##SBATCH --nodes=1 # Uncomment and specify #nodes to use
|
||||
##SBATCH --ntasks=88 # Uncomment and specify #nodes to use
|
||||
##SBATCH --ntasks-per-node=88 # Uncomment and specify #tasks per node
|
||||
##SBATCH --cpus-per-task=88 # Uncomment and specify the number of cores per task
|
||||
```
|
||||
|
||||
### GPU-based jobs templates
|
||||
|
||||
The following template should be used by any user submitting jobs to GPU nodes:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
#SBATCH --cluster=gmerlin6 # Cluster name
|
||||
#SBATCH --partition=gpu,gpu-short # Specify one or multiple partitions, or
|
||||
#SBATCH --partition=gwendolen,gwendolen-long # Only for Gwendolen users
|
||||
#SBATCH --gpus="<type>:<num_gpus>" # <type> is optional, <num_gpus> is mandatory
|
||||
#SBATCH --time=<D-HH:MM:SS> # Strongly recommended
|
||||
#SBATCH --output=<output_file> # Generate custom output file
|
||||
#SBATCH --error=<error_file> # Generate custom error file
|
||||
##SBATCH --exclusive # Uncomment if you need exclusive node usage
|
||||
|
||||
## Advanced options example
|
||||
##SBATCH --nodes=1 # Uncomment and specify number of nodes to use
|
||||
##SBATCH --ntasks=1 # Uncomment and specify number of nodes to use
|
||||
##SBATCH --cpus-per-gpu=5 # Uncomment and specify the number of cores per task
|
||||
##SBATCH --mem-per-gpu=16000 # Uncomment and specify the number of cores per task
|
||||
##SBATCH --gpus-per-node=<type>:2 # Uncomment and specify the number of GPUs per node
|
||||
##SBATCH --gpus-per-socket=<type>:2 # Uncomment and specify the number of GPUs per socket
|
||||
##SBATCH --gpus-per-task=<type>:1 # Uncomment and specify the number of GPUs per task
|
||||
```
|
||||
|
||||
## Advanced configurations
|
||||
|
||||
### Array Jobs: launching a large number of related jobs
|
||||
|
||||
If you need to run a large number of jobs based on the same executable with systematically varying inputs,
|
||||
e.g. for a parameter sweep, you can do this most easily in form of a **simple array job**.
|
||||
|
||||
``` bash
|
||||
#!/bin/bash
|
||||
#SBATCH --job-name=test-array
|
||||
#SBATCH --partition=daily
|
||||
#SBATCH --ntasks=1
|
||||
#SBATCH --time=08:00:00
|
||||
#SBATCH --array=1-8
|
||||
|
||||
echo $(date) "I am job number ${SLURM_ARRAY_TASK_ID}"
|
||||
srun myprogram config-file-${SLURM_ARRAY_TASK_ID}.dat
|
||||
|
||||
```
|
||||
|
||||
This will run 8 independent jobs, where each job can use the counter
|
||||
variable `SLURM_ARRAY_TASK_ID` defined by Slurm inside of the job's
|
||||
environment to feed the correct input arguments or configuration file
|
||||
to the "myprogram" executable. Each job will receive the same set of
|
||||
configurations (e.g. time limit of 8h in the example above).
|
||||
|
||||
The jobs are independent, but they will run in parallel (if the cluster resources allow for
|
||||
it). The jobs will get JobIDs like {some-number}_0 to {some-number}_7, and they also will each
|
||||
have their own output file.
|
||||
|
||||
**Note:**
|
||||
* Do not use such jobs if you have very short tasks, since each array sub job will incur the full overhead for launching an independent Slurm job. For such cases you should used a **packed job** (see below).
|
||||
* If you want to control how many of these jobs can run in parallel, you can use the `#SBATCH --array=1-100%5` syntax. The `%5` will define
|
||||
that only 5 sub jobs may ever run in parallel.
|
||||
|
||||
You also can use an array job approach to run over all files in a directory, substituting the payload with
|
||||
|
||||
``` bash
|
||||
FILES=(/path/to/data/*)
|
||||
srun ./myprogram ${FILES[$SLURM_ARRAY_TASK_ID]}
|
||||
```
|
||||
|
||||
Or for a trivial case you could supply the values for a parameter scan in form
|
||||
of a argument list that gets fed to the program using the counter variable.
|
||||
|
||||
``` bash
|
||||
ARGS=(0.05 0.25 0.5 1 2 5 100)
|
||||
srun ./my_program.exe ${ARGS[$SLURM_ARRAY_TASK_ID]}
|
||||
```
|
||||
|
||||
### Array jobs: running very long tasks with checkpoint files
|
||||
|
||||
If you need to run a job for much longer than the queues (partitions) permit, and
|
||||
your executable is able to create checkpoint files, you can use this
|
||||
strategy:
|
||||
|
||||
``` bash
|
||||
#!/bin/bash
|
||||
#SBATCH --job-name=test-checkpoint
|
||||
#SBATCH --partition=general
|
||||
#SBATCH --ntasks=1
|
||||
#SBATCH --time=7-00:00:00 # each job can run for 7 days
|
||||
#SBATCH --cpus-per-task=1
|
||||
#SBATCH --array=1-10%1 # Run a 10-job array, one job at a time.
|
||||
if test -e checkpointfile; then
|
||||
# There is a checkpoint file;
|
||||
myprogram --read-checkp checkpointfile
|
||||
else
|
||||
# There is no checkpoint file, start a new simulation.
|
||||
myprogram
|
||||
fi
|
||||
```
|
||||
|
||||
The `%1` in the `#SBATCH --array=1-10%1` statement defines that only 1 subjob can ever run in parallel, so
|
||||
this will result in subjob n+1 only being started when job n has finished. It will read the checkpoint file
|
||||
if it is present.
|
||||
|
||||
|
||||
### Packed jobs: running a large number of short tasks
|
||||
|
||||
Since the launching of a Slurm job incurs some overhead, you should not submit each short task as a separate
|
||||
Slurm job. Use job packing, i.e. you run the short tasks within the loop of a single Slurm job.
|
||||
|
||||
You can launch the short tasks using `srun` with the `--exclusive` switch (not to be confused with the
|
||||
switch of the same name used in the SBATCH commands). This switch will ensure that only a specified
|
||||
number of tasks can run in parallel.
|
||||
|
||||
As an example, the following job submission script will ask Slurm for
|
||||
44 cores (threads), then it will run the =myprog= program 1000 times with
|
||||
arguments passed from 1 to 1000. But with the =-N1 -n1 -c1
|
||||
--exclusive= option, it will control that at any point in time only 44
|
||||
instances are effectively running, each being allocated one CPU. You
|
||||
can at this point decide to allocate several CPUs or tasks by adapting
|
||||
the corresponding parameters.
|
||||
|
||||
``` bash
|
||||
#! /bin/bash
|
||||
#SBATCH --job-name=test-checkpoint
|
||||
#SBATCH --partition=general
|
||||
#SBATCH --ntasks=1
|
||||
#SBATCH --time=7-00:00:00
|
||||
#SBATCH --ntasks=44 # defines the number of parallel tasks
|
||||
for i in {1..1000}
|
||||
do
|
||||
srun -N1 -n1 -c1 --exclusive ./myprog $i &
|
||||
done
|
||||
wait
|
||||
```
|
||||
|
||||
**Note:** The `&` at the end of the `srun` line is needed to not have the script waiting (blocking).
|
||||
The `wait` command waits for all such background tasks to finish and returns the exit code.
|
||||
|
||||
@@ -0,0 +1,63 @@
|
||||
---
|
||||
title: Slurm Basic Commands
|
||||
#tags:
|
||||
keywords: sinfo, squeue, sbatch, srun, salloc, scancel, sview, seff, sjstat, sacct, basic commands, slurm commands, cluster
|
||||
last_updated: 07 September 2022
|
||||
#summary: ""
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/slurm-basics.html
|
||||
---
|
||||
|
||||
In this document some basic commands for using Slurm are showed. Advanced examples for some of these
|
||||
are explained in other Merlin6 Slurm pages. You can always use ```man <command>``` pages for more
|
||||
information about options and examples.
|
||||
|
||||
## Basic commands
|
||||
|
||||
Useful commands for the slurm:
|
||||
|
||||
```bash
|
||||
sinfo # to see the name of nodes, their occupancy,
|
||||
# name of slurm partitions, limits (try out with "-l" option)
|
||||
squeue # to see the currently running/waiting jobs in slurm
|
||||
# (additional "-l" option may also be useful)
|
||||
sbatch Script.sh # to submit a script (example below) to the slurm.
|
||||
srun <command> # to submit a command to Slurm. Same options as in 'sbatch' can be used.
|
||||
salloc # to allocate computing nodes. Use for interactive runs.
|
||||
scancel job_id # to cancel slurm job, job id is the numeric id, seen by the squeue.
|
||||
sview # X interface for managing jobs and track job run information.
|
||||
seff # Calculates the efficiency of a job
|
||||
sjstat # List attributes of jobs under the SLURM control
|
||||
sacct # Show job accounting, useful for checking details of finished jobs.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Advanced basic commands:
|
||||
|
||||
```bash
|
||||
sinfo -N -l # list nodes, state, resources (#CPUs, memory per node, ...), etc.
|
||||
sshare -a # to list shares of associations to a cluster
|
||||
sprio -l # to view the factors that comprise a job's scheduling priority
|
||||
# add '-u <username>' for filtering user
|
||||
```
|
||||
|
||||
## Show information for specific cluster
|
||||
|
||||
By default, any of the above commands shows information of the local cluster which is ***merlin6**.
|
||||
|
||||
If you want to see the same information for **merlin5** you have to add the parameter ``--clusters=merlin5``.
|
||||
If you want to see both clusters at the same time, add the option ``--federation``.
|
||||
|
||||
Examples:
|
||||
|
||||
```bash
|
||||
sinfo # 'sinfo' local cluster which is 'merlin6'
|
||||
sinfo --clusters=merlin5 # 'sinfo' non-local cluster 'merlin5'
|
||||
sinfo --federation # 'sinfo' all clusters which are 'merlin5' & 'merlin6'
|
||||
squeue # 'squeue' local cluster which is 'merlin6'
|
||||
squeue --clusters=merlin5 # 'squeue' non-local cluster 'merlin5'
|
||||
squeue --federation # 'squeue' all clusters which are 'merlin5' & 'merlin6'
|
||||
```
|
||||
|
||||
---
|
||||
354
docs/merlin6/03-Slurm-General-Documentation/slurm-examples.md
Normal file
@@ -0,0 +1,354 @@
|
||||
---
|
||||
title: Slurm Examples
|
||||
#tags:
|
||||
keywords: slurm example, template, examples, templates, running jobs, sbatch, single core based jobs, HT, multithread, no-multithread, mpi, openmp, packed jobs, hands-on, array jobs, gpu
|
||||
last_updated: 07 September 2022
|
||||
summary: "This document shows different template examples for running jobs in the Merlin cluster."
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/slurm-examples.html
|
||||
---
|
||||
|
||||
## Single core based job examples
|
||||
|
||||
### Example 1: Hyperthreaded job
|
||||
|
||||
In this example we want to use hyperthreading (``--ntasks-per-core=2`` and ``--hint=multithread``). In our Merlin6 configuration,
|
||||
the default memory per CPU (a CPU is equivalent to a core thread) is 4000MB, hence each task can use up 8000MB (2 threads x 4000MB).
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
#SBATCH --partition=hourly # Using 'hourly' will grant higher priority
|
||||
#SBATCH --ntasks-per-core=2 # Request the max ntasks be invoked on each core
|
||||
#SBATCH --hint=multithread # Use extra threads with in-core multi-threading
|
||||
#SBATCH --time=00:30:00 # Define max time job will run
|
||||
#SBATCH --output=myscript.out # Define your output file
|
||||
#SBATCH --error=myscript.err # Define your error file
|
||||
|
||||
module purge
|
||||
module load $MODULE_NAME # where $MODULE_NAME is a software in PModules
|
||||
srun $MYEXEC # where $MYEXEC is a path to your binary file
|
||||
```
|
||||
|
||||
### Example 2: Non-hyperthreaded job
|
||||
|
||||
In this example we do not want hyper-threading (``--ntasks-per-core=1`` and ``--hint=nomultithread``). In our Merlin6 configuration,
|
||||
the default memory per cpu (a CPU is equivalent to a core thread) is 4000MB. If we do not specify anything else, our
|
||||
single core task will use a default of 4000MB. However, one could double it with ``--mem-per-cpu=8000`` if you require more memory
|
||||
(remember, the second thread will not be used so we can safely assign +4000MB to the unique active thread).
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
#SBATCH --partition=hourly # Using 'hourly' will grant higher priority
|
||||
#SBATCH --ntasks-per-core=1 # Request the max ntasks be invoked on each core
|
||||
#SBATCH --hint=nomultithread # Don't use extra threads with in-core multi-threading
|
||||
#SBATCH --time=00:30:00 # Define max time job will run
|
||||
#SBATCH --output=myscript.out # Define your output file
|
||||
#SBATCH --error=myscript.err # Define your error file
|
||||
|
||||
module purge
|
||||
module load $MODULE_NAME # where $MODULE_NAME is a software in PModules
|
||||
srun $MYEXEC # where $MYEXEC is a path to your binary file
|
||||
```
|
||||
|
||||
## Multi core based job examples
|
||||
|
||||
### Example 1: MPI with Hyper-Threading
|
||||
|
||||
In this example we run a job that will run 88 tasks. Merlin6 Apollo nodes have 44 cores each one with hyper-threading
|
||||
enabled. This means that we can run 2 threads per core, in total 88 threads. To accomplish that, users should specify
|
||||
``--ntasks-per-core=2`` and ``--hint=multithread``.
|
||||
|
||||
Use `--nodes=1` if you want to use a node exclusively (88 hyperthreaded tasks would fit in a Merlin6 node).
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
#SBATCH --partition=hourly # Using 'hourly' will grant higher priority
|
||||
#SBATCH --ntasks=88 # Job will run 88 tasks
|
||||
#SBATCH --ntasks-per-core=2 # Request the max ntasks be invoked on each core
|
||||
#SBATCH --hint=multithread # Use extra threads with in-core multi-threading
|
||||
#SBATCH --time=00:30:00 # Define max time job will run
|
||||
#SBATCH --output=myscript.out # Define your output file
|
||||
#SBATCH --error=myscript.err # Define your error file
|
||||
|
||||
module purge
|
||||
module load $MODULE_NAME # where $MODULE_NAME is a software in PModules
|
||||
srun $MYEXEC # where $MYEXEC is a path to your binary file
|
||||
```
|
||||
|
||||
### Example 2: MPI without Hyper-Threading
|
||||
|
||||
In this example, we want to run a job that will run 44 tasks, and due to performance reasons we want to disable hyper-threading.
|
||||
Merlin6 Apollo nodes have 44 cores, each one with hyper-threading enabled. For ensuring that only 1 thread will be used per task,
|
||||
users should specify ``--ntasks-per-core=1`` and ``--hint=nomultithread``. With this configuration, we tell Slurm to run only 1
|
||||
tasks per core and no hyperthreading should be used. Hence, each tasks will be assigned to an independent core.
|
||||
|
||||
Use `--nodes=1` if you want to use a node exclusively (44 non-hyperthreaded tasks would fit in a Merlin6 node).
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
#SBATCH --partition=hourly # Using 'hourly' will grant higher priority
|
||||
#SBATCH --ntasks=44 # Job will run 44 tasks
|
||||
#SBATCH --ntasks-per-core=1 # Request the max ntasks be invoked on each core
|
||||
#SBATCH --hint=nomultithread # Don't use extra threads with in-core multi-threading
|
||||
#SBATCH --time=00:30:00 # Define max time job will run
|
||||
#SBATCH --output=myscript.out # Define your output file
|
||||
#SBATCH --error=myscript.err # Define your output file
|
||||
|
||||
module purge
|
||||
module load $MODULE_NAME # where $MODULE_NAME is a software in PModules
|
||||
srun $MYEXEC # where $MYEXEC is a path to your binary file
|
||||
```
|
||||
|
||||
### Example 3: Hyperthreaded Hybrid MPI/OpenMP job
|
||||
|
||||
In this example, we want to run a Hybrid Job using MPI and OpenMP using hyperthreading. In this job, we want to run 4 MPI
|
||||
tasks by using 8 CPUs per task. Each task in our example requires 128GB of memory. Then we specify 16000MB per CPU
|
||||
(8 x 16000MB = 128000MB). Notice that since hyperthreading is enabled, Slurm will use 4 cores per task (with hyperthreading
|
||||
2 threads -a.k.a. Slurm CPUs- fit into a core).
|
||||
|
||||
```bash
|
||||
#!/bin/bash -l
|
||||
#SBATCH --clusters=merlin6
|
||||
#SBATCH --job-name=test
|
||||
#SBATCH --ntasks=4
|
||||
#SBATCH --ntasks-per-socket=1
|
||||
#SBATCH --mem-per-cpu=16000
|
||||
#SBATCH --cpus-per-task=8
|
||||
#SBATCH --partition=hourly
|
||||
#SBATCH --time=01:00:00
|
||||
#SBATCH --output=srun_%j.out
|
||||
#SBATCH --error=srun_%j.err
|
||||
#SBATCH --hint=multithread
|
||||
|
||||
module purge
|
||||
module load $MODULE_NAME # where $MODULE_NAME is a software in PModules
|
||||
srun $MYEXEC # where $MYEXEC is a path to your binary file
|
||||
```
|
||||
|
||||
{{site.data.alerts.tip}} Also, always consider that **`'--mem-per-cpu' x '--cpus-per-task'`** can **never** exceed the maximum amount of memory per node (352000MB).
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
### Example 4: Non-hyperthreaded Hybrid MPI/OpenMP job
|
||||
|
||||
In this example, we want to run a Hybrid Job using MPI and OpenMP without hyperthreading. In this job, we want to run 4 MPI
|
||||
tasks by using 8 CPUs per task. Each task in our example requires 128GB of memory. Then we specify 16000MB per CPU
|
||||
(8 x 16000MB = 128000MB). Notice that since hyperthreading is disabled, Slurm will use 8 cores per task (disabling hyperthreading
|
||||
we force the use of only 1 thread -a.k.a. 1 CPU- per core).
|
||||
|
||||
```bash
|
||||
#!/bin/bash -l
|
||||
#SBATCH --clusters=merlin6
|
||||
#SBATCH --job-name=test
|
||||
#SBATCH --ntasks=4
|
||||
#SBATCH --ntasks-per-socket=1
|
||||
#SBATCH --mem-per-cpu=16000
|
||||
#SBATCH --cpus-per-task=8
|
||||
#SBATCH --partition=hourly
|
||||
#SBATCH --time=01:00:00
|
||||
#SBATCH --output=srun_%j.out
|
||||
#SBATCH --error=srun_%j.err
|
||||
#SBATCH --hint=nomultithread
|
||||
|
||||
module purge
|
||||
module load $MODULE_NAME # where $MODULE_NAME is a software in PModules
|
||||
srun $MYEXEC # where $MYEXEC is a path to your binary file
|
||||
```
|
||||
|
||||
{{site.data.alerts.tip}} Also, always consider that **`'--mem-per-cpu' x '--cpus-per-task'`** can **never** exceed the maximum amount of memory per node (352000MB).
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
## GPU examples
|
||||
|
||||
Using GPUs requires two major changes. First, the cluster needs to be specified
|
||||
to `gmerlin6`. This should also be added to later commands pertaining to the
|
||||
job, e.g. `scancel --cluster=gmerlin6 <jobid>`. Second, the number of GPUs
|
||||
should be specified using `--gpus`, `--gpus-per-task`, or similar parameters.
|
||||
Here's an example for a simple test job:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
#SBATCH --partition=gpu # Or 'gpu-short' for higher priority but 2-hour limit
|
||||
#SBATCH --cluster=gmerlin6 # Required for GPU
|
||||
#SBATCH --gpus=2 # Total number of GPUs
|
||||
#SBATCH --cpus-per-gpu=5 # Request CPU resources
|
||||
#SBATCH --time=1-00:00:00 # Define max time job will run
|
||||
#SBATCH --output=myscript.out # Define your output file
|
||||
#SBATCH --error=myscript.err # Define your error file
|
||||
|
||||
module purge
|
||||
module load cuda # load any needed modules here
|
||||
srun $MYEXEC # where $MYEXEC is a path to your binary file
|
||||
```
|
||||
|
||||
Slurm will automatically set the gpu visibility (eg `$CUDA_VISIBLE_DEVICES`).
|
||||
|
||||
## Advanced examples
|
||||
|
||||
### Array Jobs: launching a large number of related jobs
|
||||
|
||||
If you need to run a large number of jobs based on the same executable with systematically varying inputs,
|
||||
e.g. for a parameter sweep, you can do this most easily in form of a **simple array job**.
|
||||
|
||||
``` bash
|
||||
#!/bin/bash
|
||||
#SBATCH --job-name=test-array
|
||||
#SBATCH --partition=daily
|
||||
#SBATCH --ntasks=1
|
||||
#SBATCH --time=08:00:00
|
||||
#SBATCH --array=1-8
|
||||
|
||||
echo $(date) "I am job number ${SLURM_ARRAY_TASK_ID}"
|
||||
srun $MYEXEC config-file-${SLURM_ARRAY_TASK_ID}.dat
|
||||
```
|
||||
|
||||
This will run 8 independent jobs, where each job can use the counter
|
||||
variable `SLURM_ARRAY_TASK_ID` defined by Slurm inside of the job's
|
||||
environment to feed the correct input arguments or configuration file
|
||||
to the "myprogram" executable. Each job will receive the same set of
|
||||
configurations (e.g. time limit of 8h in the example above).
|
||||
|
||||
The jobs are independent, but they will run in parallel (if the cluster resources allow for
|
||||
it). The jobs will get JobIDs like {some-number}_0 to {some-number}_7, and they also will each
|
||||
have their own output file.
|
||||
|
||||
**Note:**
|
||||
* Do not use such jobs if you have very short tasks, since each array sub job will incur the full overhead for launching an independent Slurm job. For such cases you should used a **packed job** (see below).
|
||||
* If you want to control how many of these jobs can run in parallel, you can use the `#SBATCH --array=1-100%5` syntax. The `%5` will define
|
||||
that only 5 sub jobs may ever run in parallel.
|
||||
|
||||
You also can use an array job approach to run over all files in a directory, substituting the payload with
|
||||
|
||||
``` bash
|
||||
FILES=(/path/to/data/*)
|
||||
srun $MYEXEC ${FILES[$SLURM_ARRAY_TASK_ID]}
|
||||
```
|
||||
|
||||
Or for a trivial case you could supply the values for a parameter scan in form
|
||||
of a argument list that gets fed to the program using the counter variable.
|
||||
|
||||
``` bash
|
||||
ARGS=(0.05 0.25 0.5 1 2 5 100)
|
||||
srun $MYEXEC ${ARGS[$SLURM_ARRAY_TASK_ID]}
|
||||
```
|
||||
|
||||
### Array jobs: running very long tasks with checkpoint files
|
||||
|
||||
If you need to run a job for much longer than the queues (partitions) permit, and
|
||||
your executable is able to create checkpoint files, you can use this
|
||||
strategy:
|
||||
|
||||
``` bash
|
||||
#!/bin/bash
|
||||
#SBATCH --job-name=test-checkpoint
|
||||
#SBATCH --partition=general
|
||||
#SBATCH --ntasks=1
|
||||
#SBATCH --time=7-00:00:00 # each job can run for 7 days
|
||||
#SBATCH --cpus-per-task=1
|
||||
#SBATCH --array=1-10%1 # Run a 10-job array, one job at a time.
|
||||
if test -e checkpointfile; then
|
||||
# There is a checkpoint file;
|
||||
$MYEXEC --read-checkp checkpointfile
|
||||
else
|
||||
# There is no checkpoint file, start a new simulation.
|
||||
$MYEXEC
|
||||
fi
|
||||
```
|
||||
|
||||
The `%1` in the `#SBATCH --array=1-10%1` statement defines that only 1 subjob can ever run in parallel, so
|
||||
this will result in subjob n+1 only being started when job n has finished. It will read the checkpoint file
|
||||
if it is present.
|
||||
|
||||
|
||||
### Packed jobs: running a large number of short tasks
|
||||
|
||||
Since the launching of a Slurm job incurs some overhead, you should not submit each short task as a separate
|
||||
Slurm job. Use job packing, i.e. you run the short tasks within the loop of a single Slurm job.
|
||||
|
||||
You can launch the short tasks using `srun` with the `--exclusive` switch (not to be confused with the
|
||||
switch of the same name used in the SBATCH commands). This switch will ensure that only a specified
|
||||
number of tasks can run in parallel.
|
||||
|
||||
As an example, the following job submission script will ask Slurm for
|
||||
44 cores (threads), then it will run the =myprog= program 1000 times with
|
||||
arguments passed from 1 to 1000. But with the =-N1 -n1 -c1
|
||||
--exclusive= option, it will control that at any point in time only 44
|
||||
instances are effectively running, each being allocated one CPU. You
|
||||
can at this point decide to allocate several CPUs or tasks by adapting
|
||||
the corresponding parameters.
|
||||
|
||||
``` bash
|
||||
#! /bin/bash
|
||||
#SBATCH --job-name=test-checkpoint
|
||||
#SBATCH --partition=general
|
||||
#SBATCH --ntasks=1
|
||||
#SBATCH --time=7-00:00:00
|
||||
#SBATCH --ntasks=44 # defines the number of parallel tasks
|
||||
for i in {1..1000}
|
||||
do
|
||||
srun -N1 -n1 -c1 --exclusive $MYEXEC $i &
|
||||
done
|
||||
wait
|
||||
```
|
||||
|
||||
**Note:** The `&` at the end of the `srun` line is needed to not have the script waiting (blocking).
|
||||
The `wait` command waits for all such background tasks to finish and returns the exit code.
|
||||
|
||||
## Hands-On Example
|
||||
|
||||
Copy-paste the following example in a file called myAdvancedTest.batch):
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
#SBATCH --partition=daily # name of slurm partition to submit
|
||||
#SBATCH --time=2:00:00 # limit the execution of this job to 2 hours, see sinfo for the max. allowance
|
||||
#SBATCH --nodes=2 # number of nodes
|
||||
#SBATCH --ntasks=44 # number of tasks
|
||||
#SBATCH --ntasks-per-core=1 # Request the max ntasks be invoked on each core
|
||||
#SBATCH --hint=nomultithread # Don't use extra threads with in-core multi-threading
|
||||
|
||||
module load gcc/9.2.0 openmpi/3.1.5-1_merlin6
|
||||
module list
|
||||
|
||||
echo "Example no-MPI:" ; hostname # will print one hostname per node
|
||||
echo "Example MPI:" ; srun hostname # will print one hostname per ntask
|
||||
```
|
||||
|
||||
In the above example are specified the options ``--nodes=2`` and ``--ntasks=44``. This means that up 2 nodes are requested,
|
||||
and is expected to run 44 tasks. Hence, 44 cores are needed for running that job. Slurm will try to allocate a maximum of
|
||||
2 nodes, both together having at least 44 cores. Since our nodes have 44 cores / each, if nodes are empty (no other users
|
||||
have running jobs there), job can land on a single node (it has enough cores to run 44 tasks).
|
||||
|
||||
If we want to ensure that job is using at least two different nodes (i.e. for boosting CPU frequency, or because the job
|
||||
requires more memory per core) you should specify other options.
|
||||
|
||||
A good example is ``--ntasks-per-node=22``. This will equally distribute 22 tasks on 2 nodes.
|
||||
|
||||
```bash
|
||||
#SBATCH --ntasks-per-node=22
|
||||
```
|
||||
|
||||
A different example could be by specifying how much memory per core is needed. For instance ``--mem-per-cpu=32000`` will reserve
|
||||
~32000MB per core. Since we have a maximum of 352000MB per Apollo node, Slurm will be only able to allocate 11 cores (32000MB x 11cores = 352000MB) per node.
|
||||
It means that 4 nodes will be needed (max 11 tasks per node due to memory definition, and we need to run 44 tasks), in this case we need to change ``--nodes=4``
|
||||
(or remove ``--nodes``). Alternatively, we can decrease ``--mem-per-cpu`` to a lower value which can allow the use of at least 44 cores per node (i.e. with ``16000``
|
||||
should be able to use 2 nodes)
|
||||
|
||||
```bash
|
||||
#SBATCH --mem-per-cpu=16000
|
||||
```
|
||||
|
||||
Finally, in order to ensure exclusivity of the node, an option *--exclusive* can be used (see below). This will ensure that
|
||||
the requested nodes are exclusive for the job (no other users jobs will interact with this node, and only completely
|
||||
free nodes will be allocated).
|
||||
|
||||
```bash
|
||||
#SBATCH --exclusive
|
||||
```
|
||||
|
||||
This can be combined with the previous examples.
|
||||
|
||||
More advanced configurations can be defined and can be combined with the previous examples. More information about advanced
|
||||
options can be found in the following link: https://slurm.schedmd.com/sbatch.html (or run 'man sbatch').
|
||||
|
||||
If you have questions about how to properly execute your jobs, please contact us through merlin-admins@lists.psi.ch. Do not run
|
||||
advanced configurations unless your are sure of what you are doing.
|
||||