Migrating merlin6 user guide from jekyll-example1

From lsm-hpce/jekyll-example1 1eada07
This commit is contained in:
Spencer Bliven
2019-06-14 15:38:22 +02:00
parent 7c6f7b177d
commit ebff53c62c
19 changed files with 598 additions and 763 deletions

View File

@ -0,0 +1,57 @@
---
title: Accessing Interactive Nodes
#tags:
#keywords:
last_updated: 13 June 2019
#summary: ""
sidebar: merlin6_sidebar
permalink: /merlin6/interactive.html
---
## Login nodes description
The Merlin6 login nodes are the official machines for accessing the Merlin6 cluster.
From these machines, users can submit jobs to the Slurm batch system as well as visualize or compile their software.
The Merlin6 login nodes are the following:
| Hostname | SSH | NoMachine | #cores | CPU | Memory | Scratch | Scratch Mountpoint |
| ------------------- | --- | --------- | ----------- |:---------------------------------- | ------ | ---------- |:------------------ |
| merlin-l-01.psi.ch | yes | - | 32 (2 x 16) | 2 x Intel Xeon E5-2697A v4 2.60GHz | 512GB | 100GB SAS | ``/scratch`` |
| merlin-l-02.psi.ch | yes | yes | 32 (2 x 16) | 2 x Intel Xeon E5-2697A v4 2.60GHz | 512GB | 100GB SAS | ``/scratch`` |
| merlin-l-001.psi.ch | - | - | 44 (2 x 22) | 2 x Intel Xeon Gold 6152 2.10GHz | 512GB | 2.0TB NVMe | ``/scratch`` |
| merlin-l-002.psi.ch | - | - | 44 (2 x 22) | 2 x Intel Xeon Gold 6142 2.10GHz | 512GB | 2.0TB NVMe | ``/scratch`` |
* ``merlin-l-001`` and ``merlin-l-002`` are not in production yet, hence SSH access is not possible.
---
## Remote Access
### SSH Access
For interactive command access, use a SSH client. We recommend to use X11 forwarding, despite is not the official way supported. It may help opening X applications.
For Linux:
```bash
ssh -XY $username@merlin-l-01.psi.ch
```
X applications are supported in the login nodes and X11 forwarding can be used for those users who have properly configured X11 support in their desktops:
* Merlin6 administrators **do not offer support** for user desktop configuration (Windows, MacOS, Linux).
* Hence, Merlin6 administrators **do not offer official support** for X11 client setup.
* However, a generic guide for X11 client setup (Windows, Linux and MacOS) will be provided.
* PSI desktop configuration issues must be addressed through **[PSI Service Now](https://psi.service-now.com/psisp)** as an *Incident Request*.
* Ticket will be redirected to the corresponding Desktop support group (Windows, Linux).
### NoMachine Access
X applications are supported in the login nodes and can run through NoMachine. This service is officially supported in the Merlin6 cluster and is the official X service.
* NoMachine *client installation* support has to be requested through **[PSI Service Now](https://psi.service-now.com/psisp)** as an *Incident Request*.
* Ticket will be redirected to the corresponding support group (Windows or Linux)
* NoMachine *client configuration* and *connectivity* for Merlin6 is fully supported by Merlin6 administrators.
* Please contact us through the official channels on any configuration issue with NoMachine.
---

View File

@ -0,0 +1,11 @@
---
title: Accessing Merlin6
#tags:
#keywords:
last_updated: 13 June 2019
#summary: ""
sidebar: merlin6_sidebar
permalink: /merlin6/accessing-merlin6.html
---
In this chapter is shown how to access to the Merlin6 cluster.

View File

@ -0,0 +1,98 @@
---
title: Accessing Slurm Cluster
#tags:
#keywords:
last_updated: 13 June 2019
#summary: ""
sidebar: merlin6_sidebar
permalink: /merlin6/slurm-access.html
---
## The Merlin6 Slurm batch system
Clusters at PSI use the [Slurm Workload Manager](http://slurm.schedmd.com/) as the batch system technology for managing and scheduling jobs.
Historically, *Merlin4* and *Merlin5* also used Slurm. In the same way, **Merlin6** has been also configured with this batch system.
Slurm has been installed in a **multi-clustered** configuration, allowing to integrate multiple clusters in the same batch system.
* Two different Slurm clusters exist: **merlin5** and **merlin6**.
* **merlin5** is a cluster with very old hardware (out-of-warranty).
* **merlin5** will exist as long as hardware incidents are soft and easy to repair/fix (i.e. hard disk replacement)
* **merlin6** is the default cluster when submitting jobs.
This document is mostly focused on the **merlin6** cluster. Details for **merlin5** are not shown here, and only basic access and recent
changes will be explained (**[Official Merlin5 User Guide](https://intranet.psi.ch/PSI_HPC/Merlin5)** is still valid).
### Merlin6 Slurm Configuration Details
For understanding the Slurm configuration setup in the cluster, sometimes can be useful to check the following files:
* ``/etc/slurm/slurm.conf`` - can be found in the login nodes and computing nodes.
* ``/etc/slurm/cgroup.conf`` - can be found in the computing nodes, is also propagated to login nodes for user read access.
* ``/etc/slurm/gres.conf`` - can be found in the GPU nodes, is also propgated to login nodes and computing nodes for user read access.
The previous configuration files can be found in the *login nodes* correspond exclusively to the **merlin6** cluster configuration files. These
configuration files are also present in the **merlin6** *computing nodes*.
Slurm configuration files for the old **merlin5** cluster have to be directly checked on any of the **merlin5** *computing nodes*: those files *do
not* exist in the **merlin6** *login nodes*.
### Merlin5 Access
Keeping the **merlin5** cluster will allow running jobs in the old computing nodes until users have fully migrated their codes to the new cluster.
From July 2019, **merlin6** becomes the **default cluster** and any job submitted to Slurm will be submitted to that cluster.
However, users can keep submitting to the old **merlin5** computing nodes by using the option ``--cluster=merlin5`` and using the corresponding
Slurm partition with ``--partition=merlin``. In example:
```bash
srun --clusters=merlin5 --partition=merlin hostname
sbatch --clusters=merlin5 --partition=merlin myScript.batch
```
---
## Using Slurm 'merlin6' cluster
Basic usage for the **merlin6** cluster will be detailed here. For advanced usage, please use the following document [LINK TO SLURM ADVANCED CONFIG]()
### Merlin6 Node definition
The following table show default and maximum resources that can be used per node:
| Nodes | Def.#CPUs | Max.#CPUs | Def.Mem/CPU | Max.Mem/CPU | Max.Mem/Node | Max.Swap | Def.#GPUs | Max.#GPUs |
|:---------------------------------- | ---------:| ---------:| -----------:| -----------:| ------------:| --------:| --------- | --------- |
| merlin-c-[001-022,101-122,201-222] | 1 core | 44 cores | 8000 | 352000 | 352000 | 10000 | N/A | N/A |
| merlin-g-[001] | 1 core | 8 cores | 8000 | 102498 | 102498 | 10000 | 1 | 2 |
| merlin-g-[002-009] | 1 core | 10 cores | 8000 | 102498 | 102498 | 10000 | 1 | 4 |
If nothing is specified, by default each core will use up to 8GB of memory. More memory per core can be specified with the ``--mem=<memory>`` option,
and maximum memory allowed is ``Max.Mem/Node``.
In *Merlin6*, memory is considered a Consumable Resource, as well as the CPU.
### Merlin6 Slurm partitions
Partition can be specified when submitting a job with the ``--partition=<partitionname>`` option.
The following *partitions* (also known as *queues*) are configured in Slurm:
| Partition | Default Partition | Default Time | Max Time | Max Nodes | Priority |
|:----------- | ----------------- | ------------ | -------- | --------- | -------- |
| **general** | true | 1 day | 1 week | 50 | low |
| **daily** | false | 1 day | 1 day | 60 | medium |
| **hourly** | false | 1 hour | 1 hour | unlimited | highest |
General is the *default*, so when nothing is specified job will be by default assigned to that partition. General can not have more than 50 nodes
running jobs. For **daily** this limitation is extended to 60 nodes while for **hourly** there are no limits. Shorter jobs have more priority than
longer jobs, hence in general terms would be scheduled before (however, other factors such like user fair share value can affect to this decision).
### Merlin6 User limits
By default, users can not use more than 528 cores at the same time (Max CPU per user). This limit applies for the **general** and **daily** partitions. For the **hourly** partition, there is no restriction.
These limits are softed for the **daily** partition during non working hours and during the weekend as follows:
| Partition | Mon-Fri 08h-18h | Sun-Thu 18h-0h | From Fri 18h to Sun 8h | From Sun 8h to Mon 18h |
|:----------- | --------------- | -------------- | ----------------------- | ---------------------- |
| **general** | 528 | 528 | 528 | 528 |
| **daily** | 528 | 792 | Unlimited | 792 |
| **hourly** | Unlimited | Unlimited | Unlimited | Unlimited |

View File

@ -0,0 +1,156 @@
---
title: Merlin6 Data Directories
#tags:
#keywords:
last_updated: 13 June 2019
#summary: ""
sidebar: merlin6_sidebar
permalink: /merlin6/data-directories.html
---
## Merlin6 directory structure
Merlin6 contain the following directories available for users:
* ``/psi/home/<username>``: private user **home** directory
* ``/data/user/<username>``: private user **home** directory
* ``/data/project/general/<projectname>``: Shared **Project** directory
* For BIO experiments, a dedicate ``/data/project/bio/$projectname`` exists.
* ``/scratch``: Local *scratch* disk.
* ``/shared-scratch``: Shared *scratch* disk.
A summary for each directory would be:
| Directory | Block Quota [Soft:Hard] | Block Quota [Soft:Hard] | Quota Change Policy: Block | Quota Change Policy: Files | Backup | Backup Policy |
| ---------------------------------- | ----------------------- | ----------------------- |:--------------------------------- |:-------------------------------- | ------ | :----------------------------- |
| /psi/home/$username | USR [10GB:11GB] | *Undef* | Up to x2 when strictly justified. | N/A | yes | Daily snapshots for 1 week |
| /data/user/$username | USR [1TB:1.074TB] | USR [1M:1.1M] | Inmutable. Need a project. | Changeable when justified. | no | Users responsible for backup |
| /data/project/bio/$projectname | GRP [1TB:1.074TB] | GRP [1M:1.1M] | Subject to project requirements. | Subject to project requirements. | no | Project responsible for backup |
| /data/project/general/$projectname | GRP [1TB:1.074TB] | GRP [1M:1.1M] | Subject to project requirements. | Subject to project requirements. | no | Project responsible for backup |
| /scratch | *Undef* | *Undef* | N/A | N/A | no | N/A |
| /shared-scratch | *Undef* | *Undef* | N/A | N/A | no | N/A |
---
## User home directory
Home directories are part of the PSI NFS Central Home storage provided by AIT.
However, administration for the Merlin6 NFS homes is delegated to Merlin6 administrators.
This is the default directory users will land when login in to any Merlin6 machine.
This directory is mounted in the login and computing nodes under the directory:
```bash
/psi/home/$username
```
Users can check their quota by running the following command:
```bash
quota -s
```
### Home directory policy
* Read **[Important: Code of Conduct](## Important: Code of Conduct)** for more information about Merlin6 policies.
* Is **forbidden** to use the home directories for IO intensive tasks
* Use ``/scratch``, ``/shared-scratch``, ``/data/user`` or ``/data/project`` for this purpose.
* Users can recover up to 1 week of their lost data thanks to the automatic **daily snapshorts for 1 week**.
Snapshots are found in the following directory:
```bash
/psi/home/.snapshop/$username
```
---
## User data directory
User data directories are part of the Merlin6 storage cluster and technology is based on GPFS.
The user data directory is intended for *fast IO access* and keeping large amount of private data.
This directory is mounted in the login and computing nodes under the directory
```bash
/data/user/$username
```
Users can check their quota by running the following command:
```bash
mmlsquota -u <username> --block-size auto merlin-user
```
### User Directory policy
* Read **[Important: Code of Conduct](## Important: Code of Conduct)** for more information about Merlin6 policies.
* Is **forbidden** to use the data directories as ``scratch`` area during a job runtime.
* Use ``/scratch``, ``/shared-scratch`` for this purpose.
* No backup policy is applied for user data directories: users are responsible for backing up their data.
---
## Project data directory
Project data directories are part of the Merlin6 storage cluster and technology is based on GPFS.
This storage is intended for *fast IO access* and keeping large amount of private data, but also for sharing data amogst
different users sharing a project.
Creating a project is the way in where users can expand his storage space and will optimize the usage of the storage
(by avoiding for instance, duplicated data for different users).
Is **highly** recommended the use of a project when multiple persons are involved in the same project managing similar/common data.
Quotas are defined in a *group* and *fileset* basis: Unix Group name must exist for a specific project or must be created for
any new project. Contact the Merlin6 administrators for more information about that.
The project data directory is mounted in the login and computing nodes under the dirctory:
```bash
/data/project/$username
```
Users can check the project quota by running the following command:
```bash
mmrepquota merlin-proj:$projectname
```
### Project Directory policy
* Read **[Important: Code of Conduct](## Important: Code of Conduct)** for more information about Merlin6 policies.
* Is **forbidden** to use the data directories as ``scratch`` area during a job runtime.
* Use ``/scratch``, ``/shared-scratch`` for this purpose.
* No backups: users are responsible for managing the backups of their data directories.
---
## Scratch directories
There are two different types of scratch disk: **local** (``/scratch``) and **shared** (``/shared-scratch``).
Specific details of each type is described below.
Usually **shared** scratch will be used for those jobs running on multiple nodes which need to access to a common shared space
for creating temporary files, while **local** scratch should be used by those jobs needing a local space for creating temporary files.
**local** scratch in Merlin6 computing nodes provides a huge number of IOPS thanks to the NVMe technology,
while **shared** scratch, despite being also very fast, is an external GPFS storage with more latency.
``/shared-scratch`` is only mounted in the *Merlin6* computing nodes, and its current size is 50TB. Whenever necessary, it can be increased in the future.
A summary for the scratch directories is the following:
| Cluster | Service | Scratch | Scratch Mountpoint | Shared Scratch | Shared Scratch Mountpoint | Comments |
| ------- | -------------- | ------------ | ------------------ | -------------- | ------------------------- | ------------------------------------- |
| merlin5 | computing node | 50GB / SAS | ``/scratch`` | ``N/A`` | ``N/A`` | ``merlin-c-[01-64]`` |
| merlin6 | login node | 100GB / SAS | ``/scratch`` | ``N/A`` | ``N/A`` | ``merlin-l-0[1,2]`` |
| merlin6 | computing node | 1.3TB / NVMe | ``/scratch`` | 50TB / GPFS | ``/shared-scratch`` | ``merlin-c-[001-022,101-122,201-222`` |
| merlin6 | login node | 2.0TB / NVMe | ``/scratch`` | ``N/A`` | ``N/A`` | ``merlin-l-00[1,2]`` |
### Scratch directories policy
* Read **[Important: Code of Conduct](## Important: Code of Conduct)** for more information about Merlin6 policies.
* By default, *always* use **local** first and only use **shared** if you specific use case needs a shared scratch area.
* Temporary files *must be deleted at the end of the job by the user*.
* Remaining files will be deleted by the system if detected.
---

View File

@ -0,0 +1,76 @@
---
title: Requesting Merlin6 Accounts
#tags:
#keywords:
last_updated: 13 June 2019
#summary: ""
sidebar: merlin6_sidebar
permalink: /merlin6/request-account.html
---
## Requesting Access to Merlin6
PSI users with their Linux account belonging to the **svc-cluster_merlin6** group are allowed to use Merlin6.
Registration for **Merlin6** access *must be done* through **[PSI Service Now](https://psi.service-now.com/psisp)**:
* Please open a ticket as *Incident Request*, with subject:
```bash
Subject: [Merlin6] Access Request for user '$username'
```
* Text content (please use always this template):
```bash
Dear HelpDesk,
my name is $Name $Surname with PSI username $username and I would like to request access to the Merlin6 cluster.
Please add me to the following Unix groups:
* 'svc-cluster_merlin6'
Thanks a lot,
$Name $Surname
```
---
## Requesting Access to Merlin5
Merlin5 computing nodes will be available for some time as a **best effort** service.
For accessing the old Merlin5 resources, users should belong to the **svc-cluster_merlin5** Unix Group.
Registration for **Merlin5** access *must be done* through **[PSI Service Now](https://psi.service-now.com/psisp)**:
* Please open a ticket as *Incident Request*, with subject:
```bash
Subject: [Merlin5] Access Request for user '$username'
```
* Text content (please use always this template):
```bash
Dear HelpDesk,
my name is $Name $Surname with PSI username $username and I would like to request access to the Merlin5 cluster.
Please add me to the following Unix groups:
* 'svc-cluster_merlin5'
Thanks a lot,
$Name $Surname
```
---
## Requesting extra Unix groups
* Some users may require to be added to some extra specific Unix groups.
* This will grant access to specific resources.
* In example, some BIO groups may belong to a specific BIO group for having access to the project area for that group.
* Supervisors should inform new users which extra groups are needed.
* When requesting access to **[Merlin6](##Requesting-Access-to-Merlin6)** or **[Merlin5](##Requesting-Access-to-Merlin5)**, extra groups can be added in the same *Incident Request*
* Alternatively, this step can be done later in the future on a different **[PSI Service Now](https://psi.service-now.com/psisp)** ticket.
* If you want to request access for Merlin5 and Merlin6
* Use the template **[Requesting Access to Merlin6](##Requesting-Access-to-Merlin6)** and add also the **``'svc-cluster_merlin5'``** Unix Group to the request.