Migrating merlin6 user guide from jekyll-example1
From lsm-hpce/jekyll-example1 1eada07
This commit is contained in:
parent
7c6f7b177d
commit
ebff53c62c
@ -11,18 +11,36 @@ entries:
|
|||||||
- title: Introduction
|
- title: Introduction
|
||||||
url: /merlin6/introduction.html
|
url: /merlin6/introduction.html
|
||||||
output: web
|
output: web
|
||||||
|
- title: Code Of Conduct
|
||||||
|
url: /merlin6/code-of-conduct.html
|
||||||
|
output: web
|
||||||
|
- title: Hardware And Software Description
|
||||||
|
url: /merlin6/hardware-and-software.html
|
||||||
|
output: web
|
||||||
|
- title: Requesting Merlin6 Accounts
|
||||||
|
url: /merlin6/request-account.html
|
||||||
|
output: web
|
||||||
|
- title: Accessing Interactive Nodes
|
||||||
|
url: /merlin6/interactive.html
|
||||||
|
output: web
|
||||||
|
- title: Merlin6 Data Directories
|
||||||
|
url: /merlin6/data-directories.html
|
||||||
|
output: web
|
||||||
|
- title: Accessing Slurm Cluster
|
||||||
|
url: /merlin6/slurm-access.html
|
||||||
|
output: web
|
||||||
|
- title: Slurm Basic Commands
|
||||||
|
url: /merlin6/slurm-basics.html
|
||||||
|
output: web
|
||||||
|
- title: Slurm Configuration
|
||||||
|
url: /merlin6/slurm-configuration.html
|
||||||
|
output: web
|
||||||
- title: Contact
|
- title: Contact
|
||||||
url: /merlin6/contact.html
|
url: /merlin6/contact.html
|
||||||
output: web
|
output: web
|
||||||
- title: Using Merlin6
|
- title: Migration From Merlin5
|
||||||
url: /merlin6/use.html
|
url: /merlin6/migrating.html
|
||||||
output: web
|
output: web
|
||||||
- title: User Guide
|
- title: Known Problems and Troubleshooting
|
||||||
url: /merlin6/user-guide.html
|
url: /merlin6/troubleshooting.html
|
||||||
output: web
|
|
||||||
- title: Section 2
|
|
||||||
output: web
|
|
||||||
folderitems:
|
|
||||||
- title: broken
|
|
||||||
url: /merlin6/broken.html
|
|
||||||
output: web
|
output: web
|
@ -1,300 +0,0 @@
|
|||||||
---
|
|
||||||
layout: default
|
|
||||||
title: Accessing Merlin6
|
|
||||||
parent: Merlin6 User Guide
|
|
||||||
nav_order: 4
|
|
||||||
---
|
|
||||||
|
|
||||||
# Accessing Merlin6
|
|
||||||
{: .no_toc }
|
|
||||||
|
|
||||||
## Table of contents
|
|
||||||
{: .no_toc .text-delta }
|
|
||||||
|
|
||||||
1. TOC
|
|
||||||
{:toc}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Requesting Merlin6 Access
|
|
||||||
|
|
||||||
PSI users with their Linux account belonging to the **svc-cluster_merlin6** group are allowed to use Merlin6.
|
|
||||||
|
|
||||||
Registration for **Merlin6** access *must be done* through **[PSI Service Now](https://psi.service-now.com/psisp)**:
|
|
||||||
|
|
||||||
* Please open a ticket as *Incident Request*, with subject: ``[Merlin6] Access Request for user '<username>'``
|
|
||||||
* Text content example:
|
|
||||||
|
|
||||||
> Dear HelpDesk,
|
|
||||||
>
|
|
||||||
> my name is [Name] [Surname] with PSI username [username] and I would like to request access to the Merlin6 cluster.
|
|
||||||
>
|
|
||||||
> Please add me to the following Unix groups:
|
|
||||||
> * 'svc-cluster_merlin6'
|
|
||||||
>
|
|
||||||
> Thanks a lot,
|
|
||||||
> <Name> <Username>
|
|
||||||
|
|
||||||
### Requesting extra Unix groups
|
|
||||||
|
|
||||||
* Some users may require to be added to some extra specific Unix groups.
|
|
||||||
* In example, some BIO groups may belong to a specific BIO group.
|
|
||||||
* Extra groups can be added in the *Incident Request* described in [Requesting Merlin6 Access](#Requesting-Merlin6-Access).
|
|
||||||
* Alternatively, this step can be done later in the future on a different **[PSI Service Now](https://psi.service-now.com/psisp)** ticket.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Requesting Merlin5 Access
|
|
||||||
|
|
||||||
Merlin5 computing nodes will be available for some time as a **best efford** service. If you need to use the old resources, users should belong to the **svc-cluster_merlin5** Unix Group.
|
|
||||||
|
|
||||||
Registration for **Merlin5** access *must be done* through **[PSI Service Now](https://psi.service-now.com/psisp)**
|
|
||||||
|
|
||||||
* Please open a ticket as *Incident Request*, with subject: ``[Merlin5] Access Request for user '<username>'``
|
|
||||||
* Text content example:
|
|
||||||
|
|
||||||
> Dear HelpDesk,
|
|
||||||
>
|
|
||||||
> my name is [Name] [Surname] with PSI username [username] and I would like to request access to the old Merlin5 cluster.
|
|
||||||
>
|
|
||||||
> Please add me to the following Unix groups:
|
|
||||||
> * 'svc-cluster_merlin5'
|
|
||||||
>
|
|
||||||
> Thanks a lot,
|
|
||||||
> <Name> <Username>
|
|
||||||
|
|
||||||
* As described before, you can add in the *Incident Request* which extra Unix groups should be added to your user.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Accessing Login/Interactive Nodes
|
|
||||||
|
|
||||||
The Merlin6 login nodes are the following:
|
|
||||||
|
|
||||||
* <tt><b>merlin-l-01.psi.ch</b></tt> - SSH Access
|
|
||||||
* *SSH* Access
|
|
||||||
* **Hardware description:** 32 cores (2 x 16 core Intel Xeon E5-2697A v4), 512GB RAM, 100GB ``/scratch`` on SAS disk.
|
|
||||||
* <tt><b>merlin-l-02.psi.ch</b></tt>
|
|
||||||
* *SSH* & **NoMachine** Access
|
|
||||||
* **Hardware description:** 32 cores (2 x 16 core Intel Xeon E5-2697A v4), 512GB RAM, 100GB ``/scratch`` on SAS disk.
|
|
||||||
<!--* <tt><b>merlin-l-001.psi.ch</b></tt> - SSH Access -->
|
|
||||||
<!--* <tt><b>merlin-l-002.psi.ch</b></tt> - SSH Access -->
|
|
||||||
|
|
||||||
Login nodes are the official service for accessing the Merlin6 cluster. From there users can submit jobs to the Slurm batch system as well as visualize or compile their software.
|
|
||||||
|
|
||||||
### SSH Access
|
|
||||||
|
|
||||||
For interactive command access, Use SSH for accessing the login nodes. In example:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
ssh -XY [username]@merlin-l-01.psi.ch
|
|
||||||
```
|
|
||||||
|
|
||||||
X applications are supported in the login nodes and can be opened through SSH for those users who have properly configured X11 in their desktops. SSH options ``-XY`` are mandatory for that.
|
|
||||||
* Merlin6 administrators **do not offer support** with configuration for users Desktops (Windows, MAC and Linux). Hence, Merlin6 administrators **do not offer support** for X11 client setup.
|
|
||||||
* Any desktop configuration issues must be addressed through **[PSI Service Now](https://psi.service-now.com/psisp)** as an *Incident Request*.
|
|
||||||
* Ticket will be redirected to the corresponding Desktop support group.
|
|
||||||
|
|
||||||
### NoMachine Access
|
|
||||||
|
|
||||||
X applications are supported in the login nodes and can run through NoMachine. This service is officially supported in the Merlin6 cluster and is the official X service.
|
|
||||||
* NoMachine client installation/configuration support has to be done through **[PSI Service Now](https://psi.service-now.com/psisp)** as an *Incident Request*.
|
|
||||||
* Ticket will be redirected to the corresponding support group.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Accessing Merlin6 data
|
|
||||||
|
|
||||||
### Merlin6 directory structure
|
|
||||||
|
|
||||||
Merlin6 contain the following directories available for users:
|
|
||||||
|
|
||||||
* ``/psi/home/<username>``: private user **home** directory
|
|
||||||
* ``/data/user/<username>``: private user **home** directory
|
|
||||||
* ``/data/project/general/<projectname>``: Shared **Project** directory
|
|
||||||
* ``/scratch``: Local *scratch* disk.
|
|
||||||
* ``/shared-scratch``: Shared *scratch* disk.
|
|
||||||
|
|
||||||
#### User home directory
|
|
||||||
|
|
||||||
The user home directory can be found on login nodes and computing nodes under the ``/psi/home/<username>`` dirctory. This is the default directory users will land when login in to any Merlin6 machine.
|
|
||||||
|
|
||||||
Home directories are part of the PSI NFS Central Home storage provided by AIT, however quota administration for the Merlin6 cluster is delegated to Merlin6 administrators.
|
|
||||||
|
|
||||||
Home directory policies:
|
|
||||||
* **Per User-based Quota policy**:
|
|
||||||
* **Soft**: 10GB
|
|
||||||
* **Hard**: 11GB
|
|
||||||
* Quota can only be increased when strictly justified.
|
|
||||||
* Check home quota with the command: ``quota -s``
|
|
||||||
* **Backup policy**:
|
|
||||||
* **Daily snapshots for 1 week**: users can recover up to 1 week of their lost data.
|
|
||||||
* **Snaphot location**: ``/psi/home/.snapshop/<username>``
|
|
||||||
* **Restrictions**
|
|
||||||
* Read **[Important: Code of Conduct](## Important: Code of Conduct)** for more information about Merlin6 policies.
|
|
||||||
* Is **forbidden** to use the home directories for IO intensive tasks (in example, IO intensive data access during job runtime).
|
|
||||||
* Use ``/scratch``, ``/shared-scratch``, ``/data/user`` or ``/data/project`` for this purpose.
|
|
||||||
|
|
||||||
#### User data directory
|
|
||||||
|
|
||||||
The user data directory can be found on login nodes and computing nodes under the ``/data/user/<username>`` dirctory.
|
|
||||||
This storage is intended for fast IO access and keeping large amount of private data.
|
|
||||||
|
|
||||||
User data directories are part of the Merlin6 storage cluster and technology is based on GPFS.
|
|
||||||
|
|
||||||
User data directory policies:
|
|
||||||
* **Per User-based Quota policy** (as known as GPFS USR Block Limits):
|
|
||||||
* **Soft**: 1TB
|
|
||||||
* **Hard**: 1.074TB
|
|
||||||
* Block quota limits can not be increased. For extra space, project must exist / be created (``/data/project/<projectname>``)
|
|
||||||
* Check data quota with the command: ``mmlsquota -u <username> --block-size auto merlin-user``
|
|
||||||
* **Per User-based Number of Files Quota policy** (as known as GPFS USR File Limits):
|
|
||||||
* **Soft**: 1,048,576
|
|
||||||
* **Hard**: 1,126,400
|
|
||||||
* File quota can be increased. For extra files, contact Merlin6 administrators.
|
|
||||||
* Check data quota with the command: ``mmlsquota -u <username> --block-size auto merlin-user``
|
|
||||||
* **Backup policy**:
|
|
||||||
* No backups: users are responsible for managing the backups of their data directories.
|
|
||||||
* **Restrictions**
|
|
||||||
* Read **[Important: Code of Conduct](## Important: Code of Conduct)** for more information about Merlin6 policies.
|
|
||||||
* Is **forbidden** to use the data directories as ``scratch`` area during a job runtime.
|
|
||||||
* Use ``/scratch``, ``/shared-scratch`` for this purpose.
|
|
||||||
* For temporary interactive user data (from a login node), is allowed to use it as ``scratch``-like area (in example, compiling, deploying tars, etc.)
|
|
||||||
|
|
||||||
#### Project data directory
|
|
||||||
|
|
||||||
The project data directory can be found on login nodes and computing nodes under the ``/data/user/<username>`` dirctory.
|
|
||||||
This storage is intended for fast IO access and keeping large amount of private data, but also for sharing data amogst
|
|
||||||
different users sharing a project. Creating a project is the way in where users can expand his storage space and will
|
|
||||||
optimize the usage of the storage (by avoiding for instance, duplicated data for different users).
|
|
||||||
Is **highly** recommended the use of a project when multiple persons are involved in the same project managing similar/common data.
|
|
||||||
Quotas are defined in a *group* and *fileset* basis: Unix Group name must exist for a specific project or must be created for any new project.
|
|
||||||
Contact the Merlin6 administrators for more information about that.
|
|
||||||
|
|
||||||
Project data directories are part of the Merlin6 storage cluster and technology is based on GPFS.
|
|
||||||
|
|
||||||
Project data directory policies:
|
|
||||||
* **Per Group and Fileset-based Quota policy** (as known as GPFS GRP Block Limits):
|
|
||||||
* **Soft**: 1TB
|
|
||||||
* **Hard**: 1.074TB
|
|
||||||
* Block quota limits can be increased on demand when strictly justified and it can be defined when a project is created. For extra space, contact Merlin6 administrators.
|
|
||||||
* Check data quota with the command: ``mmlsquota -g <groupname> --block-size auto merlin-proj``
|
|
||||||
* **Per Group and Fileset-based Number of Files Quota policy** (as known as GPFS GRP File Limits):
|
|
||||||
* **Soft**: 1,048,576
|
|
||||||
* **Hard**: 1,126,400
|
|
||||||
* File quota can be increased on demand, and it can be defined when a new project is created. For extra files, contact Merlin6 administrators.
|
|
||||||
* Check data quota with the command: ``mmlsquota -g <groupname> --block-size auto merlin-proj``
|
|
||||||
* **Backup policy**:
|
|
||||||
* No backups: users are responsible for managing the backups of their data directories.
|
|
||||||
* **Restrictions**
|
|
||||||
* Read **[Important: Code of Conduct](## Important: Code of Conduct)** for more information about Merlin6 policies.
|
|
||||||
* Is **forbidden** to use the data directories as ``scratch`` area during a job runtime.
|
|
||||||
* Use ``/scratch``, ``/shared-scratch`` for this purpose.
|
|
||||||
* For temporary interactive user data (from a login node), is allowed to use it as ``scratch``-like area (in example, compiling, deploying tars, etc.)
|
|
||||||
|
|
||||||
#### Scratch directories
|
|
||||||
|
|
||||||
There are two different types of scratch disk: **local** (``/scratch``) and **shared** (``/shared-scratch``). Specific details of each type is described below.
|
|
||||||
|
|
||||||
Usually **shared** scratch will be used for those jobs which need to access to a common shared space storing temporary files, while **local** scratch should be used by those jobs which
|
|
||||||
need to store temporary files not used by jobs running on other computing nodes.
|
|
||||||
|
|
||||||
**local** scratch in computing nodes provide a huge number of IOPS thanks to the NVMe technology, while **shared** scratch, despite is also very fast, is an external storage with more latency.
|
|
||||||
|
|
||||||
By default, *always* use **local** first and only use **shared** if you specific use case needs a shared scratch area.
|
|
||||||
|
|
||||||
##### Local Scratch
|
|
||||||
|
|
||||||
Local scratch is used for creating temporary files/directories needed by running jobs. Temporary files *must be deleted at the end of the job by the user*. Remaining files will be
|
|
||||||
deleted by the system if detected. Use always **local** by default and only use *shared* when **local** does not fit to you.
|
|
||||||
|
|
||||||
Local scratch is available on login and computing nodes, and its size depends on the host:
|
|
||||||
|
|
||||||
* **Merlin6 login nodes:**
|
|
||||||
* **Old login nodes:** a */scratch* partition of ~100GB is available on each login node ``merlin-l-01`` and ``merlin-l-02``. ``/scratch`` is mounted on *SAS* disks
|
|
||||||
* **New login nodes:** a */scratch* partition of ~1.6TB is available on each login node ``merlin-l-001`` and ``merlin-l-002``. ``/scratch`` is mounted on extremely fast *NVMe Flash* disks.
|
|
||||||
* **Merlin5 computing nodes**: a */scratch* partition of ~50GB is available on each computing node. ``/scratch`` is mounted on *SAS* disks for Merlin5 nodes.
|
|
||||||
* **Merlin6 computing nodes**: a */scratch* partition of ~1.2TB is available on each computing node. ``/scratch`` is mounted on extremly fast *NVMe Flash* disks for Merlin6 nodes.
|
|
||||||
|
|
||||||
#### Shared Scratch
|
|
||||||
|
|
||||||
Shared scratch (``/shared-scratch``) is used for creating temporary files/directories needed by running jobs. Temporary files *must be deleted at the end of the job by the user*.
|
|
||||||
Remaining files will be deleted by the system if detected. Only use **shared** scratch when you need creating **shared** temporary files, otherwise use **local** by default.
|
|
||||||
|
|
||||||
``/shared-scratch`` is only available in the computing nodes, and its current size is 50TB. ``/shared-scratch`` is an independent GPFS filesystem in the new Merlin6 GPFS storagle cluster,
|
|
||||||
and it can be increased if necessary in the future.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Using the Slurm batch system
|
|
||||||
|
|
||||||
Clusters at PSI use the [Slurm Workload Manager](http://slurm.schedmd.com/) as the batch system technology for managing and scheduling jobs.
|
|
||||||
Historically, *Merlin4* and *Merlin5* also used Slurm. In the same way, **Merlin6** has been also configured with this batch system.
|
|
||||||
|
|
||||||
Slurm has been installed in a **multi-clustered** configuration, allowing to integrate multiple clusters in the same batch system.
|
|
||||||
|
|
||||||
For understanding the Slurm configuration setup in the cluster, sometimes may be useful to check the following files:
|
|
||||||
|
|
||||||
* ``/etc/slurm/slurm.conf`` - can be found in the login nodes and computing nodes.
|
|
||||||
* ``/etc/slurm/cgroup.conf`` - can be found in the computing nodes, is also propagated to login nodes for user read access.
|
|
||||||
* ``/etc/slurm/gres.conf`` - can be found in the GPU nodes, is also propgated to login nodes and computing nodes for user read access.
|
|
||||||
|
|
||||||
The previous configuration files which can be found in the login nodes, correspond exclusively to the **merlin6** cluster configuration files.
|
|
||||||
Configuration files for the old **merlin5** cluster must be checked directly on any of the **merlin5** computing nodes: these are not propagated
|
|
||||||
to the **merlin6** login nodes.
|
|
||||||
|
|
||||||
### About Merlin5 & Merlin6
|
|
||||||
|
|
||||||
The new Slurm cluster is called **merlin6**. However, the old Slurm *merlin* cluster will be kept for some time, and it has been renamed to **merlin5**.
|
|
||||||
It will allow to keep running jobs in the old computing nodes until users have fully migrated their codes to the new cluster.
|
|
||||||
|
|
||||||
From July 2019, **merlin6** becomes the **default cluster** and any job submitted to Slurm will be submitted to that cluster. Users can keep submitting to
|
|
||||||
the old *merlin5* computing nodes by using the option ``--cluster=merlin5``.
|
|
||||||
|
|
||||||
In this documentation is only explained the usage of the **merlin6** Slurm cluster.
|
|
||||||
|
|
||||||
### Using Slurm 'merlin6' cluster
|
|
||||||
|
|
||||||
Basic usage for the **merlin6** cluster will be detailed here. For advanced usage, please use the following document [LINK TO SLURM ADVANCED CONFIG]()
|
|
||||||
|
|
||||||
#### Merlin6 Node definition
|
|
||||||
|
|
||||||
The following table show default and maximum resources that can be used per node:
|
|
||||||
|
|
||||||
| Nodes | Def.#CPUs | Max.#CPUs | Def.Mem/CPU | Max.Mem/CPU | Max.Mem/Node | Max.Swap | Def.#GPUs | Max.#GPUs |
|
|
||||||
|:---------------------------------- | ---------:| ---------:| -----------:| -----------:| ------------:| --------:| --------- | --------- |
|
|
||||||
| merlin-c-[001-022,101-122,201-222] | 1 core | 44 cores | 8000 | 352000 | 352000 | 10000 | N/A | N/A |
|
|
||||||
| merlin-g-[001] | 1 core | 8 cores | 8000 | 102498 | 102498 | 10000 | 1 | 2 |
|
|
||||||
| merlin-g-[002-009] | 1 core | 10 cores | 8000 | 102498 | 102498 | 10000 | 1 | 4 |
|
|
||||||
|
|
||||||
If nothing is specified, by default each core will use up to 8GB of memory. More memory per core can be specified with the ``--mem=<memory>`` option,
|
|
||||||
and maximum memory allowed is ``Max.Mem/Node``.
|
|
||||||
|
|
||||||
In *Merlin6*, memory is considered a Consumable Resource, as well as the CPU.
|
|
||||||
|
|
||||||
#### Merlin6 Slurm partitions
|
|
||||||
|
|
||||||
Partition can be specified when submitting a job with the ``--partition=<partitionname>`` option.
|
|
||||||
The following *partitions* (also known as *queues*) are configured in Slurm:
|
|
||||||
|
|
||||||
| Partition | Default Partition | Default Time | Max Time | Max Nodes | Priority |
|
|
||||||
|:----------- | ----------------- | ------------ | -------- | --------- | -------- |
|
|
||||||
| **general** | true | 1 day | 1 week | 50 | low |
|
|
||||||
| **daily** | false | 1 day | 1 day | 60 | medium |
|
|
||||||
| **hourly** | false | 1 hour | 1 hour | unlimited | highest |
|
|
||||||
|
|
||||||
General is the *default*, so when nothing is specified job will be by default assigned to that partition. General can not have more than 50 nodes
|
|
||||||
running jobs. For **daily** this limitation is extended to 60 nodes while for **hourly** there are no limits. Shorter jobs have more priority than
|
|
||||||
longer jobs, hence in general terms would be scheduled before (however, other factors such like user fair share value can affect to this decision).
|
|
||||||
|
|
||||||
#### Merlin6 User limits
|
|
||||||
|
|
||||||
By default, users can not use more than 528 cores at the same time (Max CPU per user). This limit applies for the **general** and **daily** partitions. For the **hourly** partition, there is no restriction.
|
|
||||||
These limits are softed for the **daily** partition during non working hours and during the weekend as follows:
|
|
||||||
|
|
||||||
| Partition | Mon-Fri 08h-18h | Sun-Thu 18h-0h | From Fri 18h to Sun 8h | From Sun 8h to Mon 18h |
|
|
||||||
|:----------- | --------------- | -------------- | ----------------------- | ---------------------- |
|
|
||||||
| **general** | 528 | 528 | 528 | 528 |
|
|
||||||
| **daily** | 528 | 792 | Unlimited | 792 |
|
|
||||||
| **hourly** | Unlimited | Unlimited | Unlimited | Unlimited |
|
|
||||||
|
|
@ -0,0 +1,57 @@
|
|||||||
|
---
|
||||||
|
title: Accessing Interactive Nodes
|
||||||
|
#tags:
|
||||||
|
#keywords:
|
||||||
|
last_updated: 13 June 2019
|
||||||
|
#summary: ""
|
||||||
|
sidebar: merlin6_sidebar
|
||||||
|
permalink: /merlin6/interactive.html
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
## Login nodes description
|
||||||
|
|
||||||
|
The Merlin6 login nodes are the official machines for accessing the Merlin6 cluster.
|
||||||
|
From these machines, users can submit jobs to the Slurm batch system as well as visualize or compile their software.
|
||||||
|
|
||||||
|
The Merlin6 login nodes are the following:
|
||||||
|
|
||||||
|
| Hostname | SSH | NoMachine | #cores | CPU | Memory | Scratch | Scratch Mountpoint |
|
||||||
|
| ------------------- | --- | --------- | ----------- |:---------------------------------- | ------ | ---------- |:------------------ |
|
||||||
|
| merlin-l-01.psi.ch | yes | - | 32 (2 x 16) | 2 x Intel Xeon E5-2697A v4 2.60GHz | 512GB | 100GB SAS | ``/scratch`` |
|
||||||
|
| merlin-l-02.psi.ch | yes | yes | 32 (2 x 16) | 2 x Intel Xeon E5-2697A v4 2.60GHz | 512GB | 100GB SAS | ``/scratch`` |
|
||||||
|
| merlin-l-001.psi.ch | - | - | 44 (2 x 22) | 2 x Intel Xeon Gold 6152 2.10GHz | 512GB | 2.0TB NVMe | ``/scratch`` |
|
||||||
|
| merlin-l-002.psi.ch | - | - | 44 (2 x 22) | 2 x Intel Xeon Gold 6142 2.10GHz | 512GB | 2.0TB NVMe | ``/scratch`` |
|
||||||
|
|
||||||
|
* ``merlin-l-001`` and ``merlin-l-002`` are not in production yet, hence SSH access is not possible.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Remote Access
|
||||||
|
|
||||||
|
### SSH Access
|
||||||
|
|
||||||
|
For interactive command access, use a SSH client. We recommend to use X11 forwarding, despite is not the official way supported. It may help opening X applications.
|
||||||
|
|
||||||
|
For Linux:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh -XY $username@merlin-l-01.psi.ch
|
||||||
|
```
|
||||||
|
|
||||||
|
X applications are supported in the login nodes and X11 forwarding can be used for those users who have properly configured X11 support in their desktops:
|
||||||
|
* Merlin6 administrators **do not offer support** for user desktop configuration (Windows, MacOS, Linux).
|
||||||
|
* Hence, Merlin6 administrators **do not offer official support** for X11 client setup.
|
||||||
|
* However, a generic guide for X11 client setup (Windows, Linux and MacOS) will be provided.
|
||||||
|
* PSI desktop configuration issues must be addressed through **[PSI Service Now](https://psi.service-now.com/psisp)** as an *Incident Request*.
|
||||||
|
* Ticket will be redirected to the corresponding Desktop support group (Windows, Linux).
|
||||||
|
|
||||||
|
### NoMachine Access
|
||||||
|
|
||||||
|
X applications are supported in the login nodes and can run through NoMachine. This service is officially supported in the Merlin6 cluster and is the official X service.
|
||||||
|
* NoMachine *client installation* support has to be requested through **[PSI Service Now](https://psi.service-now.com/psisp)** as an *Incident Request*.
|
||||||
|
* Ticket will be redirected to the corresponding support group (Windows or Linux)
|
||||||
|
* NoMachine *client configuration* and *connectivity* for Merlin6 is fully supported by Merlin6 administrators.
|
||||||
|
* Please contact us through the official channels on any configuration issue with NoMachine.
|
||||||
|
|
||||||
|
---
|
11
pages/merlin6/accessing-merlin6/accessing-merlin6.md
Normal file
11
pages/merlin6/accessing-merlin6/accessing-merlin6.md
Normal file
@ -0,0 +1,11 @@
|
|||||||
|
---
|
||||||
|
title: Accessing Merlin6
|
||||||
|
#tags:
|
||||||
|
#keywords:
|
||||||
|
last_updated: 13 June 2019
|
||||||
|
#summary: ""
|
||||||
|
sidebar: merlin6_sidebar
|
||||||
|
permalink: /merlin6/accessing-merlin6.html
|
||||||
|
---
|
||||||
|
|
||||||
|
In this chapter is shown how to access to the Merlin6 cluster.
|
98
pages/merlin6/accessing-merlin6/accessing-slurm.md
Normal file
98
pages/merlin6/accessing-merlin6/accessing-slurm.md
Normal file
@ -0,0 +1,98 @@
|
|||||||
|
---
|
||||||
|
title: Accessing Slurm Cluster
|
||||||
|
#tags:
|
||||||
|
#keywords:
|
||||||
|
last_updated: 13 June 2019
|
||||||
|
#summary: ""
|
||||||
|
sidebar: merlin6_sidebar
|
||||||
|
permalink: /merlin6/slurm-access.html
|
||||||
|
---
|
||||||
|
|
||||||
|
## The Merlin6 Slurm batch system
|
||||||
|
|
||||||
|
Clusters at PSI use the [Slurm Workload Manager](http://slurm.schedmd.com/) as the batch system technology for managing and scheduling jobs.
|
||||||
|
Historically, *Merlin4* and *Merlin5* also used Slurm. In the same way, **Merlin6** has been also configured with this batch system.
|
||||||
|
|
||||||
|
Slurm has been installed in a **multi-clustered** configuration, allowing to integrate multiple clusters in the same batch system.
|
||||||
|
* Two different Slurm clusters exist: **merlin5** and **merlin6**.
|
||||||
|
* **merlin5** is a cluster with very old hardware (out-of-warranty).
|
||||||
|
* **merlin5** will exist as long as hardware incidents are soft and easy to repair/fix (i.e. hard disk replacement)
|
||||||
|
* **merlin6** is the default cluster when submitting jobs.
|
||||||
|
|
||||||
|
This document is mostly focused on the **merlin6** cluster. Details for **merlin5** are not shown here, and only basic access and recent
|
||||||
|
changes will be explained (**[Official Merlin5 User Guide](https://intranet.psi.ch/PSI_HPC/Merlin5)** is still valid).
|
||||||
|
|
||||||
|
### Merlin6 Slurm Configuration Details
|
||||||
|
|
||||||
|
For understanding the Slurm configuration setup in the cluster, sometimes can be useful to check the following files:
|
||||||
|
|
||||||
|
* ``/etc/slurm/slurm.conf`` - can be found in the login nodes and computing nodes.
|
||||||
|
* ``/etc/slurm/cgroup.conf`` - can be found in the computing nodes, is also propagated to login nodes for user read access.
|
||||||
|
* ``/etc/slurm/gres.conf`` - can be found in the GPU nodes, is also propgated to login nodes and computing nodes for user read access.
|
||||||
|
|
||||||
|
The previous configuration files can be found in the *login nodes* correspond exclusively to the **merlin6** cluster configuration files. These
|
||||||
|
configuration files are also present in the **merlin6** *computing nodes*.
|
||||||
|
|
||||||
|
Slurm configuration files for the old **merlin5** cluster have to be directly checked on any of the **merlin5** *computing nodes*: those files *do
|
||||||
|
not* exist in the **merlin6** *login nodes*.
|
||||||
|
|
||||||
|
### Merlin5 Access
|
||||||
|
|
||||||
|
Keeping the **merlin5** cluster will allow running jobs in the old computing nodes until users have fully migrated their codes to the new cluster.
|
||||||
|
|
||||||
|
From July 2019, **merlin6** becomes the **default cluster** and any job submitted to Slurm will be submitted to that cluster.
|
||||||
|
However, users can keep submitting to the old **merlin5** computing nodes by using the option ``--cluster=merlin5`` and using the corresponding
|
||||||
|
Slurm partition with ``--partition=merlin``. In example:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
srun --clusters=merlin5 --partition=merlin hostname
|
||||||
|
sbatch --clusters=merlin5 --partition=merlin myScript.batch
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Using Slurm 'merlin6' cluster
|
||||||
|
|
||||||
|
Basic usage for the **merlin6** cluster will be detailed here. For advanced usage, please use the following document [LINK TO SLURM ADVANCED CONFIG]()
|
||||||
|
|
||||||
|
### Merlin6 Node definition
|
||||||
|
|
||||||
|
The following table show default and maximum resources that can be used per node:
|
||||||
|
|
||||||
|
| Nodes | Def.#CPUs | Max.#CPUs | Def.Mem/CPU | Max.Mem/CPU | Max.Mem/Node | Max.Swap | Def.#GPUs | Max.#GPUs |
|
||||||
|
|:---------------------------------- | ---------:| ---------:| -----------:| -----------:| ------------:| --------:| --------- | --------- |
|
||||||
|
| merlin-c-[001-022,101-122,201-222] | 1 core | 44 cores | 8000 | 352000 | 352000 | 10000 | N/A | N/A |
|
||||||
|
| merlin-g-[001] | 1 core | 8 cores | 8000 | 102498 | 102498 | 10000 | 1 | 2 |
|
||||||
|
| merlin-g-[002-009] | 1 core | 10 cores | 8000 | 102498 | 102498 | 10000 | 1 | 4 |
|
||||||
|
|
||||||
|
If nothing is specified, by default each core will use up to 8GB of memory. More memory per core can be specified with the ``--mem=<memory>`` option,
|
||||||
|
and maximum memory allowed is ``Max.Mem/Node``.
|
||||||
|
|
||||||
|
In *Merlin6*, memory is considered a Consumable Resource, as well as the CPU.
|
||||||
|
|
||||||
|
### Merlin6 Slurm partitions
|
||||||
|
|
||||||
|
Partition can be specified when submitting a job with the ``--partition=<partitionname>`` option.
|
||||||
|
The following *partitions* (also known as *queues*) are configured in Slurm:
|
||||||
|
|
||||||
|
| Partition | Default Partition | Default Time | Max Time | Max Nodes | Priority |
|
||||||
|
|:----------- | ----------------- | ------------ | -------- | --------- | -------- |
|
||||||
|
| **general** | true | 1 day | 1 week | 50 | low |
|
||||||
|
| **daily** | false | 1 day | 1 day | 60 | medium |
|
||||||
|
| **hourly** | false | 1 hour | 1 hour | unlimited | highest |
|
||||||
|
|
||||||
|
General is the *default*, so when nothing is specified job will be by default assigned to that partition. General can not have more than 50 nodes
|
||||||
|
running jobs. For **daily** this limitation is extended to 60 nodes while for **hourly** there are no limits. Shorter jobs have more priority than
|
||||||
|
longer jobs, hence in general terms would be scheduled before (however, other factors such like user fair share value can affect to this decision).
|
||||||
|
|
||||||
|
### Merlin6 User limits
|
||||||
|
|
||||||
|
By default, users can not use more than 528 cores at the same time (Max CPU per user). This limit applies for the **general** and **daily** partitions. For the **hourly** partition, there is no restriction.
|
||||||
|
These limits are softed for the **daily** partition during non working hours and during the weekend as follows:
|
||||||
|
|
||||||
|
| Partition | Mon-Fri 08h-18h | Sun-Thu 18h-0h | From Fri 18h to Sun 8h | From Sun 8h to Mon 18h |
|
||||||
|
|:----------- | --------------- | -------------- | ----------------------- | ---------------------- |
|
||||||
|
| **general** | 528 | 528 | 528 | 528 |
|
||||||
|
| **daily** | 528 | 792 | Unlimited | 792 |
|
||||||
|
| **hourly** | Unlimited | Unlimited | Unlimited | Unlimited |
|
||||||
|
|
156
pages/merlin6/accessing-merlin6/merlin6-directories.md
Normal file
156
pages/merlin6/accessing-merlin6/merlin6-directories.md
Normal file
@ -0,0 +1,156 @@
|
|||||||
|
---
|
||||||
|
title: Merlin6 Data Directories
|
||||||
|
#tags:
|
||||||
|
#keywords:
|
||||||
|
last_updated: 13 June 2019
|
||||||
|
#summary: ""
|
||||||
|
sidebar: merlin6_sidebar
|
||||||
|
permalink: /merlin6/data-directories.html
|
||||||
|
---
|
||||||
|
|
||||||
|
## Merlin6 directory structure
|
||||||
|
|
||||||
|
Merlin6 contain the following directories available for users:
|
||||||
|
|
||||||
|
* ``/psi/home/<username>``: private user **home** directory
|
||||||
|
* ``/data/user/<username>``: private user **home** directory
|
||||||
|
* ``/data/project/general/<projectname>``: Shared **Project** directory
|
||||||
|
* For BIO experiments, a dedicate ``/data/project/bio/$projectname`` exists.
|
||||||
|
* ``/scratch``: Local *scratch* disk.
|
||||||
|
* ``/shared-scratch``: Shared *scratch* disk.
|
||||||
|
|
||||||
|
A summary for each directory would be:
|
||||||
|
|
||||||
|
| Directory | Block Quota [Soft:Hard] | Block Quota [Soft:Hard] | Quota Change Policy: Block | Quota Change Policy: Files | Backup | Backup Policy |
|
||||||
|
| ---------------------------------- | ----------------------- | ----------------------- |:--------------------------------- |:-------------------------------- | ------ | :----------------------------- |
|
||||||
|
| /psi/home/$username | USR [10GB:11GB] | *Undef* | Up to x2 when strictly justified. | N/A | yes | Daily snapshots for 1 week |
|
||||||
|
| /data/user/$username | USR [1TB:1.074TB] | USR [1M:1.1M] | Inmutable. Need a project. | Changeable when justified. | no | Users responsible for backup |
|
||||||
|
| /data/project/bio/$projectname | GRP [1TB:1.074TB] | GRP [1M:1.1M] | Subject to project requirements. | Subject to project requirements. | no | Project responsible for backup |
|
||||||
|
| /data/project/general/$projectname | GRP [1TB:1.074TB] | GRP [1M:1.1M] | Subject to project requirements. | Subject to project requirements. | no | Project responsible for backup |
|
||||||
|
| /scratch | *Undef* | *Undef* | N/A | N/A | no | N/A |
|
||||||
|
| /shared-scratch | *Undef* | *Undef* | N/A | N/A | no | N/A |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## User home directory
|
||||||
|
|
||||||
|
Home directories are part of the PSI NFS Central Home storage provided by AIT.
|
||||||
|
However, administration for the Merlin6 NFS homes is delegated to Merlin6 administrators.
|
||||||
|
|
||||||
|
This is the default directory users will land when login in to any Merlin6 machine.
|
||||||
|
This directory is mounted in the login and computing nodes under the directory:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
/psi/home/$username
|
||||||
|
```
|
||||||
|
|
||||||
|
Users can check their quota by running the following command:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
quota -s
|
||||||
|
```
|
||||||
|
|
||||||
|
### Home directory policy
|
||||||
|
|
||||||
|
* Read **[Important: Code of Conduct](## Important: Code of Conduct)** for more information about Merlin6 policies.
|
||||||
|
* Is **forbidden** to use the home directories for IO intensive tasks
|
||||||
|
* Use ``/scratch``, ``/shared-scratch``, ``/data/user`` or ``/data/project`` for this purpose.
|
||||||
|
* Users can recover up to 1 week of their lost data thanks to the automatic **daily snapshorts for 1 week**.
|
||||||
|
Snapshots are found in the following directory:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
/psi/home/.snapshop/$username
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## User data directory
|
||||||
|
|
||||||
|
User data directories are part of the Merlin6 storage cluster and technology is based on GPFS.
|
||||||
|
|
||||||
|
The user data directory is intended for *fast IO access* and keeping large amount of private data.
|
||||||
|
This directory is mounted in the login and computing nodes under the directory
|
||||||
|
|
||||||
|
```bash
|
||||||
|
/data/user/$username
|
||||||
|
```
|
||||||
|
|
||||||
|
Users can check their quota by running the following command:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
mmlsquota -u <username> --block-size auto merlin-user
|
||||||
|
```
|
||||||
|
|
||||||
|
### User Directory policy
|
||||||
|
|
||||||
|
* Read **[Important: Code of Conduct](## Important: Code of Conduct)** for more information about Merlin6 policies.
|
||||||
|
* Is **forbidden** to use the data directories as ``scratch`` area during a job runtime.
|
||||||
|
* Use ``/scratch``, ``/shared-scratch`` for this purpose.
|
||||||
|
* No backup policy is applied for user data directories: users are responsible for backing up their data.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Project data directory
|
||||||
|
|
||||||
|
Project data directories are part of the Merlin6 storage cluster and technology is based on GPFS.
|
||||||
|
|
||||||
|
This storage is intended for *fast IO access* and keeping large amount of private data, but also for sharing data amogst
|
||||||
|
different users sharing a project.
|
||||||
|
Creating a project is the way in where users can expand his storage space and will optimize the usage of the storage
|
||||||
|
(by avoiding for instance, duplicated data for different users).
|
||||||
|
|
||||||
|
Is **highly** recommended the use of a project when multiple persons are involved in the same project managing similar/common data.
|
||||||
|
Quotas are defined in a *group* and *fileset* basis: Unix Group name must exist for a specific project or must be created for
|
||||||
|
any new project. Contact the Merlin6 administrators for more information about that.
|
||||||
|
|
||||||
|
The project data directory is mounted in the login and computing nodes under the dirctory:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
/data/project/$username
|
||||||
|
```
|
||||||
|
|
||||||
|
Users can check the project quota by running the following command:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
mmrepquota merlin-proj:$projectname
|
||||||
|
```
|
||||||
|
|
||||||
|
### Project Directory policy
|
||||||
|
|
||||||
|
* Read **[Important: Code of Conduct](## Important: Code of Conduct)** for more information about Merlin6 policies.
|
||||||
|
* Is **forbidden** to use the data directories as ``scratch`` area during a job runtime.
|
||||||
|
* Use ``/scratch``, ``/shared-scratch`` for this purpose.
|
||||||
|
* No backups: users are responsible for managing the backups of their data directories.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Scratch directories
|
||||||
|
|
||||||
|
There are two different types of scratch disk: **local** (``/scratch``) and **shared** (``/shared-scratch``).
|
||||||
|
Specific details of each type is described below.
|
||||||
|
|
||||||
|
Usually **shared** scratch will be used for those jobs running on multiple nodes which need to access to a common shared space
|
||||||
|
for creating temporary files, while **local** scratch should be used by those jobs needing a local space for creating temporary files.
|
||||||
|
|
||||||
|
**local** scratch in Merlin6 computing nodes provides a huge number of IOPS thanks to the NVMe technology,
|
||||||
|
while **shared** scratch, despite being also very fast, is an external GPFS storage with more latency.
|
||||||
|
|
||||||
|
``/shared-scratch`` is only mounted in the *Merlin6* computing nodes, and its current size is 50TB. Whenever necessary, it can be increased in the future.
|
||||||
|
|
||||||
|
A summary for the scratch directories is the following:
|
||||||
|
|
||||||
|
| Cluster | Service | Scratch | Scratch Mountpoint | Shared Scratch | Shared Scratch Mountpoint | Comments |
|
||||||
|
| ------- | -------------- | ------------ | ------------------ | -------------- | ------------------------- | ------------------------------------- |
|
||||||
|
| merlin5 | computing node | 50GB / SAS | ``/scratch`` | ``N/A`` | ``N/A`` | ``merlin-c-[01-64]`` |
|
||||||
|
| merlin6 | login node | 100GB / SAS | ``/scratch`` | ``N/A`` | ``N/A`` | ``merlin-l-0[1,2]`` |
|
||||||
|
| merlin6 | computing node | 1.3TB / NVMe | ``/scratch`` | 50TB / GPFS | ``/shared-scratch`` | ``merlin-c-[001-022,101-122,201-222`` |
|
||||||
|
| merlin6 | login node | 2.0TB / NVMe | ``/scratch`` | ``N/A`` | ``N/A`` | ``merlin-l-00[1,2]`` |
|
||||||
|
|
||||||
|
### Scratch directories policy
|
||||||
|
|
||||||
|
* Read **[Important: Code of Conduct](## Important: Code of Conduct)** for more information about Merlin6 policies.
|
||||||
|
* By default, *always* use **local** first and only use **shared** if you specific use case needs a shared scratch area.
|
||||||
|
* Temporary files *must be deleted at the end of the job by the user*.
|
||||||
|
* Remaining files will be deleted by the system if detected.
|
||||||
|
|
||||||
|
---
|
@ -0,0 +1,76 @@
|
|||||||
|
---
|
||||||
|
title: Requesting Merlin6 Accounts
|
||||||
|
#tags:
|
||||||
|
#keywords:
|
||||||
|
last_updated: 13 June 2019
|
||||||
|
#summary: ""
|
||||||
|
sidebar: merlin6_sidebar
|
||||||
|
permalink: /merlin6/request-account.html
|
||||||
|
---
|
||||||
|
|
||||||
|
## Requesting Access to Merlin6
|
||||||
|
|
||||||
|
PSI users with their Linux account belonging to the **svc-cluster_merlin6** group are allowed to use Merlin6.
|
||||||
|
|
||||||
|
Registration for **Merlin6** access *must be done* through **[PSI Service Now](https://psi.service-now.com/psisp)**:
|
||||||
|
|
||||||
|
* Please open a ticket as *Incident Request*, with subject:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
Subject: [Merlin6] Access Request for user '$username'
|
||||||
|
```
|
||||||
|
|
||||||
|
* Text content (please use always this template):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
Dear HelpDesk,
|
||||||
|
|
||||||
|
my name is $Name $Surname with PSI username $username and I would like to request access to the Merlin6 cluster.
|
||||||
|
|
||||||
|
Please add me to the following Unix groups:
|
||||||
|
* 'svc-cluster_merlin6'
|
||||||
|
|
||||||
|
Thanks a lot,
|
||||||
|
$Name $Surname
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Requesting Access to Merlin5
|
||||||
|
|
||||||
|
Merlin5 computing nodes will be available for some time as a **best effort** service.
|
||||||
|
For accessing the old Merlin5 resources, users should belong to the **svc-cluster_merlin5** Unix Group.
|
||||||
|
|
||||||
|
Registration for **Merlin5** access *must be done* through **[PSI Service Now](https://psi.service-now.com/psisp)**:
|
||||||
|
|
||||||
|
* Please open a ticket as *Incident Request*, with subject:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
Subject: [Merlin5] Access Request for user '$username'
|
||||||
|
```
|
||||||
|
|
||||||
|
* Text content (please use always this template):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
Dear HelpDesk,
|
||||||
|
|
||||||
|
my name is $Name $Surname with PSI username $username and I would like to request access to the Merlin5 cluster.
|
||||||
|
|
||||||
|
Please add me to the following Unix groups:
|
||||||
|
* 'svc-cluster_merlin5'
|
||||||
|
|
||||||
|
Thanks a lot,
|
||||||
|
$Name $Surname
|
||||||
|
```
|
||||||
|
---
|
||||||
|
|
||||||
|
## Requesting extra Unix groups
|
||||||
|
|
||||||
|
* Some users may require to be added to some extra specific Unix groups.
|
||||||
|
* This will grant access to specific resources.
|
||||||
|
* In example, some BIO groups may belong to a specific BIO group for having access to the project area for that group.
|
||||||
|
* Supervisors should inform new users which extra groups are needed.
|
||||||
|
* When requesting access to **[Merlin6](##Requesting-Access-to-Merlin6)** or **[Merlin5](##Requesting-Access-to-Merlin5)**, extra groups can be added in the same *Incident Request*
|
||||||
|
* Alternatively, this step can be done later in the future on a different **[PSI Service Now](https://psi.service-now.com/psisp)** ticket.
|
||||||
|
* If you want to request access for Merlin5 and Merlin6
|
||||||
|
* Use the template **[Requesting Access to Merlin6](##Requesting-Access-to-Merlin6)** and add also the **``'svc-cluster_merlin5'``** Unix Group to the request.
|
@ -1,35 +0,0 @@
|
|||||||
---
|
|
||||||
layout: default
|
|
||||||
title: Code Of Conduct
|
|
||||||
parent: Merlin6 User Guide
|
|
||||||
nav_order: 2
|
|
||||||
---
|
|
||||||
|
|
||||||
# Code Of Conduct
|
|
||||||
{: .no_toc }
|
|
||||||
|
|
||||||
## Table of contents
|
|
||||||
{: .no_toc .text-delta }
|
|
||||||
|
|
||||||
1. TOC
|
|
||||||
{:toc}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
The basic principle is courtesy and consideration for other users.
|
|
||||||
|
|
||||||
* Merlin6 is a shared resource, not your laptop, therefore you are kindly requested to behave in a way you would be happy to see other users behaving towards you.
|
|
||||||
* Basic shell programming skills in Linux/UNIX environment is a must-have requirement for HPC users; a proficiency in shell programming would be greatly beneficial.
|
|
||||||
* The login nodes are for development and quick testing:
|
|
||||||
* Is **strictly forbidden to run production jobs** on the login nodes.
|
|
||||||
* Is **forbidden to run long processes** occupying big part of the resources.
|
|
||||||
* *Any miss-behaving running processes according to these rules will be killed.*
|
|
||||||
* All production jobs should be submitted using the batch system.
|
|
||||||
* Make sure that no broken or run-away processes are left when your job is done. Keep the process space clean on all nodes.
|
|
||||||
* During the runtime of a job, is mandatory to make use of the ``scratch`` and ``/shared-scratch`` partitions for temporary data. This also applies to temporary data generated on login nodes:
|
|
||||||
* Is **forbidden** to use the ``/data/user``, ``/data/project`` or ``/psi/home/`` for that purpose.
|
|
||||||
* Always remove files you do not need any more (e.g. core dumps, temporary files) as early as possible. Keep the disk space clean on all nodes.
|
|
||||||
* Read description in **[Merlin6 directory structure](### Merlin6 directory structure)** for correct usage of each partition type.
|
|
||||||
|
|
||||||
The system administrator has the right to block the access to Merlin6 for an account violating the Code of Conduct, in which case the issue will be escalated to the user's supervisor.
|
|
||||||
The system administrator has the right to delete files in the *scratch* directories exceeding the above rules.
|
|
40
pages/merlin6/code-of-conduct.md
Normal file
40
pages/merlin6/code-of-conduct.md
Normal file
@ -0,0 +1,40 @@
|
|||||||
|
---
|
||||||
|
title: Code Of Conduct
|
||||||
|
#tags:
|
||||||
|
#keywords:
|
||||||
|
last_updated: 13 June 2019
|
||||||
|
#summary: ""
|
||||||
|
sidebar: merlin6_sidebar
|
||||||
|
permalink: /merlin6/code-of-conduct.html
|
||||||
|
---
|
||||||
|
|
||||||
|
## The Basic principle
|
||||||
|
|
||||||
|
The basic principle is courtesy and consideration for other users.
|
||||||
|
|
||||||
|
* Merlin6 is a shared resource, not your laptop, therefore you are kindly requested to behave in a way you would be happy to see other users behaving towards you.
|
||||||
|
* Basic shell programming skills in Linux/UNIX environment is a must-have requirement for HPC users; a proficiency in shell programming would be greatly beneficial.
|
||||||
|
|
||||||
|
## Interactive nodes
|
||||||
|
|
||||||
|
* The interactive nodes (also known as login nodes) are for development and quick testing:
|
||||||
|
* Is **strictly forbidden to run production jobs** on the login nodes.
|
||||||
|
* Is **forbidden to run long processes** occupying big part of the resources.
|
||||||
|
* According to the previous rules, **any miss-behaving running processes will be killed.**
|
||||||
|
* All production jobs must be submitted to the batch system.
|
||||||
|
|
||||||
|
## Batch system
|
||||||
|
|
||||||
|
* Make sure that no broken or run-away processes are left when your job is done. Keep the process space clean on all nodes.
|
||||||
|
* During the runtime of a job, is mandatory to make use of the ``/scratch`` and ``/shared-scratch`` partitions for temporary data:
|
||||||
|
* Is **forbidden** to use the ``/data/user``, ``/data/project`` or ``/psi/home/`` for that purpose.
|
||||||
|
* Always remove files you do not need any more (e.g. core dumps, temporary files) as early as possible. Keep the disk space clean on all nodes.
|
||||||
|
* Prioritize the use of ``/scratch`` over ``/shared-scratch`` and use the latter when clearly needed (in example, a need for a shared disk visible from multiple nodes)
|
||||||
|
* Read description in **[Merlin6 directory structure](### Merlin6 directory structure)** for correct usage of each partition type.
|
||||||
|
|
||||||
|
## System Administrator Rights
|
||||||
|
|
||||||
|
* The system administrator has the right to block the access to Merlin6 for an account violating the Code of Conduct
|
||||||
|
* The issue will be escalated to the user's supervisor.
|
||||||
|
* The system administrator has the right to delete files in the **scratch** directories exceeding the above rules.
|
||||||
|
* The system administrator has the right to kill any miss-behaving running processes.
|
@ -1,9 +1,9 @@
|
|||||||
---
|
---
|
||||||
title: Contact
|
title: Contact
|
||||||
tags:
|
#tags:
|
||||||
keywords:
|
#keywords:
|
||||||
last_updated: 13 June 2019
|
last_updated: 13 June 2019
|
||||||
summary: "Contact information for merlin support"
|
#summary: ""
|
||||||
sidebar: merlin6_sidebar
|
sidebar: merlin6_sidebar
|
||||||
permalink: /merlin6/contact.html
|
permalink: /merlin6/contact.html
|
||||||
---
|
---
|
||||||
@ -18,37 +18,36 @@ Support can be done through:
|
|||||||
|
|
||||||
### PSI Service Now
|
### PSI Service Now
|
||||||
|
|
||||||
[PSI Service Now](https://psi.service-now.com/psisp) is the official tool for opening incidents.
|
* **[PSI Service Now](https://psi.service-now.com/psisp)**: is the official tool for opening incidents.
|
||||||
|
* PSI HelpDesk will redirect the incident to the corresponding department, or
|
||||||
PSI HelpDesk will redirect the incident to the corresponding department, but you can always assign it directly to us
|
* you can always assign it directly to us: **Assignment Group[``'itsm-sci_hpc_loc'``]**.
|
||||||
(``Assignment Group['itsm-sci_hpc_loc']``).
|
|
||||||
|
|
||||||
### Contact Merlin6 Administrators
|
### Contact Merlin6 Administrators
|
||||||
|
|
||||||
An official mail list is available for contacting Merlin6 Administrators:
|
* **E-Mail <merlin-admins@lists.psi.ch>**"
|
||||||
* <merlin-admins@lists.psi.ch>
|
* This is the official way to contact Merlin6 Administrators.
|
||||||
* This is the official way to contact Merlin6 Administrators.
|
* Do not hesitate to contact us on any question, request and/or issue.
|
||||||
* Do not hesitate to contact us on any question, request and/or problem.
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Get updated through the Merlin User list!
|
## Get updated through the Merlin User list!
|
||||||
|
|
||||||
Is *strictly* recommended to register to the Merlin Users mail list:
|
Is strictly recommended to register to the Merlin Users mail list:
|
||||||
* <merlin-users@lists.psi.ch>
|
|
||||||
* please subscribe to this list to receive updates about Merlin6 general
|
|
||||||
information, interventions and system improvements useful for users.
|
|
||||||
* Users can be subscribed in two ways:
|
|
||||||
* [Sympa Link](https://psilists.ethz.ch/sympa/info/merlin-users)
|
|
||||||
* Send a request to the admin list: <merlin-admins@lists.psi.ch>
|
|
||||||
|
|
||||||
This is the official channel we use to inform users about downtimes, interventions or problems.
|
* **E-Mail: <merlin-users@lists.psi.ch>**
|
||||||
|
* please subscribe to this list to receive updates about Merlin6 general
|
||||||
|
information, interventions and system improvements useful for users.
|
||||||
|
* Users can be subscribed in two ways:
|
||||||
|
* **[Sympa Link](https://psilists.ethz.ch/sympa/info/merlin-users)**
|
||||||
|
* Send a request to the admin list: **<merlin-admins@lists.psi.ch>**
|
||||||
|
|
||||||
|
This mail list is the official channel used by Merlin6 administrators to infrom users about downtimes, interventions or problems.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## The Merlin6 Team
|
## The Merlin6 Team
|
||||||
|
|
||||||
Merlin6 is managed by the [High Performance Computing and Emerging technologies Group](https://www.psi.ch/de/lsm/hpce-group), which
|
Merlin6 is managed by the **[High Performance Computing and Emerging technologies Group](https://www.psi.ch/de/lsm/hpce-group)**, which
|
||||||
is one of the from from the [Laboratory for Scientific Computing and Modelling](https://www.psi.ch/de/lsm).
|
is one of the from from the **[Laboratory for Scientific Computing and Modelling](https://www.psi.ch/de/lsm)**.
|
||||||
|
|
||||||
For more information about our team and contacts please visit: <https://www.psi.ch/de/lsm/hpce-group>
|
For more information about our team and contacts please visit: **<https://www.psi.ch/de/lsm/hpce-group>**
|
||||||
|
@ -1,8 +1,11 @@
|
|||||||
---
|
---
|
||||||
layout: default
|
|
||||||
title: Hardware And Software Description
|
title: Hardware And Software Description
|
||||||
parent: Merlin6 User Guide
|
#tags:
|
||||||
nav_order: 3
|
#keywords:
|
||||||
|
last_updated: 13 June 2019
|
||||||
|
#summary: ""
|
||||||
|
sidebar: merlin6_sidebar
|
||||||
|
permalink: /merlin6/hardware-and-software.html
|
||||||
---
|
---
|
||||||
|
|
||||||
# Hardware And Software Description
|
# Hardware And Software Description
|
||||||
@ -98,8 +101,8 @@ The solution is equipped with 334 x 10TB disks providing a useable capacity of 2
|
|||||||
|
|
||||||
Merlin6 cluster connectivity is based on the [Infiniband](https://en.wikipedia.org/wiki/InfiniBand) technology. This allows fast access with very low latencies to the data as well as running
|
Merlin6 cluster connectivity is based on the [Infiniband](https://en.wikipedia.org/wiki/InfiniBand) technology. This allows fast access with very low latencies to the data as well as running
|
||||||
extremely efficient MPI-based jobs:
|
extremely efficient MPI-based jobs:
|
||||||
* Connectivity amongst different computing nodes on different chassis ensures up to 1200Gbps of aggregated bandwidth.
|
* Connectivity amongst different computing nodes on different chassis ensures up to 1200Gbps of aggregated bandwidth.
|
||||||
* Inter connectivity (communication amongst computing nodes in the same chassis) ensures up to 2400Gbps of aggregated bandwidth.
|
* Inter connectivity (communication amongst computing nodes in the same chassis) ensures up to 2400Gbps of aggregated bandwidth.
|
||||||
* Communication to the storage ensures up to 800Gbps of aggregated bandwidth.
|
* Communication to the storage ensures up to 800Gbps of aggregated bandwidth.
|
||||||
|
|
||||||
Merlin6 cluster currently contains 5 Infiniband Managed switches and 3 Infiniband Unmanaged switches (one per HP Apollo chassis):
|
Merlin6 cluster currently contains 5 Infiniband Managed switches and 3 Infiniband Unmanaged switches (one per HP Apollo chassis):
|
||||||
|
@ -1,9 +1,9 @@
|
|||||||
---
|
---
|
||||||
title: Introduction
|
title: Introduction
|
||||||
tags:
|
#tags:
|
||||||
keywords:
|
#keywords:
|
||||||
last_updated: 13 June 2019
|
last_updated: 13 June 2019
|
||||||
summary: "Merlin 6 cluster overview"
|
#summary: "Merlin 6 cluster overview"
|
||||||
sidebar: merlin6_sidebar
|
sidebar: merlin6_sidebar
|
||||||
permalink: /merlin6/introduction.html
|
permalink: /merlin6/introduction.html
|
||||||
---
|
---
|
||||||
@ -25,4 +25,4 @@ of GPU-based resources which are mostly used by the BIO experiments.
|
|||||||
|
|
||||||
## Merlin6
|
## Merlin6
|
||||||
|
|
||||||

|
[][https://lsm-hpce.gitpages.psi.ch/jekyll-example1/docs/merlin6-user-guide/source/images/merlinschema3.png]{: .shadow}
|
||||||
|
@ -1,20 +1,11 @@
|
|||||||
---
|
---
|
||||||
layout: default
|
title: Known Problems and Troubleshooting
|
||||||
title: Known Problems and Troubleshooting
|
#tags:
|
||||||
parent: Merlin6 User Guide
|
#keywords:
|
||||||
nav_order: 7
|
last_updated: 13 June 2019
|
||||||
---
|
#summary: ""
|
||||||
|
sidebar: merlin6_sidebar
|
||||||
# Known Problems and Troubleshooting
|
permalink: /merlin6/troubleshooting.html
|
||||||
|
|
||||||
{: .no_toc }
|
|
||||||
|
|
||||||
## Table of contents
|
|
||||||
{: .no_toc .text-delta }
|
|
||||||
|
|
||||||
1. TOC
|
|
||||||
{:toc}
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Known Problems
|
## Known Problems
|
||||||
@ -37,7 +28,7 @@ paraview --mesa
|
|||||||
|
|
||||||
It may happened that your code, compiled on one machine will not be executed on another throwing exception like "(Illegal instruction)".
|
It may happened that your code, compiled on one machine will not be executed on another throwing exception like "(Illegal instruction)".
|
||||||
Check (with "hostname" command) on which of the node you are and compare it with the names from first item. We observe few applications
|
Check (with "hostname" command) on which of the node you are and compare it with the names from first item. We observe few applications
|
||||||
that can't be run on merlin-c-01..16 because of this problem (notice that these machines are more then 5 years old). Hint: you may
|
that can't be run on merlin-c-01..16 because of this problem (notice that these machines are more then 5 years old). Hint: you may
|
||||||
choose the particular flavour of the machines for your slurm job, check the "--cores-per-node" option for sbatch:
|
choose the particular flavour of the machines for your slurm job, check the "--cores-per-node" option for sbatch:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
@ -48,7 +39,7 @@ sbatch --cores-per-socket=8 Script.sh # will filter the selection of the machine
|
|||||||
|
|
||||||
### Before asking for help
|
### Before asking for help
|
||||||
|
|
||||||
Please, if you have problems running jobs and you want to report something or just ask for help,
|
Please, if you have problems running jobs and you want to report something or just ask for help,
|
||||||
please gather and attach in advance the following information:
|
please gather and attach in advance the following information:
|
||||||
|
|
||||||
* Unix username and session (``who am i`` command output)
|
* Unix username and session (``who am i`` command output)
|
||||||
@ -56,10 +47,10 @@ please gather and attach in advance the following information:
|
|||||||
* Slurm batch script location (path to script and input/output files)
|
* Slurm batch script location (path to script and input/output files)
|
||||||
* Slurm job_id (``id`` is returned on ``sbatch``/``salloc`` command, but also can be taken from ``squeue`` commmand)
|
* Slurm job_id (``id`` is returned on ``sbatch``/``salloc`` command, but also can be taken from ``squeue`` commmand)
|
||||||
|
|
||||||
### Troubleshooting SSH
|
### Troubleshooting SSH
|
||||||
|
|
||||||
Use the ssh command with the "-vvv" option and copy and paste (no screenshot please)
|
Use the ssh command with the "-vvv" option and copy and paste (no screenshot please)
|
||||||
the output to your request in Service-Now. Example
|
the output to your request in Service-Now. Example
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
ssh -Y -vvv bond_j@merlin-l-01
|
ssh -Y -vvv bond_j@merlin-l-01
|
||||||
@ -67,26 +58,26 @@ ssh -Y -vvv bond_j@merlin-l-01
|
|||||||
|
|
||||||
### Troubleshooting SLURM
|
### Troubleshooting SLURM
|
||||||
|
|
||||||
If one copies Slurm commands or batch scripts from another cluster,
|
If one copies Slurm commands or batch scripts from another cluster,
|
||||||
they may need some changes (often minor) to run successfully on Merlin5.
|
they may need some changes (often minor) to run successfully on Merlin5.
|
||||||
Examine carefully the error message, especially concerning the options
|
Examine carefully the error message, especially concerning the options
|
||||||
used in the slurm commands.
|
used in the slurm commands.
|
||||||
|
|
||||||
Try to submit jobs using the examples given in the section "Using Batch System to Submit Jobs to Merlin5".
|
Try to submit jobs using the examples given in the section "Using Batch System to Submit Jobs to Merlin5".
|
||||||
If you can run successfully an example for a type of job (!OpenMP, MPI) similar to your one,
|
If you can run successfully an example for a type of job (!OpenMP, MPI) similar to your one,
|
||||||
try to edit the example to run your application.
|
try to edit the example to run your application.
|
||||||
|
|
||||||
If the problem remains, then, in your request in Service-Now, describe the problem in details that
|
If the problem remains, then, in your request in Service-Now, describe the problem in details that
|
||||||
are needed to reproduce it. Include the output of the following commands:
|
are needed to reproduce it. Include the output of the following commands:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
date
|
date
|
||||||
hostname
|
hostname
|
||||||
pwd
|
pwd
|
||||||
module list
|
module list
|
||||||
|
|
||||||
# All slurm commands used with the corresponding output
|
# All slurm commands used with the corresponding output
|
||||||
```
|
```
|
||||||
|
|
||||||
Do not delete any output and error files generated by Slurm.
|
Do not delete any output and error files generated by Slurm.
|
||||||
Make a copy of the failed job script if you like to edit it meanwhile.
|
Make a copy of the failed job script if you like to edit it meanwhile.
|
||||||
|
@ -1,10 +1,11 @@
|
|||||||
---
|
---
|
||||||
layout: default
|
|
||||||
title: Merlin6 Slurm
|
title: Merlin6 Slurm
|
||||||
parent: Merlin6 User Guide
|
#tags:
|
||||||
nav_order: 5
|
#keywords:
|
||||||
has_children: true
|
last_updated: 13 June 2019
|
||||||
permalink: /docs/merlin6-user-guide/merlin6-slurm/merlin6-slurm.html
|
#summary: ""
|
||||||
|
sidebar: merlin6_sidebar
|
||||||
|
permalink: /merlin6/slurm.html
|
||||||
---
|
---
|
||||||
|
|
||||||
# Merlin6 Slurm
|
# Merlin6 Slurm
|
||||||
|
@ -1,20 +1,11 @@
|
|||||||
---
|
---
|
||||||
layout: default
|
|
||||||
title: Slurm Basic Commands
|
title: Slurm Basic Commands
|
||||||
parent: Merlin6 Slurm
|
#tags:
|
||||||
grand_parent: Merlin6 User Guide
|
#keywords:
|
||||||
nav_order: 1
|
last_updated: 13 June 2019
|
||||||
---
|
#summary: ""
|
||||||
|
sidebar: merlin6_sidebar
|
||||||
# Slurm Basic Commands
|
permalink: /merlin6/slurm-basics.html
|
||||||
{: .no_toc }
|
|
||||||
|
|
||||||
## Table of contents
|
|
||||||
{: .no_toc .text-delta }
|
|
||||||
|
|
||||||
1. TOC
|
|
||||||
{:toc}
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Basic commands
|
## Basic commands
|
||||||
@ -42,13 +33,13 @@ sprio -l # to view the factors that comprise a job's scheduling priority
|
|||||||
|
|
||||||
## Basic slurm example
|
## Basic slurm example
|
||||||
|
|
||||||
You can copy-paste the following example in a file file called ``mySlurm.batch``.
|
You can copy-paste the following example in a file file called ``mySlurm.batch``.
|
||||||
Some basic parameters are explained in the example.
|
Some basic parameters are explained in the example.
|
||||||
Please notice that ``#`` is an enabled option while ``##`` is a commented out option (no effect).
|
Please notice that ``#`` is an enabled option while ``##`` is a commented out option (no effect).
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
#!/bin/sh
|
#!/bin/sh
|
||||||
#SBATCH --partition=daily # name of slurm partition to submit. Can be 'general' (default if not specified), 'daily', 'hourly'.
|
#SBATCH --partition=daily # name of slurm partition to submit. Can be 'general' (default if not specified), 'daily', 'hourly'.
|
||||||
#SBATCH --job-name="mySlurmTest" # name of the job. Useful when submitting different types of jobs for filtering (i.e. 'squeue' command)
|
#SBATCH --job-name="mySlurmTest" # name of the job. Useful when submitting different types of jobs for filtering (i.e. 'squeue' command)
|
||||||
#SBATCH --time=0-12:00:00 # time limit. Here is shortened to 12 hours (default and max for 'daily' is 1 day).
|
#SBATCH --time=0-12:00:00 # time limit. Here is shortened to 12 hours (default and max for 'daily' is 1 day).
|
||||||
#SBATCH --exclude=merlin-c-001 # exclude which nodes you don't want to submit
|
#SBATCH --exclude=merlin-c-001 # exclude which nodes you don't want to submit
|
||||||
@ -74,17 +65,17 @@ squeue # check its status
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Advanced slurm test script
|
## Advanced slurm test script
|
||||||
|
|
||||||
Copy-paste the following example in a file called myAdvancedTest.batch):
|
Copy-paste the following example in a file called myAdvancedTest.batch):
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
#!/bin/bash
|
#!/bin/bash
|
||||||
#SBATCH --partition=merlin # name of slurm partition to submit
|
#SBATCH --partition=merlin # name of slurm partition to submit
|
||||||
#SBATCH --time=2:00:00 # limit the execution of this job to 2 hours, see sinfo for the max. allowance
|
#SBATCH --time=2:00:00 # limit the execution of this job to 2 hours, see sinfo for the max. allowance
|
||||||
#SBATCH --nodes=2 # number of nodes
|
#SBATCH --nodes=2 # number of nodes
|
||||||
#SBATCH --ntasks=24 # number of tasks
|
#SBATCH --ntasks=24 # number of tasks
|
||||||
|
|
||||||
module load gcc/8.3.0 openmpi/3.1.3
|
module load gcc/8.3.0 openmpi/3.1.3
|
||||||
module list
|
module list
|
||||||
|
|
||||||
@ -92,15 +83,15 @@ echo "Example no-MPI:" ; hostname # will print one hostname per node
|
|||||||
echo "Example MPI:" ; mpirun hostname # will print one hostname per ntask
|
echo "Example MPI:" ; mpirun hostname # will print one hostname per ntask
|
||||||
```
|
```
|
||||||
|
|
||||||
In the above example are specified the options ``--nodes=2`` and ``--ntasks=24``. This means that up 2 nodes are requested,
|
In the above example are specified the options ``--nodes=2`` and ``--ntasks=24``. This means that up 2 nodes are requested,
|
||||||
and is expected to run 24 tasks. Hence, 24 cores are needed for running that job. Slurm will try to allocate a maximum of 2 nodes,
|
and is expected to run 24 tasks. Hence, 24 cores are needed for running that job. Slurm will try to allocate a maximum of 2 nodes,
|
||||||
both together having at least 24 cores. Since our nodes have 44 cores / each, if nodes are empty (no other users
|
both together having at least 24 cores. Since our nodes have 44 cores / each, if nodes are empty (no other users
|
||||||
have running jobs there), job will land on a single node (it has enough cores to run 24 tasks).
|
have running jobs there), job will land on a single node (it has enough cores to run 24 tasks).
|
||||||
|
|
||||||
If we want to ensure that job is using at least two different nodes (i.e. for boosting CPU frequency, or because the job requires
|
If we want to ensure that job is using at least two different nodes (i.e. for boosting CPU frequency, or because the job requires
|
||||||
more memory per core) you should specify other options.
|
more memory per core) you should specify other options.
|
||||||
|
|
||||||
A good example is ``--ntasks-per-node=12``. This will equally distribute 12 tasks on 2 nodes.
|
A good example is ``--ntasks-per-node=12``. This will equally distribute 12 tasks on 2 nodes.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
#SBATCH --ntasks-per-node=12
|
#SBATCH --ntasks-per-node=12
|
||||||
@ -115,9 +106,9 @@ can allow the use of at least 12 cores per node (i.e. ``28000``)
|
|||||||
#SBATCH --mem-per-cpu=28000
|
#SBATCH --mem-per-cpu=28000
|
||||||
```
|
```
|
||||||
|
|
||||||
Finally, in order to ensure exclusivity of the node, an option *--exclusive* can be used (see below). This will ensure that
|
Finally, in order to ensure exclusivity of the node, an option *--exclusive* can be used (see below). This will ensure that
|
||||||
the requested nodes are exclusive for the job (no other users jobs will interact with this node, and only completely
|
the requested nodes are exclusive for the job (no other users jobs will interact with this node, and only completely
|
||||||
free nodes will be allocated).
|
free nodes will be allocated).
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
#SBATCH --exclusive
|
#SBATCH --exclusive
|
||||||
@ -125,7 +116,7 @@ free nodes will be allocated).
|
|||||||
|
|
||||||
This can be combined with the previous examples.
|
This can be combined with the previous examples.
|
||||||
|
|
||||||
More advanced configurations can be defined and can be combined with the previous examples. More information about advanced
|
More advanced configurations can be defined and can be combined with the previous examples. More information about advanced
|
||||||
options can be found in the following link: https://slurm.schedmd.com/sbatch.html (or run 'man sbatch').
|
options can be found in the following link: https://slurm.schedmd.com/sbatch.html (or run 'man sbatch').
|
||||||
|
|
||||||
If you have questions about how to properly execute your jobs, please contact us through merlin-admins@lists.psi.ch. Do not run
|
If you have questions about how to properly execute your jobs, please contact us through merlin-admins@lists.psi.ch. Do not run
|
||||||
@ -133,7 +124,7 @@ advanced configurations unless your are sure of what you are doing.
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Environment Modules
|
## Environment Modules
|
||||||
|
|
||||||
On top of the operating system stack we provide different software using the PSI developed
|
On top of the operating system stack we provide different software using the PSI developed
|
||||||
pmodule system. Useful commands:
|
pmodule system. Useful commands:
|
||||||
@ -144,9 +135,9 @@ module load gnuplot/5.2.0 # to load specific version of
|
|||||||
module search hdf # try it out to see which version of hdf5 package is provided and with which dependencies
|
module search hdf # try it out to see which version of hdf5 package is provided and with which dependencies
|
||||||
module load gcc/6.2.0 openmpi/1.10.2 hdf5/1.8.17 # load the specific version of hdf5, compiled with specific version of gcc and openmpi
|
module load gcc/6.2.0 openmpi/1.10.2 hdf5/1.8.17 # load the specific version of hdf5, compiled with specific version of gcc and openmpi
|
||||||
module use unstable # to get access to the packages which are not considered to be fully stable by module provider (may be very fresh version, or not yet tested by community)
|
module use unstable # to get access to the packages which are not considered to be fully stable by module provider (may be very fresh version, or not yet tested by community)
|
||||||
module list # to see which software is loaded in your environment
|
module list # to see which software is loaded in your environment
|
||||||
```
|
```
|
||||||
|
|
||||||
### Requests for New Software
|
### Requests for New Software
|
||||||
|
|
||||||
If you miss some package/version, contact us
|
If you miss some package/version, contact us
|
||||||
|
@ -1,25 +1,16 @@
|
|||||||
---
|
---
|
||||||
layout: default
|
|
||||||
title: Slurm Configuration
|
title: Slurm Configuration
|
||||||
parent: Merlin6 Slurm
|
#tags:
|
||||||
grand_parent: Merlin6 User Guide
|
#keywords:
|
||||||
nav_order: 2
|
last_updated: 13 June 2019
|
||||||
---
|
#summary: ""
|
||||||
|
sidebar: merlin6_sidebar
|
||||||
# Slurm Configuration
|
permalink: /merlin6/slurm-configuration.html
|
||||||
{: .no_toc }
|
|
||||||
|
|
||||||
## Table of contents
|
|
||||||
{: .no_toc .text-delta }
|
|
||||||
|
|
||||||
1. TOC
|
|
||||||
{:toc}
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Using the Slurm batch system
|
## Using the Slurm batch system
|
||||||
|
|
||||||
Clusters at PSI use the [Slurm Workload Manager](http://slurm.schedmd.com/) as the batch system technology for managing and scheduling jobs.
|
Clusters at PSI use the [Slurm Workload Manager](http://slurm.schedmd.com/) as the batch system technology for managing and scheduling jobs.
|
||||||
Historically, *Merlin4* and *Merlin5* also used Slurm. In the same way, **Merlin6** has been also configured with this batch system.
|
Historically, *Merlin4* and *Merlin5* also used Slurm. In the same way, **Merlin6** has been also configured with this batch system.
|
||||||
|
|
||||||
Slurm has been installed in a **multi-clustered** configuration, allowing to integrate multiple clusters in the same batch system.
|
Slurm has been installed in a **multi-clustered** configuration, allowing to integrate multiple clusters in the same batch system.
|
||||||
@ -30,7 +21,7 @@ For understanding the Slurm configuration setup in the cluster, sometimes may be
|
|||||||
* ``/etc/slurm/cgroup.conf`` - can be found in the computing nodes, is also propagated to login nodes for user read access.
|
* ``/etc/slurm/cgroup.conf`` - can be found in the computing nodes, is also propagated to login nodes for user read access.
|
||||||
* ``/etc/slurm/gres.conf`` - can be found in the GPU nodes, is also propgated to login nodes and computing nodes for user read access.
|
* ``/etc/slurm/gres.conf`` - can be found in the GPU nodes, is also propgated to login nodes and computing nodes for user read access.
|
||||||
|
|
||||||
The previous configuration files which can be found in the login nodes, correspond exclusively to the **merlin6** cluster configuration files.
|
The previous configuration files which can be found in the login nodes, correspond exclusively to the **merlin6** cluster configuration files.
|
||||||
Configuration files for the old **merlin5** cluster must be checked directly on any of the **merlin5** computing nodes: these are not propagated
|
Configuration files for the old **merlin5** cluster must be checked directly on any of the **merlin5** computing nodes: these are not propagated
|
||||||
to the **merlin6** login nodes.
|
to the **merlin6** login nodes.
|
||||||
|
|
||||||
@ -40,9 +31,9 @@ The new Slurm cluster is called **merlin6**. However, the old Slurm *merlin* clu
|
|||||||
It will allow to keep running jobs in the old computing nodes until users have fully migrated their codes to the new cluster.
|
It will allow to keep running jobs in the old computing nodes until users have fully migrated their codes to the new cluster.
|
||||||
|
|
||||||
From July 2019, **merlin6** becomes the **default cluster** and any job submitted to Slurm will be submitted to that cluster. Users can keep submitting to
|
From July 2019, **merlin6** becomes the **default cluster** and any job submitted to Slurm will be submitted to that cluster. Users can keep submitting to
|
||||||
the old *merlin5* computing nodes by using the option ``--cluster=merlin5``.
|
the old *merlin5* computing nodes by using the option ``--cluster=merlin5``.
|
||||||
|
|
||||||
In this documentation is only explained the usage of the **merlin6** Slurm cluster.
|
In this documentation is only explained the usage of the **merlin6** Slurm cluster.
|
||||||
|
|
||||||
### Using Slurm 'merlin6' cluster
|
### Using Slurm 'merlin6' cluster
|
||||||
|
|
||||||
@ -65,26 +56,26 @@ In *Merlin6*, memory is considered a Consumable Resource, as well as the CPU.
|
|||||||
|
|
||||||
#### Merlin6 Slurm partitions
|
#### Merlin6 Slurm partitions
|
||||||
|
|
||||||
Partition can be specified when submitting a job with the ``--partition=<partitionname>`` option.
|
Partition can be specified when submitting a job with the ``--partition=<partitionname>`` option.
|
||||||
The following *partitions* (also known as *queues*) are configured in Slurm:
|
The following *partitions* (also known as *queues*) are configured in Slurm:
|
||||||
|
|
||||||
| Partition | Default Partition | Default Time | Max Time | Max Nodes | Priority |
|
| Partition | Default Partition | Default Time | Max Time | Max Nodes | Priority |
|
||||||
|:----------- | ----------------- | ------------ | -------- | --------- | -------- |
|
|:----------- | ----------------- | ------------ | -------- | --------- | -------- |
|
||||||
| **general** | true | 1 day | 1 week | 50 | low |
|
| **general** | true | 1 day | 1 week | 50 | low |
|
||||||
| **daily** | false | 1 day | 1 day | 60 | medium |
|
| **daily** | false | 1 day | 1 day | 60 | medium |
|
||||||
| **hourly** | false | 1 hour | 1 hour | unlimited | highest |
|
| **hourly** | false | 1 hour | 1 hour | unlimited | highest |
|
||||||
|
|
||||||
General is the *default*, so when nothing is specified job will be by default assigned to that partition. General can not have more than 50 nodes
|
General is the *default*, so when nothing is specified job will be by default assigned to that partition. General can not have more than 50 nodes
|
||||||
running jobs. For **daily** this limitation is extended to 60 nodes while for **hourly** there are no limits. Shorter jobs have more priority than
|
running jobs. For **daily** this limitation is extended to 60 nodes while for **hourly** there are no limits. Shorter jobs have more priority than
|
||||||
longer jobs, hence in general terms would be scheduled before (however, other factors such like user fair share value can affect to this decision).
|
longer jobs, hence in general terms would be scheduled before (however, other factors such like user fair share value can affect to this decision).
|
||||||
|
|
||||||
#### Merlin6 User limits
|
#### Merlin6 User limits
|
||||||
|
|
||||||
By default, users can not use more than 528 cores at the same time (Max CPU per user). This limit applies for the **general** and **daily** partitions. For the **hourly** partition, there is no restriction.
|
By default, users can not use more than 528 cores at the same time (Max CPU per user). This limit applies for the **general** and **daily** partitions. For the **hourly** partition, there is no restriction.
|
||||||
These limits are softed for the **daily** partition during non working hours and during the weekend as follows:
|
These limits are softed for the **daily** partition during non working hours and during the weekend as follows:
|
||||||
|
|
||||||
| Partition | Mon-Fri 08h-18h | Sun-Thu 18h-0h | From Fri 18h to Sun 8h | From Sun 8h to Mon 18h |
|
| Partition | Mon-Fri 08h-18h | Sun-Thu 18h-0h | From Fri 18h to Sun 8h | From Sun 8h to Mon 18h |
|
||||||
|:----------- | --------------- | -------------- | ----------------------- | ---------------------- |
|
|:----------- | --------------- | -------------- | ----------------------- | ---------------------- |
|
||||||
| **general** | 528 | 528 | 528 | 528 |
|
| **general** | 528 | 528 | 528 | 528 |
|
||||||
| **daily** | 528 | 792 | Unlimited | 792 |
|
| **daily** | 528 | 792 | Unlimited | 792 |
|
||||||
| **hourly** | Unlimited | Unlimited | Unlimited | Unlimited |
|
| **hourly** | Unlimited | Unlimited | Unlimited | Unlimited |
|
||||||
|
@ -1,11 +1,13 @@
|
|||||||
---
|
---
|
||||||
title: Merlin6 User Guide
|
title: Merlin6 User Guide
|
||||||
tags:
|
#tags:
|
||||||
keywords:
|
#keywords:
|
||||||
last_updated: 13 June 2019
|
last_updated: 13 June 2019
|
||||||
summary: "Everything you need to know to run jobs"
|
#summary: "Merlin 6 cluster overview"
|
||||||
sidebar: merlin6_sidebar
|
sidebar: merlin6_sidebar
|
||||||
permalink: /merlin6/user-guide.html
|
permalink: /merlin6/user-guide.html
|
||||||
---
|
---
|
||||||
|
|
||||||
Welcome to the PSI Merlin6 Cluster.
|
# Merlin6 User Guide
|
||||||
|
|
||||||
|
Wellcome to the PSI Merlin6 Cluster.
|
||||||
|
@ -1,20 +1,11 @@
|
|||||||
---
|
---
|
||||||
layout: default
|
|
||||||
title: Migration From Merlin5
|
title: Migration From Merlin5
|
||||||
parent: Merlin6 User Guide
|
#tags:
|
||||||
nav_order: 7
|
#keywords:
|
||||||
---
|
last_updated: 13 June 2019
|
||||||
|
#summary: ""
|
||||||
# Migration From Merlin5
|
sidebar: merlin6_sidebar
|
||||||
|
permalink: /merlin6/migrating.html
|
||||||
{: .no_toc }
|
|
||||||
|
|
||||||
## Table of contents
|
|
||||||
{: .no_toc .text-delta }
|
|
||||||
|
|
||||||
1. TOC
|
|
||||||
{:toc}
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Merlin5 vs Merlin6
|
## Merlin5 vs Merlin6
|
||||||
@ -38,7 +29,7 @@ nav_order: 7
|
|||||||
where:
|
where:
|
||||||
* **Block** is capacity size in GB and TB
|
* **Block** is capacity size in GB and TB
|
||||||
* **Files** is number of files + directories in Millions (M)
|
* **Files** is number of files + directories in Millions (M)
|
||||||
* User data directorry ``/data/user`` has a strict user block quota limit policy. If more disk space is required, 'project' must be created.
|
* User data directorry ``/data/user`` has a strict user block quota limit policy. If more disk space is required, 'project' must be created.
|
||||||
|
|
||||||
### Project directory
|
### Project directory
|
||||||
|
|
||||||
@ -46,26 +37,26 @@ where:
|
|||||||
|
|
||||||
In Merlin5 the concept *project* did not exist. A similar concept (*group*) was existing and was mostly focused for BIO experiments.
|
In Merlin5 the concept *project* did not exist. A similar concept (*group*) was existing and was mostly focused for BIO experiments.
|
||||||
|
|
||||||
Quite often different users are working in *a similar* / *the same* project. Data was shared in different ways,
|
Quite often different users are working in *a similar* / *the same* project. Data was shared in different ways,
|
||||||
such like by allowing other users to access private data, or by having duplicates on each user directory needing access to that data.
|
such like by allowing other users to access private data, or by having duplicates on each user directory needing access to that data.
|
||||||
This makes the storage usage unefficient and insecure.
|
This makes the storage usage unefficient and insecure.
|
||||||
|
|
||||||
Also, there is another problem related to that: when a user leaves, we have plenty of data which needs to be kept and nobody becomes
|
Also, there is another problem related to that: when a user leaves, we have plenty of data which needs to be kept and nobody becomes
|
||||||
responsible for that. In addition, after several months user is unregistered from PSI and we end up with orphaned data which needs to
|
responsible for that. In addition, after several months user is unregistered from PSI and we end up with orphaned data which needs to
|
||||||
be kept, but we sometimes loose track of the user.
|
be kept, but we sometimes loose track of the user.
|
||||||
|
|
||||||
With that, we want to restrict the usage of individual data and bet for project (shared) data. There will be one main responsible for
|
With that, we want to restrict the usage of individual data and bet for project (shared) data. There will be one main responsible for
|
||||||
this project, but if for some reason this person leaves, responsible can be somebody else (successor if exists, supervisor, or in the
|
this project, but if for some reason this person leaves, responsible can be somebody else (successor if exists, supervisor, or in the
|
||||||
worst case, the admin).
|
worst case, the admin).
|
||||||
|
|
||||||
#### Requesting a *project*
|
#### Requesting a *project*
|
||||||
|
|
||||||
For requesting a *project* users must provide:
|
For requesting a *project* users must provide:
|
||||||
|
|
||||||
* Define a *'project'* directory name. This must be unique.
|
* Define a *'project'* directory name. This must be unique.
|
||||||
* Have an existing *project* **Unix Group**.
|
* Have an existing *project* **Unix Group**.
|
||||||
* This can be requested through [PSI Service Now](https://psi.service-now.com/psisp)
|
* This can be requested through [PSI Service Now](https://psi.service-now.com/psisp)
|
||||||
* Unix group must start with *``unx-``*
|
* Unix group must start with *``unx-``*
|
||||||
* This Unix Group will be the default group for the *'project'*
|
* This Unix Group will be the default group for the *'project'*
|
||||||
* Define a project main responsible and supervisor
|
* Define a project main responsible and supervisor
|
||||||
* Define and justify quota requirements:
|
* Define and justify quota requirements:
|
||||||
@ -78,7 +69,7 @@ For requesting a *project* users must provide:
|
|||||||
|
|
||||||
### Phase 1 [June]: Pre-migration
|
### Phase 1 [June]: Pre-migration
|
||||||
* Users keep working on Merlin5
|
* Users keep working on Merlin5
|
||||||
* Merlin5 production directories: ``'/gpfs/home/'``, ``'/gpfs/data'``, ``'/gpfs/group'``
|
* Merlin5 production directories: ``'/gpfs/home/'``, ``'/gpfs/data'``, ``'/gpfs/group'``
|
||||||
* Users may raise any problems (quota limits, unaccessible files, etc.) to merlin-admins@lists.psi.ch
|
* Users may raise any problems (quota limits, unaccessible files, etc.) to merlin-admins@lists.psi.ch
|
||||||
* Users can start migrating data (see [Migration steps](# Migration steps))
|
* Users can start migrating data (see [Migration steps](# Migration steps))
|
||||||
* Users should copy their data from Merlin5 /gpfs/data to Merlin6 /data/user
|
* Users should copy their data from Merlin5 /gpfs/data to Merlin6 /data/user
|
||||||
@ -87,11 +78,11 @@ For requesting a *project* users must provide:
|
|||||||
|
|
||||||
### Phase 2 [July-October]: Migration to Merlin6
|
### Phase 2 [July-October]: Migration to Merlin6
|
||||||
* Merlin6 becomes official cluster, and directories are switched to the new structure:
|
* Merlin6 becomes official cluster, and directories are switched to the new structure:
|
||||||
* Merlin6 production directories: ``'/psi/home/'``, ``'/data/user'``, ``'/data/project'``
|
* Merlin6 production directories: ``'/psi/home/'``, ``'/data/user'``, ``'/data/project'``
|
||||||
* Merlin5 directories available in RO: ``'/gpfs/home/'``, ``'/gpfs/data'``, ``'/gpfs/group'``
|
* Merlin5 directories available in RO: ``'/gpfs/home/'``, ``'/gpfs/data'``, ``'/gpfs/group'``
|
||||||
* Users can keep migrating their data (see [Migration steps](# Migration steps))
|
* Users can keep migrating their data (see [Migration steps](# Migration steps))
|
||||||
* ALL data must be migrated
|
* ALL data must be migrated
|
||||||
* Job submissions by default to Merlin6. Submission to Merlin5 computing nodes possible.
|
* Job submissions by default to Merlin6. Submission to Merlin5 computing nodes possible.
|
||||||
* Users should inform when migration is done, and which directories were migrated. Deletion for such directories can be requested by admins.
|
* Users should inform when migration is done, and which directories were migrated. Deletion for such directories can be requested by admins.
|
||||||
|
|
||||||
### Phase 3 [November]: Merlin5 Decomission
|
### Phase 3 [November]: Merlin5 Decomission
|
||||||
@ -128,7 +119,7 @@ This can take several hours or days:
|
|||||||
### Step 2: Mirroring
|
### Step 2: Mirroring
|
||||||
|
|
||||||
Once first migration is done, a second ``rsync`` should be ran. This is done with ``--delete``. With this option ``rsync`` will
|
Once first migration is done, a second ``rsync`` should be ran. This is done with ``--delete``. With this option ``rsync`` will
|
||||||
behave in a way where it will delete from the destination all files that were removed in the source, but also will propagate
|
behave in a way where it will delete from the destination all files that were removed in the source, but also will propagate
|
||||||
new files from the source to the destination.
|
new files from the source to the destination.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
@ -1,255 +0,0 @@
|
|||||||
---
|
|
||||||
title: Using Merlin6
|
|
||||||
tags:
|
|
||||||
keywords:
|
|
||||||
last_updated: 13 June 2019
|
|
||||||
summary: "Everything you need to know to run jobs"
|
|
||||||
sidebar: merlin6_sidebar
|
|
||||||
permalink: /merlin6/use.html
|
|
||||||
---
|
|
||||||
|
|
||||||
## Important: Code of Conduct
|
|
||||||
|
|
||||||
The basic principle is courtesy and consideration for other users.
|
|
||||||
|
|
||||||
* Merlin6 is a shared resource, not your laptop, therefore you are kindly requested to behave in a way you would be happy to see other users behaving towards you.
|
|
||||||
* Basic shell programming skills in Linux/UNIX environment is a must-have requirement for HPC users; a proficiency in shell programming would be greatly beneficial.
|
|
||||||
* The login nodes are for development and quick testing:
|
|
||||||
* Is **strictly forbidden to run production jobs** on the login nodes.
|
|
||||||
* Is **forbidden to run long processes** occupying big part of the resources.
|
|
||||||
* *Any miss-behaving running processes according to these rules will be killed.*
|
|
||||||
* All production jobs should be submitted using the batch system.
|
|
||||||
* Make sure that no broken or run-away processes are left when your job is done. Keep the process space clean on all nodes.
|
|
||||||
* Remove files you do not need any more (e.g. core dumps, temporary files) as early as possible. Keep the disk space clean on all nodes.
|
|
||||||
|
|
||||||
The system administrator has the right to block the access to Merlin6 for an account violating the Code of Conduct, in which case the issue will be escalated to the user's supervisor.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Merlin6 Access
|
|
||||||
|
|
||||||
### HowTo: Request Access to Merlin6
|
|
||||||
|
|
||||||
* PSI users with their Linux accounts belonging to the *svc-cluster_merlin6* group are allowed to use Merlin6.
|
|
||||||
|
|
||||||
* Registration for Merlin6 access must be done through [PSI Service Now](https://psi.service-now.com/psisp)
|
|
||||||
* Please open it as an Incident request, with subject: ``[Merlin6] Access Request for user '<username>'``
|
|
||||||
|
|
||||||
|
|
||||||
### HowTo: Access to Merlin6
|
|
||||||
|
|
||||||
Use SSH to access the login nodes:
|
|
||||||
* <tt><b>merlin-l-01.psi.ch</b></tt> ("merlin" '-' 'el' '-' 'zero' 'one')
|
|
||||||
* <tt><b>merlin-l-02.psi.ch</b></tt> ("merlin" '-' 'el' '-' 'zero' 'two')
|
|
||||||
|
|
||||||
Examples:
|
|
||||||
<pre>
|
|
||||||
ssh -Y merlin-l-01
|
|
||||||
ssh -Y bond_j@merlin-l-02.psi.ch
|
|
||||||
</pre>
|
|
||||||
|
|
||||||
<!--
|
|
||||||
### Home and Data Directories
|
|
||||||
|
|
||||||
Default quota for the home directory */gpfs/home/$USER* is 10GB.
|
|
||||||
Until a service for an automatic backup of the home directories is announced
|
|
||||||
to be in production, the users are responsible for managing the backups
|
|
||||||
of their home directories.
|
|
||||||
|
|
||||||
The data directories */gpfs/data/$USER* have much larger quotas per user (default is 1TB, extendible on request) then the home directories,
|
|
||||||
but there will be no automatic backup of the data directories.
|
|
||||||
The users are fully responsible for backup and restore operations
|
|
||||||
in the data directories.
|
|
||||||
|
|
||||||
Command to see your quota on merlin5:
|
|
||||||
<pre>
|
|
||||||
/usr/lpp/mmfs/bin/mmlsquota -u $USER --block-size auto merlin5
|
|
||||||
</pre>
|
|
||||||
|
|
||||||
### Scratch disk and Temporary Files
|
|
||||||
|
|
||||||
A */scratch* partition of ~50GB is available on each computing node. This partition should be the one used by the users for creating temporary files and/or
|
|
||||||
directories that are needed by running jobs. Temporary files *must be deleted at the end of the job*.
|
|
||||||
|
|
||||||
Example of how to use the */scratch* disk:
|
|
||||||
|
|
||||||
<pre>
|
|
||||||
#!/bin/bash
|
|
||||||
#SBATCH --partition=merlin # name of slurm partition to submit
|
|
||||||
#SBATCH --time=2:00:00 # limit the execution of this job to 2 hours, see sinfo for the max. allowance
|
|
||||||
#SBATCH --nodes=4 # you request a 4 nodes
|
|
||||||
|
|
||||||
<b># Create scratch directory</b>
|
|
||||||
<i>SCRATCHDIR="/scratch/$(id -un)/${SLURM_JOB_ID}"
|
|
||||||
mkdir -p ${SCRATCHDIR}</i>
|
|
||||||
...
|
|
||||||
<b># Core code, generating temporary files in $SCRATCHDIR</b>
|
|
||||||
...
|
|
||||||
<b># Copy final results (whenever needed)</b>
|
|
||||||
<i>mkdir /gpfs/home/$(id -un)/${SLURM_JOB_ID}</i>
|
|
||||||
<i>cp -pr /scratch/$(id -un)/${SLURM_JOB_ID}/my_results /gpfs/home/$(id -un)/${SLURM_JOB_ID}</i>
|
|
||||||
<b># Cleanup temporary data and directories</b>
|
|
||||||
<i>rm -rf /scratch/$(id -un)/${SLURM_JOB_ID}</i>
|
|
||||||
<i>rmdir /scratch/$(id -un)</i>
|
|
||||||
</pre>
|
|
||||||
|
|
||||||
### Using Batch System to Submit Jobs to Merlin5
|
|
||||||
|
|
||||||
The Slurm Workload Manager is used on Merlin5 to manage and schedule jobs.
|
|
||||||
Please see "man slurm" and references therein for more details.
|
|
||||||
There are many tutorials and howtos on Slurm elsewhere, e.g. at CSCS.
|
|
||||||
We shall provide some typical examples for submitting different types of jobs.
|
|
||||||
|
|
||||||
Useful commands for the slurm:
|
|
||||||
<pre>
|
|
||||||
sinfo # to see the name of nodes, their occupancy, name of slurm partitions, limits (try out with "-l" option)
|
|
||||||
squeue # to see the currently running/waiting jobs in slurm (additional "-l" option may also be useful)
|
|
||||||
sbatch Script.sh # to submit a script (example below) to the slurm
|
|
||||||
scancel job_id # to cancel slurm job, job id is the numeric id, seen by the squeue
|
|
||||||
</pre>
|
|
||||||
|
|
||||||
Other advanced commands:
|
|
||||||
<pre>
|
|
||||||
sinfo -N -l # list nodes, state, resources (number of CPUs, memory per node, etc.), and other information
|
|
||||||
sshare -a # to list shares of associations to a cluster
|
|
||||||
sprio -l # to view the factors that comprise a job's scheduling priority (add -u <username> for filtering user)
|
|
||||||
</pre>
|
|
||||||
|
|
||||||
###+ Simple slurm test script (copy-paste the following example in file Script.sh):
|
|
||||||
<pre>
|
|
||||||
#!/bin/bash
|
|
||||||
#SBATCH --partition=merlin # name of slurm partition to submit
|
|
||||||
#SBATCH --time=2:00:00 # limit the execution of this job to 2 hours, see sinfo for the max. allowance
|
|
||||||
#SBATCH --nodes=4 # you request a 4 nodes
|
|
||||||
|
|
||||||
hostname # will print one name, since executed on one node
|
|
||||||
echo
|
|
||||||
module load gcc/6.2.0 openmpi/1.10.2 hdf5/1.8.17
|
|
||||||
mpirun hostname # will be executed on all 4 nodes (see above --nodes)
|
|
||||||
echo
|
|
||||||
sleep 60 # useless work occupying 4 merlin nodes
|
|
||||||
module list
|
|
||||||
</pre>
|
|
||||||
|
|
||||||
Submit job to slurm and check it's status:
|
|
||||||
<pre>
|
|
||||||
sbatch Script.sh # submit this job to slurm
|
|
||||||
squeue # check it's status
|
|
||||||
</pre>
|
|
||||||
|
|
||||||
###+ Advanced slurm test script (copy-paste the following example in file Script.sh):
|
|
||||||
<pre>
|
|
||||||
#!/bin/bash
|
|
||||||
#SBATCH --partition=merlin # name of slurm partition to submit
|
|
||||||
#SBATCH --time=2:00:00 # limit the execution of this job to 2 hours, see sinfo for the max. allowance
|
|
||||||
#SBATCH --nodes=2 # number of nodes
|
|
||||||
#SBATCH --ntasks=24 # number of tasks
|
|
||||||
|
|
||||||
hostname # will print one name, since executed on one node
|
|
||||||
echo
|
|
||||||
module load gcc/6.2.0 openmpi/1.10.2 hdf5/1.8.17
|
|
||||||
mpirun hostname # will be executed on all 4 nodes (see above --nodes)
|
|
||||||
echo
|
|
||||||
sleep 60 # useless work occupying 4 merlin nodes
|
|
||||||
module list
|
|
||||||
</pre>
|
|
||||||
|
|
||||||
In the above example are specified the options *--nodes=2* and *--ntasks=24*. This means that 2 nodes are requested,
|
|
||||||
and is expected to run 24 tasks. Hence, 24 cores are needed for running that job. Slurm will try to allocate 2 nodes
|
|
||||||
with similar resources, having at least 12 cores/node.
|
|
||||||
|
|
||||||
Usually, 2 nodes with 12 cores/node would fit in the allocation decision. However, other combinations may be possible
|
|
||||||
(i.e, 2 nodes with 16 cores/node). In this second case, could happen that other users are running jobs in the allocated
|
|
||||||
nodes (in this example, up to 4 cores per node could be used by other user jobs, having at least 12 cores per node
|
|
||||||
available, which are the minimum number of tasks/cores required by our job).
|
|
||||||
|
|
||||||
In order to ensure exclusivity of the node, an option *--exclusive* can be used (see below). This will ensure that
|
|
||||||
the requested nodes are exclusive for the job (no other users jobs will interact with this node, and only completely
|
|
||||||
free nodes will be allocated).
|
|
||||||
|
|
||||||
<pre>
|
|
||||||
#SBATCH --exclusive
|
|
||||||
</pre>
|
|
||||||
|
|
||||||
More advanced configurations can be defined and can be combined with the previous examples. More information about advanced
|
|
||||||
options can be found in the following link: https://slurm.schedmd.com/sbatch.html (or run 'man sbatch').
|
|
||||||
|
|
||||||
If you have questions about how to properly execute your jobs, please contact us through merlin-admins@lists.psi.ch. Do not run
|
|
||||||
advanced configurations unless your are sure of what you are doing.
|
|
||||||
|
|
||||||
### Environment Modules
|
|
||||||
|
|
||||||
On top of the operating system stack we provide different software using the PSI developed
|
|
||||||
pmodule system. Useful commands:
|
|
||||||
<pre>
|
|
||||||
module avail # to see the list of available software provided via pmodules
|
|
||||||
module load gnuplot/5.2.0 # to load specific version of gnuplot package
|
|
||||||
module search hdf # try it out to see which version of hdf5 package is provided and with which dependencies
|
|
||||||
module load gcc/6.2.0 openmpi/1.10.2 hdf5/1.8.17 # load the specific version of hdf5, compiled with specific version of gcc and openmpi
|
|
||||||
module use unstable # to get access to the packages which are not considered to be fully stable by module provider (may be very fresh version, or not yet tested by community)
|
|
||||||
module list # to see which software is loaded in your environment
|
|
||||||
</pre>
|
|
||||||
|
|
||||||
###+ Requests for New Software
|
|
||||||
|
|
||||||
If you miss some package/version, contact us
|
|
||||||
|
|
||||||
### Known Problems and Troubleshooting
|
|
||||||
|
|
||||||
###+ Paraview, ANSYS and openGL
|
|
||||||
|
|
||||||
Try to use X11(mesa) driver for paraview and ANSYS instead of OpenGL:
|
|
||||||
<pre>
|
|
||||||
module load ANSYS
|
|
||||||
fluent -driver x11
|
|
||||||
</pre>
|
|
||||||
|
|
||||||
<pre>
|
|
||||||
module load paraview
|
|
||||||
paraview --mesa
|
|
||||||
</pre>
|
|
||||||
|
|
||||||
###+ Illegal instructions
|
|
||||||
|
|
||||||
It may happened that your code, compiled on one machine will not be executed on another throwing exception like "(Illegal instruction)".
|
|
||||||
Check (with "hostname" command) on which of the node you are and compare it with the names from first item. We observe few applications
|
|
||||||
that can't be run on merlin-c-01..16 because of this problem (notice that these machines are more then 5 years old). Hint: you may
|
|
||||||
choose the particular flavour of the machines for your slurm job, check the "--cores-per-node" option for sbatch:
|
|
||||||
<pre>
|
|
||||||
sbatch --cores-per-socket=8 Script.sh # will filter the selection of the machine and exclude the oldest one, merlin-c-01..16
|
|
||||||
</pre>
|
|
||||||
|
|
||||||
###+ Troubleshooting SSH
|
|
||||||
|
|
||||||
Use the ssh command with the "-vvv" option and copy and paste (no screenshot please)
|
|
||||||
the output to your request in Service-Now. Example
|
|
||||||
|
|
||||||
<pre>
|
|
||||||
ssh -Y -vvv bond_j@merlin-l-01
|
|
||||||
</pre>
|
|
||||||
|
|
||||||
###+ Troubleshooting SLURM
|
|
||||||
|
|
||||||
If one copies Slurm commands or batch scripts from another cluster,
|
|
||||||
they may need some changes (often minor) to run successfully on Merlin5.
|
|
||||||
Examine carefully the error message, especially concerning the options
|
|
||||||
used in the slurm commands.
|
|
||||||
|
|
||||||
Try to submit jobs using the examples given in the section "Using Batch System to Submit Jobs to Merlin5".
|
|
||||||
If you can run successfully an example for a type of job (!OpenMP, MPI) similar to your one,
|
|
||||||
try to edit the example to run your application.
|
|
||||||
|
|
||||||
If the problem remains, then, in your request in Service-Now, describe the problem in details that
|
|
||||||
are needed to reproduce it. Include the output of the following commands:
|
|
||||||
|
|
||||||
<pre>
|
|
||||||
date
|
|
||||||
hostname
|
|
||||||
pwd
|
|
||||||
module list
|
|
||||||
# All slurm commands used with the corresponding output
|
|
||||||
</pre>
|
|
||||||
|
|
||||||
Do not delete any output and error files generated by Slurm.
|
|
||||||
Make a copy of the failed job script if you like to edit it meanwhile.
|
|
||||||
-->
|
|
Loading…
x
Reference in New Issue
Block a user