Doc changes

This commit is contained in:
2021-05-21 12:34:19 +02:00
parent 42d8f38934
commit fcfdbf1344
46 changed files with 447 additions and 528 deletions

View File

@ -0,0 +1,235 @@
---
title: Running Interactive Jobs
#tags:
keywords: interactive, X11, X, srun
last_updated: 23 January 2020
summary: "This document describes how to run interactive jobs as well as X based software."
sidebar: merlin6_sidebar
permalink: /merlin6/interactive-jobs.html
---
## Running interactive jobs
There are two different ways for running interactive jobs in Slurm. This is possible by using
the ``salloc`` and ``srun`` commands:
* **``salloc``**: to obtain a Slurm job allocation (a set of nodes), execute command(s), and then release the allocation when the command is finished.
* **``srun``**: is used for running parallel tasks.
### srun
Is run is used to run parallel jobs in the batch system. It can be used within a batch script
(which can be run with ``sbatch``), or within a job allocation (which can be run with ``salloc``).
Also, it can be used as a direct command (in example, from the login nodes).
When used inside a batch script or during a job allocation, ``srun`` is constricted to the
amount of resources allocated by the ``sbatch``/``salloc`` commands. In ``sbatch``, usually
these resources are defined inside the batch script with the format ``#SBATCH <option>=<value>``.
In other words, if you define in your batch script or allocation 88 tasks (and 1 thread / core)
and 2 nodes, ``srun`` is constricted to these amount of resources (you can use less, but never
exceed those limits).
When used from the login node, usually is used to run a specific command or software in an
interactive way. ``srun`` is a blocking process (it will block bash prompt until the ``srun``
command finishes, unless you run it in background with ``&``). This can be very useful to run
interactive software which pops up a Window and then submits jobs or run sub-tasks in the
background (in example, **Relion**, **cisTEM**, etc.)
Refer to ``man srun`` for exploring all possible options for that command.
<details>
<summary>[Show 'srun' example]: Running 'hostname' command on 3 nodes, using 2 cores (1 task/core) per node</summary>
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
(base) [caubet_m@merlin-l-001 ~]$ srun --clusters=merlin6 --ntasks=6 --ntasks-per-node=2 --nodes=3 hostname
srun: job 135088230 queued and waiting for resources
srun: job 135088230 has been allocated resources
merlin-c-102.psi.ch
merlin-c-102.psi.ch
merlin-c-101.psi.ch
merlin-c-101.psi.ch
merlin-c-103.psi.ch
merlin-c-103.psi.ch
</pre>
</details>
### salloc
**``salloc``** is used to obtain a Slurm job allocation (a set of nodes). Once job is allocated,
users are able to execute interactive command(s). Once finished (``exit`` or ``Ctrl+D``),
the allocation is released. **``salloc``** is a blocking command, it is, command will be blocked
until the requested resources are allocated.
When running **``salloc``**, once the resources are allocated, *by default* the user will get
a ***new shell on one of the allocated resources*** (if a user has requested few nodes, it will
prompt a new shell on the first allocated node). However, this behaviour can be changed by adding
a shell (`$SHELL`) at the end of the `salloc` command. In example:
```bash
# Typical 'salloc' call
# - Same as running:
# 'salloc --clusters=merlin6 -N 2 -n 2 srun -n1 -N1 --mem-per-cpu=0 --gres=gpu:0 --pty --preserve-env --mpi=none $SHELL'
salloc --clusters=merlin6 -N 2 -n 2
# Custom 'salloc' call
# - $SHELL will open a local shell on the login node from where ``salloc`` is running
salloc --clusters=merlin6 -N 2 -n 2 $SHELL
```
<details>
<summary>[Show 'salloc' example]: Allocating 2 cores (1 task/core) in 2 nodes (1 core/node) - <i>Default</i></summary>
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
(base) [caubet_m@merlin-l-001 ~]$ salloc --clusters=merlin6 --ntasks=2 --nodes=2
salloc: Pending job allocation 135171306
salloc: job 135171306 queued and waiting for resources
salloc: job 135171306 has been allocated resources
salloc: Granted job allocation 135171306
(base) [caubet_m@merlin-c-213 ~]$ srun hostname
merlin-c-213.psi.ch
merlin-c-214.psi.ch
(base) [caubet_m@merlin-c-213 ~]$ exit
exit
salloc: Relinquishing job allocation 135171306
(base) [caubet_m@merlin-l-001 ~]$ salloc --clusters=merlin6 -N 2 -n 2 srun -n1 -N1 --mem-per-cpu=0 --gres=gpu:0 --pty --preserve-env --mpi=none $SHELL
salloc: Pending job allocation 135171342
salloc: job 135171342 queued and waiting for resources
salloc: job 135171342 has been allocated resources
salloc: Granted job allocation 135171342
(base) [caubet_m@merlin-c-021 ~]$ srun hostname
merlin-c-021.psi.ch
merlin-c-022.psi.ch
(base) [caubet_m@merlin-c-021 ~]$ exit
exit
salloc: Relinquishing job allocation 135171342
</pre>
</details>
<details>
<summary>[Show 'salloc' example]: Allocating 2 cores (1 task/core) in 2 nodes (1 core/node) - <i>$SHELL</i></summary>
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
(base) [caubet_m@merlin-export-01 ~]$ salloc --clusters=merlin6 --ntasks=2 --nodes=2 $SHELL
salloc: Pending job allocation 135171308
salloc: job 135171308 queued and waiting for resources
salloc: job 135171308 has been allocated resources
salloc: Granted job allocation 135171308
(base) [caubet_m@merlin-export-01 ~]$ srun hostname
merlin-c-218.psi.ch
merlin-c-117.psi.ch
(base) [caubet_m@merlin-export-01 ~]$ exit
exit
salloc: Relinquishing job allocation 135171308
</pre>
</details>
## Running interactive jobs with X11 support
### Requirements
#### Graphical access
[NoMachine](/merlin6/nomachine.html) is the official supported service for graphical
access in the Merlin cluster. This service is running on the login nodes. Check the
document [{Accessing Merlin -> NoMachine}](/merlin6/nomachine.html) for details about
how to connect to the **NoMachine** service in the Merlin cluster.
For other non officially supported graphical access (X11 forwarding):
* For Linux clients, please follow [{How To Use Merlin -> Accessing from Linux Clients}](/merlin6/connect-from-linux.html)
* For Windows clients, please follow [{How To Use Merlin -> Accessing from Windows Clients}](/merlin6/connect-from-windows.html)
* For MacOS clients, please follow [{How To Use Merlin -> Accessing from MacOS Clients}](/merlin6/connect-from-macos.html)
### 'srun' with x11 support
Merlin5 and Merlin6 clusters allow running any windows based applications. For that, you need to
add the option ``--x11`` to the ``srun`` command. In example:
```bash
srun --clusters=merlin6 --x11 xclock
```
will popup a X11 based clock.
In the same manner, you can create a bash shell with x11 support. For doing that, you need
to add the option ``--pty`` to the ``srun --x11`` command. Once resource is allocated, from
there you can interactively run X11 and non-X11 based commands.
```bash
srun --clusters=merlin6 --x11 --pty bash
```
<details>
<summary>[Show 'srun' with X11 support examples]</summary>
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
(base) [caubet_m@merlin-l-001 ~]$ srun --clusters=merlin6 --x11 xclock
srun: job 135095591 queued and waiting for resources
srun: job 135095591 has been allocated resources
(base) [caubet_m@merlin-l-001 ~]$
(base) [caubet_m@merlin-l-001 ~]$ srun --clusters=merlin6 --x11 --pty bash
srun: job 135095592 queued and waiting for resources
srun: job 135095592 has been allocated resources
(base) [caubet_m@merlin-c-205 ~]$ xclock
(base) [caubet_m@merlin-c-205 ~]$ echo "This was an example"
This was an example
(base) [caubet_m@merlin-c-205 ~]$ exit
exit
</pre>
</details>
### 'salloc' with x11 support
**Merlin5** and **Merlin6** clusters allow running any windows based applications. For that, you need to
add the option ``--x11`` to the ``salloc`` command. In example:
```bash
salloc --clusters=merlin6 --x11 xclock
```
will popup a X11 based clock.
In the same manner, you can create a bash shell with x11 support. For doing that, you need
to add to run just ``salloc --clusters=merlin6 --x11``. Once resource is allocated, from
there you can interactively run X11 and non-X11 based commands.
```bash
salloc --clusters=merlin6 --x11
```
<details>
<summary>[Show 'salloc' with X11 support examples]</summary>
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
(base) [caubet_m@merlin-l-001 ~]$ salloc --clusters=merlin6 --x11 xclock
salloc: Pending job allocation 135171355
salloc: job 135171355 queued and waiting for resources
salloc: job 135171355 has been allocated resources
salloc: Granted job allocation 135171355
salloc: Relinquishing job allocation 135171355
(base) [caubet_m@merlin-l-001 ~]$ salloc --clusters=merlin6 --x11
salloc: Pending job allocation 135171349
salloc: job 135171349 queued and waiting for resources
salloc: job 135171349 has been allocated resources
salloc: Granted job allocation 135171349
salloc: Waiting for resource configuration
salloc: Nodes merlin-c-117 are ready for job
(base) [caubet_m@merlin-c-117 ~]$ xclock
(base) [caubet_m@merlin-c-117 ~]$ echo "This was an example"
This was an example
(base) [caubet_m@merlin-c-117 ~]$ exit
exit
salloc: Relinquishing job allocation 135171349
</pre>
</details>

View File

@ -0,0 +1,229 @@
---
title: Monitoring
#tags:
#keywords:
last_updated: 20 June 2019
#summary: ""
sidebar: merlin6_sidebar
permalink: /merlin6/monitoring.html
---
## Slurm Monitoring
### Job status
The status of submitted jobs can be check with the ``squeue`` command:
```bash
squeue -u $username
```
Common statuses:
* **merlin-\***: Running on the specified host
* **(Priority)**: Waiting in the queue
* **(Resources)**: At the head of the queue, waiting for machines to become available
* **(AssocGrpCpuLimit), (AssocGrpNodeLimit)**: Job would exceed per-user limitations on
the number of simultaneous CPUs/Nodes. Use `scancel` to remove the job and
resubmit with fewer resources, or else wait for your other jobs to finish.
* **(PartitionNodeLimit)**: Exceeds all resources available on this partition.
Run `scancel` and resubmit to a different partition (`-p`) or with fewer
resources.
Check in the **man** pages (``man squeue``) for all possible options for this command.
<details>
<summary>[Show 'squeue' example]</summary>
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
[root@merlin-slurmctld01 ~]# squeue -u feichtinger
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
134332544 general spawner- feichtin R 5-06:47:45 1 merlin-c-204
134321376 general subm-tal feichtin R 5-22:27:59 1 merlin-c-204
</pre>
</details>
### Partition status
The status of the nodes and partitions (a.k.a. queues) can be seen with the ``sinfo`` command:
```bash
sinfo
```
Check in the **man** pages (``man sinfo``) for all possible options for this command.
<details>
<summary>[Show 'sinfo' example]</summary>
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
[root@merlin-l-001 ~]# sinfo -l
Thu Jan 23 16:34:49 2020
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE NODELIST
test up 1-00:00:00 1-infinite no NO all 3 mixed merlin-c-[024,223-224]
test up 1-00:00:00 1-infinite no NO all 2 allocated merlin-c-[123-124]
test up 1-00:00:00 1-infinite no NO all 1 idle merlin-c-023
general* up 7-00:00:00 1-50 no NO all 6 mixed merlin-c-[007,204,207-209,219]
general* up 7-00:00:00 1-50 no NO all 57 allocated merlin-c-[001-005,008-020,101-122,201-203,205-206,210-218,220-222]
general* up 7-00:00:00 1-50 no NO all 3 idle merlin-c-[006,021-022]
daily up 1-00:00:00 1-60 no NO all 9 mixed merlin-c-[007,024,204,207-209,219,223-224]
daily up 1-00:00:00 1-60 no NO all 59 allocated merlin-c-[001-005,008-020,101-124,201-203,205-206,210-218,220-222]
daily up 1-00:00:00 1-60 no NO all 4 idle merlin-c-[006,021-023]
hourly up 1:00:00 1-infinite no NO all 9 mixed merlin-c-[007,024,204,207-209,219,223-224]
hourly up 1:00:00 1-infinite no NO all 59 allocated merlin-c-[001-005,008-020,101-124,201-203,205-206,210-218,220-222]
hourly up 1:00:00 1-infinite no NO all 4 idle merlin-c-[006,021-023]
gpu up 7-00:00:00 1-infinite no NO all 1 mixed merlin-g-007
gpu up 7-00:00:00 1-infinite no NO all 8 allocated merlin-g-[001-006,008-009]
</pre>
</details>
### Job efficiency
Users can check how efficient are their jobs. For that, the ``seff`` command is available.
```bash
seff $jobid
```
<details>
<summary>[Show 'seff' example]</summary>
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
[root@merlin-slurmctld01 ~]# seff 134333893
Job ID: 134333893
Cluster: merlin6
User/Group: albajacas_a/unx-sls
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 8
CPU Utilized: 00:26:15
CPU Efficiency: 49.47% of 00:53:04 core-walltime
Job Wall-clock time: 00:06:38
Memory Utilized: 60.73 MB
Memory Efficiency: 0.19% of 31.25 GB
</pre>
</details>
### List job attributes
The ``sjstat`` command is used to display statistics of jobs under control of SLURM. To use it
```bash
jstat
```
<details>
<summary>[Show 'sjstat' example]</summary>
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
[root@merlin-l-001 ~]# sjstat -v
Scheduling pool data:
----------------------------------------------------------------------------------
Total Usable Free Node Time Other
Pool Memory Cpus Nodes Nodes Nodes Limit Limit traits
----------------------------------------------------------------------------------
test 373502Mb 88 6 6 1 UNLIM 1-00:00:00
general* 373502Mb 88 66 66 8 50 7-00:00:00
daily 373502Mb 88 72 72 9 60 1-00:00:00
hourly 373502Mb 88 72 72 9 UNLIM 01:00:00
gpu 128000Mb 8 1 1 0 UNLIM 7-00:00:00
gpu 128000Mb 20 8 8 0 UNLIM 7-00:00:00
Running job data:
---------------------------------------------------------------------------------------------------
Time Time Time
JobID User Procs Pool Status Used Limit Started Master/Other
---------------------------------------------------------------------------------------------------
13433377 collu_g 1 gpu PD 0:00 24:00:00 N/A (Resources)
13433389 collu_g 20 gpu PD 0:00 24:00:00 N/A (Resources)
13433382 jaervine 4 gpu PD 0:00 24:00:00 N/A (Priority)
13433386 barret_d 20 gpu PD 0:00 24:00:00 N/A (Priority)
13433382 pamula_f 20 gpu PD 0:00 168:00:00 N/A (Priority)
13433387 pamula_f 4 gpu PD 0:00 24:00:00 N/A (Priority)
13433365 andreani 132 daily PD 0:00 24:00:00 N/A (Dependency)
13433388 marino_j 6 gpu R 1:43:12 168:00:00 01-23T14:54:57 merlin-g-007
13433377 choi_s 40 gpu R 2:09:55 48:00:00 01-23T14:28:14 merlin-g-006
13433373 qi_c 20 gpu R 7:00:04 24:00:00 01-23T09:38:05 merlin-g-004
13433390 jaervine 2 gpu R 5:18 24:00:00 01-23T16:32:51 merlin-g-007
13433390 jaervine 2 gpu R 15:18 24:00:00 01-23T16:22:51 merlin-g-007
13433375 bellotti 4 gpu R 7:35:44 9:00:00 01-23T09:02:25 merlin-g-001
13433358 bellotti 1 gpu R 1-05:52:19 144:00:00 01-22T10:45:50 merlin-g-007
13433377 lavriha_ 20 gpu R 5:13:24 24:00:00 01-23T11:24:45 merlin-g-008
13433370 lavriha_ 40 gpu R 22:43:09 24:00:00 01-22T17:55:00 merlin-g-003
13433373 qi_c 20 gpu R 15:03:15 24:00:00 01-23T01:34:54 merlin-g-002
13433371 qi_c 4 gpu R 22:14:14 168:00:00 01-22T18:23:55 merlin-g-001
13433254 feichtin 2 general R 5-07:26:11 156:00:00 01-18T09:11:58 merlin-c-204
13432137 feichtin 2 general R 5-23:06:25 160:00:00 01-17T17:31:44 merlin-c-204
13433389 albajaca 32 hourly R 41:19 1:00:00 01-23T15:56:50 merlin-c-219
13433387 riemann_ 2 general R 1:51:47 4:00:00 01-23T14:46:22 merlin-c-204
13433370 jimenez_ 2 general R 23:20:45 168:00:00 01-22T17:17:24 merlin-c-106
13433381 jimenez_ 2 general R 4:55:33 168:00:00 01-23T11:42:36 merlin-c-219
13433390 sayed_m 128 daily R 21:49 10:00:00 01-23T16:16:20 merlin-c-223
13433359 adelmann 2 general R 1-05:00:09 48:00:00 01-22T11:38:00 merlin-c-204
13433377 zimmerma 2 daily R 6:13:38 24:00:00 01-23T10:24:31 merlin-c-007
13433375 zohdirad 24 daily R 7:33:16 10:00:00 01-23T09:04:53 merlin-c-218
13433363 zimmerma 6 general R 1-02:54:20 47:50:00 01-22T13:43:49 merlin-c-106
13433376 zimmerma 6 general R 7:25:42 23:50:00 01-23T09:12:27 merlin-c-007
13433371 vazquez_ 16 daily R 21:46:31 23:59:00 01-22T18:51:38 merlin-c-106
13433382 vazquez_ 16 daily R 4:09:23 23:59:00 01-23T12:28:46 merlin-c-024
13433376 jiang_j1 440 daily R 7:11:14 10:00:00 01-23T09:26:55 merlin-c-123
13433376 jiang_j1 24 daily R 7:08:19 10:00:00 01-23T09:29:50 merlin-c-220
13433384 kranjcev 440 daily R 2:48:19 24:00:00 01-23T13:49:50 merlin-c-108
13433371 vazquez_ 16 general R 20:15:15 120:00:00 01-22T20:22:54 merlin-c-210
13433371 vazquez_ 16 general R 21:15:51 120:00:00 01-22T19:22:18 merlin-c-210
13433374 colonna_ 176 daily R 8:23:18 24:00:00 01-23T08:14:51 merlin-c-211
13433374 bures_l 88 daily R 10:45:06 24:00:00 01-23T05:53:03 merlin-c-001
13433375 derlet 88 daily R 7:32:05 24:00:00 01-23T09:06:04 merlin-c-107
13433373 derlet 88 daily R 17:21:57 24:00:00 01-22T23:16:12 merlin-c-002
13433373 derlet 88 daily R 18:13:05 24:00:00 01-22T22:25:04 merlin-c-112
13433365 andreani 264 daily R 4:10:08 24:00:00 01-23T12:28:01 merlin-c-003
13431187 mahrous_ 88 general R 6-15:59:16 168:00:00 01-17T00:38:53 merlin-c-111
13433387 kranjcev 2 general R 1:48:47 4:00:00 01-23T14:49:22 merlin-c-204
13433368 karalis_ 352 general R 1-00:05:22 96:00:00 01-22T16:32:47 merlin-c-013
13433367 karalis_ 352 general R 1-00:06:44 96:00:00 01-22T16:31:25 merlin-c-118
13433385 karalis_ 352 general R 1:37:24 96:00:00 01-23T15:00:45 merlin-c-213
13433374 sato 256 general R 14:55:55 24:00:00 01-23T01:42:14 merlin-c-204
13433374 sato 64 general R 10:43:35 24:00:00 01-23T05:54:34 merlin-c-106
67723568 sato 32 general R 10:40:07 24:00:00 01-23T05:58:02 merlin-c-007
13433265 khanppna 440 general R 3-18:20:58 168:00:00 01-19T22:17:11 merlin-c-008
13433375 khanppna 704 general R 7:31:24 24:00:00 01-23T09:06:45 merlin-c-101
13433371 khanppna 616 general R 21:40:33 24:00:00 01-22T18:57:36 merlin-c-208
</pre>
</details>
### Graphical user interface
When using **ssh** with X11 forwarding (``ssh -XY``) users can use ``sview``. **SView** is a graphical user
interface to view and modify Slurm state. To run **sview**:
```bash
ssh -XY $username@merlin-l-001.psi.ch
sview
```
!['sview' graphical user interface]({{ "/images/Slurm/sview.png" }})
## General Monitoring
The following pages contain basic monitoring for Slurm and computing nodes.
Currently, monitoring is based on Grafana + InfluxDB. In the future it will
be moved to a different service based on ElasticSearch + LogStash + Kibana.
In the meantime, the following monitoring pages are available in a best effort
support:
### Merlin6 Monitoring Pages
* Slurm monitoring:
* ***[Merlin6 Slurm Statistics - XDMOD](https://merlin-slurmmon01.psi.ch/)***
* [Merlin6 Slurm Live Status](https://hpc-monitor02.psi.ch/d/QNcbW1AZk/merlin6-slurm-live-status?orgId=1&refresh=10s)
* [Merlin6 Slurm Overview](https://hpc-monitor02.psi.ch/d/94UxWJ0Zz/merlin6-slurm-overview?orgId=1&refresh=10s)
* Nodes monitoring:
* [Merlin6 CPU Nodes Overview](https://hpc-monitor02.psi.ch/d/JmvLR8gZz/merlin6-computing-cpu-nodes?orgId=1&refresh=10s)
* [Merlin6 GPU Nodes Overview](https://hpc-monitor02.psi.ch/d/gOo1Z10Wk/merlin6-computing-gpu-nodes?orgId=1&refresh=10s)
### Merlin5 Monitoring Pages
* Slurm monitoring:
* [Merlin5 Slurm Live Status](https://hpc-monitor02.psi.ch/d/o8msZJ0Zz/merlin5-slurm-live-status?orgId=1&refresh=10s)
* [Merlin5 Slurm Overview](https://hpc-monitor02.psi.ch/d/eWLEW1AWz/merlin5-slurm-overview?orgId=1&refresh=10s)
* Nodes monitoring:
* [Merlin5 CPU Nodes Overview](https://hpc-monitor02.psi.ch/d/ejTyWJAWk/merlin5-computing-cpu-nodes?orgId=1&refresh=10s)

View File

@ -0,0 +1,284 @@
---
title: Running Slurm Scripts
#tags:
keywords: batch script, slurm, sbatch, srun
last_updated: 23 January 2020
summary: "This document describes how to run batch scripts in Slurm."
sidebar: merlin6_sidebar
permalink: /merlin6/running-jobs.html
---
## The rules
Before starting using the cluster, please read the following rules:
1. To ease and improve *scheduling* and *backfilling*, always try to **estimate and** to **define a proper run time** of your jobs:
* Use ``--time=<D-HH:MM:SS>`` for that.
* For very long runs, please consider using ***[Job Arrays with Checkpointing](/merlin6/running-jobs.html#array-jobs-running-very-long-tasks-with-checkpoint-files)***
2. Try to optimize your jobs for running at most within **one day**. Please, consider the following:
* Some software can simply scale up by using more nodes while drastically reducing the run time.
* Some software allow to save a specific state, and a second job can start from that state: ***[Job Arrays with Checkpointing](/merlin6/running-jobs.html#array-jobs-running-very-long-tasks-with-checkpoint-files)*** can help you with that.
* Jobs submitted to **`hourly`** get more priority than jobs submitted to **`daily`**: always use **`hourly`** for jobs shorter than 1 hour.
* Jobs submitted to **`daily`** get more priority than jobs submitted to **`general`**: always use **`daily`** for jobs shorter than 1 day.
3. Is **forbidden** to run **very short jobs** as they cause a lot of overhead but also can cause severe problems to the main scheduler.
* ***Question:*** Is my job a very short job? ***Answer:*** If it lasts in few seconds or very few minutes, yes.
* ***Question:*** How long should my job run? ***Answer:*** as the *Rule of Thumb*, from 5' would start being ok, from 15' would preferred.
* Use ***[Packed Jobs](/merlin6/running-jobs.html#packed-jobs-running-a-large-number-of-short-tasks)*** for running a large number of short tasks.
4. Do not submit hundreds of similar jobs!
* Use ***[Array Jobs](/merlin6/running-jobs.html#array-jobs-launching-a-large-number-of-related-jobs)*** for gathering jobs instead.
{{site.data.alerts.tip}}Having a good estimation of the <i>time</i> needed by your jobs, a proper way for running them, and optimizing the jobs to <i>run within one day</i> will contribute to make the system fairly and efficiently used.
{{site.data.alerts.end}}
## Basic commands for running batch scripts
* Use **``sbatch``** for submitting a batch script to Slurm.
* Use **``srun``** for running parallel tasks.
* Use **``squeue``** for checking jobs status.
* Use **``scancel``** for cancelling/deleting a job from the queue.
{{site.data.alerts.tip}}Use Linux <b>'man'</b> pages when needed (i.e. <span style="color:orange;">'man sbatch'</span>), mostly for checking the available options for the above commands.
{{site.data.alerts.end}}
## Basic settings
For a complete list of options and parameters available is recommended to use the **man pages** (i.e. ``man sbatch``, ``man srun``, ``man salloc``).
Please, notice that behaviour for some parameters might change depending on the command used when running jobs (in example, ``--exclusive`` behaviour in ``sbatch`` differs from ``srun``).
In this chapter we show the basic parameters which are usually needed in the Merlin cluster.
### Common settings
The following settings are the minimum required for running a job in the Merlin CPU and GPU nodes. Please, consider taking a look to the **man pages** (i.e. `man sbatch`, `man salloc`, `man srun`) for more information about all possible options. Also, do not hesitate to contact us on any questions.
* **Clusters:** For running jobs in the different Slurm clusters, users should to add the following option:
```bash
#SBATCH --clusters=<cluster_name> # Possible values: merlin5, merlin6, gmerlin6
```
Refer to the documentation of each cluster ([**`merlin6`**](/merlin6/slurm-configuration.html),[**`gmerlin6`**](/gmerlin6/slurm-configuration.html),[**`merlin5`**](/merlin5/slurm-configuration.html) for further information.
* **Partitions:** except when using the *default* partition for each cluster, one needs to specify the partition:
```bash
#SBATCH --partition=<partition_name> # Check each cluster documentation for possible values
```
Refer to the documentation of each cluster ([**`merlin6`**](/merlin6/slurm-configuration.html),[**`gmerlin6`**](/gmerlin6/slurm-configuration.html),[**`merlin5`**](/merlin5/slurm-configuration.html) for further information.
* **[Optional] Disabling shared nodes**: by default, nodes are not exclusive. Hence, multiple users can run in the same node. One can request exclusive node usage with the following option:
```bash
#SBATCH --exclusive # Only if you want a dedicated node
```
* **Time**: is important to define how long a job should run, according to the reality. This will help Slurm when *scheduling* and *backfilling*, and will let Slurm managing job queues in a more efficient way. This value can never exceed the `MaxTime` of the affected partition.
```bash
#SBATCH --time=<D-HH:MM:SS> # Can not exceed the partition `MaxTime`
```
Refer to the documentation of each cluster ([**`merlin6`**](/merlin6/slurm-configuration.html),[**`gmerlin6`**](/gmerlin6/slurm-configuration.html),[**`merlin5`**](/merlin5/slurm-configuration.html) for further information about partition `MaxTime` values.
* **Output and error files**: by default, Slurm script will generate standard output (``slurm-%j.out``, where `%j` is the job_id) and error (``slurm-%j.err``, where `%j` is the job_id) files in the directory from where the job was submitted. Users can change default name with the following options:
```bash
#SBATCH --output=<filename> # Can include path. Patterns accepted (i.e. %j)
#SBATCH --error=<filename> # Can include path. Patterns accepted (i.e. %j)
```
Use **man sbatch** (``man sbatch | grep -A36 '^filename pattern'``) for getting a list specification of **filename patterns**.
* **Enable/Disable Hyper-Threading**: Whether a node has or not Hyper-Threading depends on the node configuration. By default, HT nodes have HT enabled, but one should specify it from the Slurm command as follows:
```bash
#SBATCH --hint=multithread # Use extra threads with in-core multi-threading.
#SBATCH --hint=nomultithread # Don't use extra threads with in-core multi-threading.
```
Refer to the documentation of each cluster ([**`merlin6`**](/merlin6/slurm-configuration.html),[**`gmerlin6`**](/gmerlin6/slurm-configuration.html),[**`merlin5`**](/merlin5/slurm-configuration.html) for further information about node configuration and Hyper-Threading.
Consider that, sometimes, depending on your job requirements, you might need also to setup how many `--ntasks-per-core` or `--cpus-per-task` (even other options) in addition to the `--hint` command. Please, contact us in case of doubts.
{{site.data.alerts.tip}} In general, for the cluster `merlin6` <span style="color:orange;"><b>--hint=[no]multithread</b></span> is a recommended field. On the other hand, <span style="color:orange;"><b>--ntasks-per-core</b></span> is only needed when
one needs to define how a task should be handled within a core, and this setting will not be generally used on Hybrid MPI/OpenMP jobs where multiple cores are needed for single tasks.
{{site.data.alerts.end}}
## Batch script templates
### CPU-based jobs templates
The following examples apply to the **Merlin6** cluster.
#### Nomultithreaded jobs template
The following template should be used by any user submitting jobs to the Merlin6 CPU nodes:
```bash
#!/bin/bash
#SBATCH --cluster=merlin6 # Cluster name
#SBATCH --partition=general,daily,hourly # Specify one or multiple partitions
#SBATCH --time=<D-HH:MM:SS> # Strongly recommended
#SBATCH --output=<output_file> # Generate custom output file
#SBATCH --error=<error_file> # Generate custom error file
#SBATCH --hint=nomultithread # Mandatory for multithreaded jobs
##SBATCH --exclusive # Uncomment if you need exclusive node usage
##SBATCH --ntasks-per-core=1 # Only mandatory for multithreaded single tasks
## Advanced options example
##SBATCH --nodes=1 # Uncomment and specify #nodes to use
##SBATCH --ntasks=44 # Uncomment and specify #nodes to use
##SBATCH --ntasks-per-node=44 # Uncomment and specify #tasks per node
##SBATCH --cpus-per-task=44 # Uncomment and specify the number of cores per task
```
#### Multithreaded jobs template
The following template should be used by any user submitting jobs to the Merlin6 CPU nodes:
```bash
#!/bin/bash
#SBATCH --cluster=merlin6 # Cluster name
#SBATCH --partition=general,daily,hourly # Specify one or multiple partitions
#SBATCH --time=<D-HH:MM:SS> # Strongly recommended
#SBATCH --output=<output_file> # Generate custom output file
#SBATCH --error=<error_file> # Generate custom error file
#SBATCH --hint=multithread # Mandatory for multithreaded jobs
##SBATCH --exclusive # Uncomment if you need exclusive node usage
##SBATCH --ntasks-per-core=2 # Only mandatory for multithreaded single tasks
## Advanced options example
##SBATCH --nodes=1 # Uncomment and specify #nodes to use
##SBATCH --ntasks=88 # Uncomment and specify #nodes to use
##SBATCH --ntasks-per-node=88 # Uncomment and specify #tasks per node
##SBATCH --cpus-per-task=88 # Uncomment and specify the number of cores per task
```
### GPU-based jobs templates
The following template should be used by any user submitting jobs to GPU nodes:
```bash
#!/bin/bash
#SBATCH --cluster=gmerlin6 # Cluster name
#SBATCH --partition=gpu,gpu-short,gwendolen # Specify one or multiple partitions
#SBATCH --gpus="<type>:<num_gpus>" # <type> is optional, <num_gpus> is mandatory
#SBATCH --time=<D-HH:MM:SS> # Strongly recommended
#SBATCH --output=<output_file> # Generate custom output file
#SBATCH --error=<error_file> # Generate custom error file
##SBATCH --exclusive # Uncomment if you need exclusive node usage
##SBATCH --account=gwendolen_public # Uncomment if you need to use gwendolen
## Advanced options example
##SBATCH --nodes=1 # Uncomment and specify number of nodes to use
##SBATCH --ntasks=1 # Uncomment and specify number of nodes to use
##SBATCH --cpus-per-gpu=5 # Uncomment and specify the number of cores per task
##SBATCH --mem-per-gpu=16000 # Uncomment and specify the number of cores per task
##SBATCH --gpus-per-node=<type>:2 # Uncomment and specify the number of GPUs per node
##SBATCH --gpus-per-socket=<type>:2 # Uncomment and specify the number of GPUs per socket
##SBATCH --gpus-per-task=<type>:1 # Uncomment and specify the number of GPUs per task
```
## Advanced configurations
### Array Jobs: launching a large number of related jobs
If you need to run a large number of jobs based on the same executable with systematically varying inputs,
e.g. for a parameter sweep, you can do this most easily in form of a **simple array job**.
``` bash
#!/bin/bash
#SBATCH --job-name=test-array
#SBATCH --partition=daily
#SBATCH --ntasks=1
#SBATCH --time=08:00:00
#SBATCH --array=1-8
echo $(date) "I am job number ${SLURM_ARRAY_TASK_ID}"
srun myprogram config-file-${SLURM_ARRAY_TASK_ID}.dat
```
This will run 8 independent jobs, where each job can use the counter
variable `SLURM_ARRAY_TASK_ID` defined by Slurm inside of the job's
environment to feed the correct input arguments or configuration file
to the "myprogram" executable. Each job will receive the same set of
configurations (e.g. time limit of 8h in the example above).
The jobs are independent, but they will run in parallel (if the cluster resources allow for
it). The jobs will get JobIDs like {some-number}_0 to {some-number}_7, and they also will each
have their own output file.
**Note:**
* Do not use such jobs if you have very short tasks, since each array sub job will incur the full overhead for launching an independent Slurm job. For such cases you should used a **packed job** (see below).
* If you want to control how many of these jobs can run in parallel, you can use the `#SBATCH --array=1-100%5` syntax. The `%5` will define
that only 5 sub jobs may ever run in parallel.
You also can use an array job approach to run over all files in a directory, substituting the payload with
``` bash
FILES=(/path/to/data/*)
srun ./myprogram ${FILES[$SLURM_ARRAY_TASK_ID]}
```
Or for a trivial case you could supply the values for a parameter scan in form
of a argument list that gets fed to the program using the counter variable.
``` bash
ARGS=(0.05 0.25 0.5 1 2 5 100)
srun ./my_program.exe ${ARGS[$SLURM_ARRAY_TASK_ID]}
```
### Array jobs: running very long tasks with checkpoint files
If you need to run a job for much longer than the queues (partitions) permit, and
your executable is able to create checkpoint files, you can use this
strategy:
``` bash
#!/bin/bash
#SBATCH --job-name=test-checkpoint
#SBATCH --partition=general
#SBATCH --ntasks=1
#SBATCH --time=7-00:00:00 # each job can run for 7 days
#SBATCH --cpus-per-task=1
#SBATCH --array=1-10%1 # Run a 10-job array, one job at a time.
if test -e checkpointfile; then
# There is a checkpoint file;
myprogram --read-checkp checkpointfile
else
# There is no checkpoint file, start a new simulation.
myprogram
fi
```
The `%1` in the `#SBATCH --array=1-10%1` statement defines that only 1 subjob can ever run in parallel, so
this will result in subjob n+1 only being started when job n has finished. It will read the checkpoint file
if it is present.
### Packed jobs: running a large number of short tasks
Since the launching of a Slurm job incurs some overhead, you should not submit each short task as a separate
Slurm job. Use job packing, i.e. you run the short tasks within the loop of a single Slurm job.
You can launch the short tasks using `srun` with the `--exclusive` switch (not to be confused with the
switch of the same name used in the SBATCH commands). This switch will ensure that only a specified
number of tasks can run in parallel.
As an example, the following job submission script will ask Slurm for
44 cores (threads), then it will run the =myprog= program 1000 times with
arguments passed from 1 to 1000. But with the =-N1 -n1 -c1
--exclusive= option, it will control that at any point in time only 44
instances are effectively running, each being allocated one CPU. You
can at this point decide to allocate several CPUs or tasks by adapting
the corresponding parameters.
``` bash
#! /bin/bash
#SBATCH --job-name=test-checkpoint
#SBATCH --partition=general
#SBATCH --ntasks=1
#SBATCH --time=7-00:00:00
#SBATCH --ntasks=44 # defines the number of parallel tasks
for i in {1..1000}
do
srun -N1 -n1 -c1 --exclusive ./myprog $i &
done
wait
```
**Note:** The `&` at the end of the `srun` line is needed to not have the script waiting (blocking).
The `wait` command waits for all such background tasks to finish and returns the exit code.

View File

@ -0,0 +1,62 @@
---
title: Slurm Basic Commands
#tags:
#keywords:
last_updated: 19 June 2019
#summary: ""
sidebar: merlin6_sidebar
permalink: /merlin6/slurm-basics.html
---
In this document some basic commands for using Slurm are showed. Advanced examples for some of these
are explained in other Merlin6 Slurm pages. You can always use ```man <command>``` pages for more
information about options and examples.
## Basic commands
Useful commands for the slurm:
```bash
sinfo # to see the name of nodes, their occupancy,
# name of slurm partitions, limits (try out with "-l" option)
squeue # to see the currently running/waiting jobs in slurm
# (additional "-l" option may also be useful)
sbatch Script.sh # to submit a script (example below) to the slurm.
srun <command> # to submit a command to Slurm. Same options as in 'sbatch' can be used.
salloc # to allocate computing nodes. Use for interactive runs.
scancel job_id # to cancel slurm job, job id is the numeric id, seen by the squeue.
sview # X interface for managing jobs and track job run information.
seff # Calculates the efficiency of a job
sjstat # List attributes of jobs under the SLURM control
```
---
## Advanced basic commands:
```bash
sinfo -N -l # list nodes, state, resources (#CPUs, memory per node, ...), etc.
sshare -a # to list shares of associations to a cluster
sprio -l # to view the factors that comprise a job's scheduling priority
# add '-u <username>' for filtering user
```
## Show information for specific cluster
By default, any of the above commands shows information of the local cluster which is ***merlin6**.
If you want to see the same information for **merlin5** you have to add the parameter ``--clusters=merlin5``.
If you want to see both clusters at the same time, add the option ``--federation``.
Examples:
```bash
sinfo # 'sinfo' local cluster which is 'merlin6'
sinfo --clusters=merlin5 # 'sinfo' non-local cluster 'merlin5'
sinfo --federation # 'sinfo' all clusters which are 'merlin5' & 'merlin6'
squeue # 'squeue' local cluster which is 'merlin6'
squeue --clusters=merlin5 # 'squeue' non-local cluster 'merlin5'
squeue --federation # 'squeue' all clusters which are 'merlin5' & 'merlin6'
```
---

View File

@ -0,0 +1,203 @@
---
title: Slurm Configuration
#tags:
keywords: configuration, partitions, node definition
last_updated: 29 January 2021
summary: "This document describes a summary of the Merlin6 configuration."
sidebar: merlin6_sidebar
permalink: /merlin6/slurm-configuration.html
---
This documentation shows basic Slurm configuration and options needed to run jobs in the Merlin6 CPU cluster.
## Merlin6 CPU nodes definition
The following table show default and maximum resources that can be used per node:
| Nodes | Def.#CPUs | Max.#CPUs | #Threads | Def.Mem/CPU | Max.Mem/CPU | Max.Mem/Node | Max.Swap | Def.#GPUs | Max.#GPUs |
|:------------------:| ---------:| :--------:| :------: | :----------:| :----------:| :-----------:| :-------:| :-------: | :-------: |
| merlin-c-[001-024] | 1 core | 44 cores | 2 | 4000 | 352000 | 352000 | 10000 | N/A | N/A |
| merlin-c-[101-124] | 1 core | 44 cores | 2 | 4000 | 352000 | 352000 | 10000 | N/A | N/A |
| merlin-c-[201-224] | 1 core | 44 cores | 2 | 4000 | 352000 | 352000 | 10000 | N/A | N/A |
| merlin-c-[301-306] | 1 core | 44 cores | 2 | 4000 | 352000 | 352000 | 10000 | N/A | N/A |
If nothing is specified, by default each core will use up to 8GB of memory. Memory can be increased with the `--mem=<mem_in_MB>` and
`--mem-per-cpu=<mem_in_MB>` options, and maximum memory allowed is `Max.Mem/Node`.
In **`merlin6`**, Memory is considered a Consumable Resource, as well as the CPU. Hence, both resources will account when submitting a job,
and by default resources can not be oversubscribed. This is a main difference with the old **`merlin5`** cluster, when only CPU were accounted,
and memory was by default oversubscribed.
{{site.data.alerts.tip}}Always check <b>'/etc/slurm/slurm.conf'</b> for changes in the hardware.
{{site.data.alerts.end}}
## Running jobs in the 'merlin6' cluster
In this chapter we will cover basic settings that users need to specify in order to run jobs in the Merlin6 CPU cluster.
### Merlin6 CPU cluster
To run jobs in the **`merlin6`** cluster users **can optionally** specify the cluster name in Slurm:
```bash
#SBATCH --cluster=merlin6
```
If no cluster name is specified, by default any job will be submitted to this cluster (as this is the main cluster).
Hence, this would be only necessary if one has to deal with multiple clusters or when one has defined some environmental
variables which can modify the cluster name.
### Merlin6 CPU partitions
Users might need to specify the Slurm partition. If no partition is specified, it will default to **`general`**:
```bash
#SBATCH --partition=<partition_name> # Possible <partition_name> values: general, daily, hourly
```
The following *partitions* (also known as *queues*) are configured in Slurm:
| CPU Partition | Default Time | Max Time | Max Nodes | PriorityJobFactor\* | PriorityTier\*\* |
|:-----------------: | :----------: | :------: | :-------: | :-----------------: | :--------------: |
| **<u>general</u>** | 1 day | 1 week | 50 | 1 | 1 |
| **daily** | 1 day | 1 day | 67 | 500 | 1 |
| **hourly** | 1 hour | 1 hour | unlimited | 1000 | 1 |
| **gfa-asa** | 1 day | 1 week | 11 | 1000 | 1000 |
\*The **PriorityJobFactor** value will be added to the job priority (*PARTITION* column in `sprio -l` ). In other words, jobs sent to higher priority
partitions will usually run first (however, other factors such like **job age** or mainly **fair share** might affect to that decision). For the GPU
partitions, Slurm will also attempt first to allocate jobs on partitions with higher priority over partitions with lesser priority.
**\*\***Jobs submitted to a partition with a higher **PriorityTier** value will be dispatched before pending jobs in partition with lower *PriorityTier* value
and, if possible, they will preempt running jobs from partitions with lower *PriorityTier* values.
* The **`general`** partition is the **default**. It can not have more than 50 nodes running jobs.
* For **`daily`** this limitation is extended to 67 nodes.
* For **`hourly`** there are no limits.
* **`gfa-asa`** is a **private hidden** partition, belonging to one experiment. **Access is restricted**. However, by agreement with the experiment,
nodes are usually added to the **`hourly`** partition as extra resources for the public resources.
{{site.data.alerts.tip}}Jobs which would run for less than one day should be always sent to <b>daily</b>, while jobs that would run for less
than one hour should be sent to <b>hourly</b>. This would ensure that you have highest priority over jobs sent to partitions with less priority,
but also because <b>general</b> has limited the number of nodes that can be used for that. The idea behind that, is that the cluster can not
be blocked by long jobs and we can always ensure resources for shorter jobs.
{{site.data.alerts.end}}
### Merlin5 CPU Accounts
Users need to ensure that the public **`merlin`** account is specified. No specifying account options would default to this account.
This is mostly needed by users which have multiple Slurm accounts, which may define by mistake a different account.
```bash
#SBATCH --account=merlin # Possible values: merlin, gfa-asa
```
Not all the accounts can be used on all partitions. This is resumed in the table below:
| Slurm Account | Slurm Partitions |
| :------------------: | :----------------------------------: |
| **<u>merlin</u>** | `hourly`,`daily`, `general` |
| **gfa-asa** | `gfa-asa`,`hourly`,`daily`, `general` |
#### The 'gfa-asa' private account
For accessing the **`gfa-asa`** partition, it must be done through the **`gfa-asa`** account. This account **is restricted**
to a group of users and is not public.
### Slurm CPU specific options
Some options are available when using CPUs. These are detailed here.
Alternative Slurm options for CPU based jobs are available. Please refer to the **man** pages
for each Slurm command for further information about it (`man salloc`, `man sbatch`, `man srun`).
Below are listed the most common settings:
```bash
#SBATCH --hint=[no]multithread
#SBATCH --ntasks=<ntasks>
#SBATCH --ntasks-per-core=<ntasks>
#SBATCH --ntasks-per-socket=<ntasks>
#SBATCH --ntasks-per-node=<ntasks>
#SBATCH --mem=<size[units]>
#SBATCH --mem-per-cpu=<size[units]>
#SBATCH --cpus-per-task=<ncpus>
#SBATCH --cpu-bind=[{quiet,verbose},]<type> # only for 'srun' command
```
#### Dealing with Hyper-Threading
The **`merlin6`** cluster contains nodes with Hyper-Threading enabled. One should always specify
whether to use Hyper-Threading or not. If not defined, Slurm will generally use it (exceptions apply).
```bash
#SBATCH --hint=multithread # Use extra threads with in-core multi-threading.
#SBATCH --hint=nomultithread # Don't use extra threads with in-core multi-threading.
```
### User and job limits
In the CPU cluster we provide some limits which basically apply to jobs and users. The idea behind this is to ensure a fair usage of the resources and to
avoid overabuse of the resources from a single user or job. However, applying limits might affect the overall usage efficiency of the cluster (in example,
pending jobs from a single user while having many idle nodes due to low overall activity is something that can be seen when user limits are applied).
In the same way, these limits can be also used to improve the efficiency of the cluster (in example, without any job size limits, a job requesting all
resources from the batch system would drain the entire cluster for fitting the job, which is undesirable).
Hence, there is a need of setting up wise limits and to ensure that there is a fair usage of the resources, by trying to optimize the overall efficiency
of the cluster while allowing jobs of different nature and sizes (it is, **single core** based **vs parallel jobs** of different sizes) to run.
{{site.data.alerts.warning}}Wide limits are provided in the <b>daily</b> and <b>hourly</b> partitions, while for <b>general</b> those limits are
more restrictive.
<br>However, we kindly ask users to inform the Merlin administrators when there are plans to send big jobs which would require a
massive draining of nodes for allocating such jobs. This would apply to jobs requiring the <b>unlimited</b> QoS (see below <i>"Per job limits"</i>)
{{site.data.alerts.end}}
{{site.data.alerts.tip}}If you have different requirements, please let us know, we will try to accomodate or propose a solution for you.
{{site.data.alerts.end}}
#### Per job limits
These are limits which apply to a single job. In other words, there is a maximum of resources a single job can use. Limits are described in the table below with the format: `SlurmQoS(limits)` (possible `SlurmQoS` values can be listed with the command `sacctmgr show qos`). Some limits will vary depending on the day and time of the week.
| Partition | Mon-Fri 0h-18h | Sun-Thu 18h-0h | From Fri 18h to Mon 0h |
|:----------: | :------------------------------: | :------------------------------: | :------------------------------: |
| **general** | normal(cpu=704,mem=2750G) | normal(cpu=704,mem=2750G) | normal(cpu=704,mem=2750G) |
| **daily** | daytime(cpu=704,mem=2750G) | nighttime(cpu=1408,mem=5500G) | unlimited(cpu=2200,mem=8593.75G) |
| **hourly** | unlimited(cpu=2200,mem=8593.75G) | unlimited(cpu=2200,mem=8593.75G) | unlimited(cpu=2200,mem=8593.75G) |
By default, a job can not use more than 704 cores (max CPU per job). In the same way, memory is also proportionally limited. This is equivalent as
running a job using up to 8 nodes at once. This limit applies to the **general** partition (fixed limit) and to the **daily** partition (only during working hours).
Limits are softed for the **daily** partition during non working hours, and during the weekend limits are even wider.
For the **hourly** partition, **despite running many parallel jobs is something not desirable** (for allocating such jobs it requires massive draining of nodes),
wider limits are provided. In order to avoid massive nodes drain in the cluster, for allocating huge jobs, setting per job limits is necessary. Hence, **unlimited** QoS
mostly refers to "per user" limits more than to "per job" limits (in other words, users can run any number of hourly jobs, but the job size for such jobs is limited
with wide values).
#### Per user limits for CPU partitions
These limits which apply exclusively to users. In other words, there is a maximum of resources a single user can use. Limits are described in the table below with the format: `SlurmQoS(limits)` (possible `SlurmQoS` values can be listed with the command `sacctmgr show qos`). Some limits will vary depending on the day and time of the week.
| Partition | Mon-Fri 0h-18h | Sun-Thu 18h-0h | From Fri 18h to Mon 0h |
|:-----------:| :----------------------------: | :---------------------------: | :----------------------------: |
| **general** | normal(cpu=704,mem=2750G) | normal(cpu=704,mem=2750G) | normal(cpu=704,mem=2750G) |
| **daily** | daytime(cpu=1408,mem=5500G) | nighttime(cpu=2112,mem=8250G) | unlimited(cpu=6336,mem=24750G) |
| **hourly** | unlimited(cpu=6336,mem=24750G) | unlimited(cpu=6336,mem=24750G)| unlimited(cpu=6336,mem=24750G) |
By default, users can not use more than 704 cores at the same time (max CPU per user). Memory is also proportionally limited in the same way. This is
equivalent to 8 exclusive nodes. This limit applies to the **general** partition (fixed limit) and to the **daily** partition (only during working hours).
For the **hourly** partition, there are no limits restriction and user limits are removed. Limits are softed for the **daily** partition during non
working hours, and during the weekend limits are removed.
## Advanced Slurm configuration
Clusters at PSI use the [Slurm Workload Manager](http://slurm.schedmd.com/) as the batch system technology for managing and scheduling jobs.
Slurm has been installed in a **multi-clustered** configuration, allowing to integrate multiple clusters in the same batch system.
For understanding the Slurm configuration setup in the cluster, sometimes may be useful to check the following files:
* ``/etc/slurm/slurm.conf`` - can be found in the login nodes and computing nodes.
* ``/etc/slurm/gres.conf`` - can be found in the GPU nodes, is also propgated to login nodes and computing nodes for user read access.
* ``/etc/slurm/cgroup.conf`` - can be found in the computing nodes, is also propagated to login nodes for user read access.
The previous configuration files which can be found in the login nodes, correspond exclusively to the **merlin6** cluster configuration files.
Configuration files for the old **merlin5** cluster or for the **gmerlin6** cluster must be checked directly on any of the **merlin5** or **gmerlin6** computing nodes (in example, by login in to one of the nodes while a job or an active allocation is running).

View File

@ -0,0 +1,329 @@
---
title: Slurm Examples
#tags:
keywords: example, template, examples, templates, running jobs, sbatch
last_updated: 28 June 2019
summary: "This document shows different template examples for running jobs in the Merlin cluster."
sidebar: merlin6_sidebar
permalink: /merlin6/slurm-examples.html
---
## Single core based job examples
### Example 1: Hyperthreaded job
In this example we want to use hyperthreading (``--ntasks-per-core=2`` and ``--hint=multithread``). In our Merlin6 configuration,
the default memory per CPU (a CPU is equivalent to a core thread) is 4000MB, hence each task can use up 8000MB (2 threads x 4000MB).
```bash
#!/bin/bash
#SBATCH --partition=hourly # Using 'hourly' will grant higher priority
#SBATCH --ntasks-per-core=2 # Request the max ntasks be invoked on each core
#SBATCH --hint=multithread # Use extra threads with in-core multi-threading
#SBATCH --time=00:30:00 # Define max time job will run
#SBATCH --output=myscript.out # Define your output file
#SBATCH --error=myscript.err # Define your error file
module purge
module load $MODULE_NAME # where $MODULE_NAME is a software in PModules
srun $MYEXEC # where $MYEXEC is a path to your binary file
```
### Example 2: Non-hyperthreaded job
In this example we do not want hyper-threading (``--ntasks-per-core=1`` and ``--hint=nomultithread``). In our Merlin6 configuration,
the default memory per cpu (a CPU is equivalent to a core thread) is 4000MB. If we do not specify anything else, our
single core task will use a default of 4000MB. However, one could double it with ``--mem-per-cpu=8000`` if you require more memory
(remember, the second thread will not be used so we can safely assign +4000MB to the unique active thread).
```bash
#!/bin/bash
#SBATCH --partition=hourly # Using 'hourly' will grant higher priority
#SBATCH --ntasks-per-core=1 # Request the max ntasks be invoked on each core
#SBATCH --hint=nomultithread # Don't use extra threads with in-core multi-threading
#SBATCH --time=00:30:00 # Define max time job will run
#SBATCH --output=myscript.out # Define your output file
#SBATCH --error=myscript.err # Define your error file
module purge
module load $MODULE_NAME # where $MODULE_NAME is a software in PModules
srun $MYEXEC # where $MYEXEC is a path to your binary file
```
## Multi core based job examples
### Example 1: MPI with Hyper-Threading
In this example we run a job that will run 88 tasks. Merlin6 Apollo nodes have 44 cores each one with hyper-threading
enabled. This means that we can run 2 threads per core, in total 88 threads. To accomplish that, users should specify
``--ntasks-per-core=2`` and ``--hint=multithread``.
Use `--nodes=1` if you want to use a node exclusively (88 hyperthreaded tasks would fit in a Merlin6 node).
```bash
#!/bin/bash
#SBATCH --partition=hourly # Using 'hourly' will grant higher priority
#SBATCH --ntasks=88 # Job will run 88 tasks
#SBATCH --ntasks-per-core=2 # Request the max ntasks be invoked on each core
#SBATCH --hint=multithread # Use extra threads with in-core multi-threading
#SBATCH --time=00:30:00 # Define max time job will run
#SBATCH --output=myscript.out # Define your output file
#SBATCH --error=myscript.err # Define your error file
module purge
module load $MODULE_NAME # where $MODULE_NAME is a software in PModules
srun $MYEXEC # where $MYEXEC is a path to your binary file
```
### Example 2: MPI without Hyper-Threading
In this example, we want to run a job that will run 44 tasks, and due to performance reasons we want to disable hyper-threading.
Merlin6 Apollo nodes have 44 cores, each one with hyper-threading enabled. For ensuring that only 1 thread will be used per task,
users should specify ``--ntasks-per-core=1`` and ``--hint=nomultithread``. With this configuration, we tell Slurm to run only 1
tasks per core and no hyperthreading should be used. Hence, each tasks will be assigned to an independent core.
Use `--nodes=1` if you want to use a node exclusively (44 non-hyperthreaded tasks would fit in a Merlin6 node).
```bash
#!/bin/bash
#SBATCH --partition=hourly # Using 'hourly' will grant higher priority
#SBATCH --ntasks=44 # Job will run 44 tasks
#SBATCH --ntasks-per-core=1 # Request the max ntasks be invoked on each core
#SBATCH --hint=nomultithread # Don't use extra threads with in-core multi-threading
#SBATCH --time=00:30:00 # Define max time job will run
#SBATCH --output=myscript.out # Define your output file
#SBATCH --error=myscript.err # Define your output file
module purge
module load $MODULE_NAME # where $MODULE_NAME is a software in PModules
srun $MYEXEC # where $MYEXEC is a path to your binary file
```
### Example 3: Hyperthreaded Hybrid MPI/OpenMP job
In this example, we want to run a Hybrid Job using MPI and OpenMP using hyperthreading. In this job, we want to run 4 MPI
tasks by using 8 CPUs per task. Each task in our example requires 128GB of memory. Then we specify 16000MB per CPU
(8 x 16000MB = 128000MB). Notice that since hyperthreading is enabled, Slurm will use 4 cores per task (with hyperthreading
2 threads -a.k.a. Slurm CPUs- fit into a core).
```bash
#!/bin/bash -l
#SBATCH --clusters=merlin6
#SBATCH --job-name=test
#SBATCH --ntasks=4
#SBATCH --ntasks-per-socket=1
#SBATCH --mem-per-cpu=16000
#SBATCH --cpus-per-task=8
#SBATCH --partition=hourly
#SBATCH --time=01:00:00
#SBATCH --output=srun_%j.out
#SBATCH --error=srun_%j.err
#SBATCH --hint=multithread
module purge
module load $MODULE_NAME # where $MODULE_NAME is a software in PModules
srun $MYEXEC # where $MYEXEC is a path to your binary file
```
{{site.data.alerts.tip}} Also, always consider that **`'--mem-per-cpu' x '--cpus-per-task'`** can **never** exceed the maximum amount of memory per node (352000MB).
{{site.data.alerts.end}}
### Example 4: Non-hyperthreaded Hybrid MPI/OpenMP job
In this example, we want to run a Hybrid Job using MPI and OpenMP without hyperthreading. In this job, we want to run 4 MPI
tasks by using 8 CPUs per task. Each task in our example requires 128GB of memory. Then we specify 16000MB per CPU
(8 x 16000MB = 128000MB). Notice that since hyperthreading is disabled, Slurm will use 8 cores per task (disabling hyperthreading
we force the use of only 1 thread -a.k.a. 1 CPU- per core).
```bash
#!/bin/bash -l
#SBATCH --clusters=merlin6
#SBATCH --job-name=test
#SBATCH --ntasks=4
#SBATCH --ntasks-per-socket=1
#SBATCH --mem-per-cpu=16000
#SBATCH --cpus-per-task=8
#SBATCH --partition=hourly
#SBATCH --time=01:00:00
#SBATCH --output=srun_%j.out
#SBATCH --error=srun_%j.err
#SBATCH --hint=nomultithread
module purge
module load $MODULE_NAME # where $MODULE_NAME is a software in PModules
srun $MYEXEC # where $MYEXEC is a path to your binary file
```
{{site.data.alerts.tip}} Also, always consider that **`'--mem-per-cpu' x '--cpus-per-task'`** can **never** exceed the maximum amount of memory per node (352000MB).
{{site.data.alerts.end}}
## Advanced examples
### Array Jobs: launching a large number of related jobs
If you need to run a large number of jobs based on the same executable with systematically varying inputs,
e.g. for a parameter sweep, you can do this most easily in form of a **simple array job**.
``` bash
#!/bin/bash
#SBATCH --job-name=test-array
#SBATCH --partition=daily
#SBATCH --ntasks=1
#SBATCH --time=08:00:00
#SBATCH --array=1-8
echo $(date) "I am job number ${SLURM_ARRAY_TASK_ID}"
srun $MYEXEC config-file-${SLURM_ARRAY_TASK_ID}.dat
```
This will run 8 independent jobs, where each job can use the counter
variable `SLURM_ARRAY_TASK_ID` defined by Slurm inside of the job's
environment to feed the correct input arguments or configuration file
to the "myprogram" executable. Each job will receive the same set of
configurations (e.g. time limit of 8h in the example above).
The jobs are independent, but they will run in parallel (if the cluster resources allow for
it). The jobs will get JobIDs like {some-number}_0 to {some-number}_7, and they also will each
have their own output file.
**Note:**
* Do not use such jobs if you have very short tasks, since each array sub job will incur the full overhead for launching an independent Slurm job. For such cases you should used a **packed job** (see below).
* If you want to control how many of these jobs can run in parallel, you can use the `#SBATCH --array=1-100%5` syntax. The `%5` will define
that only 5 sub jobs may ever run in parallel.
You also can use an array job approach to run over all files in a directory, substituting the payload with
``` bash
FILES=(/path/to/data/*)
srun $MYEXEC ${FILES[$SLURM_ARRAY_TASK_ID]}
```
Or for a trivial case you could supply the values for a parameter scan in form
of a argument list that gets fed to the program using the counter variable.
``` bash
ARGS=(0.05 0.25 0.5 1 2 5 100)
srun $MYEXEC ${ARGS[$SLURM_ARRAY_TASK_ID]}
```
### Array jobs: running very long tasks with checkpoint files
If you need to run a job for much longer than the queues (partitions) permit, and
your executable is able to create checkpoint files, you can use this
strategy:
``` bash
#!/bin/bash
#SBATCH --job-name=test-checkpoint
#SBATCH --partition=general
#SBATCH --ntasks=1
#SBATCH --time=7-00:00:00 # each job can run for 7 days
#SBATCH --cpus-per-task=1
#SBATCH --array=1-10%1 # Run a 10-job array, one job at a time.
if test -e checkpointfile; then
# There is a checkpoint file;
$MYEXEC --read-checkp checkpointfile
else
# There is no checkpoint file, start a new simulation.
$MYEXEC
fi
```
The `%1` in the `#SBATCH --array=1-10%1` statement defines that only 1 subjob can ever run in parallel, so
this will result in subjob n+1 only being started when job n has finished. It will read the checkpoint file
if it is present.
### Packed jobs: running a large number of short tasks
Since the launching of a Slurm job incurs some overhead, you should not submit each short task as a separate
Slurm job. Use job packing, i.e. you run the short tasks within the loop of a single Slurm job.
You can launch the short tasks using `srun` with the `--exclusive` switch (not to be confused with the
switch of the same name used in the SBATCH commands). This switch will ensure that only a specified
number of tasks can run in parallel.
As an example, the following job submission script will ask Slurm for
44 cores (threads), then it will run the =myprog= program 1000 times with
arguments passed from 1 to 1000. But with the =-N1 -n1 -c1
--exclusive= option, it will control that at any point in time only 44
instances are effectively running, each being allocated one CPU. You
can at this point decide to allocate several CPUs or tasks by adapting
the corresponding parameters.
``` bash
#! /bin/bash
#SBATCH --job-name=test-checkpoint
#SBATCH --partition=general
#SBATCH --ntasks=1
#SBATCH --time=7-00:00:00
#SBATCH --ntasks=44 # defines the number of parallel tasks
for i in {1..1000}
do
srun -N1 -n1 -c1 --exclusive $MYEXEC $i &
done
wait
```
**Note:** The `&` at the end of the `srun` line is needed to not have the script waiting (blocking).
The `wait` command waits for all such background tasks to finish and returns the exit code.
## Hands-On Example
Copy-paste the following example in a file called myAdvancedTest.batch):
```bash
#!/bin/bash
#SBATCH --partition=daily # name of slurm partition to submit
#SBATCH --time=2:00:00 # limit the execution of this job to 2 hours, see sinfo for the max. allowance
#SBATCH --nodes=2 # number of nodes
#SBATCH --ntasks=44 # number of tasks
#SBATCH --ntasks-per-core=1 # Request the max ntasks be invoked on each core
#SBATCH --hint=nomultithread # Don't use extra threads with in-core multi-threading
module load gcc/9.2.0 openmpi/3.1.5-1_merlin6
module list
echo "Example no-MPI:" ; hostname # will print one hostname per node
echo "Example MPI:" ; srun hostname # will print one hostname per ntask
```
In the above example are specified the options ``--nodes=2`` and ``--ntasks=44``. This means that up 2 nodes are requested,
and is expected to run 44 tasks. Hence, 44 cores are needed for running that job. Slurm will try to allocate a maximum of
2 nodes, both together having at least 44 cores. Since our nodes have 44 cores / each, if nodes are empty (no other users
have running jobs there), job can land on a single node (it has enough cores to run 44 tasks).
If we want to ensure that job is using at least two different nodes (i.e. for boosting CPU frequency, or because the job
requires more memory per core) you should specify other options.
A good example is ``--ntasks-per-node=22``. This will equally distribute 22 tasks on 2 nodes.
```bash
#SBATCH --ntasks-per-node=22
```
A different example could be by specifying how much memory per core is needed. For instance ``--mem-per-cpu=32000`` will reserve
~32000MB per core. Since we have a maximum of 352000MB per Apollo node, Slurm will be only able to allocate 11 cores (32000MB x 11cores = 352000MB) per node.
It means that 4 nodes will be needed (max 11 tasks per node due to memory definition, and we need to run 44 tasks), in this case we need to change ``--nodes=4``
(or remove ``--nodes``). Alternatively, we can decrease ``--mem-per-cpu`` to a lower value which can allow the use of at least 44 cores per node (i.e. with ``16000``
should be able to use 2 nodes)
```bash
#SBATCH --mem-per-cpu=16000
```
Finally, in order to ensure exclusivity of the node, an option *--exclusive* can be used (see below). This will ensure that
the requested nodes are exclusive for the job (no other users jobs will interact with this node, and only completely
free nodes will be allocated).
```bash
#SBATCH --exclusive
```
This can be combined with the previous examples.
More advanced configurations can be defined and can be combined with the previous examples. More information about advanced
options can be found in the following link: https://slurm.schedmd.com/sbatch.html (or run 'man sbatch').
If you have questions about how to properly execute your jobs, please contact us through merlin-admins@lists.psi.ch. Do not run
advanced configurations unless your are sure of what you are doing.