refactored slurm general docs merlin6
This commit is contained in:
|
Before Width: | Height: | Size: 1008 KiB After Width: | Height: | Size: 1008 KiB |
|
Before Width: | Height: | Size: 197 KiB After Width: | Height: | Size: 197 KiB |
@@ -1,235 +0,0 @@
|
||||
---
|
||||
title: Running Interactive Jobs
|
||||
#tags:
|
||||
keywords: interactive, X11, X, srun, salloc, job, jobs, slurm, nomachine, nx
|
||||
last_updated: 07 September 2022
|
||||
summary: "This document describes how to run interactive jobs as well as X based software."
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/interactive-jobs.html
|
||||
---
|
||||
|
||||
## Running interactive jobs
|
||||
|
||||
There are two different ways for running interactive jobs in Slurm. This is possible by using
|
||||
the ``salloc`` and ``srun`` commands:
|
||||
|
||||
* **``salloc``**: to obtain a Slurm job allocation (a set of nodes), execute command(s), and then release the allocation when the command is finished.
|
||||
* **``srun``**: is used for running parallel tasks.
|
||||
|
||||
### srun
|
||||
|
||||
Is run is used to run parallel jobs in the batch system. It can be used within a batch script
|
||||
(which can be run with ``sbatch``), or within a job allocation (which can be run with ``salloc``).
|
||||
Also, it can be used as a direct command (in example, from the login nodes).
|
||||
|
||||
When used inside a batch script or during a job allocation, ``srun`` is constricted to the
|
||||
amount of resources allocated by the ``sbatch``/``salloc`` commands. In ``sbatch``, usually
|
||||
these resources are defined inside the batch script with the format ``#SBATCH <option>=<value>``.
|
||||
In other words, if you define in your batch script or allocation 88 tasks (and 1 thread / core)
|
||||
and 2 nodes, ``srun`` is constricted to these amount of resources (you can use less, but never
|
||||
exceed those limits).
|
||||
|
||||
When used from the login node, usually is used to run a specific command or software in an
|
||||
interactive way. ``srun`` is a blocking process (it will block bash prompt until the ``srun``
|
||||
command finishes, unless you run it in background with ``&``). This can be very useful to run
|
||||
interactive software which pops up a Window and then submits jobs or run sub-tasks in the
|
||||
background (in example, **Relion**, **cisTEM**, etc.)
|
||||
|
||||
Refer to ``man srun`` for exploring all possible options for that command.
|
||||
|
||||
<details>
|
||||
<summary>[Show 'srun' example]: Running 'hostname' command on 3 nodes, using 2 cores (1 task/core) per node</summary>
|
||||
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
|
||||
(base) [caubet_m@merlin-l-001 ~]$ srun --clusters=merlin6 --ntasks=6 --ntasks-per-node=2 --nodes=3 hostname
|
||||
srun: job 135088230 queued and waiting for resources
|
||||
srun: job 135088230 has been allocated resources
|
||||
merlin-c-102.psi.ch
|
||||
merlin-c-102.psi.ch
|
||||
merlin-c-101.psi.ch
|
||||
merlin-c-101.psi.ch
|
||||
merlin-c-103.psi.ch
|
||||
merlin-c-103.psi.ch
|
||||
</pre>
|
||||
</details>
|
||||
|
||||
### salloc
|
||||
|
||||
**``salloc``** is used to obtain a Slurm job allocation (a set of nodes). Once job is allocated,
|
||||
users are able to execute interactive command(s). Once finished (``exit`` or ``Ctrl+D``),
|
||||
the allocation is released. **``salloc``** is a blocking command, it is, command will be blocked
|
||||
until the requested resources are allocated.
|
||||
|
||||
When running **``salloc``**, once the resources are allocated, *by default* the user will get
|
||||
a ***new shell on one of the allocated resources*** (if a user has requested few nodes, it will
|
||||
prompt a new shell on the first allocated node). However, this behaviour can be changed by adding
|
||||
a shell (`$SHELL`) at the end of the `salloc` command. In example:
|
||||
|
||||
```bash
|
||||
# Typical 'salloc' call
|
||||
# - Same as running:
|
||||
# 'salloc --clusters=merlin6 -N 2 -n 2 srun -n1 -N1 --mem-per-cpu=0 --gres=gpu:0 --pty --preserve-env --mpi=none $SHELL'
|
||||
salloc --clusters=merlin6 -N 2 -n 2
|
||||
|
||||
# Custom 'salloc' call
|
||||
# - $SHELL will open a local shell on the login node from where ``salloc`` is running
|
||||
salloc --clusters=merlin6 -N 2 -n 2 $SHELL
|
||||
```
|
||||
|
||||
<details>
|
||||
<summary>[Show 'salloc' example]: Allocating 2 cores (1 task/core) in 2 nodes (1 core/node) - <i>Default</i></summary>
|
||||
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
|
||||
(base) [caubet_m@merlin-l-001 ~]$ salloc --clusters=merlin6 --ntasks=2 --nodes=2
|
||||
salloc: Pending job allocation 135171306
|
||||
salloc: job 135171306 queued and waiting for resources
|
||||
salloc: job 135171306 has been allocated resources
|
||||
salloc: Granted job allocation 135171306
|
||||
|
||||
(base) [caubet_m@merlin-c-213 ~]$ srun hostname
|
||||
merlin-c-213.psi.ch
|
||||
merlin-c-214.psi.ch
|
||||
|
||||
(base) [caubet_m@merlin-c-213 ~]$ exit
|
||||
exit
|
||||
salloc: Relinquishing job allocation 135171306
|
||||
|
||||
(base) [caubet_m@merlin-l-001 ~]$ salloc --clusters=merlin6 -N 2 -n 2 srun -n1 -N1 --mem-per-cpu=0 --gres=gpu:0 --pty --preserve-env --mpi=none $SHELL
|
||||
salloc: Pending job allocation 135171342
|
||||
salloc: job 135171342 queued and waiting for resources
|
||||
salloc: job 135171342 has been allocated resources
|
||||
salloc: Granted job allocation 135171342
|
||||
|
||||
(base) [caubet_m@merlin-c-021 ~]$ srun hostname
|
||||
merlin-c-021.psi.ch
|
||||
merlin-c-022.psi.ch
|
||||
|
||||
(base) [caubet_m@merlin-c-021 ~]$ exit
|
||||
exit
|
||||
salloc: Relinquishing job allocation 135171342
|
||||
</pre>
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>[Show 'salloc' example]: Allocating 2 cores (1 task/core) in 2 nodes (1 core/node) - <i>$SHELL</i></summary>
|
||||
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
|
||||
(base) [caubet_m@merlin-export-01 ~]$ salloc --clusters=merlin6 --ntasks=2 --nodes=2 $SHELL
|
||||
salloc: Pending job allocation 135171308
|
||||
salloc: job 135171308 queued and waiting for resources
|
||||
salloc: job 135171308 has been allocated resources
|
||||
salloc: Granted job allocation 135171308
|
||||
|
||||
(base) [caubet_m@merlin-export-01 ~]$ srun hostname
|
||||
merlin-c-218.psi.ch
|
||||
merlin-c-117.psi.ch
|
||||
|
||||
(base) [caubet_m@merlin-export-01 ~]$ exit
|
||||
exit
|
||||
salloc: Relinquishing job allocation 135171308
|
||||
</pre>
|
||||
</details>
|
||||
|
||||
## Running interactive jobs with X11 support
|
||||
|
||||
### Requirements
|
||||
|
||||
#### Graphical access
|
||||
|
||||
[NoMachine](/merlin6/nomachine.html) is the official supported service for graphical
|
||||
access in the Merlin cluster. This service is running on the login nodes. Check the
|
||||
document [{Accessing Merlin -> NoMachine}](/merlin6/nomachine.html) for details about
|
||||
how to connect to the **NoMachine** service in the Merlin cluster.
|
||||
|
||||
For other non officially supported graphical access (X11 forwarding):
|
||||
|
||||
* For Linux clients, please follow [{How To Use Merlin -> Accessing from Linux Clients}](/merlin6/connect-from-linux.html)
|
||||
* For Windows clients, please follow [{How To Use Merlin -> Accessing from Windows Clients}](/merlin6/connect-from-windows.html)
|
||||
* For MacOS clients, please follow [{How To Use Merlin -> Accessing from MacOS Clients}](/merlin6/connect-from-macos.html)
|
||||
|
||||
### 'srun' with x11 support
|
||||
|
||||
Merlin5 and Merlin6 clusters allow running any windows based applications. For that, you need to
|
||||
add the option ``--x11`` to the ``srun`` command. In example:
|
||||
|
||||
```bash
|
||||
srun --clusters=merlin6 --x11 xclock
|
||||
```
|
||||
|
||||
will popup a X11 based clock.
|
||||
|
||||
In the same manner, you can create a bash shell with x11 support. For doing that, you need
|
||||
to add the option ``--pty`` to the ``srun --x11`` command. Once resource is allocated, from
|
||||
there you can interactively run X11 and non-X11 based commands.
|
||||
|
||||
```bash
|
||||
srun --clusters=merlin6 --x11 --pty bash
|
||||
```
|
||||
|
||||
<details>
|
||||
<summary>[Show 'srun' with X11 support examples]</summary>
|
||||
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
|
||||
(base) [caubet_m@merlin-l-001 ~]$ srun --clusters=merlin6 --x11 xclock
|
||||
srun: job 135095591 queued and waiting for resources
|
||||
srun: job 135095591 has been allocated resources
|
||||
|
||||
(base) [caubet_m@merlin-l-001 ~]$
|
||||
|
||||
(base) [caubet_m@merlin-l-001 ~]$ srun --clusters=merlin6 --x11 --pty bash
|
||||
srun: job 135095592 queued and waiting for resources
|
||||
srun: job 135095592 has been allocated resources
|
||||
|
||||
(base) [caubet_m@merlin-c-205 ~]$ xclock
|
||||
|
||||
(base) [caubet_m@merlin-c-205 ~]$ echo "This was an example"
|
||||
This was an example
|
||||
|
||||
(base) [caubet_m@merlin-c-205 ~]$ exit
|
||||
exit
|
||||
</pre>
|
||||
</details>
|
||||
|
||||
### 'salloc' with x11 support
|
||||
|
||||
**Merlin5** and **Merlin6** clusters allow running any windows based applications. For that, you need to
|
||||
add the option ``--x11`` to the ``salloc`` command. In example:
|
||||
|
||||
```bash
|
||||
salloc --clusters=merlin6 --x11 xclock
|
||||
```
|
||||
|
||||
will popup a X11 based clock.
|
||||
|
||||
In the same manner, you can create a bash shell with x11 support. For doing that, you need
|
||||
to add to run just ``salloc --clusters=merlin6 --x11``. Once resource is allocated, from
|
||||
there you can interactively run X11 and non-X11 based commands.
|
||||
|
||||
```bash
|
||||
salloc --clusters=merlin6 --x11
|
||||
```
|
||||
|
||||
<details>
|
||||
<summary>[Show 'salloc' with X11 support examples]</summary>
|
||||
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
|
||||
(base) [caubet_m@merlin-l-001 ~]$ salloc --clusters=merlin6 --x11 xclock
|
||||
salloc: Pending job allocation 135171355
|
||||
salloc: job 135171355 queued and waiting for resources
|
||||
salloc: job 135171355 has been allocated resources
|
||||
salloc: Granted job allocation 135171355
|
||||
salloc: Relinquishing job allocation 135171355
|
||||
|
||||
(base) [caubet_m@merlin-l-001 ~]$ salloc --clusters=merlin6 --x11
|
||||
salloc: Pending job allocation 135171349
|
||||
salloc: job 135171349 queued and waiting for resources
|
||||
salloc: job 135171349 has been allocated resources
|
||||
salloc: Granted job allocation 135171349
|
||||
salloc: Waiting for resource configuration
|
||||
salloc: Nodes merlin-c-117 are ready for job
|
||||
|
||||
(base) [caubet_m@merlin-c-117 ~]$ xclock
|
||||
|
||||
(base) [caubet_m@merlin-c-117 ~]$ echo "This was an example"
|
||||
This was an example
|
||||
|
||||
(base) [caubet_m@merlin-c-117 ~]$ exit
|
||||
exit
|
||||
salloc: Relinquishing job allocation 135171349
|
||||
</pre>
|
||||
</details>
|
||||
@@ -1,288 +0,0 @@
|
||||
---
|
||||
title: Monitoring
|
||||
#tags:
|
||||
keywords: monitoring, jobs, slurm, job status, squeue, sinfo, sacct
|
||||
last_updated: 07 September 2022
|
||||
#summary: ""
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/monitoring.html
|
||||
---
|
||||
|
||||
## Slurm Monitoring
|
||||
|
||||
### Job status
|
||||
|
||||
The status of submitted jobs can be check with the ``squeue`` command:
|
||||
|
||||
```bash
|
||||
squeue -u $username
|
||||
```
|
||||
|
||||
Common statuses:
|
||||
|
||||
* **merlin-\***: Running on the specified host
|
||||
* **(Priority)**: Waiting in the queue
|
||||
* **(Resources)**: At the head of the queue, waiting for machines to become available
|
||||
* **(AssocGrpCpuLimit), (AssocGrpNodeLimit)**: Job would exceed per-user limitations on
|
||||
the number of simultaneous CPUs/Nodes. Use `scancel` to remove the job and
|
||||
resubmit with fewer resources, or else wait for your other jobs to finish.
|
||||
* **(PartitionNodeLimit)**: Exceeds all resources available on this partition.
|
||||
Run `scancel` and resubmit to a different partition (`-p`) or with fewer
|
||||
resources.
|
||||
|
||||
Check in the **man** pages (``man squeue``) for all possible options for this command.
|
||||
|
||||
<details>
|
||||
<summary>[Show 'squeue' example]</summary>
|
||||
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
|
||||
[root@merlin-slurmctld01 ~]# squeue -u feichtinger
|
||||
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
|
||||
134332544 general spawner- feichtin R 5-06:47:45 1 merlin-c-204
|
||||
134321376 general subm-tal feichtin R 5-22:27:59 1 merlin-c-204
|
||||
</pre>
|
||||
</details>
|
||||
|
||||
### Partition status
|
||||
|
||||
The status of the nodes and partitions (a.k.a. queues) can be seen with the ``sinfo`` command:
|
||||
|
||||
```bash
|
||||
sinfo
|
||||
```
|
||||
|
||||
Check in the **man** pages (``man sinfo``) for all possible options for this command.
|
||||
|
||||
<details>
|
||||
<summary>[Show 'sinfo' example]</summary>
|
||||
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
|
||||
[root@merlin-l-001 ~]# sinfo -l
|
||||
Thu Jan 23 16:34:49 2020
|
||||
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE NODELIST
|
||||
test up 1-00:00:00 1-infinite no NO all 3 mixed merlin-c-[024,223-224]
|
||||
test up 1-00:00:00 1-infinite no NO all 2 allocated merlin-c-[123-124]
|
||||
test up 1-00:00:00 1-infinite no NO all 1 idle merlin-c-023
|
||||
general* up 7-00:00:00 1-50 no NO all 6 mixed merlin-c-[007,204,207-209,219]
|
||||
general* up 7-00:00:00 1-50 no NO all 57 allocated merlin-c-[001-005,008-020,101-122,201-203,205-206,210-218,220-222]
|
||||
general* up 7-00:00:00 1-50 no NO all 3 idle merlin-c-[006,021-022]
|
||||
daily up 1-00:00:00 1-60 no NO all 9 mixed merlin-c-[007,024,204,207-209,219,223-224]
|
||||
daily up 1-00:00:00 1-60 no NO all 59 allocated merlin-c-[001-005,008-020,101-124,201-203,205-206,210-218,220-222]
|
||||
daily up 1-00:00:00 1-60 no NO all 4 idle merlin-c-[006,021-023]
|
||||
hourly up 1:00:00 1-infinite no NO all 9 mixed merlin-c-[007,024,204,207-209,219,223-224]
|
||||
hourly up 1:00:00 1-infinite no NO all 59 allocated merlin-c-[001-005,008-020,101-124,201-203,205-206,210-218,220-222]
|
||||
hourly up 1:00:00 1-infinite no NO all 4 idle merlin-c-[006,021-023]
|
||||
gpu up 7-00:00:00 1-infinite no NO all 1 mixed merlin-g-007
|
||||
gpu up 7-00:00:00 1-infinite no NO all 8 allocated merlin-g-[001-006,008-009]
|
||||
</pre>
|
||||
</details>
|
||||
|
||||
### Slurm commander
|
||||
|
||||
The **[Slurm Commander (scom)](https://github.com/CLIP-HPC/SlurmCommander/)** is a simple but very useful open source text-based user interface for
|
||||
simple and efficient interaction with Slurm. It is developed by the **CLoud Infrastructure Project (CLIP-HPC)** and external contributions. To use it, one can
|
||||
simply run the following command:
|
||||
|
||||
```bash
|
||||
scom # merlin6 cluster
|
||||
SLURM_CLUSTERS=merlin5 scom # merlin5 cluster
|
||||
SLURM_CLUSTERS=gmerlin6 scom # gmerlin6 cluster
|
||||
scom -h # Help and extra options
|
||||
scom -d 14 # Set Job History to 14 days (instead of default 7)
|
||||
```
|
||||
With this simple interface, users can interact with their jobs, as well as getting information about past and present jobs:
|
||||
* Filtering jobs by substring is possible with the `/` key.
|
||||
* Users can perform multiple actions on their jobs (such like cancelling, holding or requeing a job), SSH to a node with an already running job,
|
||||
or getting extended details and statistics of the job itself.
|
||||
|
||||
Also, users can check the status of the cluster, to get statistics and node usage information as well as getting information about node properties.
|
||||
|
||||
The interface also provides a few job templates for different use cases (i.e. MPI, OpenMP, Hybrid, single core). Users can modify these templates,
|
||||
save it locally to the current directory, and submit the job to the cluster.
|
||||
|
||||
{{site.data.alerts.note}}Currently, <span style="color:darkblue;">scom</span> does not provide live updated information for the <span style="color:darkorange;">[Job History]</span> tab.
|
||||
To update Job History information, users have to exit the application with the <span style="color:darkorange;">q</span> key. Other tabs will be updated every 5 seconds (default).
|
||||
On the other hand, the <span style="color:darkorange;">[Job History]</span> tab contains only information for the <b>merlin6</b> CPU cluster only. Future updates will provide information
|
||||
for other clusters.
|
||||
{{site.data.alerts.end}}
|
||||
|
||||
For further information about how to use **scom**, please refer to the **[Slurm Commander Project webpage](https://github.com/CLIP-HPC/SlurmCommander/)**
|
||||
|
||||

|
||||
|
||||
### Job accounting
|
||||
|
||||
Users can check detailed information of jobs (pending, running, completed, failed, etc.) with the `sacct` command.
|
||||
This command is very flexible and can provide a lot of information. For checking all the available options, please read `man sacct`.
|
||||
Below, we summarize some examples that can be useful for the users:
|
||||
|
||||
```bash
|
||||
# Today jobs, basic summary
|
||||
sacct
|
||||
|
||||
# Today jobs, with details
|
||||
sacct --long
|
||||
|
||||
# Jobs from January 1, 2022, 12pm, with details
|
||||
sacct -S 2021-01-01T12:00:00 --long
|
||||
|
||||
# Specific job accounting
|
||||
sacct --long -j $jobid
|
||||
|
||||
# Jobs custom details, without steps (-X)
|
||||
sacct -X --format=User%20,JobID,Jobname,partition,state,time,submit,start,end,elapsed,AveRss,MaxRss,MaxRSSTask,MaxRSSNode%20,MaxVMSize,nnodes,ncpus,ntasks,reqcpus,totalcpu,reqmem,cluster,TimeLimit,TimeLimitRaw,cputime,nodelist%50,AllocTRES%80
|
||||
|
||||
# Jobs custom details, with steps
|
||||
sacct --format=User%20,JobID,Jobname,partition,state,time,submit,start,end,elapsed,AveRss,MaxRss,MaxRSSTask,MaxRSSNode%20,MaxVMSize,nnodes,ncpus,ntasks,reqcpus,totalcpu,reqmem,cluster,TimeLimit,TimeLimitRaw,cputime,nodelist%50,AllocTRES%80
|
||||
```
|
||||
|
||||
### Job efficiency
|
||||
|
||||
Users can check how efficient are their jobs. For that, the ``seff`` command is available.
|
||||
|
||||
```bash
|
||||
seff $jobid
|
||||
```
|
||||
|
||||
<details>
|
||||
<summary>[Show 'seff' example]</summary>
|
||||
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
|
||||
[root@merlin-slurmctld01 ~]# seff 134333893
|
||||
Job ID: 134333893
|
||||
Cluster: merlin6
|
||||
User/Group: albajacas_a/unx-sls
|
||||
State: COMPLETED (exit code 0)
|
||||
Nodes: 1
|
||||
Cores per node: 8
|
||||
CPU Utilized: 00:26:15
|
||||
CPU Efficiency: 49.47% of 00:53:04 core-walltime
|
||||
Job Wall-clock time: 00:06:38
|
||||
Memory Utilized: 60.73 MB
|
||||
Memory Efficiency: 0.19% of 31.25 GB
|
||||
</pre>
|
||||
</details>
|
||||
|
||||
### List job attributes
|
||||
|
||||
The ``sjstat`` command is used to display statistics of jobs under control of SLURM. To use it
|
||||
|
||||
```bash
|
||||
sjstat
|
||||
```
|
||||
|
||||
<details>
|
||||
<summary>[Show 'sjstat' example]</summary>
|
||||
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
|
||||
[root@merlin-l-001 ~]# sjstat -v
|
||||
|
||||
Scheduling pool data:
|
||||
----------------------------------------------------------------------------------
|
||||
Total Usable Free Node Time Other
|
||||
Pool Memory Cpus Nodes Nodes Nodes Limit Limit traits
|
||||
----------------------------------------------------------------------------------
|
||||
test 373502Mb 88 6 6 1 UNLIM 1-00:00:00
|
||||
general* 373502Mb 88 66 66 8 50 7-00:00:00
|
||||
daily 373502Mb 88 72 72 9 60 1-00:00:00
|
||||
hourly 373502Mb 88 72 72 9 UNLIM 01:00:00
|
||||
gpu 128000Mb 8 1 1 0 UNLIM 7-00:00:00
|
||||
gpu 128000Mb 20 8 8 0 UNLIM 7-00:00:00
|
||||
|
||||
Running job data:
|
||||
---------------------------------------------------------------------------------------------------
|
||||
Time Time Time
|
||||
JobID User Procs Pool Status Used Limit Started Master/Other
|
||||
---------------------------------------------------------------------------------------------------
|
||||
13433377 collu_g 1 gpu PD 0:00 24:00:00 N/A (Resources)
|
||||
13433389 collu_g 20 gpu PD 0:00 24:00:00 N/A (Resources)
|
||||
13433382 jaervine 4 gpu PD 0:00 24:00:00 N/A (Priority)
|
||||
13433386 barret_d 20 gpu PD 0:00 24:00:00 N/A (Priority)
|
||||
13433382 pamula_f 20 gpu PD 0:00 168:00:00 N/A (Priority)
|
||||
13433387 pamula_f 4 gpu PD 0:00 24:00:00 N/A (Priority)
|
||||
13433365 andreani 132 daily PD 0:00 24:00:00 N/A (Dependency)
|
||||
13433388 marino_j 6 gpu R 1:43:12 168:00:00 01-23T14:54:57 merlin-g-007
|
||||
13433377 choi_s 40 gpu R 2:09:55 48:00:00 01-23T14:28:14 merlin-g-006
|
||||
13433373 qi_c 20 gpu R 7:00:04 24:00:00 01-23T09:38:05 merlin-g-004
|
||||
13433390 jaervine 2 gpu R 5:18 24:00:00 01-23T16:32:51 merlin-g-007
|
||||
13433390 jaervine 2 gpu R 15:18 24:00:00 01-23T16:22:51 merlin-g-007
|
||||
13433375 bellotti 4 gpu R 7:35:44 9:00:00 01-23T09:02:25 merlin-g-001
|
||||
13433358 bellotti 1 gpu R 1-05:52:19 144:00:00 01-22T10:45:50 merlin-g-007
|
||||
13433377 lavriha_ 20 gpu R 5:13:24 24:00:00 01-23T11:24:45 merlin-g-008
|
||||
13433370 lavriha_ 40 gpu R 22:43:09 24:00:00 01-22T17:55:00 merlin-g-003
|
||||
13433373 qi_c 20 gpu R 15:03:15 24:00:00 01-23T01:34:54 merlin-g-002
|
||||
13433371 qi_c 4 gpu R 22:14:14 168:00:00 01-22T18:23:55 merlin-g-001
|
||||
13433254 feichtin 2 general R 5-07:26:11 156:00:00 01-18T09:11:58 merlin-c-204
|
||||
13432137 feichtin 2 general R 5-23:06:25 160:00:00 01-17T17:31:44 merlin-c-204
|
||||
13433389 albajaca 32 hourly R 41:19 1:00:00 01-23T15:56:50 merlin-c-219
|
||||
13433387 riemann_ 2 general R 1:51:47 4:00:00 01-23T14:46:22 merlin-c-204
|
||||
13433370 jimenez_ 2 general R 23:20:45 168:00:00 01-22T17:17:24 merlin-c-106
|
||||
13433381 jimenez_ 2 general R 4:55:33 168:00:00 01-23T11:42:36 merlin-c-219
|
||||
13433390 sayed_m 128 daily R 21:49 10:00:00 01-23T16:16:20 merlin-c-223
|
||||
13433359 adelmann 2 general R 1-05:00:09 48:00:00 01-22T11:38:00 merlin-c-204
|
||||
13433377 zimmerma 2 daily R 6:13:38 24:00:00 01-23T10:24:31 merlin-c-007
|
||||
13433375 zohdirad 24 daily R 7:33:16 10:00:00 01-23T09:04:53 merlin-c-218
|
||||
13433363 zimmerma 6 general R 1-02:54:20 47:50:00 01-22T13:43:49 merlin-c-106
|
||||
13433376 zimmerma 6 general R 7:25:42 23:50:00 01-23T09:12:27 merlin-c-007
|
||||
13433371 vazquez_ 16 daily R 21:46:31 23:59:00 01-22T18:51:38 merlin-c-106
|
||||
13433382 vazquez_ 16 daily R 4:09:23 23:59:00 01-23T12:28:46 merlin-c-024
|
||||
13433376 jiang_j1 440 daily R 7:11:14 10:00:00 01-23T09:26:55 merlin-c-123
|
||||
13433376 jiang_j1 24 daily R 7:08:19 10:00:00 01-23T09:29:50 merlin-c-220
|
||||
13433384 kranjcev 440 daily R 2:48:19 24:00:00 01-23T13:49:50 merlin-c-108
|
||||
13433371 vazquez_ 16 general R 20:15:15 120:00:00 01-22T20:22:54 merlin-c-210
|
||||
13433371 vazquez_ 16 general R 21:15:51 120:00:00 01-22T19:22:18 merlin-c-210
|
||||
13433374 colonna_ 176 daily R 8:23:18 24:00:00 01-23T08:14:51 merlin-c-211
|
||||
13433374 bures_l 88 daily R 10:45:06 24:00:00 01-23T05:53:03 merlin-c-001
|
||||
13433375 derlet 88 daily R 7:32:05 24:00:00 01-23T09:06:04 merlin-c-107
|
||||
13433373 derlet 88 daily R 17:21:57 24:00:00 01-22T23:16:12 merlin-c-002
|
||||
13433373 derlet 88 daily R 18:13:05 24:00:00 01-22T22:25:04 merlin-c-112
|
||||
13433365 andreani 264 daily R 4:10:08 24:00:00 01-23T12:28:01 merlin-c-003
|
||||
13431187 mahrous_ 88 general R 6-15:59:16 168:00:00 01-17T00:38:53 merlin-c-111
|
||||
13433387 kranjcev 2 general R 1:48:47 4:00:00 01-23T14:49:22 merlin-c-204
|
||||
13433368 karalis_ 352 general R 1-00:05:22 96:00:00 01-22T16:32:47 merlin-c-013
|
||||
13433367 karalis_ 352 general R 1-00:06:44 96:00:00 01-22T16:31:25 merlin-c-118
|
||||
13433385 karalis_ 352 general R 1:37:24 96:00:00 01-23T15:00:45 merlin-c-213
|
||||
13433374 sato 256 general R 14:55:55 24:00:00 01-23T01:42:14 merlin-c-204
|
||||
13433374 sato 64 general R 10:43:35 24:00:00 01-23T05:54:34 merlin-c-106
|
||||
67723568 sato 32 general R 10:40:07 24:00:00 01-23T05:58:02 merlin-c-007
|
||||
13433265 khanppna 440 general R 3-18:20:58 168:00:00 01-19T22:17:11 merlin-c-008
|
||||
13433375 khanppna 704 general R 7:31:24 24:00:00 01-23T09:06:45 merlin-c-101
|
||||
13433371 khanppna 616 general R 21:40:33 24:00:00 01-22T18:57:36 merlin-c-208
|
||||
</pre>
|
||||
</details>
|
||||
|
||||
### Graphical user interface
|
||||
|
||||
When using **ssh** with X11 forwarding (``ssh -XY``), or when using NoMachine, users can use ``sview``.
|
||||
**SView** is a graphical user interface to view and modify Slurm states. To run **sview**:
|
||||
|
||||
```bash
|
||||
ssh -XY $username@merlin-l-001.psi.ch # Not necessary when using NoMachine
|
||||
sview
|
||||
```
|
||||
|
||||

|
||||
|
||||
|
||||
## General Monitoring
|
||||
|
||||
The following pages contain basic monitoring for Slurm and computing nodes.
|
||||
Currently, monitoring is based on Grafana + InfluxDB. In the future it will
|
||||
be moved to a different service based on ElasticSearch + LogStash + Kibana.
|
||||
|
||||
In the meantime, the following monitoring pages are available in a best effort
|
||||
support:
|
||||
|
||||
### Merlin6 Monitoring Pages
|
||||
|
||||
* Slurm monitoring:
|
||||
* ***[Merlin6 Slurm Statistics - XDMOD](https://merlin-slurmmon01.psi.ch/)***
|
||||
* [Merlin6 Slurm Live Status](https://hpc-monitor02.psi.ch/d/QNcbW1AZk/merlin6-slurm-live-status?orgId=1&refresh=10s)
|
||||
* [Merlin6 Slurm Overview](https://hpc-monitor02.psi.ch/d/94UxWJ0Zz/merlin6-slurm-overview?orgId=1&refresh=10s)
|
||||
* Nodes monitoring:
|
||||
* [Merlin6 CPU Nodes Overview](https://hpc-monitor02.psi.ch/d/JmvLR8gZz/merlin6-computing-cpu-nodes?orgId=1&refresh=10s)
|
||||
* [Merlin6 GPU Nodes Overview](https://hpc-monitor02.psi.ch/d/gOo1Z10Wk/merlin6-computing-gpu-nodes?orgId=1&refresh=10s)
|
||||
|
||||
### Merlin5 Monitoring Pages
|
||||
|
||||
* Slurm monitoring:
|
||||
* [Merlin5 Slurm Live Status](https://hpc-monitor02.psi.ch/d/o8msZJ0Zz/merlin5-slurm-live-status?orgId=1&refresh=10s)
|
||||
* [Merlin5 Slurm Overview](https://hpc-monitor02.psi.ch/d/eWLEW1AWz/merlin5-slurm-overview?orgId=1&refresh=10s)
|
||||
* Nodes monitoring:
|
||||
* [Merlin5 CPU Nodes Overview](https://hpc-monitor02.psi.ch/d/ejTyWJAWk/merlin5-computing-cpu-nodes?orgId=1&refresh=10s)
|
||||
217
docs/merlin6/slurm-general-docs/interactive-jobs.md
Normal file
217
docs/merlin6/slurm-general-docs/interactive-jobs.md
Normal file
@@ -0,0 +1,217 @@
|
||||
# Running Interactive Jobs
|
||||
|
||||
## Running interactive jobs
|
||||
|
||||
There are two different ways for running interactive jobs in Slurm. This is possible by using
|
||||
the `salloc` and `srun` commands:
|
||||
|
||||
* **`salloc`**: to obtain a Slurm job allocation (a set of nodes), execute command(s), and then release the allocation when the command is finished.
|
||||
* **`srun`**: is used for running parallel tasks.
|
||||
|
||||
### srun
|
||||
|
||||
Is run is used to run parallel jobs in the batch system. It can be used within a batch script
|
||||
(which can be run with `sbatch`), or within a job allocation (which can be run with `salloc`).
|
||||
Also, it can be used as a direct command (in example, from the login nodes).
|
||||
|
||||
When used inside a batch script or during a job allocation, `srun` is constricted to the
|
||||
amount of resources allocated by the `sbatch`/`salloc` commands. In `sbatch`, usually
|
||||
these resources are defined inside the batch script with the format `#SBATCH <option>=<value>`.
|
||||
In other words, if you define in your batch script or allocation 88 tasks (and 1 thread / core)
|
||||
and 2 nodes, `srun` is constricted to these amount of resources (you can use less, but never
|
||||
exceed those limits).
|
||||
|
||||
When used from the login node, usually is used to run a specific command or software in an
|
||||
interactive way. `srun` is a blocking process (it will block bash prompt until the `srun`
|
||||
command finishes, unless you run it in background with `&`). This can be very useful to run
|
||||
interactive software which pops up a Window and then submits jobs or run sub-tasks in the
|
||||
background (in example, **Relion**, **cisTEM**, etc.)
|
||||
|
||||
Refer to `man srun` for exploring all possible options for that command.
|
||||
|
||||
??? note "Running 'hostname' command on 3 nodes, using 2 cores (1 task/core) per node"
|
||||
```console
|
||||
(base) [caubet_m@merlin-l-001 ~]$ srun --clusters=merlin6 --ntasks=6 --ntasks-per-node=2 --nodes=3 hostname
|
||||
srun: job 135088230 queued and waiting for resources
|
||||
srun: job 135088230 has been allocated resources
|
||||
merlin-c-102.psi.ch
|
||||
merlin-c-102.psi.ch
|
||||
merlin-c-101.psi.ch
|
||||
merlin-c-101.psi.ch
|
||||
merlin-c-103.psi.ch
|
||||
merlin-c-103.psi.ch
|
||||
```
|
||||
|
||||
### salloc
|
||||
|
||||
**`salloc`** is used to obtain a Slurm job allocation (a set of nodes). Once job is allocated,
|
||||
users are able to execute interactive command(s). Once finished (`exit` or `Ctrl+D`),
|
||||
the allocation is released. **`salloc`** is a blocking command, it is, command will be blocked
|
||||
until the requested resources are allocated.
|
||||
|
||||
When running **`salloc`**, once the resources are allocated, *by default* the user will get
|
||||
a ***new shell on one of the allocated resources*** (if a user has requested few nodes, it will
|
||||
prompt a new shell on the first allocated node). However, this behaviour can be changed by adding
|
||||
a shell (`$SHELL`) at the end of the `salloc` command. In example:
|
||||
|
||||
```bash
|
||||
# Typical 'salloc' call
|
||||
# - Same as running:
|
||||
# 'salloc --clusters=merlin6 -N 2 -n 2 srun -n1 -N1 --mem-per-cpu=0 --gres=gpu:0 --pty --preserve-env --mpi=none $SHELL'
|
||||
salloc --clusters=merlin6 -N 2 -n 2
|
||||
|
||||
# Custom 'salloc' call
|
||||
# - $SHELL will open a local shell on the login node from where ``salloc`` is running
|
||||
salloc --clusters=merlin6 -N 2 -n 2 $SHELL
|
||||
```
|
||||
|
||||
??? note "Allocating 2 cores (1 task/core) in 2 nodes (1 core/node) - *default*"
|
||||
```console
|
||||
(base) [caubet_m@merlin-l-001 ~]$ salloc --clusters=merlin6 --ntasks=2 --nodes=2
|
||||
salloc: Pending job allocation 135171306
|
||||
salloc: job 135171306 queued and waiting for resources
|
||||
salloc: job 135171306 has been allocated resources
|
||||
salloc: Granted job allocation 135171306
|
||||
|
||||
(base) [caubet_m@merlin-c-213 ~]$ srun hostname
|
||||
merlin-c-213.psi.ch
|
||||
merlin-c-214.psi.ch
|
||||
|
||||
(base) [caubet_m@merlin-c-213 ~]$ exit
|
||||
exit
|
||||
salloc: Relinquishing job allocation 135171306
|
||||
|
||||
(base) [caubet_m@merlin-l-001 ~]$ salloc --clusters=merlin6 -N 2 -n 2 srun -n1 -N1 --mem-per-cpu=0 --gres=gpu:0 --pty --preserve-env --mpi=none $SHELL
|
||||
salloc: Pending job allocation 135171342
|
||||
salloc: job 135171342 queued and waiting for resources
|
||||
salloc: job 135171342 has been allocated resources
|
||||
salloc: Granted job allocation 135171342
|
||||
|
||||
(base) [caubet_m@merlin-c-021 ~]$ srun hostname
|
||||
merlin-c-021.psi.ch
|
||||
merlin-c-022.psi.ch
|
||||
|
||||
(base) [caubet_m@merlin-c-021 ~]$ exit
|
||||
exit
|
||||
salloc: Relinquishing job allocation 135171342
|
||||
```
|
||||
|
||||
??? note "Allocating 2 cores (1 task/core) in 2 nodes (1 core/node) - `$SHELL`"
|
||||
```console
|
||||
(base) [caubet_m@merlin-export-01 ~]$ salloc --clusters=merlin6 --ntasks=2 --nodes=2 $SHELL
|
||||
salloc: Pending job allocation 135171308
|
||||
salloc: job 135171308 queued and waiting for resources
|
||||
salloc: job 135171308 has been allocated resources
|
||||
salloc: Granted job allocation 135171308
|
||||
|
||||
(base) [caubet_m@merlin-export-01 ~]$ srun hostname
|
||||
merlin-c-218.psi.ch
|
||||
merlin-c-117.psi.ch
|
||||
|
||||
(base) [caubet_m@merlin-export-01 ~]$ exit
|
||||
exit
|
||||
salloc: Relinquishing job allocation 135171308
|
||||
```
|
||||
|
||||
## Running interactive jobs with X11 support
|
||||
|
||||
### Requirements
|
||||
|
||||
#### Graphical access
|
||||
|
||||
[NoMachine](/merlin6/nomachine.html) is the official supported service for graphical
|
||||
access in the Merlin cluster. This service is running on the login nodes. Check the
|
||||
document [{Accessing Merlin -> NoMachine}](/merlin6/nomachine.html) for details about
|
||||
how to connect to the **NoMachine** service in the Merlin cluster.
|
||||
|
||||
For other non officially supported graphical access (X11 forwarding):
|
||||
|
||||
* For Linux clients, please follow [{How To Use Merlin -> Accessing from Linux Clients}](/merlin6/connect-from-linux.html)
|
||||
* For Windows clients, please follow [{How To Use Merlin -> Accessing from Windows Clients}](/merlin6/connect-from-windows.html)
|
||||
* For MacOS clients, please follow [{How To Use Merlin -> Accessing from MacOS Clients}](/merlin6/connect-from-macos.html)
|
||||
|
||||
### 'srun' with x11 support
|
||||
|
||||
Merlin5 and Merlin6 clusters allow running any windows based applications. For that, you need to
|
||||
add the option ``--x11`` to the ``srun`` command. In example:
|
||||
|
||||
```bash
|
||||
srun --clusters=merlin6 --x11 xclock
|
||||
```
|
||||
|
||||
will popup a X11 based clock.
|
||||
|
||||
In the same manner, you can create a bash shell with x11 support. For doing that, you need
|
||||
to add the option ``--pty`` to the ``srun --x11`` command. Once resource is allocated, from
|
||||
there you can interactively run X11 and non-X11 based commands.
|
||||
|
||||
```bash
|
||||
srun --clusters=merlin6 --x11 --pty bash
|
||||
```
|
||||
|
||||
??? note "Using 'srun' with X11 support"
|
||||
```console
|
||||
(base) [caubet_m@merlin-l-001 ~]$ srun --clusters=merlin6 --x11 xclock
|
||||
srun: job 135095591 queued and waiting for resources
|
||||
srun: job 135095591 has been allocated resources
|
||||
|
||||
(base) [caubet_m@merlin-l-001 ~]$
|
||||
|
||||
(base) [caubet_m@merlin-l-001 ~]$ srun --clusters=merlin6 --x11 --pty bash
|
||||
srun: job 135095592 queued and waiting for resources
|
||||
srun: job 135095592 has been allocated resources
|
||||
|
||||
(base) [caubet_m@merlin-c-205 ~]$ xclock
|
||||
|
||||
(base) [caubet_m@merlin-c-205 ~]$ echo "This was an example"
|
||||
This was an example
|
||||
|
||||
(base) [caubet_m@merlin-c-205 ~]$ exit
|
||||
exit
|
||||
```
|
||||
|
||||
### 'salloc' with x11 support
|
||||
|
||||
**Merlin5** and **Merlin6** clusters allow running any windows based applications. For that, you need to
|
||||
add the option ``--x11`` to the ``salloc`` command. In example:
|
||||
|
||||
```bash
|
||||
salloc --clusters=merlin6 --x11 xclock
|
||||
```
|
||||
|
||||
will popup a X11 based clock.
|
||||
|
||||
In the same manner, you can create a bash shell with x11 support. For doing that, you need
|
||||
to add to run just ``salloc --clusters=merlin6 --x11``. Once resource is allocated, from
|
||||
there you can interactively run X11 and non-X11 based commands.
|
||||
|
||||
```bash
|
||||
salloc --clusters=merlin6 --x11
|
||||
```
|
||||
|
||||
??? note "Using 'salloc' with X11 support examples"
|
||||
```console
|
||||
(base) [caubet_m@merlin-l-001 ~]$ salloc --clusters=merlin6 --x11 xclock
|
||||
salloc: Pending job allocation 135171355
|
||||
salloc: job 135171355 queued and waiting for resources
|
||||
salloc: job 135171355 has been allocated resources
|
||||
salloc: Granted job allocation 135171355
|
||||
salloc: Relinquishing job allocation 135171355
|
||||
|
||||
(base) [caubet_m@merlin-l-001 ~]$ salloc --clusters=merlin6 --x11
|
||||
salloc: Pending job allocation 135171349
|
||||
salloc: job 135171349 queued and waiting for resources
|
||||
salloc: job 135171349 has been allocated resources
|
||||
salloc: Granted job allocation 135171349
|
||||
salloc: Waiting for resource configuration
|
||||
salloc: Nodes merlin-c-117 are ready for job
|
||||
|
||||
(base) [caubet_m@merlin-c-117 ~]$ xclock
|
||||
|
||||
(base) [caubet_m@merlin-c-117 ~]$ echo "This was an example"
|
||||
This was an example
|
||||
|
||||
(base) [caubet_m@merlin-c-117 ~]$ exit
|
||||
exit
|
||||
salloc: Relinquishing job allocation 135171349
|
||||
```
|
||||
278
docs/merlin6/slurm-general-docs/monitoring.md
Normal file
278
docs/merlin6/slurm-general-docs/monitoring.md
Normal file
@@ -0,0 +1,278 @@
|
||||
# Monitoring
|
||||
|
||||
## Slurm Monitoring
|
||||
|
||||
### Job status
|
||||
|
||||
The status of submitted jobs can be check with the ``squeue`` command:
|
||||
|
||||
```bash
|
||||
squeue -u $username
|
||||
```
|
||||
|
||||
Common statuses:
|
||||
|
||||
* **merlin-\***: Running on the specified host
|
||||
* **(Priority)**: Waiting in the queue
|
||||
* **(Resources)**: At the head of the queue, waiting for machines to become available
|
||||
* **(AssocGrpCpuLimit), (AssocGrpNodeLimit)**: Job would exceed per-user limitations on
|
||||
the number of simultaneous CPUs/Nodes. Use `scancel` to remove the job and
|
||||
resubmit with fewer resources, or else wait for your other jobs to finish.
|
||||
* **(PartitionNodeLimit)**: Exceeds all resources available on this partition.
|
||||
Run `scancel` and resubmit to a different partition (`-p`) or with fewer
|
||||
resources.
|
||||
|
||||
Check in the **man** pages (`man squeue`) for all possible options for this command.
|
||||
|
||||
??? note "Using 'squeue' example"
|
||||
```console
|
||||
# squeue -u feichtinger
|
||||
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
|
||||
134332544 general spawner- feichtin R 5-06:47:45 1 merlin-c-204
|
||||
134321376 general subm-tal feichtin R 5-22:27:59 1 merlin-c-204
|
||||
```
|
||||
|
||||
### Partition status
|
||||
|
||||
The status of the nodes and partitions (a.k.a. queues) can be seen with the `sinfo` command:
|
||||
|
||||
```bash
|
||||
sinfo
|
||||
```
|
||||
|
||||
Check in the **man** pages (`man sinfo`) for all possible options for this command.
|
||||
|
||||
??? note "Using 'sinfo' example"
|
||||
```console
|
||||
# sinfo -l
|
||||
Thu Jan 23 16:34:49 2020
|
||||
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE NODELIST
|
||||
test up 1-00:00:00 1-infinite no NO all 3 mixed merlin-c-[024,223-224]
|
||||
test up 1-00:00:00 1-infinite no NO all 2 allocated merlin-c-[123-124]
|
||||
test up 1-00:00:00 1-infinite no NO all 1 idle merlin-c-023
|
||||
general* up 7-00:00:00 1-50 no NO all 6 mixed merlin-c-[007,204,207-209,219]
|
||||
general* up 7-00:00:00 1-50 no NO all 57 allocated merlin-c-[001-005,008-020,101-122,201-203,205-206,210-218,220-222]
|
||||
general* up 7-00:00:00 1-50 no NO all 3 idle merlin-c-[006,021-022]
|
||||
daily up 1-00:00:00 1-60 no NO all 9 mixed merlin-c-[007,024,204,207-209,219,223-224]
|
||||
daily up 1-00:00:00 1-60 no NO all 59 allocated merlin-c-[001-005,008-020,101-124,201-203,205-206,210-218,220-222]
|
||||
daily up 1-00:00:00 1-60 no NO all 4 idle merlin-c-[006,021-023]
|
||||
hourly up 1:00:00 1-infinite no NO all 9 mixed merlin-c-[007,024,204,207-209,219,223-224]
|
||||
hourly up 1:00:00 1-infinite no NO all 59 allocated merlin-c-[001-005,008-020,101-124,201-203,205-206,210-218,220-222]
|
||||
hourly up 1:00:00 1-infinite no NO all 4 idle merlin-c-[006,021-023]
|
||||
gpu up 7-00:00:00 1-infinite no NO all 1 mixed merlin-g-007
|
||||
gpu up 7-00:00:00 1-infinite no NO all 8 allocated merlin-g-[001-006,008-009]
|
||||
```
|
||||
|
||||
### Slurm commander
|
||||
|
||||
The **[Slurm Commander (scom)](https://github.com/CLIP-HPC/SlurmCommander/)** is a simple but very useful open source text-based user interface for
|
||||
simple and efficient interaction with Slurm. It is developed by the **CLoud Infrastructure Project (CLIP-HPC)** and external contributions. To use it, one can
|
||||
simply run the following command:
|
||||
|
||||
```bash
|
||||
scom # merlin6 cluster
|
||||
SLURM_CLUSTERS=merlin5 scom # merlin5 cluster
|
||||
SLURM_CLUSTERS=gmerlin6 scom # gmerlin6 cluster
|
||||
scom -h # Help and extra options
|
||||
scom -d 14 # Set Job History to 14 days (instead of default 7)
|
||||
```
|
||||
|
||||
With this simple interface, users can interact with their jobs, as well as getting information about past and present jobs:
|
||||
|
||||
* Filtering jobs by substring is possible with the `/` key.
|
||||
* Users can perform multiple actions on their jobs (such like cancelling,
|
||||
holding or requeing a job), SSH to a node with an already running job,
|
||||
or getting extended details and statistics of the job itself.
|
||||
|
||||
Also, users can check the status of the cluster, to get statistics and node usage information as well as getting information about node properties.
|
||||
|
||||
The interface also provides a few job templates for different use cases (i.e. MPI, OpenMP, Hybrid, single core). Users can modify these templates,
|
||||
save it locally to the current directory, and submit the job to the cluster.
|
||||
|
||||
!!! note
|
||||
Currently, `scom` does not provide live updated information for the <span
|
||||
style="color:darkorange;">[Job History]</span> tab. To update Job History
|
||||
information, users have to exit the application with the <span
|
||||
style="color:darkorange;">q</span> key. Other tabs will be updated every 5
|
||||
seconds (default). On the other hand, the <span style="color:darkorange;">[Job
|
||||
History]</span> tab contains only information for the **merlin6** CPU cluster
|
||||
only. Future updates will provide information for other clusters.
|
||||
|
||||
For further information about how to use **scom**, please refer to the **[Slurm Commander Project webpage](https://github.com/CLIP-HPC/SlurmCommander/)**
|
||||
|
||||

|
||||
|
||||
### Job accounting
|
||||
|
||||
Users can check detailed information of jobs (pending, running, completed, failed, etc.) with the `sacct` command.
|
||||
This command is very flexible and can provide a lot of information. For checking all the available options, please read `man sacct`.
|
||||
Below, we summarize some examples that can be useful for the users:
|
||||
|
||||
```bash
|
||||
# Today jobs, basic summary
|
||||
sacct
|
||||
|
||||
# Today jobs, with details
|
||||
sacct --long
|
||||
|
||||
# Jobs from January 1, 2022, 12pm, with details
|
||||
sacct -S 2021-01-01T12:00:00 --long
|
||||
|
||||
# Specific job accounting
|
||||
sacct --long -j $jobid
|
||||
|
||||
# Jobs custom details, without steps (-X)
|
||||
sacct -X --format=User%20,JobID,Jobname,partition,state,time,submit,start,end,elapsed,AveRss,MaxRss,MaxRSSTask,MaxRSSNode%20,MaxVMSize,nnodes,ncpus,ntasks,reqcpus,totalcpu,reqmem,cluster,TimeLimit,TimeLimitRaw,cputime,nodelist%50,AllocTRES%80
|
||||
|
||||
# Jobs custom details, with steps
|
||||
sacct --format=User%20,JobID,Jobname,partition,state,time,submit,start,end,elapsed,AveRss,MaxRss,MaxRSSTask,MaxRSSNode%20,MaxVMSize,nnodes,ncpus,ntasks,reqcpus,totalcpu,reqmem,cluster,TimeLimit,TimeLimitRaw,cputime,nodelist%50,AllocTRES%80
|
||||
```
|
||||
|
||||
### Job efficiency
|
||||
|
||||
Users can check how efficient are their jobs. For that, the `seff` command is available.
|
||||
|
||||
```bash
|
||||
seff $jobid
|
||||
```
|
||||
|
||||
??? note "Using 'seff' example"
|
||||
```console
|
||||
# seff 134333893
|
||||
Job ID: 134333893
|
||||
Cluster: merlin6
|
||||
User/Group: albajacas_a/unx-sls
|
||||
State: COMPLETED (exit code 0)
|
||||
Nodes: 1
|
||||
Cores per node: 8
|
||||
CPU Utilized: 00:26:15
|
||||
CPU Efficiency: 49.47% of 00:53:04 core-walltime
|
||||
Job Wall-clock time: 00:06:38
|
||||
Memory Utilized: 60.73 MB
|
||||
Memory Efficiency: 0.19% of 31.25 GB
|
||||
```
|
||||
|
||||
### List job attributes
|
||||
|
||||
The ``sjstat`` command is used to display statistics of jobs under control of SLURM. To use it
|
||||
|
||||
```bash
|
||||
sjstat
|
||||
```
|
||||
|
||||
??? note "Using 'sjstat' example"
|
||||
```console
|
||||
# sjstat -v
|
||||
|
||||
Scheduling pool data:
|
||||
----------------------------------------------------------------------------------
|
||||
Total Usable Free Node Time Other
|
||||
Pool Memory Cpus Nodes Nodes Nodes Limit Limit traits
|
||||
----------------------------------------------------------------------------------
|
||||
test 373502Mb 88 6 6 1 UNLIM 1-00:00:00
|
||||
general* 373502Mb 88 66 66 8 50 7-00:00:00
|
||||
daily 373502Mb 88 72 72 9 60 1-00:00:00
|
||||
hourly 373502Mb 88 72 72 9 UNLIM 01:00:00
|
||||
gpu 128000Mb 8 1 1 0 UNLIM 7-00:00:00
|
||||
gpu 128000Mb 20 8 8 0 UNLIM 7-00:00:00
|
||||
|
||||
Running job data:
|
||||
---------------------------------------------------------------------------------------------------
|
||||
Time Time Time
|
||||
JobID User Procs Pool Status Used Limit Started Master/Other
|
||||
---------------------------------------------------------------------------------------------------
|
||||
13433377 collu_g 1 gpu PD 0:00 24:00:00 N/A (Resources)
|
||||
13433389 collu_g 20 gpu PD 0:00 24:00:00 N/A (Resources)
|
||||
13433382 jaervine 4 gpu PD 0:00 24:00:00 N/A (Priority)
|
||||
13433386 barret_d 20 gpu PD 0:00 24:00:00 N/A (Priority)
|
||||
13433382 pamula_f 20 gpu PD 0:00 168:00:00 N/A (Priority)
|
||||
13433387 pamula_f 4 gpu PD 0:00 24:00:00 N/A (Priority)
|
||||
13433365 andreani 132 daily PD 0:00 24:00:00 N/A (Dependency)
|
||||
13433388 marino_j 6 gpu R 1:43:12 168:00:00 01-23T14:54:57 merlin-g-007
|
||||
13433377 choi_s 40 gpu R 2:09:55 48:00:00 01-23T14:28:14 merlin-g-006
|
||||
13433373 qi_c 20 gpu R 7:00:04 24:00:00 01-23T09:38:05 merlin-g-004
|
||||
13433390 jaervine 2 gpu R 5:18 24:00:00 01-23T16:32:51 merlin-g-007
|
||||
13433390 jaervine 2 gpu R 15:18 24:00:00 01-23T16:22:51 merlin-g-007
|
||||
13433375 bellotti 4 gpu R 7:35:44 9:00:00 01-23T09:02:25 merlin-g-001
|
||||
13433358 bellotti 1 gpu R 1-05:52:19 144:00:00 01-22T10:45:50 merlin-g-007
|
||||
13433377 lavriha_ 20 gpu R 5:13:24 24:00:00 01-23T11:24:45 merlin-g-008
|
||||
13433370 lavriha_ 40 gpu R 22:43:09 24:00:00 01-22T17:55:00 merlin-g-003
|
||||
13433373 qi_c 20 gpu R 15:03:15 24:00:00 01-23T01:34:54 merlin-g-002
|
||||
13433371 qi_c 4 gpu R 22:14:14 168:00:00 01-22T18:23:55 merlin-g-001
|
||||
13433254 feichtin 2 general R 5-07:26:11 156:00:00 01-18T09:11:58 merlin-c-204
|
||||
13432137 feichtin 2 general R 5-23:06:25 160:00:00 01-17T17:31:44 merlin-c-204
|
||||
13433389 albajaca 32 hourly R 41:19 1:00:00 01-23T15:56:50 merlin-c-219
|
||||
13433387 riemann_ 2 general R 1:51:47 4:00:00 01-23T14:46:22 merlin-c-204
|
||||
13433370 jimenez_ 2 general R 23:20:45 168:00:00 01-22T17:17:24 merlin-c-106
|
||||
13433381 jimenez_ 2 general R 4:55:33 168:00:00 01-23T11:42:36 merlin-c-219
|
||||
13433390 sayed_m 128 daily R 21:49 10:00:00 01-23T16:16:20 merlin-c-223
|
||||
13433359 adelmann 2 general R 1-05:00:09 48:00:00 01-22T11:38:00 merlin-c-204
|
||||
13433377 zimmerma 2 daily R 6:13:38 24:00:00 01-23T10:24:31 merlin-c-007
|
||||
13433375 zohdirad 24 daily R 7:33:16 10:00:00 01-23T09:04:53 merlin-c-218
|
||||
13433363 zimmerma 6 general R 1-02:54:20 47:50:00 01-22T13:43:49 merlin-c-106
|
||||
13433376 zimmerma 6 general R 7:25:42 23:50:00 01-23T09:12:27 merlin-c-007
|
||||
13433371 vazquez_ 16 daily R 21:46:31 23:59:00 01-22T18:51:38 merlin-c-106
|
||||
13433382 vazquez_ 16 daily R 4:09:23 23:59:00 01-23T12:28:46 merlin-c-024
|
||||
13433376 jiang_j1 440 daily R 7:11:14 10:00:00 01-23T09:26:55 merlin-c-123
|
||||
13433376 jiang_j1 24 daily R 7:08:19 10:00:00 01-23T09:29:50 merlin-c-220
|
||||
13433384 kranjcev 440 daily R 2:48:19 24:00:00 01-23T13:49:50 merlin-c-108
|
||||
13433371 vazquez_ 16 general R 20:15:15 120:00:00 01-22T20:22:54 merlin-c-210
|
||||
13433371 vazquez_ 16 general R 21:15:51 120:00:00 01-22T19:22:18 merlin-c-210
|
||||
13433374 colonna_ 176 daily R 8:23:18 24:00:00 01-23T08:14:51 merlin-c-211
|
||||
13433374 bures_l 88 daily R 10:45:06 24:00:00 01-23T05:53:03 merlin-c-001
|
||||
13433375 derlet 88 daily R 7:32:05 24:00:00 01-23T09:06:04 merlin-c-107
|
||||
13433373 derlet 88 daily R 17:21:57 24:00:00 01-22T23:16:12 merlin-c-002
|
||||
13433373 derlet 88 daily R 18:13:05 24:00:00 01-22T22:25:04 merlin-c-112
|
||||
13433365 andreani 264 daily R 4:10:08 24:00:00 01-23T12:28:01 merlin-c-003
|
||||
13431187 mahrous_ 88 general R 6-15:59:16 168:00:00 01-17T00:38:53 merlin-c-111
|
||||
13433387 kranjcev 2 general R 1:48:47 4:00:00 01-23T14:49:22 merlin-c-204
|
||||
13433368 karalis_ 352 general R 1-00:05:22 96:00:00 01-22T16:32:47 merlin-c-013
|
||||
13433367 karalis_ 352 general R 1-00:06:44 96:00:00 01-22T16:31:25 merlin-c-118
|
||||
13433385 karalis_ 352 general R 1:37:24 96:00:00 01-23T15:00:45 merlin-c-213
|
||||
13433374 sato 256 general R 14:55:55 24:00:00 01-23T01:42:14 merlin-c-204
|
||||
13433374 sato 64 general R 10:43:35 24:00:00 01-23T05:54:34 merlin-c-106
|
||||
67723568 sato 32 general R 10:40:07 24:00:00 01-23T05:58:02 merlin-c-007
|
||||
13433265 khanppna 440 general R 3-18:20:58 168:00:00 01-19T22:17:11 merlin-c-008
|
||||
13433375 khanppna 704 general R 7:31:24 24:00:00 01-23T09:06:45 merlin-c-101
|
||||
13433371 khanppna 616 general R 21:40:33 24:00:00 01-22T18:57:36 merlin-c-208
|
||||
```
|
||||
|
||||
### Graphical user interface
|
||||
|
||||
When using **ssh** with X11 forwarding (`ssh -XY`), or when using NoMachine, users can use `sview`.
|
||||
**SView** is a graphical user interface to view and modify Slurm states. To run **sview**:
|
||||
|
||||
```bash
|
||||
ssh -XY $username@merlin-l-001.psi.ch # Not necessary when using NoMachine
|
||||
sview
|
||||
```
|
||||
|
||||

|
||||
|
||||
|
||||
## General Monitoring
|
||||
|
||||
The following pages contain basic monitoring for Slurm and computing nodes.
|
||||
Currently, monitoring is based on Grafana + InfluxDB. In the future it will
|
||||
be moved to a different service based on ElasticSearch + LogStash + Kibana.
|
||||
|
||||
In the meantime, the following monitoring pages are available in a best effort
|
||||
support:
|
||||
|
||||
### Merlin6 Monitoring Pages
|
||||
|
||||
* Slurm monitoring:
|
||||
* ***[Merlin6 Slurm Statistics - XDMOD](https://merlin-slurmmon01.psi.ch/)***
|
||||
* [Merlin6 Slurm Live Status](https://hpc-monitor02.psi.ch/d/QNcbW1AZk/merlin6-slurm-live-status?orgId=1&refresh=10s)
|
||||
* [Merlin6 Slurm Overview](https://hpc-monitor02.psi.ch/d/94UxWJ0Zz/merlin6-slurm-overview?orgId=1&refresh=10s)
|
||||
* Nodes monitoring:
|
||||
* [Merlin6 CPU Nodes Overview](https://hpc-monitor02.psi.ch/d/JmvLR8gZz/merlin6-computing-cpu-nodes?orgId=1&refresh=10s)
|
||||
* [Merlin6 GPU Nodes Overview](https://hpc-monitor02.psi.ch/d/gOo1Z10Wk/merlin6-computing-gpu-nodes?orgId=1&refresh=10s)
|
||||
|
||||
### Merlin5 Monitoring Pages
|
||||
|
||||
* Slurm monitoring:
|
||||
* [Merlin5 Slurm Live Status](https://hpc-monitor02.psi.ch/d/o8msZJ0Zz/merlin5-slurm-live-status?orgId=1&refresh=10s)
|
||||
* [Merlin5 Slurm Overview](https://hpc-monitor02.psi.ch/d/eWLEW1AWz/merlin5-slurm-overview?orgId=1&refresh=10s)
|
||||
* Nodes monitoring:
|
||||
* [Merlin5 CPU Nodes Overview](https://hpc-monitor02.psi.ch/d/ejTyWJAWk/merlin5-computing-cpu-nodes?orgId=1&refresh=10s)
|
||||
@@ -1,20 +1,11 @@
|
||||
---
|
||||
title: Running Slurm Scripts
|
||||
#tags:
|
||||
keywords: batch script, slurm, sbatch, srun, jobs, job, submit, submission, array jobs, array, squeue, sinfo, scancel, packed jobs, short jobs, very short jobs, multithread, rules, no-multithread, HT
|
||||
last_updated: 07 September 2022
|
||||
summary: "This document describes how to run batch scripts in Slurm."
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/running-jobs.html
|
||||
---
|
||||
|
||||
# Running Slurm Scripts
|
||||
|
||||
## The rules
|
||||
|
||||
Before starting using the cluster, please read the following rules:
|
||||
|
||||
1. To ease and improve *scheduling* and *backfilling*, always try to **estimate and** to **define a proper run time** of your jobs:
|
||||
* Use ``--time=<D-HH:MM:SS>`` for that.
|
||||
* Use `--time=<D-HH:MM:SS>` for that.
|
||||
* For very long runs, please consider using ***[Job Arrays with Checkpointing](/merlin6/running-jobs.html#array-jobs-running-very-long-tasks-with-checkpoint-files)***
|
||||
2. Try to optimize your jobs for running at most within **one day**. Please, consider the following:
|
||||
* Some software can simply scale up by using more nodes while drastically reducing the run time.
|
||||
@@ -28,8 +19,10 @@ Before starting using the cluster, please read the following rules:
|
||||
4. Do not submit hundreds of similar jobs!
|
||||
* Use ***[Array Jobs](/merlin6/running-jobs.html#array-jobs-launching-a-large-number-of-related-jobs)*** for gathering jobs instead.
|
||||
|
||||
{{site.data.alerts.tip}}Having a good estimation of the <i>time</i> needed by your jobs, a proper way for running them, and optimizing the jobs to <i>run within one day</i> will contribute to make the system fairly and efficiently used.
|
||||
{{site.data.alerts.end}}
|
||||
!!! tip
|
||||
Having a good estimation of the *time* needed by your jobs, a proper way for
|
||||
running them, and optimizing the jobs to *run within one day* will contribute
|
||||
to make the system fairly and efficiently used.
|
||||
|
||||
## Basic commands for running batch scripts
|
||||
|
||||
@@ -38,13 +31,13 @@ Before starting using the cluster, please read the following rules:
|
||||
* Use **``squeue``** for checking jobs status.
|
||||
* Use **``scancel``** for cancelling/deleting a job from the queue.
|
||||
|
||||
{{site.data.alerts.tip}}Use Linux <b>'man'</b> pages when needed (i.e. <span style="color:orange;">'man sbatch'</span>), mostly for checking the available options for the above commands.
|
||||
{{site.data.alerts.end}}
|
||||
!!! tip
|
||||
Use Linux `man` pages when needed (i.e. `man sbatch`), mostly for checking the available options for the above commands.
|
||||
|
||||
## Basic settings
|
||||
|
||||
For a complete list of options and parameters available is recommended to use the **man pages** (i.e. ``man sbatch``, ``man srun``, ``man salloc``).
|
||||
Please, notice that behaviour for some parameters might change depending on the command used when running jobs (in example, ``--exclusive`` behaviour in ``sbatch`` differs from ``srun``).
|
||||
For a complete list of options and parameters available is recommended to use the **man pages** (i.e. `man sbatch`, `man srun`, `man salloc`).
|
||||
Please, notice that behaviour for some parameters might change depending on the command used when running jobs (in example, `--exclusive` behaviour in `sbatch` differs from `srun`).
|
||||
|
||||
In this chapter we show the basic parameters which are usually needed in the Merlin cluster.
|
||||
|
||||
@@ -53,12 +46,15 @@ In this chapter we show the basic parameters which are usually needed in the Mer
|
||||
The following settings are the minimum required for running a job in the Merlin CPU and GPU nodes. Please, consider taking a look to the **man pages** (i.e. `man sbatch`, `man salloc`, `man srun`) for more information about all possible options. Also, do not hesitate to contact us on any questions.
|
||||
|
||||
* **Clusters:** For running jobs in the different Slurm clusters, users should to add the following option:
|
||||
|
||||
```bash
|
||||
#SBATCH --clusters=<cluster_name> # Possible values: merlin5, merlin6, gmerlin6
|
||||
```
|
||||
|
||||
Refer to the documentation of each cluster ([**`merlin6`**](/merlin6/slurm-configuration.html),[**`gmerlin6`**](/gmerlin6/slurm-configuration.html),[**`merlin5`**](/merlin5/slurm-configuration.html) for further information.
|
||||
|
||||
* **Partitions:** except when using the *default* partition for each cluster, one needs to specify the partition:
|
||||
|
||||
```bash
|
||||
#SBATCH --partition=<partition_name> # Check each cluster documentation for possible values
|
||||
```
|
||||
@@ -66,34 +62,46 @@ The following settings are the minimum required for running a job in the Merlin
|
||||
Refer to the documentation of each cluster ([**`merlin6`**](/merlin6/slurm-configuration.html),[**`gmerlin6`**](/gmerlin6/slurm-configuration.html),[**`merlin5`**](/merlin5/slurm-configuration.html) for further information.
|
||||
|
||||
* **[Optional] Disabling shared nodes**: by default, nodes are not exclusive. Hence, multiple users can run in the same node. One can request exclusive node usage with the following option:
|
||||
|
||||
```bash
|
||||
#SBATCH --exclusive # Only if you want a dedicated node
|
||||
```
|
||||
|
||||
* **Time**: is important to define how long a job should run, according to the reality. This will help Slurm when *scheduling* and *backfilling*, and will let Slurm managing job queues in a more efficient way. This value can never exceed the `MaxTime` of the affected partition.
|
||||
* **Time**: is important to define how long a job should run, according to the reality. This will help Slurm when *scheduling* and *backfilling*, and will let Slurm managing job queues in a more efficient way. This value can never exceed the `MaxTime` of the affected partition.
|
||||
|
||||
```bash
|
||||
#SBATCH --time=<D-HH:MM:SS> # Can not exceed the partition `MaxTime`
|
||||
```
|
||||
|
||||
Refer to the documentation of each cluster ([**`merlin6`**](/merlin6/slurm-configuration.html),[**`gmerlin6`**](/gmerlin6/slurm-configuration.html),[**`merlin5`**](/merlin5/slurm-configuration.html) for further information about partition `MaxTime` values.
|
||||
|
||||
* **Output and error files**: by default, Slurm script will generate standard output (``slurm-%j.out``, where `%j` is the job_id) and error (``slurm-%j.err``, where `%j` is the job_id) files in the directory from where the job was submitted. Users can change default name with the following options:
|
||||
* **Output and error files**: by default, Slurm script will generate standard output (`slurm-%j.out`, where `%j` is the job_id) and error (`slurm-%j.err`, where `%j` is the job_id) files in the directory from where the job was submitted. Users can change default name with the following options:
|
||||
|
||||
```bash
|
||||
#SBATCH --output=<filename> # Can include path. Patterns accepted (i.e. %j)
|
||||
#SBATCH --error=<filename> # Can include path. Patterns accepted (i.e. %j)
|
||||
```
|
||||
Use **man sbatch** (``man sbatch | grep -A36 '^filename pattern'``) for getting a list specification of **filename patterns**.
|
||||
|
||||
Use **man sbatch** (`man sbatch | grep -A36 '^filename pattern'`) for getting a list specification of **filename patterns**.
|
||||
|
||||
* **Enable/Disable Hyper-Threading**: Whether a node has or not Hyper-Threading depends on the node configuration. By default, HT nodes have HT enabled, but one should specify it from the Slurm command as follows:
|
||||
|
||||
```bash
|
||||
#SBATCH --hint=multithread # Use extra threads with in-core multi-threading.
|
||||
#SBATCH --hint=nomultithread # Don't use extra threads with in-core multi-threading.
|
||||
```
|
||||
|
||||
Refer to the documentation of each cluster ([**`merlin6`**](/merlin6/slurm-configuration.html),[**`gmerlin6`**](/gmerlin6/slurm-configuration.html),[**`merlin5`**](/merlin5/slurm-configuration.html) for further information about node configuration and Hyper-Threading.
|
||||
Consider that, sometimes, depending on your job requirements, you might need also to setup how many `--ntasks-per-core` or `--cpus-per-task` (even other options) in addition to the `--hint` command. Please, contact us in case of doubts.
|
||||
|
||||
{{site.data.alerts.tip}} In general, for the cluster `merlin6` <span style="color:orange;"><b>--hint=[no]multithread</b></span> is a recommended field. On the other hand, <span style="color:orange;"><b>--ntasks-per-core</b></span> is only needed when
|
||||
one needs to define how a task should be handled within a core, and this setting will not be generally used on Hybrid MPI/OpenMP jobs where multiple cores are needed for single tasks.
|
||||
{{site.data.alerts.end}}
|
||||
!!! tip
|
||||
In general, for the cluster `merlin6` <span
|
||||
style="color:orange;">**--hint=[no]multithread**</span> is a recommended
|
||||
field. On the other hand, <span
|
||||
style="color:orange;">**--ntasks-per-core**</span> is only needed when
|
||||
one needs to define how a task should be handled within a core, and this
|
||||
setting will not be generally used on Hybrid MPI/OpenMP jobs where
|
||||
multiple cores are needed for single tasks.
|
||||
|
||||
## Batch script templates
|
||||
|
||||
@@ -187,7 +195,6 @@ e.g. for a parameter sweep, you can do this most easily in form of a **simple ar
|
||||
|
||||
echo $(date) "I am job number ${SLURM_ARRAY_TASK_ID}"
|
||||
srun myprogram config-file-${SLURM_ARRAY_TASK_ID}.dat
|
||||
|
||||
```
|
||||
|
||||
This will run 8 independent jobs, where each job can use the counter
|
||||
@@ -200,11 +207,11 @@ The jobs are independent, but they will run in parallel (if the cluster resource
|
||||
it). The jobs will get JobIDs like {some-number}_0 to {some-number}_7, and they also will each
|
||||
have their own output file.
|
||||
|
||||
**Note:**
|
||||
* Do not use such jobs if you have very short tasks, since each array sub job will incur the full overhead for launching an independent Slurm job. For such cases you should used a **packed job** (see below).
|
||||
* If you want to control how many of these jobs can run in parallel, you can use the `#SBATCH --array=1-100%5` syntax. The `%5` will define
|
||||
!!! note
|
||||
* Do not use such jobs if you have very short tasks, since each array sub job will incur the full overhead for launching an independent Slurm job. For such cases you should used a **packed job** (see below).
|
||||
* If you want to control how many of these jobs can run in parallel, you can use the `#SBATCH --array=1-100%5` syntax. The `%5` will define
|
||||
that only 5 sub jobs may ever run in parallel.
|
||||
|
||||
|
||||
You also can use an array job approach to run over all files in a directory, substituting the payload with
|
||||
|
||||
``` bash
|
||||
@@ -247,7 +254,6 @@ The `%1` in the `#SBATCH --array=1-10%1` statement defines that only 1 subjob ca
|
||||
this will result in subjob n+1 only being started when job n has finished. It will read the checkpoint file
|
||||
if it is present.
|
||||
|
||||
|
||||
### Packed jobs: running a large number of short tasks
|
||||
|
||||
Since the launching of a Slurm job incurs some overhead, you should not submit each short task as a separate
|
||||
@@ -264,7 +270,7 @@ arguments passed from 1 to 1000. But with the =-N1 -n1 -c1
|
||||
instances are effectively running, each being allocated one CPU. You
|
||||
can at this point decide to allocate several CPUs or tasks by adapting
|
||||
the corresponding parameters.
|
||||
|
||||
|
||||
``` bash
|
||||
#! /bin/bash
|
||||
#SBATCH --job-name=test-checkpoint
|
||||
@@ -279,6 +285,7 @@ done
|
||||
wait
|
||||
```
|
||||
|
||||
**Note:** The `&` at the end of the `srun` line is needed to not have the script waiting (blocking).
|
||||
The `wait` command waits for all such background tasks to finish and returns the exit code.
|
||||
!!! note
|
||||
The `&` at the end of the `srun` line is needed to not have the script waiting (blocking).
|
||||
The `wait` command waits for all such background tasks to finish and returns the exit code.
|
||||
|
||||
@@ -1,15 +1,7 @@
|
||||
---
|
||||
title: Slurm Basic Commands
|
||||
#tags:
|
||||
keywords: sinfo, squeue, sbatch, srun, salloc, scancel, sview, seff, sjstat, sacct, basic commands, slurm commands, cluster
|
||||
last_updated: 07 September 2022
|
||||
#summary: ""
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/slurm-basics.html
|
||||
---
|
||||
# Slurm Basic Commands
|
||||
|
||||
In this document some basic commands for using Slurm are showed. Advanced examples for some of these
|
||||
are explained in other Merlin6 Slurm pages. You can always use ```man <command>``` pages for more
|
||||
are explained in other Merlin6 Slurm pages. You can always use `man <command>` pages for more
|
||||
information about options and examples.
|
||||
|
||||
## Basic commands
|
||||
@@ -33,7 +25,7 @@ sacct # Show job accounting, useful for checking details of finished
|
||||
|
||||
---
|
||||
|
||||
## Advanced basic commands:
|
||||
## Advanced basic commands
|
||||
|
||||
```bash
|
||||
sinfo -N -l # list nodes, state, resources (#CPUs, memory per node, ...), etc.
|
||||
@@ -44,7 +36,7 @@ sprio -l # to view the factors that comprise a job's scheduling priority
|
||||
|
||||
## Show information for specific cluster
|
||||
|
||||
By default, any of the above commands shows information of the local cluster which is ***merlin6**.
|
||||
By default, any of the above commands shows information of the local cluster which is **merlin6**.
|
||||
|
||||
If you want to see the same information for **merlin5** you have to add the parameter ``--clusters=merlin5``.
|
||||
If you want to see both clusters at the same time, add the option ``--federation``.
|
||||
@@ -1,12 +1,4 @@
|
||||
---
|
||||
title: Slurm Examples
|
||||
#tags:
|
||||
keywords: slurm example, template, examples, templates, running jobs, sbatch, single core based jobs, HT, multithread, no-multithread, mpi, openmp, packed jobs, hands-on, array jobs, gpu
|
||||
last_updated: 07 September 2022
|
||||
summary: "This document shows different template examples for running jobs in the Merlin cluster."
|
||||
sidebar: merlin6_sidebar
|
||||
permalink: /merlin6/slurm-examples.html
|
||||
---
|
||||
# Slurm Examples
|
||||
|
||||
## Single core based job examples
|
||||
|
||||
@@ -211,10 +203,11 @@ The jobs are independent, but they will run in parallel (if the cluster resource
|
||||
it). The jobs will get JobIDs like {some-number}_0 to {some-number}_7, and they also will each
|
||||
have their own output file.
|
||||
|
||||
**Note:**
|
||||
* Do not use such jobs if you have very short tasks, since each array sub job will incur the full overhead for launching an independent Slurm job. For such cases you should used a **packed job** (see below).
|
||||
* If you want to control how many of these jobs can run in parallel, you can use the `#SBATCH --array=1-100%5` syntax. The `%5` will define
|
||||
that only 5 sub jobs may ever run in parallel.
|
||||
!!! note
|
||||
|
||||
* Do not use such jobs if you have very short tasks, since each array sub job will incur the full overhead for launching an independent Slurm job. For such cases you should used a **packed job** (see below).
|
||||
* If you want to control how many of these jobs can run in parallel, you can use the `#SBATCH --array=1-100%5` syntax. The `%5` will define
|
||||
that only 5 sub jobs may ever run in parallel.
|
||||
|
||||
You also can use an array job approach to run over all files in a directory, substituting the payload with
|
||||
|
||||
@@ -290,8 +283,10 @@ done
|
||||
wait
|
||||
```
|
||||
|
||||
**Note:** The `&` at the end of the `srun` line is needed to not have the script waiting (blocking).
|
||||
The `wait` command waits for all such background tasks to finish and returns the exit code.
|
||||
!!! note
|
||||
|
||||
The `&` at the end of the `srun` line is needed to not have the script waiting (blocking).
|
||||
The `wait` command waits for all such background tasks to finish and returns the exit code.
|
||||
|
||||
## Hands-On Example
|
||||
|
||||
@@ -348,7 +343,7 @@ free nodes will be allocated).
|
||||
This can be combined with the previous examples.
|
||||
|
||||
More advanced configurations can be defined and can be combined with the previous examples. More information about advanced
|
||||
options can be found in the following link: https://slurm.schedmd.com/sbatch.html (or run 'man sbatch').
|
||||
options can be found in the following link: <https://slurm.schedmd.com/sbatch.html> (or run `man sbatch`).
|
||||
|
||||
If you have questions about how to properly execute your jobs, please contact us through merlin-admins@lists.psi.ch. Do not run
|
||||
If you have questions about how to properly execute your jobs, please contact us through <mailto:merlin-admins@lists.psi.ch>. Do not run
|
||||
advanced configurations unless your are sure of what you are doing.
|
||||
@@ -92,6 +92,12 @@ nav:
|
||||
- merlin6/how-to-use-merlin/ssh-keys.md
|
||||
- merlin6/how-to-use-merlin/kerberos.md
|
||||
- merlin6/how-to-use-merlin/using-modules.md
|
||||
- Slurm General Documentation:
|
||||
- merlin6/slurm-general-docs/slurm-basic-commands.md
|
||||
- merlin6/slurm-general-docs/running-jobs.md
|
||||
- merlin6/slurm-general-docs/interactive-jobs.md
|
||||
- merlin6/slurm-general-docs/slurm-examples.md
|
||||
- merlin6/slurm-general-docs/monitoring.md
|
||||
- PSI@CSCS:
|
||||
- cscs-userlab/index.md
|
||||
- cscs-userlab/transfer-data.md
|
||||
|
||||
Reference in New Issue
Block a user