256 lines
14 KiB
Markdown
256 lines
14 KiB
Markdown
---
|
|
title: Monitoring
|
|
#tags:
|
|
#keywords:
|
|
last_updated: 20 June 2019
|
|
#summary: ""
|
|
sidebar: merlin6_sidebar
|
|
permalink: /merlin6/monitoring.html
|
|
---
|
|
|
|
## Slurm Monitoring
|
|
|
|
### Job status
|
|
|
|
The status of submitted jobs can be check with the ``squeue`` command:
|
|
|
|
```bash
|
|
squeue -u $username
|
|
```
|
|
|
|
Common statuses:
|
|
|
|
* **merlin-\***: Running on the specified host
|
|
* **(Priority)**: Waiting in the queue
|
|
* **(Resources)**: At the head of the queue, waiting for machines to become available
|
|
* **(AssocGrpCpuLimit), (AssocGrpNodeLimit)**: Job would exceed per-user limitations on
|
|
the number of simultaneous CPUs/Nodes. Use `scancel` to remove the job and
|
|
resubmit with fewer resources, or else wait for your other jobs to finish.
|
|
* **(PartitionNodeLimit)**: Exceeds all resources available on this partition.
|
|
Run `scancel` and resubmit to a different partition (`-p`) or with fewer
|
|
resources.
|
|
|
|
Check in the **man** pages (``man squeue``) for all possible options for this command.
|
|
|
|
<details>
|
|
<summary>[Show 'squeue' example]</summary>
|
|
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
|
|
[root@merlin-slurmctld01 ~]# squeue -u feichtinger
|
|
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
|
|
134332544 general spawner- feichtin R 5-06:47:45 1 merlin-c-204
|
|
134321376 general subm-tal feichtin R 5-22:27:59 1 merlin-c-204
|
|
</pre>
|
|
</details>
|
|
|
|
### Partition status
|
|
|
|
The status of the nodes and partitions (a.k.a. queues) can be seen with the ``sinfo`` command:
|
|
|
|
```bash
|
|
sinfo
|
|
```
|
|
|
|
Check in the **man** pages (``man sinfo``) for all possible options for this command.
|
|
|
|
<details>
|
|
<summary>[Show 'sinfo' example]</summary>
|
|
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
|
|
[root@merlin-l-001 ~]# sinfo -l
|
|
Thu Jan 23 16:34:49 2020
|
|
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE NODELIST
|
|
test up 1-00:00:00 1-infinite no NO all 3 mixed merlin-c-[024,223-224]
|
|
test up 1-00:00:00 1-infinite no NO all 2 allocated merlin-c-[123-124]
|
|
test up 1-00:00:00 1-infinite no NO all 1 idle merlin-c-023
|
|
general* up 7-00:00:00 1-50 no NO all 6 mixed merlin-c-[007,204,207-209,219]
|
|
general* up 7-00:00:00 1-50 no NO all 57 allocated merlin-c-[001-005,008-020,101-122,201-203,205-206,210-218,220-222]
|
|
general* up 7-00:00:00 1-50 no NO all 3 idle merlin-c-[006,021-022]
|
|
daily up 1-00:00:00 1-60 no NO all 9 mixed merlin-c-[007,024,204,207-209,219,223-224]
|
|
daily up 1-00:00:00 1-60 no NO all 59 allocated merlin-c-[001-005,008-020,101-124,201-203,205-206,210-218,220-222]
|
|
daily up 1-00:00:00 1-60 no NO all 4 idle merlin-c-[006,021-023]
|
|
hourly up 1:00:00 1-infinite no NO all 9 mixed merlin-c-[007,024,204,207-209,219,223-224]
|
|
hourly up 1:00:00 1-infinite no NO all 59 allocated merlin-c-[001-005,008-020,101-124,201-203,205-206,210-218,220-222]
|
|
hourly up 1:00:00 1-infinite no NO all 4 idle merlin-c-[006,021-023]
|
|
gpu up 7-00:00:00 1-infinite no NO all 1 mixed merlin-g-007
|
|
gpu up 7-00:00:00 1-infinite no NO all 8 allocated merlin-g-[001-006,008-009]
|
|
</pre>
|
|
</details>
|
|
|
|
### Job accounting
|
|
|
|
Users can check detailed information of jobs (pending, running, completed, failed, etc.) with the `sacct` command.
|
|
This command is very flexible and can provide a lot of information. For checking all the available options, please read `man sacct`.
|
|
Below, we summarize some examples that can be useful for the users:
|
|
|
|
```bash
|
|
# Today jobs, basic summary
|
|
sacct
|
|
|
|
# Today jobs, with details
|
|
sacct --long
|
|
|
|
# Jobs from January 1, 2022, 12pm, with details
|
|
sacct -S 2021-01-01T12:00:00 --long
|
|
|
|
# Specific job accounting
|
|
sacct --long -j $jobid
|
|
|
|
# Jobs custom details, without steps (-X)
|
|
sacct -X --format=User%20,JobID,Jobname,partition,state,time,submit,start,end,elapsed,AveRss,MaxRss,MaxRSSTask,MaxRSSNode%20,MaxVMSize,nnodes,ncpus,ntasks,reqcpus,totalcpu,reqmem,cluster,TimeLimit,TimeLimitRaw,cputime,nodelist%50,AllocTRES%80
|
|
|
|
# Jobs custom details, with steps
|
|
sacct --format=User%20,JobID,Jobname,partition,state,time,submit,start,end,elapsed,AveRss,MaxRss,MaxRSSTask,MaxRSSNode%20,MaxVMSize,nnodes,ncpus,ntasks,reqcpus,totalcpu,reqmem,cluster,TimeLimit,TimeLimitRaw,cputime,nodelist%50,AllocTRES%80
|
|
```
|
|
|
|
### Job efficiency
|
|
|
|
Users can check how efficient are their jobs. For that, the ``seff`` command is available.
|
|
|
|
```bash
|
|
seff $jobid
|
|
```
|
|
|
|
<details>
|
|
<summary>[Show 'seff' example]</summary>
|
|
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
|
|
[root@merlin-slurmctld01 ~]# seff 134333893
|
|
Job ID: 134333893
|
|
Cluster: merlin6
|
|
User/Group: albajacas_a/unx-sls
|
|
State: COMPLETED (exit code 0)
|
|
Nodes: 1
|
|
Cores per node: 8
|
|
CPU Utilized: 00:26:15
|
|
CPU Efficiency: 49.47% of 00:53:04 core-walltime
|
|
Job Wall-clock time: 00:06:38
|
|
Memory Utilized: 60.73 MB
|
|
Memory Efficiency: 0.19% of 31.25 GB
|
|
</pre>
|
|
</details>
|
|
|
|
### List job attributes
|
|
|
|
The ``sjstat`` command is used to display statistics of jobs under control of SLURM. To use it
|
|
|
|
```bash
|
|
jstat
|
|
```
|
|
|
|
<details>
|
|
<summary>[Show 'sjstat' example]</summary>
|
|
<pre class="terminal code highlight js-syntax-highlight plaintext" lang="plaintext" markdown="false">
|
|
[root@merlin-l-001 ~]# sjstat -v
|
|
|
|
Scheduling pool data:
|
|
----------------------------------------------------------------------------------
|
|
Total Usable Free Node Time Other
|
|
Pool Memory Cpus Nodes Nodes Nodes Limit Limit traits
|
|
----------------------------------------------------------------------------------
|
|
test 373502Mb 88 6 6 1 UNLIM 1-00:00:00
|
|
general* 373502Mb 88 66 66 8 50 7-00:00:00
|
|
daily 373502Mb 88 72 72 9 60 1-00:00:00
|
|
hourly 373502Mb 88 72 72 9 UNLIM 01:00:00
|
|
gpu 128000Mb 8 1 1 0 UNLIM 7-00:00:00
|
|
gpu 128000Mb 20 8 8 0 UNLIM 7-00:00:00
|
|
|
|
Running job data:
|
|
---------------------------------------------------------------------------------------------------
|
|
Time Time Time
|
|
JobID User Procs Pool Status Used Limit Started Master/Other
|
|
---------------------------------------------------------------------------------------------------
|
|
13433377 collu_g 1 gpu PD 0:00 24:00:00 N/A (Resources)
|
|
13433389 collu_g 20 gpu PD 0:00 24:00:00 N/A (Resources)
|
|
13433382 jaervine 4 gpu PD 0:00 24:00:00 N/A (Priority)
|
|
13433386 barret_d 20 gpu PD 0:00 24:00:00 N/A (Priority)
|
|
13433382 pamula_f 20 gpu PD 0:00 168:00:00 N/A (Priority)
|
|
13433387 pamula_f 4 gpu PD 0:00 24:00:00 N/A (Priority)
|
|
13433365 andreani 132 daily PD 0:00 24:00:00 N/A (Dependency)
|
|
13433388 marino_j 6 gpu R 1:43:12 168:00:00 01-23T14:54:57 merlin-g-007
|
|
13433377 choi_s 40 gpu R 2:09:55 48:00:00 01-23T14:28:14 merlin-g-006
|
|
13433373 qi_c 20 gpu R 7:00:04 24:00:00 01-23T09:38:05 merlin-g-004
|
|
13433390 jaervine 2 gpu R 5:18 24:00:00 01-23T16:32:51 merlin-g-007
|
|
13433390 jaervine 2 gpu R 15:18 24:00:00 01-23T16:22:51 merlin-g-007
|
|
13433375 bellotti 4 gpu R 7:35:44 9:00:00 01-23T09:02:25 merlin-g-001
|
|
13433358 bellotti 1 gpu R 1-05:52:19 144:00:00 01-22T10:45:50 merlin-g-007
|
|
13433377 lavriha_ 20 gpu R 5:13:24 24:00:00 01-23T11:24:45 merlin-g-008
|
|
13433370 lavriha_ 40 gpu R 22:43:09 24:00:00 01-22T17:55:00 merlin-g-003
|
|
13433373 qi_c 20 gpu R 15:03:15 24:00:00 01-23T01:34:54 merlin-g-002
|
|
13433371 qi_c 4 gpu R 22:14:14 168:00:00 01-22T18:23:55 merlin-g-001
|
|
13433254 feichtin 2 general R 5-07:26:11 156:00:00 01-18T09:11:58 merlin-c-204
|
|
13432137 feichtin 2 general R 5-23:06:25 160:00:00 01-17T17:31:44 merlin-c-204
|
|
13433389 albajaca 32 hourly R 41:19 1:00:00 01-23T15:56:50 merlin-c-219
|
|
13433387 riemann_ 2 general R 1:51:47 4:00:00 01-23T14:46:22 merlin-c-204
|
|
13433370 jimenez_ 2 general R 23:20:45 168:00:00 01-22T17:17:24 merlin-c-106
|
|
13433381 jimenez_ 2 general R 4:55:33 168:00:00 01-23T11:42:36 merlin-c-219
|
|
13433390 sayed_m 128 daily R 21:49 10:00:00 01-23T16:16:20 merlin-c-223
|
|
13433359 adelmann 2 general R 1-05:00:09 48:00:00 01-22T11:38:00 merlin-c-204
|
|
13433377 zimmerma 2 daily R 6:13:38 24:00:00 01-23T10:24:31 merlin-c-007
|
|
13433375 zohdirad 24 daily R 7:33:16 10:00:00 01-23T09:04:53 merlin-c-218
|
|
13433363 zimmerma 6 general R 1-02:54:20 47:50:00 01-22T13:43:49 merlin-c-106
|
|
13433376 zimmerma 6 general R 7:25:42 23:50:00 01-23T09:12:27 merlin-c-007
|
|
13433371 vazquez_ 16 daily R 21:46:31 23:59:00 01-22T18:51:38 merlin-c-106
|
|
13433382 vazquez_ 16 daily R 4:09:23 23:59:00 01-23T12:28:46 merlin-c-024
|
|
13433376 jiang_j1 440 daily R 7:11:14 10:00:00 01-23T09:26:55 merlin-c-123
|
|
13433376 jiang_j1 24 daily R 7:08:19 10:00:00 01-23T09:29:50 merlin-c-220
|
|
13433384 kranjcev 440 daily R 2:48:19 24:00:00 01-23T13:49:50 merlin-c-108
|
|
13433371 vazquez_ 16 general R 20:15:15 120:00:00 01-22T20:22:54 merlin-c-210
|
|
13433371 vazquez_ 16 general R 21:15:51 120:00:00 01-22T19:22:18 merlin-c-210
|
|
13433374 colonna_ 176 daily R 8:23:18 24:00:00 01-23T08:14:51 merlin-c-211
|
|
13433374 bures_l 88 daily R 10:45:06 24:00:00 01-23T05:53:03 merlin-c-001
|
|
13433375 derlet 88 daily R 7:32:05 24:00:00 01-23T09:06:04 merlin-c-107
|
|
13433373 derlet 88 daily R 17:21:57 24:00:00 01-22T23:16:12 merlin-c-002
|
|
13433373 derlet 88 daily R 18:13:05 24:00:00 01-22T22:25:04 merlin-c-112
|
|
13433365 andreani 264 daily R 4:10:08 24:00:00 01-23T12:28:01 merlin-c-003
|
|
13431187 mahrous_ 88 general R 6-15:59:16 168:00:00 01-17T00:38:53 merlin-c-111
|
|
13433387 kranjcev 2 general R 1:48:47 4:00:00 01-23T14:49:22 merlin-c-204
|
|
13433368 karalis_ 352 general R 1-00:05:22 96:00:00 01-22T16:32:47 merlin-c-013
|
|
13433367 karalis_ 352 general R 1-00:06:44 96:00:00 01-22T16:31:25 merlin-c-118
|
|
13433385 karalis_ 352 general R 1:37:24 96:00:00 01-23T15:00:45 merlin-c-213
|
|
13433374 sato 256 general R 14:55:55 24:00:00 01-23T01:42:14 merlin-c-204
|
|
13433374 sato 64 general R 10:43:35 24:00:00 01-23T05:54:34 merlin-c-106
|
|
67723568 sato 32 general R 10:40:07 24:00:00 01-23T05:58:02 merlin-c-007
|
|
13433265 khanppna 440 general R 3-18:20:58 168:00:00 01-19T22:17:11 merlin-c-008
|
|
13433375 khanppna 704 general R 7:31:24 24:00:00 01-23T09:06:45 merlin-c-101
|
|
13433371 khanppna 616 general R 21:40:33 24:00:00 01-22T18:57:36 merlin-c-208
|
|
</pre>
|
|
</details>
|
|
|
|
### Graphical user interface
|
|
|
|
When using **ssh** with X11 forwarding (``ssh -XY``) users can use ``sview``. **SView** is a graphical user
|
|
interface to view and modify Slurm state. To run **sview**:
|
|
|
|
```bash
|
|
ssh -XY $username@merlin-l-001.psi.ch
|
|
sview
|
|
```
|
|
|
|

|
|
|
|
|
|
## General Monitoring
|
|
|
|
The following pages contain basic monitoring for Slurm and computing nodes.
|
|
Currently, monitoring is based on Grafana + InfluxDB. In the future it will
|
|
be moved to a different service based on ElasticSearch + LogStash + Kibana.
|
|
|
|
In the meantime, the following monitoring pages are available in a best effort
|
|
support:
|
|
|
|
### Merlin6 Monitoring Pages
|
|
|
|
* Slurm monitoring:
|
|
* ***[Merlin6 Slurm Statistics - XDMOD](https://merlin-slurmmon01.psi.ch/)***
|
|
* [Merlin6 Slurm Live Status](https://hpc-monitor02.psi.ch/d/QNcbW1AZk/merlin6-slurm-live-status?orgId=1&refresh=10s)
|
|
* [Merlin6 Slurm Overview](https://hpc-monitor02.psi.ch/d/94UxWJ0Zz/merlin6-slurm-overview?orgId=1&refresh=10s)
|
|
* Nodes monitoring:
|
|
* [Merlin6 CPU Nodes Overview](https://hpc-monitor02.psi.ch/d/JmvLR8gZz/merlin6-computing-cpu-nodes?orgId=1&refresh=10s)
|
|
* [Merlin6 GPU Nodes Overview](https://hpc-monitor02.psi.ch/d/gOo1Z10Wk/merlin6-computing-gpu-nodes?orgId=1&refresh=10s)
|
|
|
|
### Merlin5 Monitoring Pages
|
|
|
|
* Slurm monitoring:
|
|
* [Merlin5 Slurm Live Status](https://hpc-monitor02.psi.ch/d/o8msZJ0Zz/merlin5-slurm-live-status?orgId=1&refresh=10s)
|
|
* [Merlin5 Slurm Overview](https://hpc-monitor02.psi.ch/d/eWLEW1AWz/merlin5-slurm-overview?orgId=1&refresh=10s)
|
|
* Nodes monitoring:
|
|
* [Merlin5 CPU Nodes Overview](https://hpc-monitor02.psi.ch/d/ejTyWJAWk/merlin5-computing-cpu-nodes?orgId=1&refresh=10s)
|