Files
gitea-pages/pages/merlin6/03 Job Submission/monitoring.md
2020-06-16 18:20:08 +02:00

12 KiB

title, last_updated, sidebar, permalink
title last_updated sidebar permalink
Monitoring 20 June 2019 merlin6_sidebar /merlin6/monitoring.html

Slurm Monitoring

Job status

The status of submitted jobs can be check with the squeue command:

squeue -u $username

Common statuses:

  • merlin-*: Running on the specified host
  • (Priority): Waiting in the queue
  • (Resources): At the head of the queue, waiting for machines to become available
  • (AssocGrpCpuLimit), (AssocGrpNodeLimit): Job would exceed per-user limitations on the number of simultaneous CPUs/Nodes. Use scancel to remove the job and resubmit with fewer resources, or else wait for your other jobs to finish.
  • (PartitionNodeLimit): Exceeds all resources available on this partition. Run scancel and resubmit to a different partition (-p) or with fewer resources.

Check in the man pages (man squeue) for all possible options for this command.

[Show 'squeue' example]
[root@merlin-slurmctld01 ~]# squeue -u feichtinger
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
         134332544   general spawner- feichtin  R 5-06:47:45      1 merlin-c-204
         134321376   general subm-tal feichtin  R 5-22:27:59      1 merlin-c-204

Partition status

The status of the nodes and partitions (a.k.a. queues) can be seen with the sinfo command:

sinfo

Check in the man pages (man sinfo) for all possible options for this command.

[Show 'sinfo' example]
[root@merlin-l-001 ~]# sinfo -l
Thu Jan 23 16:34:49 2020
PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS     GROUPS  NODES       STATE NODELIST
test         up 1-00:00:00 1-infinite   no       NO        all      3       mixed merlin-c-[024,223-224]
test         up 1-00:00:00 1-infinite   no       NO        all      2   allocated merlin-c-[123-124]
test         up 1-00:00:00 1-infinite   no       NO        all      1        idle merlin-c-023
general*     up 7-00:00:00       1-50   no       NO        all      6       mixed merlin-c-[007,204,207-209,219]
general*     up 7-00:00:00       1-50   no       NO        all     57   allocated merlin-c-[001-005,008-020,101-122,201-203,205-206,210-218,220-222]
general*     up 7-00:00:00       1-50   no       NO        all      3        idle merlin-c-[006,021-022]
daily        up 1-00:00:00       1-60   no       NO        all      9       mixed merlin-c-[007,024,204,207-209,219,223-224]
daily        up 1-00:00:00       1-60   no       NO        all     59   allocated merlin-c-[001-005,008-020,101-124,201-203,205-206,210-218,220-222]
daily        up 1-00:00:00       1-60   no       NO        all      4        idle merlin-c-[006,021-023]
hourly       up    1:00:00 1-infinite   no       NO        all      9       mixed merlin-c-[007,024,204,207-209,219,223-224]
hourly       up    1:00:00 1-infinite   no       NO        all     59   allocated merlin-c-[001-005,008-020,101-124,201-203,205-206,210-218,220-222]
hourly       up    1:00:00 1-infinite   no       NO        all      4        idle merlin-c-[006,021-023]
gpu          up 7-00:00:00 1-infinite   no       NO        all      1       mixed merlin-g-007
gpu          up 7-00:00:00 1-infinite   no       NO        all      8   allocated merlin-g-[001-006,008-009]

Job efficiency

Users can check how efficient are their jobs. For that, the seff command is available.

seff $jobid
[Show 'seff' example]
[root@merlin-slurmctld01 ~]# seff 134333893
Job ID: 134333893
Cluster: merlin6
User/Group: albajacas_a/unx-sls
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 8
CPU Utilized: 00:26:15
CPU Efficiency: 49.47% of 00:53:04 core-walltime
Job Wall-clock time: 00:06:38
Memory Utilized: 60.73 MB
Memory Efficiency: 0.19% of 31.25 GB

List job attributes

The sjstat command is used to display statistics of jobs under control of SLURM. To use it

jstat
[Show 'sjstat' example]
[root@merlin-l-001 ~]# sjstat -v

Scheduling pool data:

                       Total  Usable   Free   Node   Time      Other          

Pool Memory Cpus Nodes Nodes Nodes Limit Limit traits

test 373502Mb 88 6 6 1 UNLIM 1-00:00:00
general* 373502Mb 88 66 66 8 50 7-00:00:00
daily 373502Mb 88 72 72 9 60 1-00:00:00
hourly 373502Mb 88 72 72 9 UNLIM 01:00:00
gpu 128000Mb 8 1 1 0 UNLIM 7-00:00:00
gpu 128000Mb 20 8 8 0 UNLIM 7-00:00:00

Running job data:

                                             Time        Time            Time                  

JobID User Procs Pool Status Used Limit Started Master/Other

13433377 collu_g 1 gpu PD 0:00 24:00:00 N/A (Resources) 13433389 collu_g 20 gpu PD 0:00 24:00:00 N/A (Resources) 13433382 jaervine 4 gpu PD 0:00 24:00:00 N/A (Priority) 13433386 barret_d 20 gpu PD 0:00 24:00:00 N/A (Priority) 13433382 pamula_f 20 gpu PD 0:00 168:00:00 N/A (Priority) 13433387 pamula_f 4 gpu PD 0:00 24:00:00 N/A (Priority) 13433365 andreani 132 daily PD 0:00 24:00:00 N/A (Dependency) 13433388 marino_j 6 gpu R 1:43:12 168:00:00 01-23T14:54:57 merlin-g-007 13433377 choi_s 40 gpu R 2:09:55 48:00:00 01-23T14:28:14 merlin-g-006 13433373 qi_c 20 gpu R 7:00:04 24:00:00 01-23T09:38:05 merlin-g-004 13433390 jaervine 2 gpu R 5:18 24:00:00 01-23T16:32:51 merlin-g-007 13433390 jaervine 2 gpu R 15:18 24:00:00 01-23T16:22:51 merlin-g-007 13433375 bellotti 4 gpu R 7:35:44 9:00:00 01-23T09:02:25 merlin-g-001 13433358 bellotti 1 gpu R 1-05:52:19 144:00:00 01-22T10:45:50 merlin-g-007 13433377 lavriha_ 20 gpu R 5:13:24 24:00:00 01-23T11:24:45 merlin-g-008 13433370 lavriha_ 40 gpu R 22:43:09 24:00:00 01-22T17:55:00 merlin-g-003 13433373 qi_c 20 gpu R 15:03:15 24:00:00 01-23T01:34:54 merlin-g-002 13433371 qi_c 4 gpu R 22:14:14 168:00:00 01-22T18:23:55 merlin-g-001 13433254 feichtin 2 general R 5-07:26:11 156:00:00 01-18T09:11:58 merlin-c-204 13432137 feichtin 2 general R 5-23:06:25 160:00:00 01-17T17:31:44 merlin-c-204 13433389 albajaca 32 hourly R 41:19 1:00:00 01-23T15:56:50 merlin-c-219 13433387 riemann_ 2 general R 1:51:47 4:00:00 01-23T14:46:22 merlin-c-204 13433370 jimenez_ 2 general R 23:20:45 168:00:00 01-22T17:17:24 merlin-c-106 13433381 jimenez_ 2 general R 4:55:33 168:00:00 01-23T11:42:36 merlin-c-219 13433390 sayed_m 128 daily R 21:49 10:00:00 01-23T16:16:20 merlin-c-223 13433359 adelmann 2 general R 1-05:00:09 48:00:00 01-22T11:38:00 merlin-c-204 13433377 zimmerma 2 daily R 6:13:38 24:00:00 01-23T10:24:31 merlin-c-007 13433375 zohdirad 24 daily R 7:33:16 10:00:00 01-23T09:04:53 merlin-c-218 13433363 zimmerma 6 general R 1-02:54:20 47:50:00 01-22T13:43:49 merlin-c-106 13433376 zimmerma 6 general R 7:25:42 23:50:00 01-23T09:12:27 merlin-c-007 13433371 vazquez_ 16 daily R 21:46:31 23:59:00 01-22T18:51:38 merlin-c-106 13433382 vazquez_ 16 daily R 4:09:23 23:59:00 01-23T12:28:46 merlin-c-024 13433376 jiang_j1 440 daily R 7:11:14 10:00:00 01-23T09:26:55 merlin-c-123 13433376 jiang_j1 24 daily R 7:08:19 10:00:00 01-23T09:29:50 merlin-c-220 13433384 kranjcev 440 daily R 2:48:19 24:00:00 01-23T13:49:50 merlin-c-108 13433371 vazquez_ 16 general R 20:15:15 120:00:00 01-22T20:22:54 merlin-c-210 13433371 vazquez_ 16 general R 21:15:51 120:00:00 01-22T19:22:18 merlin-c-210 13433374 colonna_ 176 daily R 8:23:18 24:00:00 01-23T08:14:51 merlin-c-211 13433374 bures_l 88 daily R 10:45:06 24:00:00 01-23T05:53:03 merlin-c-001 13433375 derlet 88 daily R 7:32:05 24:00:00 01-23T09:06:04 merlin-c-107 13433373 derlet 88 daily R 17:21:57 24:00:00 01-22T23:16:12 merlin-c-002 13433373 derlet 88 daily R 18:13:05 24:00:00 01-22T22:25:04 merlin-c-112 13433365 andreani 264 daily R 4:10:08 24:00:00 01-23T12:28:01 merlin-c-003 13431187 mahrous_ 88 general R 6-15:59:16 168:00:00 01-17T00:38:53 merlin-c-111 13433387 kranjcev 2 general R 1:48:47 4:00:00 01-23T14:49:22 merlin-c-204 13433368 karalis_ 352 general R 1-00:05:22 96:00:00 01-22T16:32:47 merlin-c-013 13433367 karalis_ 352 general R 1-00:06:44 96:00:00 01-22T16:31:25 merlin-c-118 13433385 karalis_ 352 general R 1:37:24 96:00:00 01-23T15:00:45 merlin-c-213 13433374 sato 256 general R 14:55:55 24:00:00 01-23T01:42:14 merlin-c-204 13433374 sato 64 general R 10:43:35 24:00:00 01-23T05:54:34 merlin-c-106 67723568 sato 32 general R 10:40:07 24:00:00 01-23T05:58:02 merlin-c-007 13433265 khanppna 440 general R 3-18:20:58 168:00:00 01-19T22:17:11 merlin-c-008 13433375 khanppna 704 general R 7:31:24 24:00:00 01-23T09:06:45 merlin-c-101 13433371 khanppna 616 general R 21:40:33 24:00:00 01-22T18:57:36 merlin-c-208

Graphical user interface

When using ssh with X11 forwarding (ssh -XY) users can use sview. SView is a graphical user interface to view and modify Slurm state. To run sview:

ssh -XY $username@merlin-l-001.psi.ch
sview

!['sview' graphical user interface]({{ "/images/Slurm/sview.png" }})

General Monitoring

The following pages contain basic monitoring for Slurm and computing nodes. Currently, monitoring is based on Grafana + InfluxDB. In the future it will be moved to a different service based on ElasticSearch + LogStash + Kibana.

In the meantime, the following monitoring pages are available in a best effort support:

Merlin6 Monitoring Pages

Merlin5 Monitoring Pages