--- title: Monitoring #tags: #keywords: last_updated: 20 June 2019 #summary: "" sidebar: merlin6_sidebar permalink: /merlin6/monitoring.html --- ## Slurm Monitoring ### Job status The status of submitted jobs can be check with the ``squeue`` command: ```bash squeue -u $username ``` Common statuses: * **merlin-\***: Running on the specified host * **(Priority)**: Waiting in the queue * **(Resources)**: At the head of the queue, waiting for machines to become available * **(AssocGrpCpuLimit), (AssocGrpNodeLimit)**: Job would exceed per-user limitations on the number of simultaneous CPUs/Nodes. Use `scancel` to remove the job and resubmit with fewer resources, or else wait for your other jobs to finish. * **(PartitionNodeLimit)**: Exceeds all resources available on this partition. Run `scancel` and resubmit to a different partition (`-p`) or with fewer resources. Check in the **man** pages (``man squeue``) for all possible options for this command.
[Show 'squeue' example]
[root@merlin-slurmctld01 ~]# squeue -u feichtinger
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
         134332544   general spawner- feichtin  R 5-06:47:45      1 merlin-c-204
         134321376   general subm-tal feichtin  R 5-22:27:59      1 merlin-c-204
### Partition status The status of the nodes and partitions (a.k.a. queues) can be seen with the ``sinfo`` command: ```bash sinfo ``` Check in the **man** pages (``man sinfo``) for all possible options for this command.
[Show 'sinfo' example]
[root@merlin-l-001 ~]# sinfo -l
Thu Jan 23 16:34:49 2020
PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS     GROUPS  NODES       STATE NODELIST
test         up 1-00:00:00 1-infinite   no       NO        all      3       mixed merlin-c-[024,223-224]
test         up 1-00:00:00 1-infinite   no       NO        all      2   allocated merlin-c-[123-124]
test         up 1-00:00:00 1-infinite   no       NO        all      1        idle merlin-c-023
general*     up 7-00:00:00       1-50   no       NO        all      6       mixed merlin-c-[007,204,207-209,219]
general*     up 7-00:00:00       1-50   no       NO        all     57   allocated merlin-c-[001-005,008-020,101-122,201-203,205-206,210-218,220-222]
general*     up 7-00:00:00       1-50   no       NO        all      3        idle merlin-c-[006,021-022]
daily        up 1-00:00:00       1-60   no       NO        all      9       mixed merlin-c-[007,024,204,207-209,219,223-224]
daily        up 1-00:00:00       1-60   no       NO        all     59   allocated merlin-c-[001-005,008-020,101-124,201-203,205-206,210-218,220-222]
daily        up 1-00:00:00       1-60   no       NO        all      4        idle merlin-c-[006,021-023]
hourly       up    1:00:00 1-infinite   no       NO        all      9       mixed merlin-c-[007,024,204,207-209,219,223-224]
hourly       up    1:00:00 1-infinite   no       NO        all     59   allocated merlin-c-[001-005,008-020,101-124,201-203,205-206,210-218,220-222]
hourly       up    1:00:00 1-infinite   no       NO        all      4        idle merlin-c-[006,021-023]
gpu          up 7-00:00:00 1-infinite   no       NO        all      1       mixed merlin-g-007
gpu          up 7-00:00:00 1-infinite   no       NO        all      8   allocated merlin-g-[001-006,008-009]
### Job accounting Users can check detailed information of jobs (pending, running, completed, failed, etc.) with the `sacct` command. This command is very flexible and can provide a lot of information. For checking all the available options, please read `man sacct`. Below, we summarize some examples that can be useful for the users: ```bash # Today jobs, basic summary sacct # Today jobs, with details sacct --long # Jobs from January 1, 2022, 12pm, with details sacct -S 2021-01-01T12:00:00 --long # Specific job accounting sacct --long -j $jobid # Jobs custom details, without steps (-X) sacct -X --format=User%20,JobID,Jobname,partition,state,time,submit,start,end,elapsed,AveRss,MaxRss,MaxRSSTask,MaxRSSNode%20,MaxVMSize,nnodes,ncpus,ntasks,reqcpus,totalcpu,reqmem,cluster,TimeLimit,TimeLimitRaw,cputime,nodelist%50,AllocTRES%80 # Jobs custom details, with steps sacct --format=User%20,JobID,Jobname,partition,state,time,submit,start,end,elapsed,AveRss,MaxRss,MaxRSSTask,MaxRSSNode%20,MaxVMSize,nnodes,ncpus,ntasks,reqcpus,totalcpu,reqmem,cluster,TimeLimit,TimeLimitRaw,cputime,nodelist%50,AllocTRES%80 ``` ### Job efficiency Users can check how efficient are their jobs. For that, the ``seff`` command is available. ```bash seff $jobid ```
[Show 'seff' example]
[root@merlin-slurmctld01 ~]# seff 134333893
Job ID: 134333893
Cluster: merlin6
User/Group: albajacas_a/unx-sls
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 8
CPU Utilized: 00:26:15
CPU Efficiency: 49.47% of 00:53:04 core-walltime
Job Wall-clock time: 00:06:38
Memory Utilized: 60.73 MB
Memory Efficiency: 0.19% of 31.25 GB
### List job attributes The ``sjstat`` command is used to display statistics of jobs under control of SLURM. To use it ```bash jstat ```
[Show 'sjstat' example]
[root@merlin-l-001 ~]# sjstat -v

Scheduling pool data:
----------------------------------------------------------------------------------
                           Total  Usable   Free   Node   Time      Other          
Pool         Memory  Cpus  Nodes   Nodes  Nodes  Limit  Limit      traits         
----------------------------------------------------------------------------------
test        373502Mb    88      6       6      1  UNLIM 1-00:00:00   
general*    373502Mb    88     66      66      8     50 7-00:00:00   
daily       373502Mb    88     72      72      9     60 1-00:00:00   
hourly      373502Mb    88     72      72      9  UNLIM   01:00:00   
gpu         128000Mb     8      1       1      0  UNLIM 7-00:00:00   
gpu         128000Mb    20      8       8      0  UNLIM 7-00:00:00   

Running job data:
---------------------------------------------------------------------------------------------------
                                                 Time        Time            Time                  
JobID    User      Procs Pool      Status        Used       Limit         Started  Master/Other    
---------------------------------------------------------------------------------------------------
13433377 collu_g       1 gpu       PD            0:00    24:00:00             N/A  (Resources)
13433389 collu_g      20 gpu       PD            0:00    24:00:00             N/A  (Resources)
13433382 jaervine      4 gpu       PD            0:00    24:00:00             N/A  (Priority)
13433386 barret_d     20 gpu       PD            0:00    24:00:00             N/A  (Priority)
13433382 pamula_f     20 gpu       PD            0:00   168:00:00             N/A  (Priority)
13433387 pamula_f      4 gpu       PD            0:00    24:00:00             N/A  (Priority)
13433365 andreani    132 daily     PD            0:00    24:00:00             N/A  (Dependency)
13433388 marino_j      6 gpu       R          1:43:12   168:00:00  01-23T14:54:57  merlin-g-007
13433377 choi_s       40 gpu       R          2:09:55    48:00:00  01-23T14:28:14  merlin-g-006
13433373 qi_c         20 gpu       R          7:00:04    24:00:00  01-23T09:38:05  merlin-g-004
13433390 jaervine      2 gpu       R             5:18    24:00:00  01-23T16:32:51  merlin-g-007
13433390 jaervine      2 gpu       R            15:18    24:00:00  01-23T16:22:51  merlin-g-007
13433375 bellotti      4 gpu       R          7:35:44     9:00:00  01-23T09:02:25  merlin-g-001
13433358 bellotti      1 gpu       R       1-05:52:19   144:00:00  01-22T10:45:50  merlin-g-007
13433377 lavriha_     20 gpu       R          5:13:24    24:00:00  01-23T11:24:45  merlin-g-008
13433370 lavriha_     40 gpu       R         22:43:09    24:00:00  01-22T17:55:00  merlin-g-003
13433373 qi_c         20 gpu       R         15:03:15    24:00:00  01-23T01:34:54  merlin-g-002
13433371 qi_c          4 gpu       R         22:14:14   168:00:00  01-22T18:23:55  merlin-g-001
13433254 feichtin      2 general   R       5-07:26:11   156:00:00  01-18T09:11:58  merlin-c-204
13432137 feichtin      2 general   R       5-23:06:25   160:00:00  01-17T17:31:44  merlin-c-204
13433389 albajaca     32 hourly    R            41:19     1:00:00  01-23T15:56:50  merlin-c-219
13433387 riemann_      2 general   R          1:51:47     4:00:00  01-23T14:46:22  merlin-c-204
13433370 jimenez_      2 general   R         23:20:45   168:00:00  01-22T17:17:24  merlin-c-106
13433381 jimenez_      2 general   R          4:55:33   168:00:00  01-23T11:42:36  merlin-c-219
13433390 sayed_m     128 daily     R            21:49    10:00:00  01-23T16:16:20  merlin-c-223
13433359 adelmann      2 general   R       1-05:00:09    48:00:00  01-22T11:38:00  merlin-c-204
13433377 zimmerma      2 daily     R          6:13:38    24:00:00  01-23T10:24:31  merlin-c-007
13433375 zohdirad     24 daily     R          7:33:16    10:00:00  01-23T09:04:53  merlin-c-218
13433363 zimmerma      6 general   R       1-02:54:20    47:50:00  01-22T13:43:49  merlin-c-106
13433376 zimmerma      6 general   R          7:25:42    23:50:00  01-23T09:12:27  merlin-c-007
13433371 vazquez_     16 daily     R         21:46:31    23:59:00  01-22T18:51:38  merlin-c-106
13433382 vazquez_     16 daily     R          4:09:23    23:59:00  01-23T12:28:46  merlin-c-024
13433376 jiang_j1    440 daily     R          7:11:14    10:00:00  01-23T09:26:55  merlin-c-123
13433376 jiang_j1     24 daily     R          7:08:19    10:00:00  01-23T09:29:50  merlin-c-220
13433384 kranjcev    440 daily     R          2:48:19    24:00:00  01-23T13:49:50  merlin-c-108
13433371 vazquez_     16 general   R         20:15:15   120:00:00  01-22T20:22:54  merlin-c-210
13433371 vazquez_     16 general   R         21:15:51   120:00:00  01-22T19:22:18  merlin-c-210
13433374 colonna_    176 daily     R          8:23:18    24:00:00  01-23T08:14:51  merlin-c-211
13433374 bures_l      88 daily     R         10:45:06    24:00:00  01-23T05:53:03  merlin-c-001
13433375 derlet       88 daily     R          7:32:05    24:00:00  01-23T09:06:04  merlin-c-107
13433373 derlet       88 daily     R         17:21:57    24:00:00  01-22T23:16:12  merlin-c-002
13433373 derlet       88 daily     R         18:13:05    24:00:00  01-22T22:25:04  merlin-c-112
13433365 andreani    264 daily     R          4:10:08    24:00:00  01-23T12:28:01  merlin-c-003
13431187 mahrous_     88 general   R       6-15:59:16   168:00:00  01-17T00:38:53  merlin-c-111
13433387 kranjcev      2 general   R          1:48:47     4:00:00  01-23T14:49:22  merlin-c-204
13433368 karalis_    352 general   R       1-00:05:22    96:00:00  01-22T16:32:47  merlin-c-013
13433367 karalis_    352 general   R       1-00:06:44    96:00:00  01-22T16:31:25  merlin-c-118
13433385 karalis_    352 general   R          1:37:24    96:00:00  01-23T15:00:45  merlin-c-213
13433374 sato        256 general   R         14:55:55    24:00:00  01-23T01:42:14  merlin-c-204
13433374 sato         64 general   R         10:43:35    24:00:00  01-23T05:54:34  merlin-c-106
67723568 sato         32 general   R         10:40:07    24:00:00  01-23T05:58:02  merlin-c-007
13433265 khanppna    440 general   R       3-18:20:58   168:00:00  01-19T22:17:11  merlin-c-008
13433375 khanppna    704 general   R          7:31:24    24:00:00  01-23T09:06:45  merlin-c-101
13433371 khanppna    616 general   R         21:40:33    24:00:00  01-22T18:57:36  merlin-c-208
### Graphical user interface When using **ssh** with X11 forwarding (``ssh -XY``) users can use ``sview``. **SView** is a graphical user interface to view and modify Slurm state. To run **sview**: ```bash ssh -XY $username@merlin-l-001.psi.ch sview ``` !['sview' graphical user interface]({{ "/images/Slurm/sview.png" }}) ## General Monitoring The following pages contain basic monitoring for Slurm and computing nodes. Currently, monitoring is based on Grafana + InfluxDB. In the future it will be moved to a different service based on ElasticSearch + LogStash + Kibana. In the meantime, the following monitoring pages are available in a best effort support: ### Merlin6 Monitoring Pages * Slurm monitoring: * ***[Merlin6 Slurm Statistics - XDMOD](https://merlin-slurmmon01.psi.ch/)*** * [Merlin6 Slurm Live Status](https://hpc-monitor02.psi.ch/d/QNcbW1AZk/merlin6-slurm-live-status?orgId=1&refresh=10s) * [Merlin6 Slurm Overview](https://hpc-monitor02.psi.ch/d/94UxWJ0Zz/merlin6-slurm-overview?orgId=1&refresh=10s) * Nodes monitoring: * [Merlin6 CPU Nodes Overview](https://hpc-monitor02.psi.ch/d/JmvLR8gZz/merlin6-computing-cpu-nodes?orgId=1&refresh=10s) * [Merlin6 GPU Nodes Overview](https://hpc-monitor02.psi.ch/d/gOo1Z10Wk/merlin6-computing-gpu-nodes?orgId=1&refresh=10s) ### Merlin5 Monitoring Pages * Slurm monitoring: * [Merlin5 Slurm Live Status](https://hpc-monitor02.psi.ch/d/o8msZJ0Zz/merlin5-slurm-live-status?orgId=1&refresh=10s) * [Merlin5 Slurm Overview](https://hpc-monitor02.psi.ch/d/eWLEW1AWz/merlin5-slurm-overview?orgId=1&refresh=10s) * Nodes monitoring: * [Merlin5 CPU Nodes Overview](https://hpc-monitor02.psi.ch/d/ejTyWJAWk/merlin5-computing-cpu-nodes?orgId=1&refresh=10s)