gitea-pages/monitoring.md at 25c260c036d76a320c70b03ba7ed86ffc2800bd7

Files

caubet_m 8778464836 Add Slurm Commander

2023-01-24 14:25:19 +01:00

16 KiB

Raw Blame History

title, keywords, last_updated, sidebar, permalink

title	keywords	last_updated	sidebar	permalink
Monitoring	monitoring, jobs, slurm, job status, squeue, sinfo, sacct	07 September 2022	merlin6_sidebar	/merlin6/monitoring.html

Slurm Monitoring

Job status

The status of submitted jobs can be check with the squeue command:

squeue -u $username

Common statuses:

merlin-*: Running on the specified host
(Priority): Waiting in the queue
(Resources): At the head of the queue, waiting for machines to become available
(AssocGrpCpuLimit), (AssocGrpNodeLimit): Job would exceed per-user limitations on the number of simultaneous CPUs/Nodes. Use scancel to remove the job and resubmit with fewer resources, or else wait for your other jobs to finish.
(PartitionNodeLimit): Exceeds all resources available on this partition. Run scancel and resubmit to a different partition (-p) or with fewer resources.

Check in the man pages (man squeue) for all possible options for this command.

[Show 'squeue' example]

[root@merlin-slurmctld01 ~]# squeue -u feichtinger
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
         134332544   general spawner- feichtin  R 5-06:47:45      1 merlin-c-204
         134321376   general subm-tal feichtin  R 5-22:27:59      1 merlin-c-204

Partition status

The status of the nodes and partitions (a.k.a. queues) can be seen with the sinfo command:

sinfo

Check in the man pages (man sinfo) for all possible options for this command.

[Show 'sinfo' example]

[root@merlin-l-001 ~]# sinfo -l
Thu Jan 23 16:34:49 2020
PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS     GROUPS  NODES       STATE NODELIST
test         up 1-00:00:00 1-infinite   no       NO        all      3       mixed merlin-c-[024,223-224]
test         up 1-00:00:00 1-infinite   no       NO        all      2   allocated merlin-c-[123-124]
test         up 1-00:00:00 1-infinite   no       NO        all      1        idle merlin-c-023
general*     up 7-00:00:00       1-50   no       NO        all      6       mixed merlin-c-[007,204,207-209,219]
general*     up 7-00:00:00       1-50   no       NO        all     57   allocated merlin-c-[001-005,008-020,101-122,201-203,205-206,210-218,220-222]
general*     up 7-00:00:00       1-50   no       NO        all      3        idle merlin-c-[006,021-022]
daily        up 1-00:00:00       1-60   no       NO        all      9       mixed merlin-c-[007,024,204,207-209,219,223-224]
daily        up 1-00:00:00       1-60   no       NO        all     59   allocated merlin-c-[001-005,008-020,101-124,201-203,205-206,210-218,220-222]
daily        up 1-00:00:00       1-60   no       NO        all      4        idle merlin-c-[006,021-023]
hourly       up    1:00:00 1-infinite   no       NO        all      9       mixed merlin-c-[007,024,204,207-209,219,223-224]
hourly       up    1:00:00 1-infinite   no       NO        all     59   allocated merlin-c-[001-005,008-020,101-124,201-203,205-206,210-218,220-222]
hourly       up    1:00:00 1-infinite   no       NO        all      4        idle merlin-c-[006,021-023]
gpu          up 7-00:00:00 1-infinite   no       NO        all      1       mixed merlin-g-007
gpu          up 7-00:00:00 1-infinite   no       NO        all      8   allocated merlin-g-[001-006,008-009]

Slurm commander

The Slurm Commander (scom) is a simple but very useful open source text-based user interface for simple and efficient interaction with Slurm. It is developed by the CLoud Infrastructure Project (CLIP-HPC) and external contributions. To use it, one can simply run the following command:

scom                         # merlin6 cluster
SLURM_CLUSTERS=merlin5  scom # merlin5 cluster
SLURM_CLUSTERS=gmerlin6 scom # gmerlin6 cluster
scom -h                      # Help and extra options
scom -d 14                   # Set Job History to 14 days (instead of default 7)

With this simple interface, users can interact with their jobs, as well as getting information about past and present jobs:

Filtering jobs by substring is possible with the / key.
Users can perform multiple actions on their jobs (such like cancelling, holding or requeing a job), SSH to a node with an already running job, or getting extended details and statistics of the job itself.

Also, users can check the status of the cluster, to get statistics and node usage information as well as getting information about node properties.

The interface also provides a few job templates for different use cases (i.e. MPI, OpenMP, Hybrid, single core). Users can modify these templates, save it locally to the current directory, and submit the job to the cluster.

{{site.data.alerts.note}}Currently, scom does not provide live updated information for the [Job History] tab. To update Job History information, users have to exit the application with the q key. Other tabs will be updated every 5 seconds (default). On the other hand, the [Job History] tab contains only information for the merlin6 CPU cluster only. Future updates will provide information for other clusters. {{site.data.alerts.end}}

For further information about how to use scom, please refer to the Slurm Commander Project webpage

!['scom' text-based user interface]({{ "/images/Slurm/scom.gif" }})

Job accounting

Users can check detailed information of jobs (pending, running, completed, failed, etc.) with the sacct command. This command is very flexible and can provide a lot of information. For checking all the available options, please read man sacct. Below, we summarize some examples that can be useful for the users:

# Today jobs, basic summary
sacct

# Today jobs, with details
sacct --long

# Jobs from January 1, 2022, 12pm, with details
sacct -S 2021-01-01T12:00:00 --long

# Specific job accounting
sacct --long -j $jobid

# Jobs custom details, without steps (-X)
sacct -X --format=User%20,JobID,Jobname,partition,state,time,submit,start,end,elapsed,AveRss,MaxRss,MaxRSSTask,MaxRSSNode%20,MaxVMSize,nnodes,ncpus,ntasks,reqcpus,totalcpu,reqmem,cluster,TimeLimit,TimeLimitRaw,cputime,nodelist%50,AllocTRES%80

# Jobs custom details, with steps
sacct --format=User%20,JobID,Jobname,partition,state,time,submit,start,end,elapsed,AveRss,MaxRss,MaxRSSTask,MaxRSSNode%20,MaxVMSize,nnodes,ncpus,ntasks,reqcpus,totalcpu,reqmem,cluster,TimeLimit,TimeLimitRaw,cputime,nodelist%50,AllocTRES%80

Job efficiency

Users can check how efficient are their jobs. For that, the seff command is available.

seff $jobid

[Show 'seff' example]

[root@merlin-slurmctld01 ~]# seff 134333893
Job ID: 134333893
Cluster: merlin6
User/Group: albajacas_a/unx-sls
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 8
CPU Utilized: 00:26:15
CPU Efficiency: 49.47% of 00:53:04 core-walltime
Job Wall-clock time: 00:06:38
Memory Utilized: 60.73 MB
Memory Efficiency: 0.19% of 31.25 GB

List job attributes

The sjstat command is used to display statistics of jobs under control of SLURM. To use it

sjstat

[Show 'sjstat' example]

[root@merlin-l-001 ~]# sjstat -v

Scheduling pool data:

                       Total  Usable   Free   Node   Time      Other

Pool Memory Cpus Nodes Nodes Nodes Limit Limit traits

test 373502Mb 88 6 6 1 UNLIM 1-00:00:00
general* 373502Mb 88 66 66 8 50 7-00:00:00
daily 373502Mb 88 72 72 9 60 1-00:00:00
hourly 373502Mb 88 72 72 9 UNLIM 01:00:00
gpu 128000Mb 8 1 1 0 UNLIM 7-00:00:00
gpu 128000Mb 20 8 8 0 UNLIM 7-00:00:00

Running job data:

                                             Time        Time            Time

JobID User Procs Pool Status Used Limit Started Master/Other

13433377 collu_g 13433389 collu_g 13433382 jaervine 13433386 barret_d 13433382 pamula_f 13433387 pamula_f 13433365 andreani 13433388 marino_j 13433377 choi_s 13433373 qi_c 13433390 jaervine 13433390 jaervine 13433375 bellotti 13433358 bellotti 13433377 lavriha_ 13433370 lavriha_ 13433373 qi_c 13433371 qi_c 13433254 feichtin 13432137 feichtin 13433389 albajaca 13433387 riemann_ 13433370 jimenez_ 13433381 jimenez_ 13433390 sayed_m 13433359 adelmann 13433377 zimmerma 13433375 zohdirad 13433363 zimmerma 13433376 zimmerma 13433371 vazquez_ 13433382 vazquez_ 13433376 jiang_j1 13433376 jiang_j1 13433384 kranjcev 13433371 vazquez_ 13433371 vazquez_ 13433374 colonna_ 13433374 bures_l 13433375 derlet 13433373 derlet 13433373 derlet 13433365 andreani 13431187 mahrous_ 13433387 kranjcev 13433368 karalis_ 13433367 karalis_ 13433385 karalis_ 13433374 sato 13433374 sato 67723568 sato 13433265 khanppna 13433375 khanppna 13433371 khanppna 1 gpu PD 0:00 24:00:00 N/A (Resources) 20 gpu PD 0:00 24:00:00 N/A (Resources) 4 gpu PD 0:00 24:00:00 N/A (Priority) 20 gpu PD 0:00 24:00:00 N/A (Priority) 20 gpu PD 0:00 168:00:00 N/A (Priority) 4 gpu PD 0:00 24:00:00 N/A (Priority) 132 daily PD 0:00 24:00:00 N/A (Dependency) 6 gpu R 1:43:12 168:00:00 01-23T14:54:57 merlin-g-007 40 gpu R 2:09:55 48:00:00 01-23T14:28:14 merlin-g-006 20 gpu R 7:00:04 24:00:00 01-23T09:38:05 merlin-g-004 2 gpu R 5:18 24:00:00 01-23T16:32:51 merlin-g-007 2 gpu R 15:18 24:00:00 01-23T16:22:51 merlin-g-007 4 gpu R 7:35:44 9:00:00 01-23T09:02:25 merlin-g-001 1 gpu R 1-05:52:19 144:00:00 01-22T10:45:50 merlin-g-007 20 gpu R 5:13:24 24:00:00 01-23T11:24:45 merlin-g-008 40 gpu R 22:43:09 24:00:00 01-22T17:55:00 merlin-g-003 20 gpu R 15:03:15 24:00:00 01-23T01:34:54 merlin-g-002 4 gpu R 22:14:14 168:00:00 01-22T18:23:55 merlin-g-001 2 general R 5-07:26:11 156:00:00 01-18T09:11:58 merlin-c-204 2 general R 5-23:06:25 160:00:00 01-17T17:31:44 merlin-c-204 32 hourly R 41:19 1:00:00 01-23T15:56:50 merlin-c-219 2 general R 1:51:47 4:00:00 01-23T14:46:22 merlin-c-204 2 general R 23:20:45 168:00:00 01-22T17:17:24 merlin-c-106 2 general R 4:55:33 168:00:00 01-23T11:42:36 merlin-c-219 128 daily R 21:49 10:00:00 01-23T16:16:20 merlin-c-223 2 general R 1-05:00:09 48:00:00 01-22T11:38:00 merlin-c-204 2 daily R 6:13:38 24:00:00 01-23T10:24:31 merlin-c-007 24 daily R 7:33:16 10:00:00 01-23T09:04:53 merlin-c-218 6 general R 1-02:54:20 47:50:00 01-22T13:43:49 merlin-c-106 6 general R 7:25:42 23:50:00 01-23T09:12:27 merlin-c-007 16 daily R 21:46:31 23:59:00 01-22T18:51:38 merlin-c-106 16 daily R 4:09:23 23:59:00 01-23T12:28:46 merlin-c-024 440 daily R 7:11:14 10:00:00 01-23T09:26:55 merlin-c-123 24 daily R 7:08:19 10:00:00 01-23T09:29:50 merlin-c-220 440 daily R 2:48:19 24:00:00 01-23T13:49:50 merlin-c-108 16 general R 20:15:15 120:00:00 01-22T20:22:54 merlin-c-210 16 general R 21:15:51 120:00:00 01-22T19:22:18 merlin-c-210 176 daily R 8:23:18 24:00:00 01-23T08:14:51 merlin-c-211 88 daily R 10:45:06 24:00:00 01-23T05:53:03 merlin-c-001 88 daily R 7:32:05 24:00:00 01-23T09:06:04 merlin-c-107 88 daily R 17:21:57 24:00:00 01-22T23:16:12 merlin-c-002 88 daily R 18:13:05 24:00:00 01-22T22:25:04 merlin-c-112 264 daily R 4:10:08 24:00:00 01-23T12:28:01 merlin-c-003 88 general R 6-15:59:16 168:00:00 01-17T00:38:53 merlin-c-111 2 general R 1:48:47 4:00:00 01-23T14:49:22 merlin-c-204 352 general R 1-00:05:22 96:00:00 01-22T16:32:47 merlin-c-013 352 general R 1-00:06:44 96:00:00 01-22T16:31:25 merlin-c-118 352 general R 1:37:24 96:00:00 01-23T15:00:45 merlin-c-213 256 general R 14:55:55 24:00:00 01-23T01:42:14 merlin-c-204 64 general R 10:43:35 24:00:00 01-23T05:54:34 merlin-c-106 32 general R 10:40:07 24:00:00 01-23T05:58:02 merlin-c-007 440 general R 3-18:20:58 168:00:00 01-19T22:17:11 merlin-c-008 704 general R 7:31:24 24:00:00 01-23T09:06:45 merlin-c-101 616 general R 21:40:33 24:00:00 01-22T18:57:36 merlin-c-208

Graphical user interface

When using ssh with X11 forwarding (ssh -XY), or when using NoMachine, users can use sview. SView is a graphical user interface to view and modify Slurm states. To run sview:

ssh -XY $username@merlin-l-001.psi.ch # Not necessary when using NoMachine
sview

!['sview' graphical user interface]({{ "/images/Slurm/sview.png" }})

General Monitoring

The following pages contain basic monitoring for Slurm and computing nodes. Currently, monitoring is based on Grafana + InfluxDB. In the future it will be moved to a different service based on ElasticSearch + LogStash + Kibana.

In the meantime, the following monitoring pages are available in a best effort support:

Merlin6 Monitoring Pages

Slurm monitoring:
Nodes monitoring:
- Merlin6 CPU Nodes Overview
- Merlin6 GPU Nodes Overview

Merlin5 Monitoring Pages

Slurm monitoring:
- Merlin5 Slurm Live Status
- Merlin5 Slurm Overview
Nodes monitoring:
- Merlin5 CPU Nodes Overview

16 KiB Raw Blame History