2026-06-24 14:12:03 +02:00
2026-06-24 14:12:03 +02:00
2026-06-24 14:12:03 +02:00
2026-06-24 14:12:03 +02:00

Slurm Extra Metrics

Since version 25.11 of SLURM, Prometheus compatible metrics (which can be ingested by Grafana) are available. Full documentation: https://slurm.schedmd.com/metrics.html.

The endpoint has some limitations though:

  • node and node status are not mapped together, i.e. state information is from
  • total node count accounting metrics are limited (mainly for performance reasons)

This repo provides an exporter to compliment (i.e. run concurrently) the SLURM native metrics endpoint.

Setup

The slurm_extra_metrcs.py script can be run as a service using the SystemD service unit.

Important

The exporter uses scontrol to capture the metrics, but for a multi-cluster setup, each SLURM cluster needs to be given with the -C flag, otherwise only metrics from the first cluster are exported. This is part of a bug reported at https://support.schedmd.com/show_bug.cgi?id=25407.

Is is advised to double check that the bind address uses 127.0.0.1, i.e. localhost, to avoid exposing the endpoint to the wider network.

S
Description
No description provided
Readme
27 KiB
Languages
Python 100%