Files
gitea-pages/admin-guide/configuration/monitoring/icinga2.md

10 KiB

Icinga2 Configuration

Icinga2 is productive, but the checks are still getting added:

  • standard Linuxfabrik checks
  • 🏗️ support for automatically installed Icinga1 checks by Puppet (see issue)
  • support for custom checks

The overview of your nodes in Icinga2 you get at monitoring.psi.ch and there you can handle the alerts and create service windows, etc.

But the configuration as such is not done therein, but in Hiera and automatically propagated.

TL;DR

I, admin of xyz.psi.ch want ...

... monitoring with e-Mails during office hours:

icinga2::enable: true
icinga2::agent::enable: true
icinga2::alerting::enable: true

... monitoring with SMS all around the clock:

icinga2::enable: true
icinga2::agent::enable: true
icinga2::alerting::enable: true
icinga2::alerting::severity: 1

... just be able to check monitoring state on monitoring.psi.ch:

icinga2::enable: true
icinga2::agent::enable: true
icinga2::alerting::enable: false
icinga2::alerting::severity: 5

... no monitoring:

icinga2::enable: false

Basic Configuration

Enable monitoring with Icinga2 by

icinga2::enable: true

(which is false by default for RHEL7 and RHEL8, but true for RHEL9 and later).

This only does the ping test to check if the host is online on the network. For further checks on the host itself the agent needs to be started:

icinga2::agent::enable: true

(also here it is false by default for RHEL7 and RHEL8, but true for RHEL9 and later).

Still no alerts are generated, respectively they are suppressed by a global infinite service window. If you wish different, set

icinga2::alerting::enable: true

Per default these alerts are now sent during office hours to the admins. For further notification fine tuning checkout the chapters Notifications and Check Customization.

Finally, if Icinga2 shall be managed without Puppet (not recommended except for Icinga2 infrastructure servers), then set

icinga2::puppet: false

Web Access

Users and groups in aaa::admins and icinga2::web::users will have access to these nodes on monitoring.psi.ch. Prefix the group name with a % to distinguish them from users.

Notifications

Notification Recipients

By default the notifications are sent to all admins, this means users and groups listed in Hiera at aaa::admins with the exception of the default admins from common.yaml and the group unx-lx_support. If the admins should not be notified, then disable the sending of messages with

icinga2::alerting::notify_admins: false

Additionally to/instead of the admins you can list the notification recipients in the Hiera list icinga2::alerting::contacts. You can list

  • AD users by login name
  • AD groups with % as prefix to their name
  • plain e-mail addresses

Notificiation Time Restrictions

Notificiations for warnings and alerts are sent out by default during office hours, this means from Monday to Friday 08:00 - 17:00.

This can be configured in Hiera with the icinga2::alerting::severity key which is 4 by default. Following options are possible:

node severity media time
1 SMS and e-mail 24x7
2 e-mail 24x7
3 e-mail office hours
4 e-mail office hours
5 no notifications never

(Currently 3 and 4 behave the same.)

Please note that services where the criticality variable is set then time when notifications are sent out is also restricted:

service criticality time
- 24x7
A 24x7
B office hours
C never

The minimal settings are applied, e.g. a service with criticality C will never cause a notificiation independent of the node severity.

To receive notification messages over SMS, you need to register your mobile phone with Icinga2. You may request this informing icinga2-support@psi.ch about your wish. Alternatively you will get an e-mail with the request to do so when the first SMS was supposed to be sent out for you and the phone number is still missing.

Default Checks

By default we already run a comprehensive set of checks. Some of them can be fine-tuned in Hiera. Whenever you have a use case which is not covered yet, please talk to us.

Check Customization

Most checks can have custom parameters. The variables you can adapt you find as "Custom Variables" in the page of given service. In Hiera you can add below the key icinga2::service_check::customize as multi level hash the service name and below the variable name with the new values.

Example "CPU Usage"

Lets look at the example of CPU Usage "service":

"CPU Usage" service page

If the machinge is a number cruncher and the CPU is fine to be fully utilitzied, then you might ignore it by setting it always fine:

icinga2::service_check::customize:
  'CPU Usage':
    cpu_usage_always_ok: true

If in contrary you want to get an immediate notification when CPU is overused, then following snipped is more advisable:

icinga2::service_check::customize:
  'CPU Usage':
    criticality: A

If it is a Linuxfabrik plugin, you find a link at "Notes" which points to the documentation of the check. This might shed more light on the effect of these variables.

Example "Kernel Ring Buffer (dmesg)'"

Another check which can easily have false alerts, but also has a big potential to signal severe kernel or hardware issues, is the check of the kernel log (dmesg).

If you conclude that a given message can savely be ingored, you may add it onto the ignore list, where a partial string match will make it ignored in the future:

icinga2::service_check::customize:
  'Kernel Ring Buffer (dmesg)':
    'dmesg_ignore':
      - 'blk_update_request: I/O error, dev fd0, sector 0'
      - 'integrity: Problem loading X.509 certificate -126'

If you think that this log message can be globally ignored, please inform the Linux Team so we can ignore it by default.

Note that you can reset this check after dealing with it by executing on the node:

dmesg --clear

Extra Checks

TLS/SSL Certificate Expiration

To monitor the expiration of one or more certificates you need to give the node in Hiera the additional server role ssl-cert (except for role::jupyterserver):

icinga2::additional_server_role:
  - 'ssl-cert'

Then list what certificate files you want to have checked:

icinga2::service_check::customize:
  'TLS/SSL Certificate Expiration':
    ssl_cert_files:
      - '/etc/xrdp/cert.pem'
      - '/etc/httpd/ssl/node.crt'

Beside the file list you may set the warning time in days with the attribute ssl_cert_warning (7 by default) and the critical time with the attribute ssl_cert_critical (3 by default).

If you run your own PKI, you might also check a CA certificate for expiration with

icinga2::additional_server_role:
  - 'ca-cert'

icinga2::service_check::customize:
  'CA Certificate Expiration':
    ssl_cert_files:
      - '/etc/my_pki/ca.pem'

Here the warning is below 180 days and below 30 days is critical by default.

Check for Systemd Service Status

To check if a daemon or service has been successfully started by systemd configure:

icinga2::custom_service:
  'XRDP Active':
     template: 'st-agent-awi-lx-service-active'
     vars:
       criticality: 'A'
       service_names:
        - 'xrdp'
        - 'xrdp-sesman'

The name (here XRDP Active) needs to be unique over all Icinga "services" of a single host. The service_names variable needs to contain one or more name of systemd services to be monitored.

You can create multiple of these checks.

External Connection Checks (Active Checks)

For this we have fully custom service checks.

Below example is for a RDP port:

icinga2::custom_service:
  'RDP Access':
     command: 'tcp'
     agent: false
     perf_data: true
     vars:
       criticality: 'A'
       tcp_port: 3389

Possible commands are http, tcp, udp, ssl, ssh or ftp.

Note if you want to reference the hostname, you might use a macro, e.g.:

    http_vhost: '$host.name$'

Note that macros only work for check command arguments.

The actual service name is up to you, it only needs to be unique.

Other Custom Checks

It is possible to create a very custom check. But note the command or service template used needs to be available/configured by some other means on the Icinga Master. The check plugin executed on the Icinga Satellite or by the Icinga agent needs also to be already available or distributed by other means. So please reach out to the Linux Team to check how to do it best and to ensure that all is in place.

icinga2::custom_service:
  'My Service Check 1':
     template: st-agent-lf-file-size
     vars:
       criticality: 'B'
       file_size_filename: '/var/my/growing/file'
       file_size_warning = '100M'
       file_size_critical = '200M'
  'My Service Check 2':
     command: 'tcp'
     agent: false
     vars:
       criticality: 'A'
       tcp_port: 3389
     perf_data: true

Below icinga2::custom_service set the name of the service/service check as it will be seen in Icingaweb. Then the possible arguments are

  • command to issue a check command
  • template to inherit from given service template
  • agent shall the command run on the agent or the satellite, only if themplate is not set, default is true
  • vars hash with arguments for the service check
  • perf_data if performance data should be recorded and performance graph should be shown, default is false

You are free in the use of the actual service name, it only needs to be unique.