Files
gitea-pages/admin-guide/troubleshooting/methods-and-tools.rst
2021-05-05 14:24:27 +02:00

11 KiB

Methods and Tools

This section covers the general methods and tools available for troubleshooting RHEL Linux systems.

Methodology

When solving problems it is helpful to use a structured approach (as opposed to randomly trying things until the system seems to work again) and to keep notes.

The Google SRE book has useful information, especially the chapter on troubleshooting

Tools

Services

Services can be inspected with systemctl(1). Example:

● sssd.service - System Security Services Daemon
   Loaded: loaded (/usr/lib/systemd/system/sssd.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2018-06-21 16:26:48 CEST; 5 days ago
 Main PID: 691 (sssd)
   CGroup: /system.slice/sssd.service
           ├─691 /usr/sbin/sssd -i --logger=files
           ├─746 /usr/libexec/sssd/sssd_be --domain D.PSI.CH --uid 0 --gid 0 --logger=files
           ├─758 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
           └─759 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files

Jun 21 16:26:48 lxdev05.psi.ch systemd[1]: Starting System Security Services Daemon...
Jun 21 16:26:48 lxdev05.psi.ch sssd[691]: Starting up
Jun 21 16:26:48 lxdev05.psi.ch sssd[be[D.PSI.CH]][746]: Starting up
Jun 21 16:26:48 lxdev05.psi.ch sssd[pam][759]: Starting up
Jun 21 16:26:48 lxdev05.psi.ch sssd[nss][758]: Starting up
Jun 21 16:26:48 lxdev05.psi.ch systemd[1]: Started System Security Services Daemon.
Jun 25 10:59:22 lxdev05.psi.ch [sssd[krb5_child[5223]]][5223]: Preauthentication failed
Jun 25 10:59:24 lxdev05.psi.ch [sssd[krb5_child[5224]]][5224]: Preauthentication failed
Jun 25 10:59:24 lxdev05.psi.ch [sssd[krb5_child[5224]]][5224]: Preauthentication failed

Processes

Processes can be investigated through a variety of tools:

  1. The files in /proc/$PID/, in particular
    1. /proc/$PID/fd/*: the open files of the process
    2. /proc/$PID/environ: the process' environment
  2. strace(1) allows tracing a process' system calls.
  3. ltrace(1) allows tracing a process' library calls.

Note

Both strace(1) and ltrace(1) slow the target process down a lot, which might cause problems.

System and Application Logs

Starting with RHEL 7 almost all system logs end up in the journal, which can be queried with journalctl. One important exception is sssd(8), which provides authentication against Active Directory. Its logs can be found in /var/log/sssd.

journalctl offers a lot of functionality. The following list shows the most important features:

  1. List all reboots/show logs starting at a specific reboot:

    # journalctl --list-boots
    -10 19d173f56d314912820486b9ddfd7d6c Thu 2018-06-21 11:20:55 CEST—Thu 2018-06-21
     -9 3a5a050289314221a3863b88a0eef367 Thu 2018-06-21 11:26:33 CEST—Thu 2018-06-21
     -8 f9726e6c9ce44678ab68a2fc12b1c12c Thu 2018-06-21 11:43:38 CEST—Thu 2018-06-21
     -7 b4e6bc84ff8840adbc698992cd1900d2 Thu 2018-06-21 14:55:42 CEST—Thu 2018-06-21
     -6 81b78d0d09934937a24a73bfcd3d8ede Thu 2018-06-21 15:06:18 CEST—Thu 2018-06-21
     -5 dd78e29c073448ad9731c6c18288c97a Thu 2018-06-21 15:23:15 CEST—Thu 2018-06-21
     -4 0fc2f05d12664d3aba6364102401d5fb Thu 2018-06-21 15:29:36 CEST—Thu 2018-06-21
     -3 412bbe36d12546bab749a2a63fad99ca Thu 2018-06-21 15:34:19 CEST—Thu 2018-06-21
     -2 c5189f2006c245d7833bb8fe20e62545 Thu 2018-06-21 16:07:51 CEST—Thu 2018-06-21
     -1 7c47950edd194ff4b6a67d3556672430 Thu 2018-06-21 16:11:28 CEST—Thu 2018-06-21
      0 61ea098edc924030aafb7a822c2df0e3 Thu 2018-06-21 16:26:46 CEST—Wed 2018-06-27
    
    # journalctl -b
    -- Logs begin at Thu 2018-06-21 11:20:55 CEST, end at Wed 2018-06-27 14:20:01 CEST. --
    Jun 21 16:26:46 lxdev05.psi.ch systemd-journal[85]: Runtime journal is using 8.0M (max allowed 91.9M, trying to leave 137.9M free of 911.8M av
    Jun 21 16:26:46 lxdev05.psi.ch kernel: Initializing cgroup subsys cpuset
    
    # journalctl -b -2
    -- Logs begin at Thu 2018-06-21 11:20:55 CEST, end at Wed 2018-06-27 14:20:01 CEST. --
    Jun 21 16:07:51 lxdev05.psi.ch systemd-journal[87]: Runtime journal is using 8.0M (max allowed 91.9M, trying to leave 137.9M free of 911.8M av
    Jun 21 16:07:51 lxdev05.psi.ch kernel: Initializing cgroup subsys cpuset
  2. Show logs starting from a given date/time:

    # journalctl --since 2018-06-23
    # journalctl --since '2018-06-23 18:13'
  3. Show logs for a given unit, eg a given service:

    # journalctl -u sshd.service
    # journalctl -u pli-puppet-run.timer
  4. Show the last N messages (1000 by default):

    # journalctl -e
    # journalctl -e -n 250
  5. List all systemd timers:

    # journalctl list-timers
    NEXT                          LEFT          LAST                          PASSED  UNIT                         ACTIVATES
    Wed 2018-06-27 16:45:01 CEST  2h 16min left Tue 2018-06-26 16:45:01 CEST  21h ago systemd-tmpfiles-clean.timer systemd-tmpfiles-clean.service
    Thu 2018-06-28 07:31:00 CEST  17h left      Wed 2018-06-27 07:31:25 CEST  6h ago  pli-puppet-run.timer         pli-puppet-run.service
    
    2 timers listed.
    Pass --all to see loaded but inactive timers, too.

Filesystems and Storage

Check filesystem capacity using df(1):

# df -h
Filesystem                      Size  Used Avail Use% Mounted on
/dev/mapper/vg_root-lv_root     8.0G  1.4G  6.7G  17% /
devtmpfs                        909M     0  909M   0% /dev
tmpfs                           920M     0  920M   0% /dev/shm
tmpfs                           920M  816K  920M   1% /run
tmpfs                           920M     0  920M   0% /sys/fs/cgroup
/dev/sda1                       976M  198M  728M  22% /boot
/dev/mapper/vg_root-lv_tmp     1014M   34M  981M   4% /tmp
/dev/mapper/vg_root-lv_var      2.9G  1.4G  1.5G  47% /var
/dev/mapper/vg_root-lv_var_log  2.0G  160M  1.9G   8% /var/log
/dev/mapper/vg_root-lv_openafs 1008M  1.3M  956M   1% /var/cache/openafs
tmpfs                           184M  4.0K  184M   1% /run/user/0

Check available inodes (~ the maximum number of files that can be created):

# df -i
Filesystem                      Inodes   IUsed   IFree IUse% Mounted on
/dev/mapper/vg_root-lv_root    4194304   48891 4145413    2% /
devtmpfs                        232630     383  232247    1% /dev
tmpfs                           235485       1  235484    1% /dev/shm
tmpfs                           235485     575  234910    1% /run
tmpfs                           235485      16  235469    1% /sys/fs/cgroup
/dev/sda1                        65536     348   65188    1% /boot
/dev/mapper/vg_root-lv_tmp      524288     316  523972    1% /tmp
/dev/mapper/vg_root-lv_var     1474560 1042691  431869   71% /var
/dev/mapper/vg_root-lv_var_log 1048576      81 1048495    1% /var/log
/dev/mapper/vg_root-lv_openafs   65536      11   65525    1% /var/cache/openafs
tmpfs                           235485       2  235483    1% /run/user/0

Networking

Test hostname resolution with getent(1), for example getent hosts www.psi.ch. Unlike nslookup(1) or dig(1), it uses the system resolver.

The systems IP addresses and routes can be displayed using ip(8):

# ip address
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 00:50:56:9d:6d:03 brd ff:ff:ff:ff:ff:ff
    inet 10.129.160.195/24 brd 10.129.160.255 scope global ens160
       valid_lft forever preferred_lft forever
    inet6 fe80::250:56ff:fe9d:6d03/64 scope link 
       valid_lft forever preferred_lft forever

# ip route
default via 10.129.160.1 dev ens160
10.129.160.0/24 dev ens160 proto kernel scope link src 10.129.160.195 
169.254.0.0/16 dev ens160 scope link metric 1002    

The link status and other information of an interface can be displayed using ethtool(8):

  1. Link status:

    # ethtool ens160
    Settings for ens160:
      [...]
        Speed: 10000Mb/s
        Duplex: Full
      [...]
        Link detected: yes
  2. Statistics (driver-specific, but look for errors/discards/dropped):

    # ethtool -S ens160
    NIC statistics:
         Tx Queue#: 0
           TSO pkts tx: 21529
           TSO bytes tx: 91036062
           ucast pkts tx: 1036632
           ucast bytes tx: 235421707
           mcast pkts tx: 8
           mcast bytes tx: 648
           bcast pkts tx: 7
           bcast bytes tx: 294
           pkts tx err: 0
           pkts tx discard: 0
           drv dropped tx total: 0
              too many frags: 0
              giant hdr: 0
              hdr err: 0
              tso: 0
           ring full: 0
           pkts linearized: 0
           hdr cloned: 0
           giant hdr: 0
         Rx Queue#: 0
           LRO pkts rx: 6913
           LRO byte rx: 100534073
           ucast pkts rx: 551554
           ucast bytes rx: 161369441
           mcast pkts rx: 4
           mcast bytes rx: 344
           bcast pkts rx: 753276
           bcast bytes rx: 45787629
           pkts rx OOB: 0
           pkts rx err: 0
           drv dropped rx total: 0
              err: 0
              fcs: 0
           rx buf alloc fail: 0
         tx timeout count: 0

Packages

The integratity of installed package can be checked with rpm(8):

# rpm -Vv pciutils
.........    /usr/sbin/lspci
.........    /usr/sbin/setpci
.........    /usr/sbin/update-pciids
.........    /usr/share/doc/pciutils-3.5.1
.........  d /usr/share/doc/pciutils-3.5.1/COPYING
.........  d /usr/share/doc/pciutils-3.5.1/ChangeLog
.........  d /usr/share/doc/pciutils-3.5.1/README
.........  d /usr/share/doc/pciutils-3.5.1/pciutils.lsm
.........  d /usr/share/man/man8/lspci.8.gz
.........  d /usr/share/man/man8/setpci.8.gz
.........  d /usr/share/man/man8/update-pciids.8.gz

Running rpm -Vav will verify all installed packages and take a long time. See the man page for details on the output format. Changes, especially in configuration files, can be normal, though.