11 KiB
Methods and Tools
This section covers the general methods and tools available for troubleshooting RHEL Linux systems.
Methodology
When solving problems it is helpful to use a structured approach (as opposed to randomly trying things until the system seems to work again) and to keep notes.
The Google SRE book has useful information, especially the chapter on troubleshooting
Tools
Services
Services can be inspected with systemctl(1). Example:
● sssd.service - System Security Services Daemon
Loaded: loaded (/usr/lib/systemd/system/sssd.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2018-06-21 16:26:48 CEST; 5 days ago
Main PID: 691 (sssd)
CGroup: /system.slice/sssd.service
├─691 /usr/sbin/sssd -i --logger=files
├─746 /usr/libexec/sssd/sssd_be --domain D.PSI.CH --uid 0 --gid 0 --logger=files
├─758 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
└─759 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files
Jun 21 16:26:48 lxdev05.psi.ch systemd[1]: Starting System Security Services Daemon...
Jun 21 16:26:48 lxdev05.psi.ch sssd[691]: Starting up
Jun 21 16:26:48 lxdev05.psi.ch sssd[be[D.PSI.CH]][746]: Starting up
Jun 21 16:26:48 lxdev05.psi.ch sssd[pam][759]: Starting up
Jun 21 16:26:48 lxdev05.psi.ch sssd[nss][758]: Starting up
Jun 21 16:26:48 lxdev05.psi.ch systemd[1]: Started System Security Services Daemon.
Jun 25 10:59:22 lxdev05.psi.ch [sssd[krb5_child[5223]]][5223]: Preauthentication failed
Jun 25 10:59:24 lxdev05.psi.ch [sssd[krb5_child[5224]]][5224]: Preauthentication failed
Jun 25 10:59:24 lxdev05.psi.ch [sssd[krb5_child[5224]]][5224]: Preauthentication failed
Processes
Processes can be investigated through a variety of tools:
- The files in
/proc/$PID/, in particular/proc/$PID/fd/*: the open files of the process/proc/$PID/environ: the process' environment
strace(1)allows tracing a process' system calls.ltrace(1)allows tracing a process' library calls.
Note
Both strace(1)
and ltrace(1) slow
the target process down a lot, which might cause
problems.
System and Application Logs
Starting with RHEL 7 almost all system logs end up in the journal,
which can be queried with journalctl. One important
exception is sssd(8), which provides authentication against
Active Directory. Its logs can be found in
/var/log/sssd.
journalctl
offers a lot of functionality. The following list shows the most
important features:
List all reboots/show logs starting at a specific reboot:
# journalctl --list-boots -10 19d173f56d314912820486b9ddfd7d6c Thu 2018-06-21 11:20:55 CEST—Thu 2018-06-21 -9 3a5a050289314221a3863b88a0eef367 Thu 2018-06-21 11:26:33 CEST—Thu 2018-06-21 -8 f9726e6c9ce44678ab68a2fc12b1c12c Thu 2018-06-21 11:43:38 CEST—Thu 2018-06-21 -7 b4e6bc84ff8840adbc698992cd1900d2 Thu 2018-06-21 14:55:42 CEST—Thu 2018-06-21 -6 81b78d0d09934937a24a73bfcd3d8ede Thu 2018-06-21 15:06:18 CEST—Thu 2018-06-21 -5 dd78e29c073448ad9731c6c18288c97a Thu 2018-06-21 15:23:15 CEST—Thu 2018-06-21 -4 0fc2f05d12664d3aba6364102401d5fb Thu 2018-06-21 15:29:36 CEST—Thu 2018-06-21 -3 412bbe36d12546bab749a2a63fad99ca Thu 2018-06-21 15:34:19 CEST—Thu 2018-06-21 -2 c5189f2006c245d7833bb8fe20e62545 Thu 2018-06-21 16:07:51 CEST—Thu 2018-06-21 -1 7c47950edd194ff4b6a67d3556672430 Thu 2018-06-21 16:11:28 CEST—Thu 2018-06-21 0 61ea098edc924030aafb7a822c2df0e3 Thu 2018-06-21 16:26:46 CEST—Wed 2018-06-27 # journalctl -b -- Logs begin at Thu 2018-06-21 11:20:55 CEST, end at Wed 2018-06-27 14:20:01 CEST. -- Jun 21 16:26:46 lxdev05.psi.ch systemd-journal[85]: Runtime journal is using 8.0M (max allowed 91.9M, trying to leave 137.9M free of 911.8M av Jun 21 16:26:46 lxdev05.psi.ch kernel: Initializing cgroup subsys cpuset # journalctl -b -2 -- Logs begin at Thu 2018-06-21 11:20:55 CEST, end at Wed 2018-06-27 14:20:01 CEST. -- Jun 21 16:07:51 lxdev05.psi.ch systemd-journal[87]: Runtime journal is using 8.0M (max allowed 91.9M, trying to leave 137.9M free of 911.8M av Jun 21 16:07:51 lxdev05.psi.ch kernel: Initializing cgroup subsys cpusetShow logs starting from a given date/time:
# journalctl --since 2018-06-23 # journalctl --since '2018-06-23 18:13'Show logs for a given unit, eg a given service:
# journalctl -u sshd.service # journalctl -u pli-puppet-run.timerShow the last N messages (1000 by default):
# journalctl -e # journalctl -e -n 250List all systemd timers:
# journalctl list-timers NEXT LEFT LAST PASSED UNIT ACTIVATES Wed 2018-06-27 16:45:01 CEST 2h 16min left Tue 2018-06-26 16:45:01 CEST 21h ago systemd-tmpfiles-clean.timer systemd-tmpfiles-clean.service Thu 2018-06-28 07:31:00 CEST 17h left Wed 2018-06-27 07:31:25 CEST 6h ago pli-puppet-run.timer pli-puppet-run.service 2 timers listed. Pass --all to see loaded but inactive timers, too.
Filesystems and Storage
Check filesystem capacity using df(1):
# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_root-lv_root 8.0G 1.4G 6.7G 17% /
devtmpfs 909M 0 909M 0% /dev
tmpfs 920M 0 920M 0% /dev/shm
tmpfs 920M 816K 920M 1% /run
tmpfs 920M 0 920M 0% /sys/fs/cgroup
/dev/sda1 976M 198M 728M 22% /boot
/dev/mapper/vg_root-lv_tmp 1014M 34M 981M 4% /tmp
/dev/mapper/vg_root-lv_var 2.9G 1.4G 1.5G 47% /var
/dev/mapper/vg_root-lv_var_log 2.0G 160M 1.9G 8% /var/log
/dev/mapper/vg_root-lv_openafs 1008M 1.3M 956M 1% /var/cache/openafs
tmpfs 184M 4.0K 184M 1% /run/user/0
Check available inodes (~ the maximum number of files that can be created):
# df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/mapper/vg_root-lv_root 4194304 48891 4145413 2% /
devtmpfs 232630 383 232247 1% /dev
tmpfs 235485 1 235484 1% /dev/shm
tmpfs 235485 575 234910 1% /run
tmpfs 235485 16 235469 1% /sys/fs/cgroup
/dev/sda1 65536 348 65188 1% /boot
/dev/mapper/vg_root-lv_tmp 524288 316 523972 1% /tmp
/dev/mapper/vg_root-lv_var 1474560 1042691 431869 71% /var
/dev/mapper/vg_root-lv_var_log 1048576 81 1048495 1% /var/log
/dev/mapper/vg_root-lv_openafs 65536 11 65525 1% /var/cache/openafs
tmpfs 235485 2 235483 1% /run/user/0
Networking
Test hostname resolution with getent(1), for example
getent hosts www.psi.ch. Unlike nslookup(1) or dig(1), it uses the
system resolver.
The systems IP addresses and routes can be displayed using ip(8):
# ip address
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 00:50:56:9d:6d:03 brd ff:ff:ff:ff:ff:ff
inet 10.129.160.195/24 brd 10.129.160.255 scope global ens160
valid_lft forever preferred_lft forever
inet6 fe80::250:56ff:fe9d:6d03/64 scope link
valid_lft forever preferred_lft forever
# ip route
default via 10.129.160.1 dev ens160
10.129.160.0/24 dev ens160 proto kernel scope link src 10.129.160.195
169.254.0.0/16 dev ens160 scope link metric 1002
The link status and other information of an interface can be
displayed using ethtool(8):
Link status:
# ethtool ens160 Settings for ens160: [...] Speed: 10000Mb/s Duplex: Full [...] Link detected: yesStatistics (driver-specific, but look for errors/discards/dropped):
# ethtool -S ens160 NIC statistics: Tx Queue#: 0 TSO pkts tx: 21529 TSO bytes tx: 91036062 ucast pkts tx: 1036632 ucast bytes tx: 235421707 mcast pkts tx: 8 mcast bytes tx: 648 bcast pkts tx: 7 bcast bytes tx: 294 pkts tx err: 0 pkts tx discard: 0 drv dropped tx total: 0 too many frags: 0 giant hdr: 0 hdr err: 0 tso: 0 ring full: 0 pkts linearized: 0 hdr cloned: 0 giant hdr: 0 Rx Queue#: 0 LRO pkts rx: 6913 LRO byte rx: 100534073 ucast pkts rx: 551554 ucast bytes rx: 161369441 mcast pkts rx: 4 mcast bytes rx: 344 bcast pkts rx: 753276 bcast bytes rx: 45787629 pkts rx OOB: 0 pkts rx err: 0 drv dropped rx total: 0 err: 0 fcs: 0 rx buf alloc fail: 0 tx timeout count: 0
Packages
The integratity of installed package can be checked with rpm(8):
# rpm -Vv pciutils
......... /usr/sbin/lspci
......... /usr/sbin/setpci
......... /usr/sbin/update-pciids
......... /usr/share/doc/pciutils-3.5.1
......... d /usr/share/doc/pciutils-3.5.1/COPYING
......... d /usr/share/doc/pciutils-3.5.1/ChangeLog
......... d /usr/share/doc/pciutils-3.5.1/README
......... d /usr/share/doc/pciutils-3.5.1/pciutils.lsm
......... d /usr/share/man/man8/lspci.8.gz
......... d /usr/share/man/man8/setpci.8.gz
......... d /usr/share/man/man8/update-pciids.8.gz
Running rpm -Vav will verify all
installed packages and take a long time. See the man page for details on
the output format. Changes, especially in configuration files, can be
normal, though.