270 lines
11 KiB
ReStructuredText
270 lines
11 KiB
ReStructuredText
===================
|
|
Methods and Tools
|
|
===================
|
|
|
|
This section covers the general methods and tools available for troubleshooting
|
|
RHEL Linux systems.
|
|
|
|
Methodology
|
|
===========
|
|
|
|
When solving problems it is helpful to use a structured approach (as opposed to
|
|
randomly trying things until the system seems to work again) and to keep notes.
|
|
|
|
The `Google SRE book <https://landing.google.com/sre/book.html>`_ has useful
|
|
information, especially the `chapter on troubleshooting
|
|
<https://landing.google.com/sre/book/chapters/effective-troubleshooting.html>`_
|
|
|
|
|
|
Tools
|
|
=====
|
|
|
|
|
|
Services
|
|
--------
|
|
|
|
Services can be inspected with :manpage:`systemctl(1)`. Example::
|
|
|
|
● sssd.service - System Security Services Daemon
|
|
Loaded: loaded (/usr/lib/systemd/system/sssd.service; enabled; vendor preset: disabled)
|
|
Active: active (running) since Thu 2018-06-21 16:26:48 CEST; 5 days ago
|
|
Main PID: 691 (sssd)
|
|
CGroup: /system.slice/sssd.service
|
|
├─691 /usr/sbin/sssd -i --logger=files
|
|
├─746 /usr/libexec/sssd/sssd_be --domain D.PSI.CH --uid 0 --gid 0 --logger=files
|
|
├─758 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
|
|
└─759 /usr/libexec/sssd/sssd_pam --uid 0 --gid 0 --logger=files
|
|
|
|
Jun 21 16:26:48 lxdev05.psi.ch systemd[1]: Starting System Security Services Daemon...
|
|
Jun 21 16:26:48 lxdev05.psi.ch sssd[691]: Starting up
|
|
Jun 21 16:26:48 lxdev05.psi.ch sssd[be[D.PSI.CH]][746]: Starting up
|
|
Jun 21 16:26:48 lxdev05.psi.ch sssd[pam][759]: Starting up
|
|
Jun 21 16:26:48 lxdev05.psi.ch sssd[nss][758]: Starting up
|
|
Jun 21 16:26:48 lxdev05.psi.ch systemd[1]: Started System Security Services Daemon.
|
|
Jun 25 10:59:22 lxdev05.psi.ch [sssd[krb5_child[5223]]][5223]: Preauthentication failed
|
|
Jun 25 10:59:24 lxdev05.psi.ch [sssd[krb5_child[5224]]][5224]: Preauthentication failed
|
|
Jun 25 10:59:24 lxdev05.psi.ch [sssd[krb5_child[5224]]][5224]: Preauthentication failed
|
|
|
|
|
|
Processes
|
|
---------
|
|
|
|
Processes can be investigated through a variety of tools:
|
|
|
|
1. The files in ``/proc/$PID/``, in particular
|
|
|
|
a) ``/proc/$PID/fd/*``: the open files of the process
|
|
b) ``/proc/$PID/environ``: the process' environment
|
|
|
|
2. :manpage:`strace(1)` allows tracing a process' system calls.
|
|
3. :manpage:`ltrace(1)` allows tracing a process' library calls.
|
|
|
|
.. note:: Both :manpage:`strace(1)` and :manpage:`ltrace(1)` slow the target
|
|
process down **a lot**, which might cause problems.
|
|
|
|
|
|
System and Application Logs
|
|
---------------------------
|
|
|
|
Starting with RHEL 7 almost all system logs end up in the journal, which can be
|
|
queried with ``journalctl``. One important exception is :manpage:`sssd(8)`,
|
|
which provides authentication against Active Directory. Its logs can be found in
|
|
``/var/log/sssd``.
|
|
|
|
|
|
:manpage:`journalctl` offers a lot of functionality. The following list shows
|
|
the most important features:
|
|
|
|
1. List all reboots/show logs starting at a specific reboot::
|
|
|
|
# journalctl --list-boots
|
|
-10 19d173f56d314912820486b9ddfd7d6c Thu 2018-06-21 11:20:55 CEST—Thu 2018-06-21
|
|
-9 3a5a050289314221a3863b88a0eef367 Thu 2018-06-21 11:26:33 CEST—Thu 2018-06-21
|
|
-8 f9726e6c9ce44678ab68a2fc12b1c12c Thu 2018-06-21 11:43:38 CEST—Thu 2018-06-21
|
|
-7 b4e6bc84ff8840adbc698992cd1900d2 Thu 2018-06-21 14:55:42 CEST—Thu 2018-06-21
|
|
-6 81b78d0d09934937a24a73bfcd3d8ede Thu 2018-06-21 15:06:18 CEST—Thu 2018-06-21
|
|
-5 dd78e29c073448ad9731c6c18288c97a Thu 2018-06-21 15:23:15 CEST—Thu 2018-06-21
|
|
-4 0fc2f05d12664d3aba6364102401d5fb Thu 2018-06-21 15:29:36 CEST—Thu 2018-06-21
|
|
-3 412bbe36d12546bab749a2a63fad99ca Thu 2018-06-21 15:34:19 CEST—Thu 2018-06-21
|
|
-2 c5189f2006c245d7833bb8fe20e62545 Thu 2018-06-21 16:07:51 CEST—Thu 2018-06-21
|
|
-1 7c47950edd194ff4b6a67d3556672430 Thu 2018-06-21 16:11:28 CEST—Thu 2018-06-21
|
|
0 61ea098edc924030aafb7a822c2df0e3 Thu 2018-06-21 16:26:46 CEST—Wed 2018-06-27
|
|
|
|
# journalctl -b
|
|
-- Logs begin at Thu 2018-06-21 11:20:55 CEST, end at Wed 2018-06-27 14:20:01 CEST. --
|
|
Jun 21 16:26:46 lxdev05.psi.ch systemd-journal[85]: Runtime journal is using 8.0M (max allowed 91.9M, trying to leave 137.9M free of 911.8M av
|
|
Jun 21 16:26:46 lxdev05.psi.ch kernel: Initializing cgroup subsys cpuset
|
|
|
|
# journalctl -b -2
|
|
-- Logs begin at Thu 2018-06-21 11:20:55 CEST, end at Wed 2018-06-27 14:20:01 CEST. --
|
|
Jun 21 16:07:51 lxdev05.psi.ch systemd-journal[87]: Runtime journal is using 8.0M (max allowed 91.9M, trying to leave 137.9M free of 911.8M av
|
|
Jun 21 16:07:51 lxdev05.psi.ch kernel: Initializing cgroup subsys cpuset
|
|
|
|
2. Show logs starting from a given date/time::
|
|
|
|
# journalctl --since 2018-06-23
|
|
# journalctl --since '2018-06-23 18:13'
|
|
|
|
3. Show logs for a given unit, eg a given service::
|
|
|
|
# journalctl -u sshd.service
|
|
# journalctl -u pli-puppet-run.timer
|
|
|
|
4. Show the last N messages (1000 by default)::
|
|
|
|
# journalctl -e
|
|
# journalctl -e -n 250
|
|
|
|
5. List all systemd timers::
|
|
|
|
# journalctl list-timers
|
|
NEXT LEFT LAST PASSED UNIT ACTIVATES
|
|
Wed 2018-06-27 16:45:01 CEST 2h 16min left Tue 2018-06-26 16:45:01 CEST 21h ago systemd-tmpfiles-clean.timer systemd-tmpfiles-clean.service
|
|
Thu 2018-06-28 07:31:00 CEST 17h left Wed 2018-06-27 07:31:25 CEST 6h ago pli-puppet-run.timer pli-puppet-run.service
|
|
|
|
2 timers listed.
|
|
Pass --all to see loaded but inactive timers, too.
|
|
|
|
|
|
Filesystems and Storage
|
|
-----------------------
|
|
|
|
Check filesystem capacity using :manpage:`df(1)`::
|
|
|
|
# df -h
|
|
Filesystem Size Used Avail Use% Mounted on
|
|
/dev/mapper/vg_root-lv_root 8.0G 1.4G 6.7G 17% /
|
|
devtmpfs 909M 0 909M 0% /dev
|
|
tmpfs 920M 0 920M 0% /dev/shm
|
|
tmpfs 920M 816K 920M 1% /run
|
|
tmpfs 920M 0 920M 0% /sys/fs/cgroup
|
|
/dev/sda1 976M 198M 728M 22% /boot
|
|
/dev/mapper/vg_root-lv_tmp 1014M 34M 981M 4% /tmp
|
|
/dev/mapper/vg_root-lv_var 2.9G 1.4G 1.5G 47% /var
|
|
/dev/mapper/vg_root-lv_var_log 2.0G 160M 1.9G 8% /var/log
|
|
/dev/mapper/vg_root-lv_openafs 1008M 1.3M 956M 1% /var/cache/openafs
|
|
tmpfs 184M 4.0K 184M 1% /run/user/0
|
|
|
|
|
|
Check available inodes (~ the maximum number of files that can be created)::
|
|
|
|
# df -i
|
|
Filesystem Inodes IUsed IFree IUse% Mounted on
|
|
/dev/mapper/vg_root-lv_root 4194304 48891 4145413 2% /
|
|
devtmpfs 232630 383 232247 1% /dev
|
|
tmpfs 235485 1 235484 1% /dev/shm
|
|
tmpfs 235485 575 234910 1% /run
|
|
tmpfs 235485 16 235469 1% /sys/fs/cgroup
|
|
/dev/sda1 65536 348 65188 1% /boot
|
|
/dev/mapper/vg_root-lv_tmp 524288 316 523972 1% /tmp
|
|
/dev/mapper/vg_root-lv_var 1474560 1042691 431869 71% /var
|
|
/dev/mapper/vg_root-lv_var_log 1048576 81 1048495 1% /var/log
|
|
/dev/mapper/vg_root-lv_openafs 65536 11 65525 1% /var/cache/openafs
|
|
tmpfs 235485 2 235483 1% /run/user/0
|
|
|
|
|
|
Networking
|
|
----------
|
|
|
|
Test hostname resolution with :manpage:`getent(1)`, for example ``getent hosts
|
|
www.psi.ch``. Unlike :manpage:`nslookup(1)` or :manpage:`dig(1)`, it uses the
|
|
system resolver.
|
|
|
|
The systems IP addresses and routes can be displayed using :manpage:`ip(8)`::
|
|
|
|
# ip address
|
|
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
|
|
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
|
|
inet 127.0.0.1/8 scope host lo
|
|
valid_lft forever preferred_lft forever
|
|
inet6 ::1/128 scope host
|
|
valid_lft forever preferred_lft forever
|
|
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
|
|
link/ether 00:50:56:9d:6d:03 brd ff:ff:ff:ff:ff:ff
|
|
inet 10.129.160.195/24 brd 10.129.160.255 scope global ens160
|
|
valid_lft forever preferred_lft forever
|
|
inet6 fe80::250:56ff:fe9d:6d03/64 scope link
|
|
valid_lft forever preferred_lft forever
|
|
|
|
# ip route
|
|
default via 10.129.160.1 dev ens160
|
|
10.129.160.0/24 dev ens160 proto kernel scope link src 10.129.160.195
|
|
169.254.0.0/16 dev ens160 scope link metric 1002
|
|
|
|
The link status and other information of an interface can be displayed using
|
|
:manpage:`ethtool(8)`:
|
|
|
|
1. Link status::
|
|
|
|
# ethtool ens160
|
|
Settings for ens160:
|
|
[...]
|
|
Speed: 10000Mb/s
|
|
Duplex: Full
|
|
[...]
|
|
Link detected: yes
|
|
|
|
2. Statistics (driver-specific, but look for errors/discards/dropped)::
|
|
|
|
# ethtool -S ens160
|
|
NIC statistics:
|
|
Tx Queue#: 0
|
|
TSO pkts tx: 21529
|
|
TSO bytes tx: 91036062
|
|
ucast pkts tx: 1036632
|
|
ucast bytes tx: 235421707
|
|
mcast pkts tx: 8
|
|
mcast bytes tx: 648
|
|
bcast pkts tx: 7
|
|
bcast bytes tx: 294
|
|
pkts tx err: 0
|
|
pkts tx discard: 0
|
|
drv dropped tx total: 0
|
|
too many frags: 0
|
|
giant hdr: 0
|
|
hdr err: 0
|
|
tso: 0
|
|
ring full: 0
|
|
pkts linearized: 0
|
|
hdr cloned: 0
|
|
giant hdr: 0
|
|
Rx Queue#: 0
|
|
LRO pkts rx: 6913
|
|
LRO byte rx: 100534073
|
|
ucast pkts rx: 551554
|
|
ucast bytes rx: 161369441
|
|
mcast pkts rx: 4
|
|
mcast bytes rx: 344
|
|
bcast pkts rx: 753276
|
|
bcast bytes rx: 45787629
|
|
pkts rx OOB: 0
|
|
pkts rx err: 0
|
|
drv dropped rx total: 0
|
|
err: 0
|
|
fcs: 0
|
|
rx buf alloc fail: 0
|
|
tx timeout count: 0
|
|
|
|
|
|
Packages
|
|
--------
|
|
|
|
The integratity of installed package can be checked with :manpage:`rpm(8)`::
|
|
|
|
# rpm -Vv pciutils
|
|
......... /usr/sbin/lspci
|
|
......... /usr/sbin/setpci
|
|
......... /usr/sbin/update-pciids
|
|
......... /usr/share/doc/pciutils-3.5.1
|
|
......... d /usr/share/doc/pciutils-3.5.1/COPYING
|
|
......... d /usr/share/doc/pciutils-3.5.1/ChangeLog
|
|
......... d /usr/share/doc/pciutils-3.5.1/README
|
|
......... d /usr/share/doc/pciutils-3.5.1/pciutils.lsm
|
|
......... d /usr/share/man/man8/lspci.8.gz
|
|
......... d /usr/share/man/man8/setpci.8.gz
|
|
......... d /usr/share/man/man8/update-pciids.8.gz
|
|
|
|
Running ``rpm -Vav`` will verify **all** installed packages and take a long
|
|
time. See the man page for details on the output format. Changes, especially in
|
|
configuration files, can be normal, though.
|