From 010affa6ba32040bd1cf318e1deecff7a46f9e4d Mon Sep 17 00:00:00 2001 From: ebner Date: Wed, 7 Aug 2024 16:34:32 +0200 Subject: [PATCH] cleanup --- admin-guide/troubleshooting/boot.md | 2 +- admin-guide/troubleshooting/deployment.md | 79 ++++++++++++ admin-guide/troubleshooting/deployment.rst | 112 ------------------ admin-guide/troubleshooting/pcie_bus_error.md | 4 +- admin-guide/troubleshooting/sssd.md | 2 +- 5 files changed, 83 insertions(+), 116 deletions(-) create mode 100644 admin-guide/troubleshooting/deployment.md delete mode 100644 admin-guide/troubleshooting/deployment.rst diff --git a/admin-guide/troubleshooting/boot.md b/admin-guide/troubleshooting/boot.md index 04589c61..8f8cca28 100644 --- a/admin-guide/troubleshooting/boot.md +++ b/admin-guide/troubleshooting/boot.md @@ -1,4 +1,4 @@ -# Troubleshouting Boot Issues +# Boot Issues ## SecureBoot diff --git a/admin-guide/troubleshooting/deployment.md b/admin-guide/troubleshooting/deployment.md new file mode 100644 index 00000000..f41bd22c --- /dev/null +++ b/admin-guide/troubleshooting/deployment.md @@ -0,0 +1,79 @@ +# Deployment + +A deployment roughly has the following phases: +1. DHCP followed by PXE boot. +2. Kickstart installation followed by a reboot. +3. Initial Puppet run, followed by updates, followed by another Puppet run and a reboot. + + +## PXE boot/iPXE + +When deployment fails during the PXE phase it usually due to one of the following: + +1. No network connectivity - This is usually indicated by messages similar to ``No link on XXX``. +2. No DHCP in the connected network (eg DMZ, tier3) - The DHCP requests by the BIOS/UEFI firmware will time out. +3. Firewall (no TFTP/HTTP to the relevant servers) +4. Incompatibilities between iPXE and network card (NIC) +5. Incorrect sysdb entry (hence iPXE entry incorrect). + +If there is not DHCP, the static network information provided manually is possibly wrong or for a different network than the one connected to the host. + + +## Infiniband + +Infiniband can generally cause installation problem, expecially in the initial phase, when iPXE tries to load the configuration file. As a general rule, disable PXE on all Infiniband cards. + +Anyway this is not always enough since it happens that iPXE recognized anyway the Infiniband card as the first device (with MAC address ``79:79:79:79:79:79``) and tries to get configuration file for that. + + +## Kickstart + +Typical problems during the Kickstart phase: +1. The Kickstart file cannot be retrieved from the sysdb server __sysdb.psi.ch__. Typically caused by incorrect sysdb entries or firewalls. +2. Partitioning fails. This can happen because + - No disk is recognized, or the wrong disk is used + - Packages or other installation data cannot be downloaded. Can be caused by firewalls or incorrect sysdb entries. + +## Hiera + +A typical problem are Hiera errors, eg the following:: +```bash +Info: Using configured environment 'prod' +Info: Retrieving pluginfacts +Info: Retrieving plugin +Info: Loading facts +Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Function lookup() did not find a value for the name 'console::mount_root' at /srv/puppet/code/dev/envs/prod/code/modules/role/manifests/console.pp:1 on node lxdev05.psi.ch +Warning: Not using cache on failed catalog +Error: Could not retrieve catalog; skipping run +``` + +The error message shows that the value for `console::mount_root` could not be found in Hiera. + + +## Active Directory + +Sometimes the Active Directory join fails, usually for one of these three reasons: + +- There is already an Active Directory computer object for the same system from a previous Windows installation. In this case, delete the computer object and restart the installation. +- Firewall restrictions +- Old Puppet certificates from a previous SL6 installation are used on the system. In this case delete the certificates on the client with `find /etc/puppetlabs -name '*.pem' -delete` and clean up any certificates on the Puppet server with ``puppet cert clean $HOSTNAME``. Then restart the installation. + +### Rejoin Active Directory + +If the AD join seams to be broken (failed logins, etc.), then the node can be automatically rejoined again: +- remove `/etc/krb5.keytab` +- run puppet, e.g. with `puppet agent --test` + + +## YFS / AFS + +If the ``yfs-client`` does not start (cannot load kernel module) due to `key not available`: + +```bash +Sep 02 13:21:34 pc12661.psi.ch systemd[1]: Starting AuriStorFS Client Service... +Sep 02 13:21:34 pc12661.psi.ch modprobe[29282]: modprobe: ERROR: could not insert 'yfs': Required key not available +``` + +then there is most probably SecureBoot blocking the loading of the unsigned `yfs` kernel module. + +Disable secure boot in the BIOS/EFI settings. diff --git a/admin-guide/troubleshooting/deployment.rst b/admin-guide/troubleshooting/deployment.rst deleted file mode 100644 index db09ff43..00000000 --- a/admin-guide/troubleshooting/deployment.rst +++ /dev/null @@ -1,112 +0,0 @@ -============ - Deployment -============ - -Deployment roughly has the following phases: - -1. DHCP followed by PXE boot. -2. Kickstart installation followed by a reboot. -3. Initial Puppet run, followed by updates, followed by another Puppet run and a - reboot. - - -PXE boot/iPXE -============= - -When deployment fails during the PXE phase it usually due to one of the -following: - -1. No network connectivity - - This is usually indicated by messages similar to ``No link on XXX``. - -2. No DHCP in the connected network (eg DMZ, tier3) - - The DHCP requests by the BIOS/UEFI firmware will time out. - -3. Firewall (no TFTP/HTTP to the relevant servers) -4. Incompatibilities between iPXE and network card (NIC) -5. Incorrect sysdb entry (hence iPXE entry incorrect). - -If there is not DHCP, the static network information provided manually is -possibly wrong or for a different network than the one connected to the host. - - -Infiniband ----------- - -Infiniband can generally cause installation problem, expecially in the -initial phase, when iPXE tries to load the configuration file. - -As a general rule, disable PXE on all Infiniband cards. - -Anyway this is not always enough since it happens that iPXE recognized -anyway the Infiniband card as the first device (with MAC -address ``79:79:79:79:79:79``) and tries to get configuration file for -that. - - -Kickstart -========= - -Typical problems during the Kickstart phase: - -1. The Kickstart file cannot be retrieved from the sysdb server - ``sysdb.psi.ch``. Typically caused by incorrect sysdb entries or firewalls. -2. Partitioning fails. This can happen because - - a) No disk is recognized, or the wrong disk is used - b) Packages or other installation data cannot be downloaded. Can be caused by - firewalls or incorrect sysdb entries. - - -First Puppet Run -================ - -A typical problem are Hiera errors, eg the following:: - - # puppet agent --test - Info: Using configured environment 'prod' - Info: Retrieving pluginfacts - Info: Retrieving plugin - Info: Loading facts - Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Function lookup() did not find a value for the name 'console::mount_root' at /srv/puppet/code/dev/envs/prod/code/modules/role/manifests/console.pp:1 on node lxdev05.psi.ch - Warning: Not using cache on failed catalog - Error: Could not retrieve catalog; skipping run - -The error message shows that the value for ``console::mount_root`` could not be -found in Hiera. - -Sometimes the Active Directory join fails, usually for one of these three -reasons: - -- There is already an Active Directory computer object for the same system from - a previous Windows installation. In this case, delete the computer object and - restart the installation. -- Firewall restrictions -- Old Puppet certificates from a previous SL6 installation are used on the - system. In this case delete the certificates on the client with ``find - /etc/puppetlabs -name '*.pem' -delete`` and clean up any certificates on the - Puppet server with ``puppet cert clean $HOSTNAME``. Then restart the - installation. - -Rejoin the Active Directory -=========================== - -If the AD join seams to be broken (failed logins, etc.), then the node can be automatically rejoined again: - -- remove ``/etc/krb5.keytab`` -- run puppet, e.g. with ``puppet agent --test`` - - -Cannot Load YFS Kernel Module -============================= - -If the ``yfs-client`` does not start due to "key not available" :: - - Sep 02 13:21:34 pc12661.psi.ch systemd[1]: Starting AuriStorFS Client Service... - Sep 02 13:21:34 pc12661.psi.ch modprobe[29282]: modprobe: ERROR: could not insert 'yfs': Required key not available - -then there is most probably SecureBoot blocking the loading of the unsigned ``yfs`` kernel module. - -Please disable secure boot in the BIOS/firmware settings. diff --git a/admin-guide/troubleshooting/pcie_bus_error.md b/admin-guide/troubleshooting/pcie_bus_error.md index 37ed973b..cfc81ac4 100644 --- a/admin-guide/troubleshooting/pcie_bus_error.md +++ b/admin-guide/troubleshooting/pcie_bus_error.md @@ -1,7 +1,7 @@ # PCIe Bus Error When there are PCI Express bus errors like -``` +```bash Oct 05 11:26:19 pc16209.psi.ch kernel: pcieport 10000:e0:06.0: AER: TLP Header: 34000000 e1000010 89148914 00000000 Oct 05 11:26:19 pc16209.psi.ch kernel: pcieport 10000:e0:06.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID) Oct 05 11:26:19 pc16209.psi.ch kernel: pcieport 10000:e0:06.0: device [8086:464d] error status/mask=00100000/00010000 @@ -18,7 +18,7 @@ One thing you might try is disabling **Active State Power Management** (ASPM) in To do so set in Hiera -``` +```yaml base::enable_pcie_aspm: false ``` diff --git a/admin-guide/troubleshooting/sssd.md b/admin-guide/troubleshooting/sssd.md index ada159b8..881c5815 100644 --- a/admin-guide/troubleshooting/sssd.md +++ b/admin-guide/troubleshooting/sssd.md @@ -1,4 +1,4 @@ -# sssd Authentication +# SSSD ## Check Domain State As `root` check what domains are configured: