This commit is contained in:
2024-08-07 16:34:32 +02:00
parent 16b814f9b0
commit 010affa6ba
5 changed files with 83 additions and 116 deletions

View File

@@ -1,4 +1,4 @@
# Troubleshouting Boot Issues
# Boot Issues
## SecureBoot

View File

@@ -0,0 +1,79 @@
# Deployment
A deployment roughly has the following phases:
1. DHCP followed by PXE boot.
2. Kickstart installation followed by a reboot.
3. Initial Puppet run, followed by updates, followed by another Puppet run and a reboot.
## PXE boot/iPXE
When deployment fails during the PXE phase it usually due to one of the following:
1. No network connectivity - This is usually indicated by messages similar to ``No link on XXX``.
2. No DHCP in the connected network (eg DMZ, tier3) - The DHCP requests by the BIOS/UEFI firmware will time out.
3. Firewall (no TFTP/HTTP to the relevant servers)
4. Incompatibilities between iPXE and network card (NIC)
5. Incorrect sysdb entry (hence iPXE entry incorrect).
If there is not DHCP, the static network information provided manually is possibly wrong or for a different network than the one connected to the host.
## Infiniband
Infiniband can generally cause installation problem, expecially in the initial phase, when iPXE tries to load the configuration file. As a general rule, disable PXE on all Infiniband cards.
Anyway this is not always enough since it happens that iPXE recognized anyway the Infiniband card as the first device (with MAC address ``79:79:79:79:79:79``) and tries to get configuration file for that.
## Kickstart
Typical problems during the Kickstart phase:
1. The Kickstart file cannot be retrieved from the sysdb server __sysdb.psi.ch__. Typically caused by incorrect sysdb entries or firewalls.
2. Partitioning fails. This can happen because
- No disk is recognized, or the wrong disk is used
- Packages or other installation data cannot be downloaded. Can be caused by firewalls or incorrect sysdb entries.
## Hiera
A typical problem are Hiera errors, eg the following::
```bash
Info: Using configured environment 'prod'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Function lookup() did not find a value for the name 'console::mount_root' at /srv/puppet/code/dev/envs/prod/code/modules/role/manifests/console.pp:1 on node lxdev05.psi.ch
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run
```
The error message shows that the value for `console::mount_root` could not be found in Hiera.
## Active Directory
Sometimes the Active Directory join fails, usually for one of these three reasons:
- There is already an Active Directory computer object for the same system from a previous Windows installation. In this case, delete the computer object and restart the installation.
- Firewall restrictions
- Old Puppet certificates from a previous SL6 installation are used on the system. In this case delete the certificates on the client with `find /etc/puppetlabs -name '*.pem' -delete` and clean up any certificates on the Puppet server with ``puppet cert clean $HOSTNAME``. Then restart the installation.
### Rejoin Active Directory
If the AD join seams to be broken (failed logins, etc.), then the node can be automatically rejoined again:
- remove `/etc/krb5.keytab`
- run puppet, e.g. with `puppet agent --test`
## YFS / AFS
If the ``yfs-client`` does not start (cannot load kernel module) due to `key not available`:
```bash
Sep 02 13:21:34 pc12661.psi.ch systemd[1]: Starting AuriStorFS Client Service...
Sep 02 13:21:34 pc12661.psi.ch modprobe[29282]: modprobe: ERROR: could not insert 'yfs': Required key not available
```
then there is most probably SecureBoot blocking the loading of the unsigned `yfs` kernel module.
Disable secure boot in the BIOS/EFI settings.

View File

@@ -1,112 +0,0 @@
============
Deployment
============
Deployment roughly has the following phases:
1. DHCP followed by PXE boot.
2. Kickstart installation followed by a reboot.
3. Initial Puppet run, followed by updates, followed by another Puppet run and a
reboot.
PXE boot/iPXE
=============
When deployment fails during the PXE phase it usually due to one of the
following:
1. No network connectivity
This is usually indicated by messages similar to ``No link on XXX``.
2. No DHCP in the connected network (eg DMZ, tier3)
The DHCP requests by the BIOS/UEFI firmware will time out.
3. Firewall (no TFTP/HTTP to the relevant servers)
4. Incompatibilities between iPXE and network card (NIC)
5. Incorrect sysdb entry (hence iPXE entry incorrect).
If there is not DHCP, the static network information provided manually is
possibly wrong or for a different network than the one connected to the host.
Infiniband
----------
Infiniband can generally cause installation problem, expecially in the
initial phase, when iPXE tries to load the configuration file.
As a general rule, disable PXE on all Infiniband cards.
Anyway this is not always enough since it happens that iPXE recognized
anyway the Infiniband card as the first device (with MAC
address ``79:79:79:79:79:79``) and tries to get configuration file for
that.
Kickstart
=========
Typical problems during the Kickstart phase:
1. The Kickstart file cannot be retrieved from the sysdb server
``sysdb.psi.ch``. Typically caused by incorrect sysdb entries or firewalls.
2. Partitioning fails. This can happen because
a) No disk is recognized, or the wrong disk is used
b) Packages or other installation data cannot be downloaded. Can be caused by
firewalls or incorrect sysdb entries.
First Puppet Run
================
A typical problem are Hiera errors, eg the following::
# puppet agent --test
Info: Using configured environment 'prod'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Function lookup() did not find a value for the name 'console::mount_root' at /srv/puppet/code/dev/envs/prod/code/modules/role/manifests/console.pp:1 on node lxdev05.psi.ch
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run
The error message shows that the value for ``console::mount_root`` could not be
found in Hiera.
Sometimes the Active Directory join fails, usually for one of these three
reasons:
- There is already an Active Directory computer object for the same system from
a previous Windows installation. In this case, delete the computer object and
restart the installation.
- Firewall restrictions
- Old Puppet certificates from a previous SL6 installation are used on the
system. In this case delete the certificates on the client with ``find
/etc/puppetlabs -name '*.pem' -delete`` and clean up any certificates on the
Puppet server with ``puppet cert clean $HOSTNAME``. Then restart the
installation.
Rejoin the Active Directory
===========================
If the AD join seams to be broken (failed logins, etc.), then the node can be automatically rejoined again:
- remove ``/etc/krb5.keytab``
- run puppet, e.g. with ``puppet agent --test``
Cannot Load YFS Kernel Module
=============================
If the ``yfs-client`` does not start due to "key not available" ::
Sep 02 13:21:34 pc12661.psi.ch systemd[1]: Starting AuriStorFS Client Service...
Sep 02 13:21:34 pc12661.psi.ch modprobe[29282]: modprobe: ERROR: could not insert 'yfs': Required key not available
then there is most probably SecureBoot blocking the loading of the unsigned ``yfs`` kernel module.
Please disable secure boot in the BIOS/firmware settings.

View File

@@ -1,7 +1,7 @@
# PCIe Bus Error
When there are PCI Express bus errors like
```
```bash
Oct 05 11:26:19 pc16209.psi.ch kernel: pcieport 10000:e0:06.0: AER: TLP Header: 34000000 e1000010 89148914 00000000
Oct 05 11:26:19 pc16209.psi.ch kernel: pcieport 10000:e0:06.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
Oct 05 11:26:19 pc16209.psi.ch kernel: pcieport 10000:e0:06.0: device [8086:464d] error status/mask=00100000/00010000
@@ -18,7 +18,7 @@ One thing you might try is disabling **Active State Power Management** (ASPM) in
To do so set in Hiera
```
```yaml
base::enable_pcie_aspm: false
```

View File

@@ -1,4 +1,4 @@
# sssd Authentication
# SSSD
## Check Domain State
As `root` check what domains are configured: