Files
gitea-pages/admin-guide/troubleshooting/deployment.md
2024-08-07 16:34:32 +02:00

3.6 KiB

Deployment

A deployment roughly has the following phases:

  1. DHCP followed by PXE boot.
  2. Kickstart installation followed by a reboot.
  3. Initial Puppet run, followed by updates, followed by another Puppet run and a reboot.

PXE boot/iPXE

When deployment fails during the PXE phase it usually due to one of the following:

  1. No network connectivity - This is usually indicated by messages similar to No link on XXX.
  2. No DHCP in the connected network (eg DMZ, tier3) - The DHCP requests by the BIOS/UEFI firmware will time out.
  3. Firewall (no TFTP/HTTP to the relevant servers)
  4. Incompatibilities between iPXE and network card (NIC)
  5. Incorrect sysdb entry (hence iPXE entry incorrect).

If there is not DHCP, the static network information provided manually is possibly wrong or for a different network than the one connected to the host.

Infiniband

Infiniband can generally cause installation problem, expecially in the initial phase, when iPXE tries to load the configuration file. As a general rule, disable PXE on all Infiniband cards.

Anyway this is not always enough since it happens that iPXE recognized anyway the Infiniband card as the first device (with MAC address 79:79:79:79:79:79) and tries to get configuration file for that.

Kickstart

Typical problems during the Kickstart phase:

  1. The Kickstart file cannot be retrieved from the sysdb server sysdb.psi.ch. Typically caused by incorrect sysdb entries or firewalls.
  2. Partitioning fails. This can happen because
    • No disk is recognized, or the wrong disk is used
    • Packages or other installation data cannot be downloaded. Can be caused by firewalls or incorrect sysdb entries.

Hiera

A typical problem are Hiera errors, eg the following::

Info: Using configured environment 'prod'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Function lookup() did not find a value for the name 'console::mount_root' at /srv/puppet/code/dev/envs/prod/code/modules/role/manifests/console.pp:1 on node lxdev05.psi.ch
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

The error message shows that the value for console::mount_root could not be found in Hiera.

Active Directory

Sometimes the Active Directory join fails, usually for one of these three reasons:

  • There is already an Active Directory computer object for the same system from a previous Windows installation. In this case, delete the computer object and restart the installation.
  • Firewall restrictions
  • Old Puppet certificates from a previous SL6 installation are used on the system. In this case delete the certificates on the client with find /etc/puppetlabs -name '*.pem' -delete and clean up any certificates on the Puppet server with puppet cert clean $HOSTNAME. Then restart the installation.

Rejoin Active Directory

If the AD join seams to be broken (failed logins, etc.), then the node can be automatically rejoined again:

  • remove /etc/krb5.keytab
  • run puppet, e.g. with puppet agent --test

YFS / AFS

If the yfs-client does not start (cannot load kernel module) due to key not available:

Sep 02 13:21:34 pc12661.psi.ch systemd[1]: Starting AuriStorFS Client Service...
Sep 02 13:21:34 pc12661.psi.ch modprobe[29282]: modprobe: ERROR: could not insert 'yfs': Required key not available

then there is most probably SecureBoot blocking the loading of the unsigned yfs kernel module.

Disable secure boot in the BIOS/EFI settings.