Files
gitea-pages/admin-guide/troubleshooting/deployment.rst

113 lines
3.8 KiB
ReStructuredText

============
Deployment
============
Deployment roughly has the following phases:
1. DHCP followed by PXE boot.
2. Kickstart installation followed by a reboot.
3. Initial Puppet run, followed by updates, followed by another Puppet run and a
reboot.
PXE boot/iPXE
=============
When deployment fails during the PXE phase it usually due to one of the
following:
1. No network connectivity
This is usually indicated by messages similar to ``No link on XXX``.
2. No DHCP in the connected network (eg DMZ, tier3)
The DHCP requests by the BIOS/UEFI firmware will time out.
3. Firewall (no TFTP/HTTP to the relevant servers)
4. Incompatibilities between iPXE and network card (NIC)
5. Incorrect sysdb entry (hence iPXE entry incorrect).
If there is not DHCP, the static network information provided manually is
possibly wrong or for a different network than the one connected to the host.
Infiniband
----------
Infiniband can generally cause installation problem, expecially in the
initial phase, when iPXE tries to load the configuration file.
As a general rule, disable PXE on all Infiniband cards.
Anyway this is not always enough since it happens that iPXE recognized
anyway the Infiniband card as the first device (with MAC
address ``79:79:79:79:79:79``) and tries to get configuration file for
that.
Kickstart
=========
Typical problems during the Kickstart phase:
1. The Kickstart file cannot be retrieved from the boot server
``boot00.psi.ch``. Typically caused by incorrect sysdb entries or firewalls.
2. Partitioning fails. This can happen because
a) No disk is recognized, or the wrong disk is used
b) Packages or other installation data cannot be downloaded. Can be caused by
firewalls or incorrect sysdb entries.
First Puppet Run
================
A typical problem are Hiera errors, eg the following::
# puppet agent --test
Info: Using configured environment 'prod'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Function lookup() did not find a value for the name 'console::mount_root' at /srv/puppet/code/dev/envs/prod/code/modules/role/manifests/console.pp:1 on node lxdev05.psi.ch
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run
The error message shows that the value for ``console::mount_root`` could not be
found in Hiera.
Sometimes the Active Directory join fails, usually for one of these three
reasons:
- There is already an Active Directory computer object for the same system from
a previous Windows installation. In this case, delete the computer object and
restart the installation.
- Firewall restrictions
- Old Puppet certificates from a previous SL6 installation are used on the
system. In this case delete the certificates on the client with ``find
/etc/puppetlabs -name '*.pem' -delete`` and clean up any certificates on the
Puppet server with ``puppet cert clean $HOSTNAME``. Then restart the
installation.
Rejoin the Active Directory
===========================
If the AD join seams to be broken (failed logins, etc.), then the node can be automatically rejoined again:
- remove ``/etc/krb5.keytab``
- run puppet, e.g. with ``puppet agent --test``
Cannot Load YFS Kernel Module
=============================
If the ``yfs-client`` does not start due to "key not available" ::
Sep 02 13:21:34 pc12661.psi.ch systemd[1]: Starting AuriStorFS Client Service...
Sep 02 13:21:34 pc12661.psi.ch modprobe[29282]: modprobe: ERROR: could not insert 'yfs': Required key not available
then there is most probably SecureBoot blocking the loading of the unsigned ``yfs`` kernel module.
Please disable secure boot in the BIOS/firmware settings.