cleanup
This commit is contained in:
@@ -1,4 +1,4 @@
|
||||
# Troubleshouting Boot Issues
|
||||
# Boot Issues
|
||||
|
||||
## SecureBoot
|
||||
|
||||
|
||||
79
admin-guide/troubleshooting/deployment.md
Normal file
79
admin-guide/troubleshooting/deployment.md
Normal file
@@ -0,0 +1,79 @@
|
||||
# Deployment
|
||||
|
||||
A deployment roughly has the following phases:
|
||||
1. DHCP followed by PXE boot.
|
||||
2. Kickstart installation followed by a reboot.
|
||||
3. Initial Puppet run, followed by updates, followed by another Puppet run and a reboot.
|
||||
|
||||
|
||||
## PXE boot/iPXE
|
||||
|
||||
When deployment fails during the PXE phase it usually due to one of the following:
|
||||
|
||||
1. No network connectivity - This is usually indicated by messages similar to ``No link on XXX``.
|
||||
2. No DHCP in the connected network (eg DMZ, tier3) - The DHCP requests by the BIOS/UEFI firmware will time out.
|
||||
3. Firewall (no TFTP/HTTP to the relevant servers)
|
||||
4. Incompatibilities between iPXE and network card (NIC)
|
||||
5. Incorrect sysdb entry (hence iPXE entry incorrect).
|
||||
|
||||
If there is not DHCP, the static network information provided manually is possibly wrong or for a different network than the one connected to the host.
|
||||
|
||||
|
||||
## Infiniband
|
||||
|
||||
Infiniband can generally cause installation problem, expecially in the initial phase, when iPXE tries to load the configuration file. As a general rule, disable PXE on all Infiniband cards.
|
||||
|
||||
Anyway this is not always enough since it happens that iPXE recognized anyway the Infiniband card as the first device (with MAC address ``79:79:79:79:79:79``) and tries to get configuration file for that.
|
||||
|
||||
|
||||
## Kickstart
|
||||
|
||||
Typical problems during the Kickstart phase:
|
||||
1. The Kickstart file cannot be retrieved from the sysdb server __sysdb.psi.ch__. Typically caused by incorrect sysdb entries or firewalls.
|
||||
2. Partitioning fails. This can happen because
|
||||
- No disk is recognized, or the wrong disk is used
|
||||
- Packages or other installation data cannot be downloaded. Can be caused by firewalls or incorrect sysdb entries.
|
||||
|
||||
## Hiera
|
||||
|
||||
A typical problem are Hiera errors, eg the following::
|
||||
```bash
|
||||
Info: Using configured environment 'prod'
|
||||
Info: Retrieving pluginfacts
|
||||
Info: Retrieving plugin
|
||||
Info: Loading facts
|
||||
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Function lookup() did not find a value for the name 'console::mount_root' at /srv/puppet/code/dev/envs/prod/code/modules/role/manifests/console.pp:1 on node lxdev05.psi.ch
|
||||
Warning: Not using cache on failed catalog
|
||||
Error: Could not retrieve catalog; skipping run
|
||||
```
|
||||
|
||||
The error message shows that the value for `console::mount_root` could not be found in Hiera.
|
||||
|
||||
|
||||
## Active Directory
|
||||
|
||||
Sometimes the Active Directory join fails, usually for one of these three reasons:
|
||||
|
||||
- There is already an Active Directory computer object for the same system from a previous Windows installation. In this case, delete the computer object and restart the installation.
|
||||
- Firewall restrictions
|
||||
- Old Puppet certificates from a previous SL6 installation are used on the system. In this case delete the certificates on the client with `find /etc/puppetlabs -name '*.pem' -delete` and clean up any certificates on the Puppet server with ``puppet cert clean $HOSTNAME``. Then restart the installation.
|
||||
|
||||
### Rejoin Active Directory
|
||||
|
||||
If the AD join seams to be broken (failed logins, etc.), then the node can be automatically rejoined again:
|
||||
- remove `/etc/krb5.keytab`
|
||||
- run puppet, e.g. with `puppet agent --test`
|
||||
|
||||
|
||||
## YFS / AFS
|
||||
|
||||
If the ``yfs-client`` does not start (cannot load kernel module) due to `key not available`:
|
||||
|
||||
```bash
|
||||
Sep 02 13:21:34 pc12661.psi.ch systemd[1]: Starting AuriStorFS Client Service...
|
||||
Sep 02 13:21:34 pc12661.psi.ch modprobe[29282]: modprobe: ERROR: could not insert 'yfs': Required key not available
|
||||
```
|
||||
|
||||
then there is most probably SecureBoot blocking the loading of the unsigned `yfs` kernel module.
|
||||
|
||||
Disable secure boot in the BIOS/EFI settings.
|
||||
@@ -1,112 +0,0 @@
|
||||
============
|
||||
Deployment
|
||||
============
|
||||
|
||||
Deployment roughly has the following phases:
|
||||
|
||||
1. DHCP followed by PXE boot.
|
||||
2. Kickstart installation followed by a reboot.
|
||||
3. Initial Puppet run, followed by updates, followed by another Puppet run and a
|
||||
reboot.
|
||||
|
||||
|
||||
PXE boot/iPXE
|
||||
=============
|
||||
|
||||
When deployment fails during the PXE phase it usually due to one of the
|
||||
following:
|
||||
|
||||
1. No network connectivity
|
||||
|
||||
This is usually indicated by messages similar to ``No link on XXX``.
|
||||
|
||||
2. No DHCP in the connected network (eg DMZ, tier3)
|
||||
|
||||
The DHCP requests by the BIOS/UEFI firmware will time out.
|
||||
|
||||
3. Firewall (no TFTP/HTTP to the relevant servers)
|
||||
4. Incompatibilities between iPXE and network card (NIC)
|
||||
5. Incorrect sysdb entry (hence iPXE entry incorrect).
|
||||
|
||||
If there is not DHCP, the static network information provided manually is
|
||||
possibly wrong or for a different network than the one connected to the host.
|
||||
|
||||
|
||||
Infiniband
|
||||
----------
|
||||
|
||||
Infiniband can generally cause installation problem, expecially in the
|
||||
initial phase, when iPXE tries to load the configuration file.
|
||||
|
||||
As a general rule, disable PXE on all Infiniband cards.
|
||||
|
||||
Anyway this is not always enough since it happens that iPXE recognized
|
||||
anyway the Infiniband card as the first device (with MAC
|
||||
address ``79:79:79:79:79:79``) and tries to get configuration file for
|
||||
that.
|
||||
|
||||
|
||||
Kickstart
|
||||
=========
|
||||
|
||||
Typical problems during the Kickstart phase:
|
||||
|
||||
1. The Kickstart file cannot be retrieved from the sysdb server
|
||||
``sysdb.psi.ch``. Typically caused by incorrect sysdb entries or firewalls.
|
||||
2. Partitioning fails. This can happen because
|
||||
|
||||
a) No disk is recognized, or the wrong disk is used
|
||||
b) Packages or other installation data cannot be downloaded. Can be caused by
|
||||
firewalls or incorrect sysdb entries.
|
||||
|
||||
|
||||
First Puppet Run
|
||||
================
|
||||
|
||||
A typical problem are Hiera errors, eg the following::
|
||||
|
||||
# puppet agent --test
|
||||
Info: Using configured environment 'prod'
|
||||
Info: Retrieving pluginfacts
|
||||
Info: Retrieving plugin
|
||||
Info: Loading facts
|
||||
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Function lookup() did not find a value for the name 'console::mount_root' at /srv/puppet/code/dev/envs/prod/code/modules/role/manifests/console.pp:1 on node lxdev05.psi.ch
|
||||
Warning: Not using cache on failed catalog
|
||||
Error: Could not retrieve catalog; skipping run
|
||||
|
||||
The error message shows that the value for ``console::mount_root`` could not be
|
||||
found in Hiera.
|
||||
|
||||
Sometimes the Active Directory join fails, usually for one of these three
|
||||
reasons:
|
||||
|
||||
- There is already an Active Directory computer object for the same system from
|
||||
a previous Windows installation. In this case, delete the computer object and
|
||||
restart the installation.
|
||||
- Firewall restrictions
|
||||
- Old Puppet certificates from a previous SL6 installation are used on the
|
||||
system. In this case delete the certificates on the client with ``find
|
||||
/etc/puppetlabs -name '*.pem' -delete`` and clean up any certificates on the
|
||||
Puppet server with ``puppet cert clean $HOSTNAME``. Then restart the
|
||||
installation.
|
||||
|
||||
Rejoin the Active Directory
|
||||
===========================
|
||||
|
||||
If the AD join seams to be broken (failed logins, etc.), then the node can be automatically rejoined again:
|
||||
|
||||
- remove ``/etc/krb5.keytab``
|
||||
- run puppet, e.g. with ``puppet agent --test``
|
||||
|
||||
|
||||
Cannot Load YFS Kernel Module
|
||||
=============================
|
||||
|
||||
If the ``yfs-client`` does not start due to "key not available" ::
|
||||
|
||||
Sep 02 13:21:34 pc12661.psi.ch systemd[1]: Starting AuriStorFS Client Service...
|
||||
Sep 02 13:21:34 pc12661.psi.ch modprobe[29282]: modprobe: ERROR: could not insert 'yfs': Required key not available
|
||||
|
||||
then there is most probably SecureBoot blocking the loading of the unsigned ``yfs`` kernel module.
|
||||
|
||||
Please disable secure boot in the BIOS/firmware settings.
|
||||
@@ -1,7 +1,7 @@
|
||||
# PCIe Bus Error
|
||||
|
||||
When there are PCI Express bus errors like
|
||||
```
|
||||
```bash
|
||||
Oct 05 11:26:19 pc16209.psi.ch kernel: pcieport 10000:e0:06.0: AER: TLP Header: 34000000 e1000010 89148914 00000000
|
||||
Oct 05 11:26:19 pc16209.psi.ch kernel: pcieport 10000:e0:06.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
|
||||
Oct 05 11:26:19 pc16209.psi.ch kernel: pcieport 10000:e0:06.0: device [8086:464d] error status/mask=00100000/00010000
|
||||
@@ -18,7 +18,7 @@ One thing you might try is disabling **Active State Power Management** (ASPM) in
|
||||
|
||||
To do so set in Hiera
|
||||
|
||||
```
|
||||
```yaml
|
||||
base::enable_pcie_aspm: false
|
||||
```
|
||||
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
# sssd Authentication
|
||||
# SSSD
|
||||
|
||||
## Check Domain State
|
||||
As `root` check what domains are configured:
|
||||
|
||||
Reference in New Issue
Block a user