9.3 KiB
Red Hat Enterprise Linux 8
Production Ready
The central infrastructure (automatic provisioning, upstream package synchronisation and Puppet) are stable and production ready.
The configuration management is done with Puppet like for RHEL 7. RHEL 7 and RHEL 8 hosts can share the same hierarchy in Hiera and thus also the "same" configuration. In cases where the configuration for RHEL 7 or RHEL 8 differs, the idea is to have both in parallel in Hiera and Puppet shall select the right one.
Please still consider also implementing following two migrations when moving to RHEL 8:
- migrate from Icinga1 to Icinga2, as Icinga1 will be decommissioned by end of 2024
- explicit network configuration in Hiera with
networking::setup, especially if you have static IP addresses or static routes
Bugs and issues can be reported in the Linux project in JIRA.
Documenation
Disk Layout
The default partition schema for RHEL8 is:
- create one primary
/bootpartition of 1Gb; - create the
vg_rootVolume Group that uses the rest of the disk; - on
vg_rootcreate the following logical volumes:lv_rootof 14 Gb size for/root;lv_homeof 2 Gb size for/home;lv_varof 8 Gb size for/var;lv_var_logof 3 Gb size for/var/log;lv_var_tmpof 2 Gb size for/var/log;lv_tmpof 2 Gb size for/tmp.
Caveats
Missing or Replaced Packages
List of packages removed in RHEL 8
| RHEL 7 | RHEL 8 | remarks |
|---|---|---|
a2ps |
recommends to use enscript instead |
enscript upstream a2ps upstream |
blt |
- | blt upstream, does not work with newer Tk version (source) |
gnome-icon-theme-legacy |
- | used for RHEL 7 Icewm |
| ... | ... | here I stopped research, please report/document further packages |
devtoolset* |
gcc-toolset* |
|
git-cvs |
- | cvs itself is not supported by RHEL8, but available through EPEL. Still missing is the support for git cvsimport. |
Missing RAID Drivers
Missing RAID Drivers during Installation
For RHEL 8 Red Hat phased out some hardware drivers, here is an official list, but I also found some stuff missing not listed there.
Installation with an unsupported RAID adapter then fails as the installer does not find a system disk to use.
To figure out what driver you need, best go the the installer shell or boot a rescue linux over the network and on the shell check the PCI Device ID of the RAID controller with
$ lspci -nn
...
82:00.0 RAID bus controller [0104]: 3ware Inc 9750 SAS2/SATA-II RAID PCIe [13c1:1010] (rev 05)
...
The ID is in the rightmost square brackets. Then check if there are drivers available.
I will now focus on ElRepo which provides drivers not supported any more by Red Hat. Check the PCI Device ID on their list of (https://elrepo.org/tiki/DeviceIDs). If you found a driver, then there are also driver disks provided.
There are two option in providing this driver disk to the installer:
- Download the according
.isofile and extract it on an USB stick labelled withOEMDRVand have it connected during installation. - Extend the kernel command line with
inst.dd=$URL_OF_ISO_FILE, e.g. with a custom Grub config on the boot server or with the sysdb/bob attributekernel_cmdline.
(Red Hat documentation of this procedure)
At the end do not forget to enable the ElRepo RPM package repository in Hiera to also get new drivers for updated kernels:
# enable 3rd-party drivers from ElRepo
rpm_repos::default:
- 'elrepo_rhel8'
Missing RAID Drivers on Kernel Upgrade
If the machine does not boot after provisioning or after an kernel upgrade with
Warning: /dev/mapper/vg_root-lv_root does not exist
Warning: /dev/vg_root/lv_root does not exist
after a lot of
Warning: dracut-initqueue timeout - starting timeout scripts
the it could be that the RAID controller supported was removed with the new kernel, e.g. for the LSI MegaRAID SAS there is a dedicated article.
For the LSI MegaRAID SAS there is still a driver available in ElRepo, so it can be installed during provisioning by Puppet. To do so add to Hiera:
base::pkg_group::....:
- 'kmod-megaraid_sas'
rpm_repos::default:
- 'elrepo_rhel8'
AFS cache partition not created due to existing XFS signature
It can happen when upgrading an existing RHEL 7 installation that the puppet run produces
Error: Execution of '/usr/sbin/lvcreate -n lv_openafs --size 2G vg_root' returned 5: WARNING: xfs signature detected on /dev/vg_root/lv_openafs at offset 0. Wipe it? [y/n]: [n]
This needs to be fixed manually:
- run the complaining command and approve (or use
--yes) - run
puppet agent -tto finalize the configuration
Puppet run fails to install KCM related service/timer on Slurm node
The Puppet run fails with
Notice: /Stage[main]/Profile::Aaa/Systemd::Service[kcm-destroy]/Exec[start-global-user-service-kcm-destroy]/returns: Failed to connect to bus: Connection refused
Error: '/usr/bin/systemctl --quiet start --global kcm-destroy.service' returned 1 instead of one of [0]
Error: /Stage[main]/Profile::Aaa/Systemd::Service[kcm-destroy]/Exec[start-global-user-service-kcm-destroy]/returns: change from 'notrun' to ['0'] failed: '/usr/bin/systemctl --quiet start --global kcm-destroy.service' returned 1 instead of one of [0] (corrective)
Notice: /Stage[main]/Profile::Aaa/Profile::Custom_timer[kcm-cleanup]/Systemd::Timer[kcm-cleanup]/Exec[start-global-user-timer-kcm-cleanup]/returns: Failed to connect to bus: Connection refused
Error: '/usr/bin/systemctl --quiet start --global kcm-cleanup.timer' returned 1 instead of one of [0]
Error: /Stage[main]/Profile::Aaa/Profile::Custom_timer[kcm-cleanup]/Systemd::Timer[kcm-cleanup]/Exec[start-global-user-timer-kcm-cleanup]/returns: change from 'notrun' to ['0'] failed: '/usr/bin/systemctl --quiet start --global kcm-cleanup.timer' returned 1 instead of one of [0] (corrective)
This is caused by the use of KCM as default Kerberos credential cache in RHEL8:
- for RHEL8 it was recommended to use the KCM provided by sssd as Kerberos Credential Cache.
- a major issue of this KCM is that it does not remove outdated caches
- this leads to a Denial-of-Service situation when all 64 slots are filled, new logins start to fail after (this is persistent, reboot does not help).
- we fix this issue by running regularly cleanup script in user context
- this "user context" is handled by the
systemd --userinstance, which is started on the first login and keeps running until the last session ends. - that systemd user instance is started by
pam_systemd.so pam_systemd.soandpam_slurm_adopt.soconflict because both want to set up cgroups- because of this there is no
pam_systemd.soconfigured on Slurm nodes thus there is nosystemd --userinstance
I see two options to solve this issue:
- do not use KCM
- get somehow systemd user instance running
do not use KCM
Can be done in Hiera, to get back to RHEL7 behavior do
aaa::default_krb_cache: "KEYRING:persistent:%{literal('%')}{uid}"
then there will be no KCM magic any more. We could also make this automatically happen in Puppet when Slurm is enabled.
get somehow systemd user instance running
pam_systemd.so does not want to take its hands off cgroups:
https://github.com/systemd/systemd/issues/13535
But there is documented how to get (part?) of the pam_systemd.so functionality running with Slurm:
https://slurm.schedmd.com/pam_slurm_adopt.html#PAM_CONFIG
(the Prolog, TaskProlog and Epilog part).
I wonder if that also starts a systemd --user instance or not. Or if it is possible to somehow integrate the start of it therein.
Workstation Installation Takes Long and Seams to Hang
On the very first puppet run the command to install the GUI packages takes up to 10 minutes and it looks like it
is hanging. Usually it is after the installation of /etc/sssd/sssd.conf. Just give it a bit time.
"yum/dnf search" Gives Permission Denied as Normal User
It works fine beside the below error message:
Failed to store expired repos cache: [Errno 13] Permission denied: '/var/cache/dnf/x86_64/8/expired_repos.json'
which is IMHO OK to not allow a normal user to do changes there.