Files

ebner 63e4d83864 moved rhel8/9 documentations

2024-08-08 15:41:24 +02:00

9.3 KiB

Raw Blame History

Red Hat Enterprise Linux 8

Production Ready

The central infrastructure (automatic provisioning, upstream package synchronisation and Puppet) are stable and production ready.

The configuration management is done with Puppet like for RHEL 7. RHEL 7 and RHEL 8 hosts can share the same hierarchy in Hiera and thus also the "same" configuration. In cases where the configuration for RHEL 7 or RHEL 8 differs, the idea is to have both in parallel in Hiera and Puppet shall select the right one.

Please still consider also implementing following two migrations when moving to RHEL 8:

migrate from Icinga1 to Icinga2, as Icinga1 will be decommissioned by end of 2024
explicit network configuration in Hiera with networking::setup, especially if you have static IP addresses or static routes

Bugs and issues can be reported in the Linux project in JIRA.

Documenation

Disk Layout

The default partition schema for RHEL8 is:

create one primary /boot partition of 1Gb;
create the vg_root Volume Group that uses the rest of the disk;
on vg_root create the following logical volumes:
- lv_root of 14 Gb size for /root;
- lv_home of 2 Gb size for /home;
- lv_var of 8 Gb size for /var;
- lv_var_log of 3 Gb size for /var/log;
- lv_var_tmp of 2 Gb size for /var/log;
- lv_tmp of 2 Gb size for /tmp.

Caveats

Missing or Replaced Packages

List of packages removed in RHEL 8

RHEL 7	RHEL 8	remarks
`a2ps`	recommends to use `enscript` instead	`enscript` upstream `a2ps` upstream
`blt`	-	`blt` upstream, does not work with newer Tk version (source)
`gnome-icon-theme-legacy`	-	used for RHEL 7 Icewm
...	...	here I stopped research, please report/document further packages
`devtoolset*`	`gcc-toolset*`
`git-cvs`	-	`cvs` itself is not supported by RHEL8, but available through EPEL. Still missing is the support for `git cvsimport`.

Missing RAID Drivers

Missing RAID Drivers during Installation

For RHEL 8 Red Hat phased out some hardware drivers, here is an official list, but I also found some stuff missing not listed there.

Installation with an unsupported RAID adapter then fails as the installer does not find a system disk to use.

To figure out what driver you need, best go the the installer shell or boot a rescue linux over the network and on the shell check the PCI Device ID of the RAID controller with

$ lspci -nn
...
82:00.0 RAID bus controller [0104]: 3ware Inc 9750 SAS2/SATA-II RAID PCIe [13c1:1010] (rev 05)
...

The ID is in the rightmost square brackets. Then check if there are drivers available.

I will now focus on ElRepo which provides drivers not supported any more by Red Hat. Check the PCI Device ID on their list of (https://elrepo.org/tiki/DeviceIDs). If you found a driver, then there are also driver disks provided.

There are two option in providing this driver disk to the installer:

Download the according .iso file and extract it on an USB stick labelled with OEMDRV and have it connected during installation.
Extend the kernel command line with inst.dd=$URL_OF_ISO_FILE, e.g. with a custom Grub config on the boot server or with the sysdb/bob attribute kernel_cmdline.

(Red Hat documentation of this procedure)

At the end do not forget to enable the ElRepo RPM package repository in Hiera to also get new drivers for updated kernels:

# enable 3rd-party drivers from ElRepo
rpm_repos::default:
  - 'elrepo_rhel8'

Missing RAID Drivers on Kernel Upgrade

If the machine does not boot after provisioning or after an kernel upgrade with

Warning: /dev/mapper/vg_root-lv_root does not exist
Warning: /dev/vg_root/lv_root does not exist

after a lot of

Warning: dracut-initqueue timeout - starting timeout scripts

the it could be that the RAID controller supported was removed with the new kernel, e.g. for the LSI MegaRAID SAS there is a dedicated article.

For the LSI MegaRAID SAS there is still a driver available in ElRepo, so it can be installed during provisioning by Puppet. To do so add to Hiera:

base::pkg_group::....:
  - 'kmod-megaraid_sas'

rpm_repos::default:
  - 'elrepo_rhel8'

AFS cache partition not created due to existing XFS signature

It can happen when upgrading an existing RHEL 7 installation that the puppet run produces

Error: Execution of '/usr/sbin/lvcreate -n lv_openafs --size 2G vg_root' returned 5: WARNING: xfs signature detected on /dev/vg_root/lv_openafs at offset 0. Wipe it? [y/n]: [n]

This needs to be fixed manually:

run the complaining command and approve (or use --yes)
run puppet agent -t to finalize the configuration

The Puppet run fails with

Notice: /Stage[main]/Profile::Aaa/Systemd::Service[kcm-destroy]/Exec[start-global-user-service-kcm-destroy]/returns: Failed to connect to bus: Connection refused
Error: '/usr/bin/systemctl --quiet start --global kcm-destroy.service' returned 1 instead of one of [0]
Error: /Stage[main]/Profile::Aaa/Systemd::Service[kcm-destroy]/Exec[start-global-user-service-kcm-destroy]/returns: change from 'notrun' to ['0'] failed: '/usr/bin/systemctl --quiet start --global kcm-destroy.service' returned 1 instead of one of [0] (corrective)
Notice: /Stage[main]/Profile::Aaa/Profile::Custom_timer[kcm-cleanup]/Systemd::Timer[kcm-cleanup]/Exec[start-global-user-timer-kcm-cleanup]/returns: Failed to connect to bus: Connection refused
Error: '/usr/bin/systemctl --quiet start --global kcm-cleanup.timer' returned 1 instead of one of [0]
Error: /Stage[main]/Profile::Aaa/Profile::Custom_timer[kcm-cleanup]/Systemd::Timer[kcm-cleanup]/Exec[start-global-user-timer-kcm-cleanup]/returns: change from 'notrun' to ['0'] failed: '/usr/bin/systemctl --quiet start --global kcm-cleanup.timer' returned 1 instead of one of [0] (corrective)

This is caused by the use of KCM as default Kerberos credential cache in RHEL8:

for RHEL8 it was recommended to use the KCM provided by sssd as Kerberos Credential Cache.
a major issue of this KCM is that it does not remove outdated caches
this leads to a Denial-of-Service situation when all 64 slots are filled, new logins start to fail after (this is persistent, reboot does not help).
we fix this issue by running regularly cleanup script in user context
this "user context" is handled by the systemd --user instance, which is started on the first login and keeps running until the last session ends.
that systemd user instance is started by pam_systemd.so
pam_systemd.so and pam_slurm_adopt.so conflict because both want to set up cgroups
because of this there is no pam_systemd.so configured on Slurm nodes thus there is no systemd --user instance

I see two options to solve this issue:

do not use KCM
get somehow systemd user instance running

do not use KCM

Can be done in Hiera, to get back to RHEL7 behavior do

aaa::default_krb_cache: "KEYRING:persistent:%{literal('%')}{uid}"

then there will be no KCM magic any more. We could also make this automatically happen in Puppet when Slurm is enabled.

get somehow systemd user instance running

pam_systemd.so does not want to take its hands off cgroups: https://github.com/systemd/systemd/issues/13535

But there is documented how to get (part?) of the pam_systemd.so functionality running with Slurm: https://slurm.schedmd.com/pam_slurm_adopt.html#PAM_CONFIG (the Prolog, TaskProlog and Epilog part). I wonder if that also starts a systemd --user instance or not. Or if it is possible to somehow integrate the start of it therein.

Workstation Installation Takes Long and Seams to Hang

On the very first puppet run the command to install the GUI packages takes up to 10 minutes and it looks like it is hanging. Usually it is after the installation of /etc/sssd/sssd.conf. Just give it a bit time.

"yum/dnf search" Gives Permission Denied as Normal User

It works fine beside the below error message:

Failed to store expired repos cache: [Errno 13] Permission denied: '/var/cache/dnf/x86_64/8/expired_repos.json'

which is IMHO OK to not allow a normal user to do changes there.

9.3 KiB Raw Blame History