gitea-pages/admin-guide/deployment/rhel8/index.md

# Red Hat Enterprise Linux 8

## Production Ready

The central infrastructure (automatic provisioning, upstream package synchronisation and Puppet) are stable and production ready.

The configuration management is done with Puppet like for RHEL 7. RHEL 7 and RHEL 8 hosts can share the same hierarchy in Hiera and thus also the "same" configuration. In cases where the configuration for RHEL 7 or RHEL 8 differs, the idea is to have both in parallel in Hiera and Puppet shall select the right one.


Please still consider also implementing following two migrations when moving to RHEL 8:
- migrate from Icinga1 to [Icinga2](../admin-guide/configuration/icinga2), as Icinga1 will be decommissioned by end of 2024
- explicit [network configuration in Hiera](../admin-guide/configuration/networking) with `networking::setup`, especially if you have static IP addresses or static routes

Bugs and issues can be reported in the [Linux project in JIRA](https://jira.psi.ch/browse/PSILINUX).

## Documenation

* [Installation](installation)
* [CUDA and Nvidia Drivers](nvidia)
* [Kerberos](kerberos)
* [Desktop](desktop)
* [Hardware Compatibility](hardware_compatibility)
* [Vendor Documentation](vendor_documentation)

## Disk Layout
The default partition schema for RHEL8 is:

- create one primary ``/boot`` partition of 1Gb;
- create the ``vg_root`` Volume Group that uses the rest of the disk;
- on ``vg_root`` create the following logical volumes:
    - ``lv_root`` of 14 Gb size for ``/root``;
    - ``lv_home`` of 2 Gb size for ``/home``;
    - ``lv_var`` of 8 Gb size for ``/var``;
    - ``lv_var_log`` of 3 Gb size for ``/var/log``;
    - ``lv_var_tmp`` of 2 Gb size for ``/var/log``;
    - ``lv_tmp`` of 2 Gb size for ``/tmp``.

## Caveats

### Missing or Replaced Packages

[List of packages removed in RHEL 8](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html-single/considerations_in_adopting_rhel_8/index#removed-packages_changes-to-packages)

| RHEL 7 | RHEL 8 | remarks |
| ---    | ---    | ---     |
| `a2ps`   | recommends to use `enscript` instead | [`enscript` upstream](https://www.gnu.org/software/enscript/) [`a2ps` upstream](https://www.gnu.org/software/a2ps/)   |
| `blt` | - | [`blt` upstream](http://blt.sourceforge.net/), does not work with newer Tk version ([source](https://wiki.tcl-lang.org/page/BLT)) |
| `gnome-icon-theme-legacy` | - | used for RHEL 7 Icewm |
| ... | ... | here I stopped research, please report/document further packages |
| `devtoolset*` | `gcc-toolset*` | |
| `git-cvs` | - | `cvs` itself is not supported by RHEL8, but available through EPEL. Still missing is the support for `git cvsimport`. |


### Missing RAID Drivers

#### Missing RAID Drivers during Installation

For RHEL 8 Red Hat phased out some hardware drivers, here is an [official list](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/considerations_in_adopting_rhel_8/hardware-enablement_considerations-in-adopting-rhel-8#removed-adapters_hardware-enablement), but I also found some stuff missing not listed there.

Installation with an unsupported RAID adapter then fails as the installer does not find a system disk to use.

To figure out what driver you need, best go the the installer shell or boot a rescue linux over the network and on the shell check the PCI Device ID of the RAID controller with

```
$ lspci -nn
...
82:00.0 RAID bus controller [0104]: 3ware Inc 9750 SAS2/SATA-II RAID PCIe [13c1:1010] (rev 05)
...
```
The ID is in the rightmost square brackets. Then check if there are drivers available.

I will now focus on [ElRepo](https://elrepo.org/) which provides drivers not supported any more by Red Hat. Check the PCI Device ID on their list of  (https://elrepo.org/tiki/DeviceIDs). If you found a driver, then there are also [driver disks provided](https://linuxsoft.cern.ch/elrepo/dud/el8/x86_64/).

There are two option in providing this driver disk to the installer:

1. Download the according `.iso` file and extract it on an USB stick labelled with `OEMDRV` and have it connected during installation.
2. Extend the kernel command line with `inst.dd=$URL_OF_ISO_FILE`, e.g. with a custom Grub config on the [boot server](https://git.psi.ch/linux-infra/network-boot) or with the sysdb/bob attribute `kernel_cmdline`.

([Red Hat documentation of this procedure](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html-single/performing_an_advanced_rhel_8_installation/index#updating-drivers-during-installation_installing-rhel-as-an-experienced-user))

At the end do not forget to enable the ElRepo RPM package repository in Hiera to also get new drivers for updated kernels:
```
# enable 3rd-party drivers from ElRepo
rpm_repos::default:
  - 'elrepo_rhel8'
```

#### Missing RAID Drivers on Kernel Upgrade

If the machine does not boot after provisioning or after an kernel upgrade with
```
Warning: /dev/mapper/vg_root-lv_root does not exist
Warning: /dev/vg_root/lv_root does not exist
```
after a lot of
```
Warning: dracut-initqueue timeout - starting timeout scripts
```
the it could be that the RAID controller supported was removed with the new kernel, e.g. for the LSI MegaRAID SAS there is a [dedicated article](https://access.redhat.com/solutions/3751841).

For the LSI MegaRAID SAS there is still a driver available in ElRepo, so it can be installed during provisioning by Puppet.
To do so add to Hiera:
```
base::pkg_group::....:
  - 'kmod-megaraid_sas'

rpm_repos::default:
  - 'elrepo_rhel8'
```

### AFS cache partition not created due to existing XFS signature
It can happen when upgrading an existing RHEL 7 installation that the puppet run produces
```
Error: Execution of '/usr/sbin/lvcreate -n lv_openafs --size 2G vg_root' returned 5: WARNING: xfs signature detected on /dev/vg_root/lv_openafs at offset 0. Wipe it? [y/n]: [n]
```
This needs to be fixed manually:
- run the complaining command and approve (or use `--yes`)
- run `puppet agent -t` to finalize the configuration

### Puppet run fails to install KCM related service/timer on Slurm node

The Puppet run fails with
```
Notice: /Stage[main]/Profile::Aaa/Systemd::Service[kcm-destroy]/Exec[start-global-user-service-kcm-destroy]/returns: Failed to connect to bus: Connection refused
Error: '/usr/bin/systemctl --quiet start --global kcm-destroy.service' returned 1 instead of one of [0]
Error: /Stage[main]/Profile::Aaa/Systemd::Service[kcm-destroy]/Exec[start-global-user-service-kcm-destroy]/returns: change from 'notrun' to ['0'] failed: '/usr/bin/systemctl --quiet start --global kcm-destroy.service' returned 1 instead of one of [0] (corrective)
Notice: /Stage[main]/Profile::Aaa/Profile::Custom_timer[kcm-cleanup]/Systemd::Timer[kcm-cleanup]/Exec[start-global-user-timer-kcm-cleanup]/returns: Failed to connect to bus: Connection refused
Error: '/usr/bin/systemctl --quiet start --global kcm-cleanup.timer' returned 1 instead of one of [0]
Error: /Stage[main]/Profile::Aaa/Profile::Custom_timer[kcm-cleanup]/Systemd::Timer[kcm-cleanup]/Exec[start-global-user-timer-kcm-cleanup]/returns: change from 'notrun' to ['0'] failed: '/usr/bin/systemctl --quiet start --global kcm-cleanup.timer' returned 1 instead of one of [0] (corrective)
```

This is caused by the use of KCM as default Kerberos credential cache in RHEL8:

- for RHEL8 it was recommended to use the KCM provided by sssd as Kerberos Credential Cache.
- a major issue of this KCM is that it does not remove outdated caches
- this leads to a Denial-of-Service situation when all 64 slots are filled, new logins start to fail after (this is persistent, reboot does not help).
- we fix this issue by running regularly cleanup script in user context
- this "user context" is handled by the `systemd --user` instance, which is started on the first login and keeps running until the last session ends.
- that systemd user instance is started by `pam_systemd.so`
- `pam_systemd.so` and `pam_slurm_adopt.so` conflict because both want to set up cgroups
- because of this there is no `pam_systemd.so` configured on Slurm nodes thus there is no `systemd --user` instance

I see two options to solve this issue:
- do not use KCM
- get somehow systemd user instance running


#### do not use KCM
Can be done in Hiera, to get back to RHEL7 behavior do

    aaa::default_krb_cache: "KEYRING:persistent:%{literal('%')}{uid}"

then there will be no KCM magic any more.
We could also make this automatically happen in Puppet when Slurm is enabled.


#### get somehow systemd user instance running
`pam_systemd.so` does not want to take its hands off cgroups:
https://github.com/systemd/systemd/issues/13535

But there is documented how to get (part?) of the `pam_systemd.so` functionality running with Slurm:
https://slurm.schedmd.com/pam_slurm_adopt.html#PAM_CONFIG
(the Prolog, TaskProlog and Epilog part).
I wonder if that also starts a `systemd --user` instance or not. Or if it is possible to somehow integrate the start of it therein.


### Workstation Installation Takes Long and Seams to Hang
On the very first puppet run the command to install the GUI packages takes up to 10 minutes and it looks like it
is hanging. Usually it is after the installation of `/etc/sssd/sssd.conf`. Just give it a bit time.

### "yum/dnf search" Gives Permission Denied as Normal User
It works fine beside the below error message:
```
Failed to store expired repos cache: [Errno 13] Permission denied: '/var/cache/dnf/x86_64/8/expired_repos.json'
```
which is IMHO OK to not allow a normal user to do changes there.