gitea-pages/rhel8/nvidia.md

# CUDA and Proprietary Nvidia GPU Drivers on RHEL 8

Managing Nvidia software comes with its own set of challenges.
For the most common cases are covered by our Puppet configuration.
Those are discussed in the first chapter, more details you find more below.


## Hiera Configuration

Changes in Hiera are forwared by Puppet to the node, but **not applied**.
They are applied on **reboot**.
Alternatively you might execute `/opt/pli/libexec/ensure-nvidia-software` in a safe moment (no process using CUDA and the desktop will be restarted).

### I just need the Nvidia GPU drivers

Nothing needs to be done, they are installed by default when Nvidia GPUs or accelerators are found.

### I need CUDA

Set in Hiera `nvidia::cuda::enable true` and it will automatically install the suitable Nvidia drivers and newest possible CUDA version.

The `nvidia_persistenced` service is automatically started. If you do not want it, to set `nvidia::cuda::nvidia_persistenced::enable: false`.

### I need a specific CUDA version

Then you can additionally set `nvidia::cuda::version` to the desired version.
The version must be fully specified (all three numbers, with X.Y.0 for the GA version).

Note that newer CUDA versions do not support older drivers, for details see Table 3 in the [CUDA Release Notes](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html).

### I do not want the Nvidia drivers

Set in Hiera `nvidia::driver::enable: false`. Note this will be ignored if CUDA is enabled (see above).

Note they do not get automatically removed when already installed. That you would need to do by hand.

### I need the Nvidia drivers from a specific driver branch

The driver branch can be selected in Hiera with `nvidia::driver::branch`. It will then use the latest driver version of that branch. Note that only production branches are available in the PSI package repository.

### I need a Nvidia driver of a given version

This is not recommended, still it is possible to do so by setting the exact driver version (X.Y.Z, excluding the package iteration number) in Hiera with `nvidia::driver::version`.

If the driver version is too old, it will install an older kernel version and you will need a second reboot to activate it.


## Versioning Mess

I did not find much information about Nvidia driver version structure and policy. Still I concluded that they use following pattern.

### Driver Branches

Their drivers are oranized in driver branches. As you see for example in their [Unix Driver Archive](https://www.nvidia.com/en-us/drivers/unix/) noted as e.g. `470.xx series`.

There are `Production` and `New Feature` branches (and, on the above linked page, a `Beta Version` which is not linked to any of the above branches (yet?)).

Such a branch can be considered a major release and with new braches adding support for new hardware or removing support for old hardware.
The drivers within a branch are maintained quite a long time. Individual drivers in that branch get increasing version numbers which just start with the same first "branch" number.

In the RPM repo there are more branches available than listed in the [Unix Driver Archive](https://www.nvidia.com/en-us/drivers/unix/). It is not possible to find out retrospectively to what type of branch it belongs. My guess is that the "Legacy" section lists only the production/long term support branches.

Also it is not possible to find out from the package meta information if a driver is considered beta or not. That you only find out by googling "Nvidia $DRIVER_VERSION" and looking at the respective driver page. In my experience the first few driver versions of a branch are usually "beta".

### What Driver \[Branch] for which Hardware

To figure out what driver branch to use for given hardware, go to their [Download page](https://www.nvidia.de/Download/index.aspx) and search its Linux driver. It will then point out a driver version and its first number points out the driver branch to use.

Note that this is not always the full story. For example the Tesla K40c gives [driver 460.106.00](https://www.nvidia.de/Download/driverResults.aspx/182244/en-us), whereas the [470 driver](https://www.nvidia.com/Download/driverResults.aspx/194637/en-us/) still works, even though the hardware is not listed as supported there. My guess is that they somehow publickly differentiate between "Data Center Driver" and "Display Driver", but still they have everything in, or at least in the production/long term support branch.

Another option to figure out the driver is the third-party tool [`nvidia-detect`](http://elrepo.org/tiki/nvidia-detect) by ElRepo. It tells which driver package from ElRepo it suggests, but it can also be used to figure out which production/long term support branch can be used (and only production/long term support branches, e.g. it would never point out the 460 branch and this is how I figured out that Tesla K40c works with 470 despite the Nvidia documentation not saying so).

### CUDA - Driver Compatibility

A CUDA version needs a suitably new driver version, but old CUDA versions are supported by newer driver versions (drivers are backwards-compatible). To figure out up to which CUDA version runs on your installed driver, check out "Table 3" of the [CUDA release notes](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html). For each driver branch there is a major 11.x.0 release with possible further bugfix releases.


## Manual Operation

Instead of using Puppet/Hiera, you may also manage the drivers manually.

Note that drivers made available by default are curated, that means it contains only non-beta production drivers. If you want all drivers available, you need to use `https://repo01.psi.ch/el8/sources/cuda8/` as URL for the package repository.

### Select the Driver Branch

In the RPM package repository the driver branches are mapped to module streams, so there are different streams for different branches and `dnf module list nvidia-driver` will tell you what is available:

```
# dnf module list nvidia-driver
Last metadata expiration check: 2:37:29 ago on Mon 28 Nov 2022 09:15:57 AM CET.
CUDA and drivers from Nvidia
Name            Stream            Profiles                  Summary
nvidia-driver   latest            default [d], fm, ks, src  Nvidia driver for latest branch
nvidia-driver   latest-dkms [d]   default [d], fm, ks       Nvidia driver for latest-dkms branch
nvidia-driver   open-dkms         default [d], fm, ks, src  Nvidia driver for open-dkms branch
nvidia-driver   418               default [d], fm, ks, src  Nvidia driver for 418 branch
nvidia-driver   418-dkms          default [d], fm, ks       Nvidia driver for 418-dkms branch
nvidia-driver   440               default [d], fm, ks, src  Nvidia driver for 440 branch
nvidia-driver   440-dkms          default [d], fm, ks       Nvidia driver for 440-dkms branch
nvidia-driver   450               default [d], fm, ks, src  Nvidia driver for 450 branch
nvidia-driver   450-dkms          default [d], fm, ks       Nvidia driver for 450-dkms branch
nvidia-driver   455               default [d], fm, ks, src  Nvidia driver for 455 branch
nvidia-driver   455-dkms          default [d], fm, ks       Nvidia driver for 455-dkms branch
nvidia-driver   460               default [d], fm, ks, src  Nvidia driver for 460 branch
nvidia-driver   460-dkms          default [d], fm, ks       Nvidia driver for 460-dkms branch
nvidia-driver   465               default [d], fm, ks, src  Nvidia driver for 465 branch
nvidia-driver   465-dkms          default [d], fm, ks       Nvidia driver for 465-dkms branch
nvidia-driver   470               default [d], fm, ks, src  Nvidia driver for 470 branch
nvidia-driver   470-dkms [e]      default [d] [i], fm, ks   Nvidia driver for 470-dkms branch
nvidia-driver   495               default [d], fm, ks, src  Nvidia driver for 495 branch
nvidia-driver   495-dkms          default [d], fm, ks       Nvidia driver for 495-dkms branch
nvidia-driver   510               default [d], fm, ks, src  Nvidia driver for 510 branch
nvidia-driver   510-dkms          default [d], fm, ks       Nvidia driver for 510-dkms branch
nvidia-driver   515               default [d], fm, ks, src  Nvidia driver for 515 branch
nvidia-driver   515-dkms          default [d], fm, ks       Nvidia driver for 515-dkms branch
nvidia-driver   515-open          default [d], fm, ks, src  Nvidia driver for 515-open branch
nvidia-driver   520               default [d], fm, ks, src  Nvidia driver for 520 branch
nvidia-driver   520-dkms          default [d], fm, ks       Nvidia driver for 520-dkms branch
nvidia-driver   520-open          default [d], fm, ks, src  Nvidia driver for 520-open branch

Hint: [d]efault, [e]nabled, [x]disabled, [i]nstalled
#
```
The first try would be to pick the number of the desired branch. Currently the `520*` and `latest` are empty because the drivers where removed.

The "number only" module streams contain precompiled drivers for some kernels. Note that for older branches or older drivers it may not be precompiled for the latest kernel version. For older branches I had the experience that the `*-dkms` module stream works better for newer kernels. But I did not manage to do "real" DKMS with them, that means compiling the translation layer of any given driver version for whatever kernel. Feel free to update this guide or to tell the Core Linux Team if you found a working procedure.

Finally the `*-open` module streams contain the new open source drivers which currently do not provide the full feature set of the propretiary ones.

### Install a Driver

Best works to install the whole module stream:
```
dnf module install "nvidia-driver:$STREAM"
```

Alternatively the module stream might be enabled first (`dnf module enable "nvidia-driver:$STREAM"`) and the packages installed individually after, but then you have to figure out yourself what all is needed.

If the installation command is rather unhappy and complains a lot about `is filtered out by modular filtering`, then there is already a module stream enabled and some driver installed. So to clean that up do:
```
dnf remove cuda-driver nvidia-driver
dnf module reset nvidia-driver
```
Note that this will also remove installed CUDA packages.

### Install CUDA

It is not recommended to install the `cuda` meta-package directly, because that required the latest drivers from the "new feature" branch. It is better to install the `cuda-11-x` meta-package instead, which installs the CUDA version suitable to your driver and keeps it then updated with bugfix releases to this specific major release. Check out the Table 3 in the [CUDA Release Notes](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html) for details.

The `cuda` meta-package is by default excluded as explained above. If you still want to use it, do
```
dnf --disableexcludes cuda install cuda
```

After manual CUDA installation you should think about enabling and starting `nvidia-persistenced`:
```
systemctl enable nvidia-persistenced
systemctl start nvidia-persistenced
```


## Regular Tasks by the Core Linux Team
- classify new driver branches and beta versions in the [snapshot preparation script](https://git.psi.ch/linux-infra/repo01_pli-scripts/-/blob/master/libexec/fix-snapshot/20_remove_nvidia_beta_drivers#L90)
- update the latest production branch in [Puppet managed vidia software installation script](https://git.psi.ch/linux-infra/puppet/-/blob/preprod/code/modules/profile/files/nvidia/ensure-nvidia-software#L17)
- add more production/long term support branches supported by [`nvidia-detect`](http://elrepo.org/tiki/nvidia-detect)  to the [Puppet managed Nvidia software installation script](https://git.psi.ch/linux-infra/puppet/-/blob/preprod/code/modules/profile/files/nvidia/ensure-nvidia-software#L62)
- update the [driver version to CUDA version mapping script](https://git.psi.ch/linux-infra/puppet/-/blob/preprod/code/modules/profile/files/nvidia/suitable_cuda_version#L21) according to new entries in the [CUDA Release Notes](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html)