167 lines
12 KiB
Markdown
167 lines
12 KiB
Markdown
# CUDA and Proprietary Nvidia GPU Drivers on RHEL 8
|
|
|
|
Managing Nvidia software comes with its own set of challenges.
|
|
For the most common cases are covered by our Puppet configuration.
|
|
Those are discussed in the first chapter, more details you find more below.
|
|
|
|
|
|
## Hiera Configuration
|
|
|
|
Changes in Hiera are forwared by Puppet to the node, but **not applied**.
|
|
They are applied on **reboot**.
|
|
Alternatively you might execute `/opt/pli/libexec/ensure-nvidia-software` in a safe moment (no process using CUDA and the desktop will be restarted).
|
|
|
|
### I just need the Nvidia GPU drivers
|
|
|
|
Nothing needs to be done, they are installed by default when Nvidia GPUs or accelerators are found.
|
|
|
|
### I need CUDA
|
|
|
|
Set in Hiera `nvidia::cuda::enable true` and it will automatically install the suitable Nvidia drivers and newest possible CUDA version.
|
|
|
|
The `nvidia_persistenced` service is automatically started. If you do not want it, to set `nvidia::cuda::nvidia_persistenced::enable: false`.
|
|
|
|
### I need a specific CUDA version
|
|
|
|
Then you can additionally set `nvidia::cuda::version` to the desired version.
|
|
The version must be fully specified (all three numbers, with X.Y.0 for the GA version).
|
|
|
|
Note that newer CUDA versions do not support older drivers, for details see Table 3 in the [CUDA Release Notes](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html).
|
|
|
|
### I do not want the Nvidia drivers
|
|
|
|
Set in Hiera `nvidia::driver::enable: false`. Note this will be ignored if CUDA is enabled (see above).
|
|
|
|
Note they do not get automatically removed when already installed. That you would need to do by hand.
|
|
|
|
### I need the Nvidia drivers from a specific driver branch
|
|
|
|
The driver branch can be selected in Hiera with `nvidia::driver::branch`. It will then use the latest driver version of that branch. Note that only production branches are available in the PSI package repository.
|
|
|
|
### I need a Nvidia driver of a given version
|
|
|
|
This is not recommended, still it is possible to do so by setting the exact driver version (X.Y.Z, excluding the package iteration number) in Hiera with `nvidia::driver::version`.
|
|
|
|
If the driver version is too old, it will install an older kernel version and you will need a second reboot to activate it.
|
|
|
|
|
|
## Versioning Mess
|
|
|
|
I did not find much information about Nvidia driver version structure and policy. Still I concluded that they use following pattern.
|
|
|
|
### Driver Branches
|
|
|
|
Their drivers are oranized in driver branches. As you see for example in their [Unix Driver Archive](https://www.nvidia.com/en-us/drivers/unix/) noted as e.g. `470.xx series`.
|
|
|
|
There are `Production` and `New Feature` branches (and, on the above linked page, a `Beta Version` which is not linked to any of the above branches (yet?)).
|
|
|
|
Such a branch can be considered a major release and with new braches adding support for new hardware or removing support for old hardware.
|
|
The drivers within a branch are maintained quite a long time. Individual drivers in that branch get increasing version numbers which just start with the same first "branch" number.
|
|
|
|
In the RPM repo there are more branches available than listed in the [Unix Driver Archive](https://www.nvidia.com/en-us/drivers/unix/). It is not possible to find out retrospectively to what type of branch it belongs. My guess is that the "Legacy" section lists only the production/long term support branches.
|
|
|
|
Also it is not possible to find out from the package meta information if a driver is considered beta or not. That you only find out by googling "Nvidia $DRIVER_VERSION" and looking at the respective driver page. In my experience the first few driver versions of a branch are usually "beta".
|
|
|
|
### What Driver \[Branch] for which Hardware
|
|
|
|
To figure out what driver branch to use for given hardware, go to their [Download page](https://www.nvidia.de/Download/index.aspx) and search its Linux driver. It will then point out a driver version and its first number points out the driver branch to use.
|
|
|
|
Note that this is not always the full story. For example the Tesla K40c gives [driver 460.106.00](https://www.nvidia.de/Download/driverResults.aspx/182244/en-us), whereas the [470 driver](https://www.nvidia.com/Download/driverResults.aspx/194637/en-us/) still works, even though the hardware is not listed as supported there. My guess is that they somehow publickly differentiate between "Data Center Driver" and "Display Driver", but still they have everything in, or at least in the production/long term support branch.
|
|
|
|
Another option to figure out the driver is the third-party tool [`nvidia-detect`](http://elrepo.org/tiki/nvidia-detect) by ElRepo. It tells which driver package from ElRepo it suggests, but it can also be used to figure out which production/long term support branch can be used (and only production/long term support branches, e.g. it would never point out the 460 branch and this is how I figured out that Tesla K40c works with 470 despite the Nvidia documentation not saying so).
|
|
|
|
### CUDA - Driver Compatibility
|
|
|
|
A CUDA version needs a suitably new driver version, but old CUDA versions are supported by newer driver versions (drivers are backwards-compatible). To figure out up to which CUDA version runs on your installed driver, check out "Table 3" of the [CUDA release notes](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html). For each driver branch there is a major 11.x.0 release with possible further bugfix releases.
|
|
|
|
|
|
## Manual Operation
|
|
|
|
Instead of using Puppet/Hiera, you may also manage the drivers manually.
|
|
|
|
Note that drivers made available by default are curated, that means it contains only non-beta production drivers. If you want all drivers available, you need to use `https://repo01.psi.ch/el8/sources/cuda8/` as URL for the package repository.
|
|
|
|
### Select the Driver Branch
|
|
|
|
In the RPM package repository the driver branches are mapped to module streams, so there are different streams for different branches and `dnf module list nvidia-driver` will tell you what is available:
|
|
|
|
```
|
|
# dnf module list nvidia-driver
|
|
Last metadata expiration check: 2:37:29 ago on Mon 28 Nov 2022 09:15:57 AM CET.
|
|
CUDA and drivers from Nvidia
|
|
Name Stream Profiles Summary
|
|
nvidia-driver latest default [d], fm, ks, src Nvidia driver for latest branch
|
|
nvidia-driver latest-dkms [d] default [d], fm, ks Nvidia driver for latest-dkms branch
|
|
nvidia-driver open-dkms default [d], fm, ks, src Nvidia driver for open-dkms branch
|
|
nvidia-driver 418 default [d], fm, ks, src Nvidia driver for 418 branch
|
|
nvidia-driver 418-dkms default [d], fm, ks Nvidia driver for 418-dkms branch
|
|
nvidia-driver 440 default [d], fm, ks, src Nvidia driver for 440 branch
|
|
nvidia-driver 440-dkms default [d], fm, ks Nvidia driver for 440-dkms branch
|
|
nvidia-driver 450 default [d], fm, ks, src Nvidia driver for 450 branch
|
|
nvidia-driver 450-dkms default [d], fm, ks Nvidia driver for 450-dkms branch
|
|
nvidia-driver 455 default [d], fm, ks, src Nvidia driver for 455 branch
|
|
nvidia-driver 455-dkms default [d], fm, ks Nvidia driver for 455-dkms branch
|
|
nvidia-driver 460 default [d], fm, ks, src Nvidia driver for 460 branch
|
|
nvidia-driver 460-dkms default [d], fm, ks Nvidia driver for 460-dkms branch
|
|
nvidia-driver 465 default [d], fm, ks, src Nvidia driver for 465 branch
|
|
nvidia-driver 465-dkms default [d], fm, ks Nvidia driver for 465-dkms branch
|
|
nvidia-driver 470 default [d], fm, ks, src Nvidia driver for 470 branch
|
|
nvidia-driver 470-dkms [e] default [d] [i], fm, ks Nvidia driver for 470-dkms branch
|
|
nvidia-driver 495 default [d], fm, ks, src Nvidia driver for 495 branch
|
|
nvidia-driver 495-dkms default [d], fm, ks Nvidia driver for 495-dkms branch
|
|
nvidia-driver 510 default [d], fm, ks, src Nvidia driver for 510 branch
|
|
nvidia-driver 510-dkms default [d], fm, ks Nvidia driver for 510-dkms branch
|
|
nvidia-driver 515 default [d], fm, ks, src Nvidia driver for 515 branch
|
|
nvidia-driver 515-dkms default [d], fm, ks Nvidia driver for 515-dkms branch
|
|
nvidia-driver 515-open default [d], fm, ks, src Nvidia driver for 515-open branch
|
|
nvidia-driver 520 default [d], fm, ks, src Nvidia driver for 520 branch
|
|
nvidia-driver 520-dkms default [d], fm, ks Nvidia driver for 520-dkms branch
|
|
nvidia-driver 520-open default [d], fm, ks, src Nvidia driver for 520-open branch
|
|
|
|
Hint: [d]efault, [e]nabled, [x]disabled, [i]nstalled
|
|
#
|
|
```
|
|
The first try would be to pick the number of the desired branch. Currently the `520*` and `latest` are empty because the drivers where removed.
|
|
|
|
The "number only" module streams contain precompiled drivers for some kernels. Note that for older branches or older drivers it may not be precompiled for the latest kernel version. For older branches I had the experience that the `*-dkms` module stream works better for newer kernels. But I did not manage to do "real" DKMS with them, that means compiling the translation layer of any given driver version for whatever kernel. Feel free to update this guide or to tell the Core Linux Team if you found a working procedure.
|
|
|
|
Finally the `*-open` module streams contain the new open source drivers which currently do not provide the full feature set of the propretiary ones.
|
|
|
|
### Install a Driver
|
|
|
|
Best works to install the whole module stream:
|
|
```
|
|
dnf module install "nvidia-driver:$STREAM"
|
|
```
|
|
|
|
Alternatively the module stream might be enabled first (`dnf module enable "nvidia-driver:$STREAM"`) and the packages installed individually after, but then you have to figure out yourself what all is needed.
|
|
|
|
If the installation command is rather unhappy and complains a lot about `is filtered out by modular filtering`, then there is already a module stream enabled and some driver installed. So to clean that up do:
|
|
```
|
|
dnf remove cuda-driver nvidia-driver
|
|
dnf module reset nvidia-driver
|
|
```
|
|
Note that this will also remove installed CUDA packages.
|
|
|
|
### Install CUDA
|
|
|
|
It is not recommended to install the `cuda` meta-package directly, because that required the latest drivers from the "new feature" branch. It is better to install the `cuda-11-x` meta-package instead, which installs the CUDA version suitable to your driver and keeps it then updated with bugfix releases to this specific major release. Check out the Table 3 in the [CUDA Release Notes](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html) for details.
|
|
|
|
The `cuda` meta-package is by default excluded as explained above. If you still want to use it, do
|
|
```
|
|
dnf --disableexcludes cuda install cuda
|
|
```
|
|
|
|
After manual CUDA installation you should think about enabling and starting `nvidia-persistenced`:
|
|
```
|
|
systemctl enable nvidia-persistenced
|
|
systemctl start nvidia-persistenced
|
|
```
|
|
|
|
|
|
## Regular Tasks by the Core Linux Team
|
|
- classify new driver branches and beta versions in the [snapshot preparation script](https://git.psi.ch/linux-infra/repo01_pli-scripts/-/blob/master/libexec/fix-snapshot/20_remove_nvidia_beta_drivers#L90)
|
|
- update the latest production branch in [Puppet managed vidia software installation script](https://git.psi.ch/linux-infra/puppet/-/blob/preprod/code/modules/profile/files/nvidia/ensure-nvidia-software#L17)
|
|
- add more production/long term support branches supported by [`nvidia-detect`](http://elrepo.org/tiki/nvidia-detect) to the [Puppet managed Nvidia software installation script](https://git.psi.ch/linux-infra/puppet/-/blob/preprod/code/modules/profile/files/nvidia/ensure-nvidia-software#L62)
|
|
- update the [driver version to CUDA version mapping script](https://git.psi.ch/linux-infra/puppet/-/blob/preprod/code/modules/profile/files/nvidia/suitable_cuda_version#L21) according to new entries in the [CUDA Release Notes](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html)
|