12 KiB
CUDA and Proprietary Nvidia GPU Drivers on RHEL 8
Managing Nvidia software comes with its own set of challenges. For the most common cases are covered by our Puppet configuration. Those are discussed in the first chapter, more details you find more below.
Hiera Configuration
Changes in Hiera are forwared by Puppet to the node, but not applied.
They are applied on reboot.
Alternatively you might execute /opt/pli/libexec/ensure-nvidia-software in a safe moment (no process using CUDA and the desktop will be restarted).
I just need the Nvidia GPU drivers
Nothing needs to be done, they are installed by default when Nvidia GPUs or accelerators are found.
I need CUDA
Set in Hiera nvidia::cuda::enable true and it will automatically install the suitable Nvidia drivers and newest possible CUDA version.
The nvidia_persistenced service is automatically started. If you do not want it, to set nvidia::cuda::nvidia_persistenced::enable: false.
I need a specific CUDA version
Then you can additionally set nvidia::cuda::version to the desired version.
The version must be fully specified (all three numbers, with X.Y.0 for the GA version).
Note that newer CUDA versions do not support older drivers, for details see Table 3 in the CUDA Release Notes.
I do not want the Nvidia drivers
Set in Hiera nvidia::driver::enable: false. Note this will be ignored if CUDA is enabled (see above).
Note they do not get automatically removed when already installed. That you would need to do by hand.
I need the Nvidia drivers from a specific driver branch
The driver branch can be selected in Hiera with nvidia::driver::branch. It will then use the latest driver version of that branch. Note that only production branches are available in the PSI package repository.
I need a Nvidia driver of a given version
This is not recommended, still it is possible to do so by setting the exact driver version (X.Y.Z, excluding the package iteration number) in Hiera with nvidia::driver::version.
If the driver version is too old, it will install an older kernel version and you will need a second reboot to activate it.
Versioning Mess
I did not find much information about Nvidia driver version structure and policy. Still I concluded that they use following pattern.
Driver Branches
Their drivers are oranized in driver branches. As you see for example in their Unix Driver Archive noted as e.g. 470.xx series.
There are Production and New Feature branches (and, on the above linked page, a Beta Version which is not linked to any of the above branches (yet?)).
Such a branch can be considered a major release and with new braches adding support for new hardware or removing support for old hardware. The drivers within a branch are maintained quite a long time. Individual drivers in that branch get increasing version numbers which just start with the same first "branch" number.
In the RPM repo there are more branches available than listed in the Unix Driver Archive. It is not possible to find out retrospectively to what type of branch it belongs. My guess is that the "Legacy" section lists only the production/long term support branches.
Also it is not possible to find out from the package meta information if a driver is considered beta or not. That you only find out by googling "Nvidia $DRIVER_VERSION" and looking at the respective driver page. In my experience the first few driver versions of a branch are usually "beta".
What Driver [Branch] for which Hardware
To figure out what driver branch to use for given hardware, go to their Download page and search its Linux driver. It will then point out a driver version and its first number points out the driver branch to use.
Note that this is not always the full story. For example the Tesla K40c gives driver 460.106.00, whereas the 470 driver still works, even though the hardware is not listed as supported there. My guess is that they somehow publickly differentiate between "Data Center Driver" and "Display Driver", but still they have everything in, or at least in the production/long term support branch.
Another option to figure out the driver is the third-party tool nvidia-detect by ElRepo. It tells which driver package from ElRepo it suggests, but it can also be used to figure out which production/long term support branch can be used (and only production/long term support branches, e.g. it would never point out the 460 branch and this is how I figured out that Tesla K40c works with 470 despite the Nvidia documentation not saying so).
CUDA - Driver Compatibility
A CUDA version needs a suitably new driver version, but old CUDA versions are supported by newer driver versions (drivers are backwards-compatible). To figure out up to which CUDA version runs on your installed driver, check out "Table 3" of the CUDA release notes. For each driver branch there is a major 11.x.0 release with possible further bugfix releases.
Manual Operation
Instead of using Puppet/Hiera, you may also manage the drivers manually.
Note that drivers made available by default are curated, that means it contains only non-beta production drivers. If you want all drivers available, you need to use https://repo01.psi.ch/el8/sources/cuda8/ as URL for the package repository.
Select the Driver Branch
In the RPM package repository the driver branches are mapped to module streams, so there are different streams for different branches and dnf module list nvidia-driver will tell you what is available:
# dnf module list nvidia-driver
Last metadata expiration check: 2:37:29 ago on Mon 28 Nov 2022 09:15:57 AM CET.
CUDA and drivers from Nvidia
Name Stream Profiles Summary
nvidia-driver latest default [d], fm, ks, src Nvidia driver for latest branch
nvidia-driver latest-dkms [d] default [d], fm, ks Nvidia driver for latest-dkms branch
nvidia-driver open-dkms default [d], fm, ks, src Nvidia driver for open-dkms branch
nvidia-driver 418 default [d], fm, ks, src Nvidia driver for 418 branch
nvidia-driver 418-dkms default [d], fm, ks Nvidia driver for 418-dkms branch
nvidia-driver 440 default [d], fm, ks, src Nvidia driver for 440 branch
nvidia-driver 440-dkms default [d], fm, ks Nvidia driver for 440-dkms branch
nvidia-driver 450 default [d], fm, ks, src Nvidia driver for 450 branch
nvidia-driver 450-dkms default [d], fm, ks Nvidia driver for 450-dkms branch
nvidia-driver 455 default [d], fm, ks, src Nvidia driver for 455 branch
nvidia-driver 455-dkms default [d], fm, ks Nvidia driver for 455-dkms branch
nvidia-driver 460 default [d], fm, ks, src Nvidia driver for 460 branch
nvidia-driver 460-dkms default [d], fm, ks Nvidia driver for 460-dkms branch
nvidia-driver 465 default [d], fm, ks, src Nvidia driver for 465 branch
nvidia-driver 465-dkms default [d], fm, ks Nvidia driver for 465-dkms branch
nvidia-driver 470 default [d], fm, ks, src Nvidia driver for 470 branch
nvidia-driver 470-dkms [e] default [d] [i], fm, ks Nvidia driver for 470-dkms branch
nvidia-driver 495 default [d], fm, ks, src Nvidia driver for 495 branch
nvidia-driver 495-dkms default [d], fm, ks Nvidia driver for 495-dkms branch
nvidia-driver 510 default [d], fm, ks, src Nvidia driver for 510 branch
nvidia-driver 510-dkms default [d], fm, ks Nvidia driver for 510-dkms branch
nvidia-driver 515 default [d], fm, ks, src Nvidia driver for 515 branch
nvidia-driver 515-dkms default [d], fm, ks Nvidia driver for 515-dkms branch
nvidia-driver 515-open default [d], fm, ks, src Nvidia driver for 515-open branch
nvidia-driver 520 default [d], fm, ks, src Nvidia driver for 520 branch
nvidia-driver 520-dkms default [d], fm, ks Nvidia driver for 520-dkms branch
nvidia-driver 520-open default [d], fm, ks, src Nvidia driver for 520-open branch
Hint: [d]efault, [e]nabled, [x]disabled, [i]nstalled
#
The first try would be to pick the number of the desired branch. Currently the 520* and latest are empty because the drivers where removed.
The "number only" module streams contain precompiled drivers for some kernels. Note that for older branches or older drivers it may not be precompiled for the latest kernel version. For older branches I had the experience that the *-dkms module stream works better for newer kernels. But I did not manage to do "real" DKMS with them, that means compiling the translation layer of any given driver version for whatever kernel. Feel free to update this guide or to tell the Core Linux Team if you found a working procedure.
Finally the *-open module streams contain the new open source drivers which currently do not provide the full feature set of the propretiary ones.
Install a Driver
Best works to install the whole module stream:
dnf module install "nvidia-driver:$STREAM"
Alternatively the module stream might be enabled first (dnf module enable "nvidia-driver:$STREAM") and the packages installed individually after, but then you have to figure out yourself what all is needed.
If the installation command is rather unhappy and complains a lot about is filtered out by modular filtering, then there is already a module stream enabled and some driver installed. So to clean that up do:
dnf remove cuda-driver nvidia-driver
dnf module reset nvidia-driver
Note that this will also remove installed CUDA packages.
Install CUDA
It is not recommended to install the cuda meta-package directly, because that required the latest drivers from the "new feature" branch. It is better to install the cuda-11-x meta-package instead, which installs the CUDA version suitable to your driver and keeps it then updated with bugfix releases to this specific major release. Check out the Table 3 in the CUDA Release Notes for details.
The cuda meta-package is by default excluded as explained above. If you still want to use it, do
dnf --disableexcludes cuda install cuda
After manual CUDA installation you should think about enabling and starting nvidia-persistenced:
systemctl enable nvidia-persistenced
systemctl start nvidia-persistenced
Regular Tasks by the Core Linux Team
- classify new driver branches and beta versions in the snapshot preparation script
- update the latest production branch in Puppet managed vidia software installation script
- add more production/long term support branches supported by
nvidia-detectto the Puppet managed Nvidia software installation script - update the driver version to CUDA version mapping script according to new entries in the CUDA Release Notes