Installing NVIDIA CUDA on Leap
2024-10-02 00:00:00 +0000 | Egbert Eich | No License
In a blog post last year Simplify GPU Application Development with HMM on Leap we've described how to install CUDA using the NVIDIA open kernel driver built by SUSE. For this, we utilized driver packages from the graphics driver repository for openSUSE Leap hosted by NVIDIA. This happend to work at the time, as at the time the kernel driver version for the graphics driver happened to be the same as the one used by CUDA. This, however, is not usually the case and at present, therefore this method fails.
NVIDIA CUDA Runtime Installation on openSUSE Leap 15.6 Bare Metal
To recap, CUDA will work with the 'legacy' proprietary kernel driver that has existed for a long time as well as the open driver KMP provided with CUDA. There is also a pre-built driver KMP avaiable on openSUSE Leap (starting with version 15.5). The latter provides the additional perks:
- since it is fully pre-build it does not require an additional tool chain
to complete the build during installation - which the CUDA provided drivers
will install as dependencies.
This helps to keep an installation foot print small which is desirable for HPC compute nodes or AI clusters. - It is singed with the same key as the kernel and its components and thus will integrate seamlessly into a secure boot environment without enrolling an additional MOK - which usually requires access to the system console.
The NVIDIA open driver supports all recent GPUs, in fact, it is required for cutting edge platforms such as Grace Hopper or Blackwell, and recommended for all Turing, Ampere, Ada Lovelace, or Hopper architectures. The the proprietary driver is only still required for older GPU architectures.
The versions of the driver stacks for CUDA and for graphics frequently
differ. CUDA's higher level libraries will work with any version equal or
later to the one these have been released with. This means for instance,
that CUDA 12.5 will run with a driver stack version greater or equal to
555.42.06. The components of the lower level stack - ie. those packages
with the driver generation in their name (at the time of writing G06
) -
need to match the kernel driver version.
The component stack for graphics is always tightly version coupled.
To accomodate versions difference for CUDA and graphics, openSUSE Leap
is now providing two sets of NVIDIA Open Driver KMPs: the version targetting
CUDA has the string -cuda
prepended before -kmp
in the package name.
With this, installing CUDA using the open driver becomes quite simple.
CUDA comes with a number of meta-packages which provide an easy way to install certain aspects while still obeying required dependencies. A more detailed list of these meta packages and their purposes will follow in a future blog. To run applications built with CUDA the only components required are the runtime. If we plan to set up a minimal system - such as an HPC compute node - these are the only components to be installed. They consist of the low level driver stack as well as the CUDA libraries. In a containerized system, the libraries would reside inside the application container.
To install the runtime stack for CUDA 12.5, we would simply run:
# zypper ar https://developer.download.nvidia.com/compute/cuda/repos/opensuse15/x86_64/cuda-opensuse15.repo
# zypper --gpg-auto-import-keys refresh
# zypper -n in -y --auto-agree-with-licenses --no-recommends \
nv-prefer-signed-open-driver cuda-runtime-12-5
- If we don't need 32-bit compatibility packages, we may should add the
option
--no-recommends
. - Note that we are deliberately choosing CUDA 12.5 and driver stack version 555.42.06 over the the latest one (12.6) as the later and its matching kernel driver version have dependency issues. To install a different version we would replace `12-5' by this version.
The command above will install the SUSE-built and signed driver. Further changes to the module configuration as described in the previous blog are no longer required.
Now we can either reboot or run
# modprobe nvidia
to make sure, the driver is loaded. Once it's loaded, we ought to check if the installation has been successful by running:
nvidia-smi -q
If the driver has loaded successfully, we should see information about the GPU, the driver and CUDA version installed:
==============NVSMI LOG==============
Timestamp : Mon Aug 12 09:31:57 2024
Driver Version : 555.42.06
CUDA Version : 12.5
Attached GPUs : 1
GPU 00000000:81:00.0
Product Name : NVIDIA L4
Product Brand : NVIDIA
Product Architecture : Ada Lovelace
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
Addressing Mode : HMM
...
NVIDIA CUDA Runtime Installation for containerized Workloads
Containerized workloads using CUDA have their highlevel CUDA runtime libraries installed inside the container. The low level libraries are made available when the container is run. This allows running containerized workloads relatively independent of the version of the driver stack as long as the CUDA version is not newer than the version of driver stack used. Depending on what container environment we plan to use, there are different ways to achieve this. We've learned in a previous blog, how to do this using the NVIDIA GPU Operator and a driver container. This does not require to install any NVIDIA components on the container host. The kernel drivers will be loaded from within the driver container, they do not need to be installed on the container host. However, there are other container systems and different ways to install CUDA for K8s.
Set up Container Host for podman
For podman
, we require a different approach. For this, kernel driver as
well as the nvidia-container-toolkit
need to be installed on the container
host. We need to use the container toolkit to configure podman to set up the
GPU device files and the low level driver stack (libraries and tools)
inside a container.
# zypper ar https://developer.download.nvidia.com/compute/cuda/repos/opensuse15/x86_64/cuda-opensuse15.repo
# zypper --gpg-auto-import-keys refresh
# zypper -n in -y nvidia-compute-utils-G06 [ = <version> ] \
nv-prefer-signed-open-driver \
nvidia-container-toolkit
<version>
specify the driver stack version. If we omit this, we
will get the latest version available. Again, if we don't need 32-bit
compatibility packages, we should add --no-recommends
.
Now, we can configure podman for the devices found in our system:
# nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
This command should show INFO
as well as some WARN
messages but should
not return any lines marked ERROR
.
To run containers which use NVIDIA GPUs, we need to specify the device
using the argument: --device nvidia.com/gpu=<device>
. Here, <device>
may be on individual GPU or - on GPUs that support Multi-Instance GPU
(MIG) - an individual MIG device: ie GPU_ID
or GPU_ID
:MIG_ID
) or all
.
We are also able to specify multiple devices:
$ podman run --device nvidia.com/gpu=0 \
--device nvidia.com/gpu=1:0 \
...
This uses GPU 0 and MIG device 1 on GPU 1.
We may find the available devices by running:
$ nvidia-ctk cdi list
Set up Container Host for K8s (RKE2)
There are different ways to set up the K8s to let containers use NVIDIA devices. One has been described in a separate blog), however the driver stack can also be installed on the container host itself while still using the GPU Operator. The disadvantage of this approach is that the driver stack needs to be installed on each container host.
# zypper ar https://developer.download.nvidia.com/compute/cuda/repos/opensuse15/x86_64/cuda-opensuse15.repo
# zypper --gpg-auto-import-keys refresh
# zypper -n in -y [ --no-recommends ] nvidia-compute-utils-G06 [ = <version> ] \
nv-prefer-signed-open-driver
Here, we may specify a driver version explicitly: if we plan to use
one other than the latest. Also, we may exclude 32-bit libraries by specifying
--no-recommends
.
We need to perform above steps on each node (ie, server, agents) which
have an NVIDIA GPU installed.
How to continue depends whether we perfer to use the NVIDIA GPU Operator
(without a driver container) or just use the NVIDIA Device Plugin container.
with the NVIDIA GPU Operator
Once the previous steps have been completed, we can start the GPU Operator:
OPERATOR_VERSION="v24.6.1"
DRIVER_VERSION="555.42.06"
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install \
-n gpu-operator --generate-name \
--wait \
--create-namespace \
--version=${OPERATOR_VERSION} \
nvidia/gpu-operator \
--set driver.enabled=false \
--set driver.version=${DRIVER_VERSION} \
--set operator.defaultRuntime=containerd \
--set toolkit.env[0].name=CONTAINERD_CONFIG \
--set toolkit.env[0].value=/var/lib/rancher/rke2/agent/etc/containerd/config.toml \
--set toolkit.env[1].name=CONTAINERD_SOCKET \
--set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
--set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
--set toolkit.env[2].value=nvidia \
--set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
--set-string toolkit.env[3].value=true
We need to set OPERATOR_VERSION
to the GPU Operator version we
whish like to use and DRIVER_VERSION
to the driver version
we are running.
Now, we are able to see the pods starting:
kubectl get pods -n gpu-operator
The final state should either be Running
or Completed
.
Now, we should check the logs of the nvidia-operator-validator
whether
the driver has been set up successfully
# kubectl logs -n gpu-operator -l app=nvidia-operator-validator
and the nvidia-cuda-validator
if CUDA workloads can be run successfully.
# kubectl logs -n gpu-operator -l app=nvidia-cuda-validator
Both commands should return that the validations have been successful.
without the NVIDIA GPU Operator
Alternatively, we can set up NVIDIA support without the help of the GPU
Operator. For this, we will need to create a configuration - much like
for podman - and use the NVIDIA Device
Plugin.
If not done already, we need to install the nvidia-container-toolkit
package, for this:
# zypper -n in -y nvidia-container-toolkit
and restart RKE2:
# systemctl restart rke2-server
on the K8s server or
# systemctl restart rke2-agent
on each agent with NVIDIA hardware installed. This will make sure, the container runtime is configure appropriately.
Once the server is again up and running, we should find this entry in
/var/lib/rancher/rke2/agent/etc/containerd/config.toml
:
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
BinaryName = "/usr/bin/nvidia-container-runtime"
SystemdCgroup = true
Which is required to use the correct runtime for workloads requiring NVIDA
harware.
Next, we need to configure the NVIDIA RuntimeClass
as an additional
K8s runtime so that any user whose pods need access to the GPU
can request them by using the runtime class nvidia
:
# kubectl apply -f - <<EOF
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidia
EOF
Since Device Plugin version 0.15 we need to set a hardware feature label telling the DaemonSet that there is a GPU on the node. This is normally set by the Node or GPU Feature Discovery daemons both deployed as part of the GPU Operator chart. Since we don't use this, we will have to set this manually for each node with a GPU:
kubectl label nodes <node_list> nvidia.com/gpu.present="true"
and install the NVIDIA Device Plugin:
# helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
# helm repo update
# helm upgrade -i nvdp nvdp/nvidia-device-plugin \
--namespace nvidia-device-plugin --create-namespace \
--set runtimeClassName=nvidia
The pod should start up, complete the detection and tag the nodes with the number of GPUs available. To verify, we run:
# kubectl get pods -n nvidia-device-plugin
NAME READY STATUS RESTARTS AGE
nvdp-nvidia-device-plugin-tjfcg 1/1 Running 0 7m23s
# kubectl get node `cat /etc/HOSTNAME` -o json | jq .status.capacity
{
"cpu": "32",
"ephemeral-storage": "32647900Ki",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "65295804Ki",
"nvidia.com/gpu": "1",
"pods": "110"
}
Acknowledgements
Many ideas in this post were taken from the documentation NVIDIA GPUs in SLE Micro.
Further Documentation
NVIDIA CUDA Installation Guide for Linux
Installing the NVIDIA Container Toolkit
Support for Container Device Interface