Hello, I'm currently testing the NVIDIA CDI feature.
I've enabled CDI on a host with an NVIDIA GPU, and set runtimeClassName: nvidia-cdi in my Kubernetes pod spec.
Here's what I observed:
apiVersion: v1
kind: Pod
metadata:
name: cdi-gpu-test
spec:
nodeName: h100
runtimeClassName: nvidia-cdi <- I added this
containers:
- name: cuda-container
image: ubuntu22.04
command: ["sh", "-c", "sleep 3600"]
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
restartPolicy: Never
# kubectl exec -it cdi-gpu-test -- bash
# after creating pod, checkint /dev/
root@cdi-gpu-test:/dev# ls -al /dev/nvidia*
crw-rw-rw- 1 root root 195, 254 Jul 22 02:38 /dev/nvidia-modeset
crw-rw-rw- 1 root root 498, 0 Jul 22 02:38 /dev/nvidia-uvm
crw-rw-rw- 1 root root 498, 1 Jul 22 02:38 /dev/nvidia-uvm-tools
crw-rw-rw- 1 root root 195, 6 Jul 22 02:38 /dev/nvidia6
crw-rw-rw- 1 root root 195, 255 Jul 22 02:38 /dev/nvidiactl
# try nvidia-smi command
root@cdi-gpu-test:/dev# nvidia-smi
Tue Jul 22 02:39:38 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:D2:00.0 Off | 0 |
| N/A 27C P0 69W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
# after nvidia-smi, I can see all devices
root@cdi-gpu-test:/dev# ls -al /dev/nvidia*
crw-rw-rw- 1 root root 195, 254 Jul 22 02:38 /dev/nvidia-modeset
crw-rw-rw- 1 root root 498, 0 Jul 22 02:38 /dev/nvidia-uvm
crw-rw-rw- 1 root root 498, 1 Jul 22 02:38 /dev/nvidia-uvm-tools
crw-rw-rw- 1 root root 195, 0 Jul 22 02:39 /dev/nvidia0
crw-rw-rw- 1 root root 195, 1 Jul 22 02:39 /dev/nvidia1
crw-rw-rw- 1 root root 195, 2 Jul 22 02:39 /dev/nvidia2
crw-rw-rw- 1 root root 195, 3 Jul 22 02:39 /dev/nvidia3
crw-rw-rw- 1 root root 195, 4 Jul 22 02:39 /dev/nvidia4
crw-rw-rw- 1 root root 195, 5 Jul 22 02:39 /dev/nvidia5
crw-rw-rw- 1 root root 195, 6 Jul 22 02:38 /dev/nvidia6
crw-rw-rw- 1 root root 195, 7 Jul 22 02:39 /dev/nvidia7
crw-rw-rw- 1 root root 195, 255 Jul 22 02:38 /dev/nvidiactl
When the pod starts, checking /dev/ inside the pod shows only one device: nvidia6.
However, after running nvidia-smi inside the pod, all 8 NVIDIA device nodes (e.g., nvidia0 through nvidia7) appear under /dev/.
This leads me to two questions regarding CDI behavior:
- Is it the intended behavior of CDI that only one device node (e.g., nvidia6) appears at first?
Personally, I expected CDI (being a device virtualization/mapping layer) to expose only one device (e.g., nvidia0) and internally manage the mapping.
- Why do all device nodes appear under /dev/ after running nvidia-smi?
It seems like nvidia-smi somehow triggers the creation of all device nodes.
This makes me wonder: do I need to run nvidia-smi before my actual container command to ensure all necessary devices are available? Is there a better way to ensure this without manually invoking nvidia-smi?
I'd appreciate your comments.
Hello, I'm currently testing the NVIDIA CDI feature.
I've enabled CDI on a host with an NVIDIA GPU, and set
runtimeClassName: nvidia-cdiin my Kubernetes pod spec.Here's what I observed:
When the pod starts, checking /dev/ inside the pod shows only one device: nvidia6.
However, after running nvidia-smi inside the pod, all 8 NVIDIA device nodes (e.g., nvidia0 through nvidia7) appear under /dev/.
This leads me to two questions regarding CDI behavior:
Personally, I expected CDI (being a device virtualization/mapping layer) to expose only one device (e.g., nvidia0) and internally manage the mapping.
It seems like nvidia-smi somehow triggers the creation of all device nodes.
This makes me wonder: do I need to run nvidia-smi before my actual container command to ensure all necessary devices are available? Is there a better way to ensure this without manually invoking nvidia-smi?
I'd appreciate your comments.