When attempting to run the NVIDIA gpu-operator it fails to fully initialize. From what I can tell it is because the nvidia-validator tries to run the nvidia-smi binary from the host in /usr/bin/
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-85b955d87b-9cx56 1/1 Running 0 70m
kube-system coredns-85b955d87b-nfdgb 1/1 Running 0 70m
kube-system gpu-feature-discovery-jn6ps 0/1 Init:0/1 0 49m
kube-system gpu-operator-7bbf8bb6b7-g4pd2 1/1 Running 0 50m
kube-system gpu-operator-node-feature-discovery-gc-79d6d968bb-jkn2s 1/1 Running 0 50m
kube-system gpu-operator-node-feature-discovery-master-6d9f8d497c-xvttn 1/1 Running 0 50m
kube-system gpu-operator-node-feature-discovery-worker-6cgnv 1/1 Running 0 50m
kube-system gpu-operator-node-feature-discovery-worker-tdc8j 1/1 Running 0 50m
kube-system kube-apiserver-up 1/1 Running 0 69m
kube-system kube-controller-manager-up 1/1 Running 1 (70m ago) 68m
kube-system kube-flannel-ffftw 1/1 Running 0 69m
kube-system kube-flannel-q972c 1/1 Running 0 69m
kube-system kube-proxy-mrc75 1/1 Running 0 69m
kube-system kube-proxy-n5qdc 1/1 Running 0 69m
kube-system kube-scheduler-up 1/1 Running 2 (70m ago) 68m
kube-system nvidia-dcgm-exporter-jlqbb 0/1 Init:0/1 0 49m
kube-system nvidia-device-plugin-daemonset-q89xh 0/1 Init:0/1 0 49m
kube-system nvidia-operator-validator-jfs6m 0/1 Init:0/4 0 49m
I installed the operator via helm with the following values.yaml
driver:
enabled: false
toolkit:
enabled: false
env:
- name: CONTAINERD_CONFIG
value: /etc/cri/conf.d/nvidia-container-runtime.part
- name: CONTAINERD_SET_AS_DEFAULT
value: "true"
This should skip installing drivers and changing containerd config (already included with the extensions), but it apparently doesn't skip checking them.
The chart was installed with
helm install gpu-operator \
-n kube-system nvidia/gpu-operator --values values.yaml
I tried manually touching the files that the validator creates and it still attempts to execute the nvidia-smi command
running command chroot with args [/run/nvidia/driver nvidia-smi]
chroot: failed to run command 'nvidia-smi': No such file or directory
more information in the repo
https://github.com/NVIDIA/gpu-operator/tree/master
and installation docs
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#operator-install-guide
When attempting to run the NVIDIA gpu-operator it fails to fully initialize. From what I can tell it is because the
nvidia-validatortries to run thenvidia-smibinary from the host in/usr/bin/I installed the operator via helm with the following values.yaml
This should skip installing drivers and changing containerd config (already included with the extensions), but it apparently doesn't skip checking them.
The chart was installed with
I tried manually touching the files that the validator creates and it still attempts to execute the nvidia-smi command
more information in the repo
https://github.com/NVIDIA/gpu-operator/tree/master
and installation docs
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#operator-install-guide