Include /usr/bin/nvidia-smi for nvidia-kmod extension

When attempting to run the NVIDIA gpu-operator it fails to fully initialize. From what I can tell it is because the [`nvidia-validator` ](https://github.com/NVIDIA/gpu-operator/blob/master/validator/main.go) tries to run the `nvidia-smi` binary from the host in `/usr/bin/`

```
NAMESPACE     NAME                                                          READY   STATUS     RESTARTS      AGE
kube-system   coredns-85b955d87b-9cx56                                      1/1     Running    0             70m
kube-system   coredns-85b955d87b-nfdgb                                      1/1     Running    0             70m
kube-system   gpu-feature-discovery-jn6ps                                   0/1     Init:0/1   0             49m
kube-system   gpu-operator-7bbf8bb6b7-g4pd2                                 1/1     Running    0             50m
kube-system   gpu-operator-node-feature-discovery-gc-79d6d968bb-jkn2s       1/1     Running    0             50m
kube-system   gpu-operator-node-feature-discovery-master-6d9f8d497c-xvttn   1/1     Running    0             50m
kube-system   gpu-operator-node-feature-discovery-worker-6cgnv              1/1     Running    0             50m
kube-system   gpu-operator-node-feature-discovery-worker-tdc8j              1/1     Running    0             50m
kube-system   kube-apiserver-up                                             1/1     Running    0             69m
kube-system   kube-controller-manager-up                                    1/1     Running    1 (70m ago)   68m
kube-system   kube-flannel-ffftw                                            1/1     Running    0             69m
kube-system   kube-flannel-q972c                                            1/1     Running    0             69m
kube-system   kube-proxy-mrc75                                              1/1     Running    0             69m
kube-system   kube-proxy-n5qdc                                              1/1     Running    0             69m
kube-system   kube-scheduler-up                                             1/1     Running    2 (70m ago)   68m
kube-system   nvidia-dcgm-exporter-jlqbb                                    0/1     Init:0/1   0             49m
kube-system   nvidia-device-plugin-daemonset-q89xh                          0/1     Init:0/1   0             49m
kube-system   nvidia-operator-validator-jfs6m                               0/1     Init:0/4   0             49m
```

I installed the operator via helm with the following values.yaml
```
driver:
  enabled: false

toolkit:
  enabled: false
  env:
    - name: CONTAINERD_CONFIG
      value: /etc/cri/conf.d/nvidia-container-runtime.part
    - name: CONTAINERD_SET_AS_DEFAULT
      value: "true"
```
This should skip installing drivers and changing containerd config (already included with the extensions), but it apparently doesn't skip checking them.

The chart was installed with

```
helm install gpu-operator \                                              
    -n kube-system nvidia/gpu-operator --values values.yaml
```

I tried manually touching the files that the validator creates and it still attempts to execute the nvidia-smi command

```
running command chroot with args [/run/nvidia/driver nvidia-smi]
chroot: failed to run command 'nvidia-smi': No such file or directory
```

more information in the repo
https://github.com/NVIDIA/gpu-operator/tree/master
and installation docs
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#operator-install-guide

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include /usr/bin/nvidia-smi for nvidia-kmod extension #385

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Include /usr/bin/nvidia-smi for nvidia-kmod extension #385

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions