Skip to content

Commit 701cf2f

Browse files
committed
CI: pass --user root to nvidia-persistenced after Linux driver swap
nvidia-persistenced defaults to `--user nvidia-persistenced`, which our apt-purge of `nvidia-compute-utils-*` removed. Without that user the daemon's setuid(3) post-fork fails and the process exits silently -- the `nvidia-smi -pm 1` right after sees Persistence-M briefly On (daemon held it), then it flips back to Off (daemon gone), and the test container's NVML SET call later returns NVML_ERROR_UNKNOWN. Pass --user root so the daemon doesn't depend on a user account that the purge deleted. Also add a `pgrep nvidia-persistenced` + `ls -la /run/nvidia-persistenced/` diagnostic so the next CI log proves the daemon is alive when the test starts.
1 parent 3dfaa84 commit 701cf2f

1 file changed

Lines changed: 20 additions & 9 deletions

File tree

ci/tools/install_gpu_driver.sh

Lines changed: 20 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -119,16 +119,27 @@ host_install() {
119119
# `--silent --no-questions` .run installer drops `/usr/bin/nvidia-
120120
# persistenced` but does not reliably reinstall a usable systemd
121121
# unit -- so a previous attempt at `systemctl start nvidia-
122-
# persistenced.service` was a no-op (see ComputeLab repro on driver
123-
# 610.43.02). Exec the daemon directly; it self-daemonizes and
124-
# creates `/run/nvidia-persistenced/socket`, which NVML clients in
125-
# the test container need for state-changing calls like
126-
# `nvmlDeviceSetPersistenceMode` -- without it those calls return
127-
# NVML_ERROR_UNKNOWN. nv-gha-runners/vm-images' `nvgha-driver` has
128-
# the same gap; their CUDA-runtime validation workload doesn't hit
129-
# an NVML SET write so they haven't surfaced it yet.
122+
# persistenced.service` was a no-op. Exec the daemon directly; it
123+
# self-daemonizes and creates `/run/nvidia-persistenced/socket`,
124+
# which NVML clients in the test container need for state-changing
125+
# calls like `nvmlDeviceSetPersistenceMode` -- without it those
126+
# calls return NVML_ERROR_UNKNOWN.
127+
#
128+
# `--user root`: the daemon's default user is `nvidia-persistenced`,
129+
# which our apt purge of `nvidia-compute-utils-*` deleted. Without
130+
# this flag the daemon's setuid(3) call fails post-fork and the
131+
# process exits silently (which leaves Persistence-M flipping back
132+
# to Off the moment we exit the start window).
133+
#
134+
# Same latent gap exists in nv-gha-runners/vm-images' `nvgha-driver`;
135+
# their CUDA-runtime validation workload doesn't issue an NVML SET
136+
# write so they haven't surfaced it yet.
130137
set -x
131-
/usr/bin/nvidia-persistenced --verbose 2>&1 || true
138+
/usr/bin/nvidia-persistenced --verbose --user root 2>&1 || true
139+
sleep 1
140+
# Diagnostics: confirm the daemon is alive + socket present.
141+
pgrep -laf nvidia-persistenced || echo "WARN: nvidia-persistenced not running"
142+
ls -la /run/nvidia-persistenced/ 2>&1 || echo "WARN: /run/nvidia-persistenced missing"
132143
# Set persistence mode explicitly so we match the runner image's
133144
# `Persistence-M: On` baseline regardless of how the daemon came up.
134145
nvidia-smi -pm 1 || true

0 commit comments

Comments
 (0)