Skip to content

Commit 3dfaa84

Browse files
committed
CI: exec nvidia-persistenced directly after Linux driver swap
The `--silent --no-questions` .run installer drops /usr/bin/nvidia- persistenced but does not reliably install a usable systemd unit, so `systemctl start nvidia-persistenced.service` was a no-op (verified in CI logs: `+ true` after the start). With the daemon down, the /run/nvidia-persistenced/socket bind-mounted into the test container is stale, and NVML state-changing calls (e.g. nvmlDeviceSetPersistenceMode) made by root inside the container return NVML_ERROR_UNKNOWN -- which is what cuda.core's test_persistence_mode_enabled has been failing on. Verified on ComputeLab with the same driver (610.43.02), same GPU arch (Ada L40S), root in container: with the daemon up, the SET call returns NVML_SUCCESS; with the daemon down it returns UnknownError. Fix: exec /usr/bin/nvidia-persistenced directly. The binary self-daemonizes and creates the socket on its own. (Same latent gap exists in nv-gha-runners/vm-images' nvgha-driver; will flag upstream.)
1 parent 0d5f0e9 commit 3dfaa84

1 file changed

Lines changed: 15 additions & 14 deletions

File tree

ci/tools/install_gpu_driver.sh

Lines changed: 15 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -115,22 +115,23 @@ host_install() {
115115
--accept-license --ui=none --no-cc-version-check --kernel-module-type="$KMT" )
116116
modprobe nvidia nvidia_uvm nvidia_modeset
117117

118-
# Restore the runner image's baseline state: persistence mode ENABLED
119-
# plus nvidia-persistenced running. The runner-team's pre-installed
120-
# drivers come up with `Persistence-M: On`, but our .run install leaves
121-
# it Off, which breaks tests that toggle the value (cuda.core's
122-
# test_persistence_mode_enabled hits NVML_ERROR_UNKNOWN when setting
123-
# the mode to its current value on driver 610.43.02).
124-
#
125-
# `nvidia-smi -pm 1` is the load-bearing call -- it sets the kernel-
126-
# level persistence flag directly via NVML (equivalent to what the
127-
# daemon would do on startup). The systemctl block is best-effort: the
128-
# silent .run installer doesn't always drop the systemd unit, so we
129-
# daemon-reload first and tolerate failure on `start`.
118+
# Bring nvidia-persistenced back up. We stopped it above, and the
119+
# `--silent --no-questions` .run installer drops `/usr/bin/nvidia-
120+
# persistenced` but does not reliably reinstall a usable systemd
121+
# unit -- so a previous attempt at `systemctl start nvidia-
122+
# persistenced.service` was a no-op (see ComputeLab repro on driver
123+
# 610.43.02). Exec the daemon directly; it self-daemonizes and
124+
# creates `/run/nvidia-persistenced/socket`, which NVML clients in
125+
# the test container need for state-changing calls like
126+
# `nvmlDeviceSetPersistenceMode` -- without it those calls return
127+
# NVML_ERROR_UNKNOWN. nv-gha-runners/vm-images' `nvgha-driver` has
128+
# the same gap; their CUDA-runtime validation workload doesn't hit
129+
# an NVML SET write so they haven't surfaced it yet.
130130
set -x
131+
/usr/bin/nvidia-persistenced --verbose 2>&1 || true
132+
# Set persistence mode explicitly so we match the runner image's
133+
# `Persistence-M: On` baseline regardless of how the daemon came up.
131134
nvidia-smi -pm 1 || true
132-
systemctl daemon-reload 2>/dev/null || true
133-
systemctl start nvidia-persistenced.service 2>/dev/null || true
134135
set +x
135136
}
136137

0 commit comments

Comments
 (0)