Commit 3dfaa84
committed
CI: exec nvidia-persistenced directly after Linux driver swap
The `--silent --no-questions` .run installer drops /usr/bin/nvidia-
persistenced but does not reliably install a usable systemd unit, so
`systemctl start nvidia-persistenced.service` was a no-op (verified
in CI logs: `+ true` after the start). With the daemon down, the
/run/nvidia-persistenced/socket bind-mounted into the test container
is stale, and NVML state-changing calls (e.g.
nvmlDeviceSetPersistenceMode) made by root inside the container
return NVML_ERROR_UNKNOWN -- which is what cuda.core's
test_persistence_mode_enabled has been failing on.
Verified on ComputeLab with the same driver (610.43.02), same GPU
arch (Ada L40S), root in container: with the daemon up, the SET call
returns NVML_SUCCESS; with the daemon down it returns UnknownError.
Fix: exec /usr/bin/nvidia-persistenced directly. The binary
self-daemonizes and creates the socket on its own. (Same latent gap
exists in nv-gha-runners/vm-images' nvgha-driver; will flag upstream.)1 parent 0d5f0e9 commit 3dfaa84
1 file changed
Lines changed: 15 additions & 14 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
115 | 115 | | |
116 | 116 | | |
117 | 117 | | |
118 | | - | |
119 | | - | |
120 | | - | |
121 | | - | |
122 | | - | |
123 | | - | |
124 | | - | |
125 | | - | |
126 | | - | |
127 | | - | |
128 | | - | |
129 | | - | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
130 | 130 | | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
131 | 134 | | |
132 | | - | |
133 | | - | |
134 | 135 | | |
135 | 136 | | |
136 | 137 | | |
| |||
0 commit comments