Skip to content

Commit ece27e4

Browse files
authored
CI: require N consecutive nvidia-smi successes after Windows device cycle (#2195)
* CI: require N consecutive nvidia-smi successes after device cycle Multi-GPU Windows rows (observed on 2x H100 MCDM after #2176 landed) keep failing the "Ensure GPU is working" step with `Failed to initialize NVML: Not Found`. Root cause: after `pnputil` cycles both display devices, NVML briefly reports success mid-init then flaps back to "Not Found" a couple seconds later. The existing poll exits on the *first* `nvidia-smi` exit code 0, so the loop bails ~2 seconds in and the next workflow step hits the flap window. Scale the consecutive-success requirement to the number of cycled NVIDIA devices (1 for single-GPU rows, 2 for the H100 pair) and bump the inter-iteration sleep from 2 to 3 seconds. Single-GPU rows pay an extra 1-sec floor; multi-GPU rows now require ~6 sec of stable NVML before moving on. The 60-sec deadline is unchanged; the loop still bails (and the script fails loudly) if NVML doesn't settle in time. * CI: restore the pre-#2176 5-sec unconditional settle before the poll Pre-#2176, every Windows row ran install_gpu_driver.ps1 unconditionally and that script ended with a fixed `Start-Sleep -Seconds 5` after the pnputil cycle. #2176 dropped that floor (the poll exits on the first nvidia-smi success at ~2 sec on single-GPU, ~2 sec mid-flap on the H100 pair). Put the 5-sec floor back, before the consecutive-success poll, so we never settle for less than the known-good baseline. * CI: trim configure_driver_mode.ps1 comments for portability * CI: drop redundant @(...) comment
1 parent 50ed678 commit ece27e4

1 file changed

Lines changed: 14 additions & 10 deletions

File tree

ci/tools/configure_driver_mode.ps1

Lines changed: 14 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -30,26 +30,30 @@ function Set-DriverMode {
3030
exit 1
3131
}
3232

33-
# Only restart NVIDIA display adapters, not other display devices (e.g. QEMU VGA)
34-
$nvidia_devices = Get-PnpDevice -Class Display -FriendlyName "NVIDIA*"
33+
# Only restart NVIDIA display adapters, not other display devices (e.g. QEMU VGA).
34+
$nvidia_devices = @(Get-PnpDevice -Class Display -FriendlyName "NVIDIA*")
35+
$gpu_count = $nvidia_devices.Count
3536
foreach ($device in $nvidia_devices) {
3637
Write-Output "Restarting device: $($device.FriendlyName) ($($device.InstanceId))"
3738
pnputil /disable-device "$($device.InstanceId)"
3839
pnputil /enable-device "$($device.InstanceId)"
3940
}
4041

41-
# Poll nvidia-smi until NVML can initialize, or give up after ~60s.
42-
# A fixed sleep is not enough on slower-coming-back-up multi-GPU rows
43-
# (e.g. 2x H100 MCDM) where pnputil enable returns before NVML is
44-
# ready. Pattern borrowed from the runner-team `nvgha-driver.ps1`.
42+
# Initial settle after the device cycle.
43+
Start-Sleep -Seconds 5
44+
45+
# Poll nvidia-smi for N consecutive successes (N == cycled GPUs)
46+
# so a mid-init "ok" flap doesn't fool the loop; bail after ~60s.
4547
Write-Output "Waiting for nvidia-smi/NVML to come back up after device cycle..."
4648
$deadline = (Get-Date).AddSeconds(60)
49+
$consecutive_ok = 0
4750
do {
48-
Start-Sleep -Seconds 2
51+
Start-Sleep -Seconds 3
4952
& nvidia-smi.exe 2>&1 | Out-Null
50-
} while ($LASTEXITCODE -ne 0 -and (Get-Date) -lt $deadline)
51-
if ($LASTEXITCODE -ne 0) {
52-
Write-Error "nvidia-smi did not return cleanly within 60s of the device cycle"
53+
if ($LASTEXITCODE -eq 0) { $consecutive_ok++ } else { $consecutive_ok = 0 }
54+
} while ($consecutive_ok -lt $gpu_count -and (Get-Date) -lt $deadline)
55+
if ($consecutive_ok -lt $gpu_count) {
56+
Write-Error "nvidia-smi did not return cleanly $gpu_count times in a row within 60s of the device cycle"
5357
exit 1
5458
}
5559
}

0 commit comments

Comments
 (0)