Commit ece27e4
authored
CI: require N consecutive nvidia-smi successes after Windows device cycle (#2195)
* CI: require N consecutive nvidia-smi successes after device cycle
Multi-GPU Windows rows (observed on 2x H100 MCDM after #2176 landed)
keep failing the "Ensure GPU is working" step with `Failed to
initialize NVML: Not Found`. Root cause: after `pnputil` cycles both
display devices, NVML briefly reports success mid-init then flaps back
to "Not Found" a couple seconds later. The existing poll exits on the
*first* `nvidia-smi` exit code 0, so the loop bails ~2 seconds in and
the next workflow step hits the flap window.
Scale the consecutive-success requirement to the number of cycled
NVIDIA devices (1 for single-GPU rows, 2 for the H100 pair) and bump
the inter-iteration sleep from 2 to 3 seconds. Single-GPU rows pay an
extra 1-sec floor; multi-GPU rows now require ~6 sec of stable NVML
before moving on.
The 60-sec deadline is unchanged; the loop still bails (and the script
fails loudly) if NVML doesn't settle in time.
* CI: restore the pre-#2176 5-sec unconditional settle before the poll
Pre-#2176, every Windows row ran install_gpu_driver.ps1 unconditionally
and that script ended with a fixed `Start-Sleep -Seconds 5` after the
pnputil cycle. #2176 dropped that floor (the poll exits on the first
nvidia-smi success at ~2 sec on single-GPU, ~2 sec mid-flap on the
H100 pair). Put the 5-sec floor back, before the consecutive-success
poll, so we never settle for less than the known-good baseline.
* CI: trim configure_driver_mode.ps1 comments for portability
* CI: drop redundant @(...) comment1 parent 50ed678 commit ece27e4
1 file changed
Lines changed: 14 additions & 10 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
30 | 30 | | |
31 | 31 | | |
32 | 32 | | |
33 | | - | |
34 | | - | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
35 | 36 | | |
36 | 37 | | |
37 | 38 | | |
38 | 39 | | |
39 | 40 | | |
40 | 41 | | |
41 | | - | |
42 | | - | |
43 | | - | |
44 | | - | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
45 | 47 | | |
46 | 48 | | |
| 49 | + | |
47 | 50 | | |
48 | | - | |
| 51 | + | |
49 | 52 | | |
50 | | - | |
51 | | - | |
52 | | - | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
53 | 57 | | |
54 | 58 | | |
55 | 59 | | |
| |||
0 commit comments