ci: stress test NuGet QFE fallback fix (do not merge)#628
Conversation
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Melvin Wang <melvin.mc.wang@gmail.com>
Wraps taiki-e/install-action@v2 with a pwsh verification step that falls back to cargo install if the tool binary is missing. Works around intermittent silent bash failures on windows-11-arm where shell: bash exits 0 with zero output under WoW64 arm64 emulation. See https://github.com/actions/partner-runner-images/issues/169
Disables amd64 runners and CodeQL to focus on windows-11-arm jobs where the intermittent shell: bash silent failure occurs. This maximizes the chance of triggering the fallback path.
There was a problem hiding this comment.
Pull request overview
This PR is a CI stress-test change set intended to exercise the NuGet WDK QFE resolution fallback logic under higher parallelism, primarily by expanding the LLVM version matrix and adjusting workflow/tool installation behavior.
Changes:
- Add retry + hard-fail behavior when resolving NuGet WDK QFE versions in the
install-wdkcomposite action. - Introduce an
install-cargo-toolcomposite action to install cargo tools viataiki-e/install-actionwith a verification step andcargo installfallback. - Expand LLVM versions tested in CI and adjust several workflows/matrices (including disabling CodeQL for this run).
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| README.md | Updates local LLVM install guidance and points readers at CI-tested LLVM versions. |
| .github/workflows/test.yaml | Expands LLVM matrix; switches cargo tool installs to the new install-cargo-tool; reduces matrix dimensions for this run. |
| .github/workflows/local-development-makefile.yaml | Expands LLVM matrix; switches cargo-make install to install-cargo-tool; reduces matrix dimensions for this run. |
| .github/workflows/lint.yaml | Expands LLVM matrix. |
| .github/workflows/docs.yaml | Expands LLVM matrix. |
| .github/workflows/codeql.yaml | Expands LLVM matrix; disables CodeQL job for this run; switches cargo-make install to install-cargo-tool. |
| .github/workflows/build.yaml | Expands LLVM matrix; switches cargo-make install to install-cargo-tool; reduces matrix dimensions for this run. |
| .github/actions/install-wdk/action.yaml | Implements QFE resolution retry logic and a hard error when resolution fails. |
| .github/actions/install-cargo-tool/action.yaml | New composite action wrapper for cargo tool install with verification + fallback. |
Comments suppressed due to low confidence (1)
.github/actions/install-cargo-tool/action.yaml:34
- This composite action’s PowerShell step doesn’t enable
Set-StrictModeand doesn’t validate failures from native commands. In this repo, action scripts typically run with strict mode and explicitly fail on non-zero$LASTEXITCODEfor native commands; add strict mode and check/throw after running$toolName --versionandcargo installso silent failures don’t pass CI.
- name: Verify ${{ inputs.tool }} (fallback to cargo install)
shell: pwsh
run: |
$toolInput = "${{ inputs.tool }}"
# Strip version suffix (e.g., 'cargo-expand@1.0.85' -> 'cargo-expand')
$toolName = ($toolInput -split '@')[0]
if (Get-Command $toolName -ErrorAction SilentlyContinue) {
$version = & $toolName --version 2>&1
Write-Host "$toolName is installed: $version"
} else {
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| cargo install $toolName --version $version | ||
| } else { | ||
| cargo install $toolName | ||
| } |
There was a problem hiding this comment.
The fallback path uses cargo install without --locked (and without verifying the tool is now discoverable). Elsewhere in the repo tool installs are generally --locked for reproducibility. Consider using cargo install --locked (and optionally --force when a specific version is requested), then re-check Get-Command and fail with a clear error if the binary is still missing.
This issue also appears on line 24 of the same file.
| cargo install $toolName --version $version | |
| } else { | |
| cargo install $toolName | |
| } | |
| cargo install $toolName --version $version --locked --force | |
| } else { | |
| cargo install $toolName --locked | |
| } | |
| if (Get-Command $toolName -ErrorAction SilentlyContinue) { | |
| $version = & $toolName --version 2>&1 | |
| Write-Host "$toolName installed via cargo install: $version" | |
| } else { | |
| Write-Error "$toolName was not found in PATH after cargo install. Ensure that cargo's bin directory is on PATH and that installation succeeded." | |
| exit 1 | |
| } |
| * `winget install -i LLVM.LLVM` | ||
| * Ensure you select the GUI option to add LLVM to the PATH | ||
| * LLVM 18 has a bug that causes bindings to fail to generate for ARM64. Continue using LLVM 17 until LLVM 19 comes out with [the fix](https://github.com/llvm/llvm-project/pull/93235). See [this](https://github.com/rust-lang/rust-bindgen/issues/2842) for more details. | ||
| * See list of [tested/supported versions](https://github.com/microsoft/windows-drivers-rs/blob/main/.github/workflows/build.yaml#L39) in the [CI workflow](./.github/workflows/build.yaml) |
There was a problem hiding this comment.
The README links to the CI workflow using a hard-coded line anchor on the main branch (.../build.yaml#L39). This is brittle (line numbers change) and can drift from what’s actually tested in a PR branch. Prefer linking to the file without a line anchor, or to a stable section/anchor in the repo that lists supported LLVM versions.
| * See list of [tested/supported versions](https://github.com/microsoft/windows-drivers-rs/blob/main/.github/workflows/build.yaml#L39) in the [CI workflow](./.github/workflows/build.yaml) | |
| * See the list of tested/supported versions in the [CI workflow](./.github/workflows/build.yaml) |
|
|
||
| jobs: | ||
| analyze: | ||
| if: false # Disabled for ARM64-only test run |
There was a problem hiding this comment.
This hard-disables the CodeQL job for all events. If this PR (or the branch) is accidentally merged, it will silently stop code scanning. Prefer gating this on a branch name / workflow_dispatch input, or restricting the change to the stress-test branch only (so CodeQL remains enabled by default).
| # Filter versions that match the base version and find the latest QFE | ||
| $matchingVersions = @($availableVersions | Where-Object { $_.StartsWith("$inputVersion.") }) | ||
| if (@($matchingVersions).Count -eq 0) { | ||
| throw "No QFE versions found on NuGet for base version $inputVersion. NuGet WDK packages require a 4-part version (e.g. $inputVersion.1882)." | ||
| } | ||
|
|
||
| $version = $matchingVersions | Sort-Object { [System.Version]$_ } | Select-Object -Last 1 | ||
| Write-Host "Found latest QFE version: $version" | ||
| Write-Host "Resolved latest QFE version: $version" | ||
| break | ||
| } catch { | ||
| Write-Warning "Attempt $attempt failed: $_" | ||
| if ($attempt -lt $maxAttempts) { | ||
| Write-Host "Retrying in $retryDelaySec seconds..." | ||
| Start-Sleep -Seconds $retryDelaySec | ||
| $retryDelaySec *= 2 | ||
| } |
There was a problem hiding this comment.
The retry loop will also retry when no matching QFE versions are found (the throw on empty $matchingVersions is caught and treated like a transient failure). That case is deterministic and likely indicates a bad base version or feed issue, so it should fail fast without exponential backoff delays.
…d 3-part version When a 3-part WDK version like 10.0.26100 is passed to the install-wdk action with NuGet source, it must resolve a QFE suffix (e.g. 10.0.26100.1882) because NuGet WDK packages only exist with 4-part versions. The previous fallback logic silently used the bare 3-part version on timeout or API failure, which guaranteed a nuget install failure downstream with a confusing "package not found" error. Replace the silent fallback with retry logic (3 attempts with exponential backoff) and a hard error if resolution fails entirely.
7a1eb0f to
be2b562
Compare
|
Stress test complete. All 45 nuget jobs resolved QFE successfully across the expanded LLVM matrix. No NuGet API timeouts observed — the fix is validated. Closing as this was only for CI validation of PR #624. |
Stress test for the NuGet QFE fallback fix from PR #624, cherry-picked onto the
expand-llvmbranch to exercise it across a larger CI matrix with more parallel nuget jobs.The expand-llvm matrix produces ~48+ nuget jobs running simultaneously, which increases the chance of observing a NuGet API timeout and verifying the retry logic handles it correctly.
This PR is not intended to be merged. It will be closed after CI results are collected.