Skip to content

ci: stress test NuGet QFE fallback fix (do not merge)#628

Closed
wmmc88 wants to merge 8 commits into
microsoft:mainfrom
wmmc88:ci/stress-test-qfe-fix
Closed

ci: stress test NuGet QFE fallback fix (do not merge)#628
wmmc88 wants to merge 8 commits into
microsoft:mainfrom
wmmc88:ci/stress-test-qfe-fix

Conversation

@wmmc88
Copy link
Copy Markdown
Collaborator

@wmmc88 wmmc88 commented Mar 7, 2026

Stress test for the NuGet QFE fallback fix from PR #624, cherry-picked onto the expand-llvm branch to exercise it across a larger CI matrix with more parallel nuget jobs.

The expand-llvm matrix produces ~48+ nuget jobs running simultaneously, which increases the chance of observing a NuGet API timeout and verifying the retry logic handles it correctly.

This PR is not intended to be merged. It will be closed after CI results are collected.

wmmc88 and others added 7 commits March 3, 2026 19:31
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Melvin Wang <melvin.mc.wang@gmail.com>
Wraps taiki-e/install-action@v2 with a pwsh verification step that
falls back to cargo install if the tool binary is missing. Works
around intermittent silent bash failures on windows-11-arm where
shell: bash exits 0 with zero output under WoW64 arm64 emulation.

See https://github.com/actions/partner-runner-images/issues/169
Disables amd64 runners and CodeQL to focus on windows-11-arm jobs
where the intermittent shell: bash silent failure occurs. This
maximizes the chance of triggering the fallback path.
Copilot AI review requested due to automatic review settings March 7, 2026 20:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR is a CI stress-test change set intended to exercise the NuGet WDK QFE resolution fallback logic under higher parallelism, primarily by expanding the LLVM version matrix and adjusting workflow/tool installation behavior.

Changes:

  • Add retry + hard-fail behavior when resolving NuGet WDK QFE versions in the install-wdk composite action.
  • Introduce an install-cargo-tool composite action to install cargo tools via taiki-e/install-action with a verification step and cargo install fallback.
  • Expand LLVM versions tested in CI and adjust several workflows/matrices (including disabling CodeQL for this run).

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
README.md Updates local LLVM install guidance and points readers at CI-tested LLVM versions.
.github/workflows/test.yaml Expands LLVM matrix; switches cargo tool installs to the new install-cargo-tool; reduces matrix dimensions for this run.
.github/workflows/local-development-makefile.yaml Expands LLVM matrix; switches cargo-make install to install-cargo-tool; reduces matrix dimensions for this run.
.github/workflows/lint.yaml Expands LLVM matrix.
.github/workflows/docs.yaml Expands LLVM matrix.
.github/workflows/codeql.yaml Expands LLVM matrix; disables CodeQL job for this run; switches cargo-make install to install-cargo-tool.
.github/workflows/build.yaml Expands LLVM matrix; switches cargo-make install to install-cargo-tool; reduces matrix dimensions for this run.
.github/actions/install-wdk/action.yaml Implements QFE resolution retry logic and a hard error when resolution fails.
.github/actions/install-cargo-tool/action.yaml New composite action wrapper for cargo tool install with verification + fallback.
Comments suppressed due to low confidence (1)

.github/actions/install-cargo-tool/action.yaml:34

  • This composite action’s PowerShell step doesn’t enable Set-StrictMode and doesn’t validate failures from native commands. In this repo, action scripts typically run with strict mode and explicitly fail on non-zero $LASTEXITCODE for native commands; add strict mode and check/throw after running $toolName --version and cargo install so silent failures don’t pass CI.
    - name: Verify ${{ inputs.tool }} (fallback to cargo install)
      shell: pwsh
      run: |
        $toolInput = "${{ inputs.tool }}"
        # Strip version suffix (e.g., 'cargo-expand@1.0.85' -> 'cargo-expand')
        $toolName = ($toolInput -split '@')[0]

        if (Get-Command $toolName -ErrorAction SilentlyContinue) {
          $version = & $toolName --version 2>&1
          Write-Host "$toolName is installed: $version"
        } else {

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +39 to +42
cargo install $toolName --version $version
} else {
cargo install $toolName
}
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback path uses cargo install without --locked (and without verifying the tool is now discoverable). Elsewhere in the repo tool installs are generally --locked for reproducibility. Consider using cargo install --locked (and optionally --force when a specific version is requested), then re-check Get-Command and fail with a clear error if the binary is still missing.

This issue also appears on line 24 of the same file.

Suggested change
cargo install $toolName --version $version
} else {
cargo install $toolName
}
cargo install $toolName --version $version --locked --force
} else {
cargo install $toolName --locked
}
if (Get-Command $toolName -ErrorAction SilentlyContinue) {
$version = & $toolName --version 2>&1
Write-Host "$toolName installed via cargo install: $version"
} else {
Write-Error "$toolName was not found in PATH after cargo install. Ensure that cargo's bin directory is on PATH and that installation succeeded."
exit 1
}

Copilot uses AI. Check for mistakes.
Comment thread README.md
* `winget install -i LLVM.LLVM`
* Ensure you select the GUI option to add LLVM to the PATH
* LLVM 18 has a bug that causes bindings to fail to generate for ARM64. Continue using LLVM 17 until LLVM 19 comes out with [the fix](https://github.com/llvm/llvm-project/pull/93235). See [this](https://github.com/rust-lang/rust-bindgen/issues/2842) for more details.
* See list of [tested/supported versions](https://github.com/microsoft/windows-drivers-rs/blob/main/.github/workflows/build.yaml#L39) in the [CI workflow](./.github/workflows/build.yaml)
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README links to the CI workflow using a hard-coded line anchor on the main branch (.../build.yaml#L39). This is brittle (line numbers change) and can drift from what’s actually tested in a PR branch. Prefer linking to the file without a line anchor, or to a stable section/anchor in the repo that lists supported LLVM versions.

Suggested change
* See list of [tested/supported versions](https://github.com/microsoft/windows-drivers-rs/blob/main/.github/workflows/build.yaml#L39) in the [CI workflow](./.github/workflows/build.yaml)
* See the list of tested/supported versions in the [CI workflow](./.github/workflows/build.yaml)

Copilot uses AI. Check for mistakes.

jobs:
analyze:
if: false # Disabled for ARM64-only test run
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hard-disables the CodeQL job for all events. If this PR (or the branch) is accidentally merged, it will silently stop code scanning. Prefer gating this on a branch name / workflow_dispatch input, or restricting the change to the stress-test branch only (so CodeQL remains enabled by default).

Copilot uses AI. Check for mistakes.
Comment thread .github/actions/install-wdk/action.yaml Outdated
Comment on lines +128 to +143
# Filter versions that match the base version and find the latest QFE
$matchingVersions = @($availableVersions | Where-Object { $_.StartsWith("$inputVersion.") })
if (@($matchingVersions).Count -eq 0) {
throw "No QFE versions found on NuGet for base version $inputVersion. NuGet WDK packages require a 4-part version (e.g. $inputVersion.1882)."
}

$version = $matchingVersions | Sort-Object { [System.Version]$_ } | Select-Object -Last 1
Write-Host "Found latest QFE version: $version"
Write-Host "Resolved latest QFE version: $version"
break
} catch {
Write-Warning "Attempt $attempt failed: $_"
if ($attempt -lt $maxAttempts) {
Write-Host "Retrying in $retryDelaySec seconds..."
Start-Sleep -Seconds $retryDelaySec
$retryDelaySec *= 2
}
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry loop will also retry when no matching QFE versions are found (the throw on empty $matchingVersions is caught and treated like a transient failure). That case is deterministic and likely indicates a bad base version or feed issue, so it should fail fast without exponential backoff delays.

Copilot uses AI. Check for mistakes.
…d 3-part version

When a 3-part WDK version like 10.0.26100 is passed to the install-wdk
action with NuGet source, it must resolve a QFE suffix (e.g.
10.0.26100.1882) because NuGet WDK packages only exist with 4-part
versions. The previous fallback logic silently used the bare 3-part
version on timeout or API failure, which guaranteed a nuget install
failure downstream with a confusing "package not found" error.

Replace the silent fallback with retry logic (3 attempts with
exponential backoff) and a hard error if resolution fails entirely.
@wmmc88 wmmc88 force-pushed the ci/stress-test-qfe-fix branch from 7a1eb0f to be2b562 Compare March 7, 2026 20:59
@wmmc88
Copy link
Copy Markdown
Collaborator Author

wmmc88 commented Mar 7, 2026

Stress test complete. All 45 nuget jobs resolved QFE successfully across the expanded LLVM matrix. No NuGet API timeouts observed — the fix is validated. Closing as this was only for CI validation of PR #624.

@wmmc88 wmmc88 closed this Mar 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants