Skip to content

ci: fix Check job status to detect skipped results (fixes #2208)#2210

Merged
rwgk merged 5 commits into
NVIDIA:mainfrom
leofang:leofang/fix-issue-2208-gating-check
Jun 16, 2026
Merged

ci: fix Check job status to detect skipped results (fixes #2208)#2210
rwgk merged 5 commits into
NVIDIA:mainfrom
leofang:leofang/fix-issue-2208-gating-check

Conversation

@leofang

@leofang leofang commented Jun 13, 2026

Copy link
Copy Markdown
Member

Fixes #2208.

Replaces the cancelled || failure predicate in .github/workflows/ci.yml's checks: job with CCCL's check_result pattern: per-dep expected="success", with expected="skipped" for legitimate [doc-only] skips, and an early short-circuit for [no-ci].

Reference: https://github.com/NVIDIA/cccl/blob/8c0e6cb1b6412d7bd0070f9ac3f55fa80231961a/.github/workflows/ci-workflow-pull-request.yml#L463-L526

End-to-end validation (pre-fix vs post-fix reproducer): #2210 (comment)

…esult

Fixes NVIDIA#2208. The previous `Check job status` body only treated a dep's
`result == 'cancelled' || == 'failure'` as failure, letting `'skipped'`
slip through silently. When a `build-*` job fails, the dependent
`test-*` job is set to `'skipped'` by default needs-failure
propagation, and the aggregator passes -- exactly the case demonstrated
by NVIDIA#2209.

Adopt CCCL's `check_result` pattern: explicit `expected="success"` per
dependency, with `expected="skipped"` for legitimate `[doc-only]` skips,
and an early short-circuit for `[no-ci]`. Now any deviation from the
expected status (including `'skipped'` from a failed upstream) fails
the aggregator.

Reference: NVIDIA/cccl ci-workflow-pull-request.yml L463-L526.
@copy-pr-bot

copy-pr-bot Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the CI/CD CI/CD infrastructure label Jun 13, 2026
@leofang

leofang commented Jun 14, 2026

Copy link
Copy Markdown
Member Author

/ok to test 170c799

@github-actions

This comment has been minimized.

@leofang

leofang commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

/ok to test 739c499

@leofang leofang added bug Something isn't working P0 High priority - Must do! labels Jun 15, 2026
@leofang leofang added this to the cuda.core v1.1.0 milestone Jun 15, 2026
leofang added 2 commits June 16, 2026 01:36
Mirrors the NVIDIA#2209 reproducer on this branch's new aggregator. Expected
outcome:
- All Build * matrix entries fail at the env-vars build step.
- Downstream Test * jobs are skipped (cascaded needs-failure).
- Check job status now reports failure -- the symmetric, post-fix
  counterpart to NVIDIA#2209 (where Check job status reported success despite
  the same set of failures).

DO NOT MERGE while this commit is present. Remove before merge.

Refs NVIDIA#2208, NVIDIA#2209.
@leofang

leofang commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

/ok to test 811388c

@leofang

leofang commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

Gating-check fix validated end-to-end

The new aggregator catches the exact failure mode that motivated #2208. Demonstrated by running the #2209 reproducer (intentional exit 7 in ci/tools/env-vars) against this branch:

Same reproducer, opposite verdict. The skipped-as-success hole is closed.

The demo commit 811388c99e will be reverted next so this PR returns to a mergeable state.

@leofang

leofang commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

/ok to test 7d31a04

@leofang leofang self-assigned this Jun 16, 2026
@leofang leofang marked this pull request as ready for review June 16, 2026 02:14

@rwgk rwgk left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me and codex gpt-5.5:

The new check_result approach is a clear improvement over the previous cancelled || failure checks: it closes the skipped-as-success hole while keeping the intentional [doc-only] and [no-ci] cases explicit.

The one subtle behavior change I noticed is that fork or non-NVIDIA repo runs may now report a red final aggregator where jobs are skipped by owner-gated conditions. I agree that is not a concern for this PR, and it is preferable to avoid weakening the aggregator in a way that could reintroduce false-green CI.

@rwgk rwgk merged commit 671af69 into NVIDIA:main Jun 16, 2026
203 of 206 checks passed
@github-actions

Copy link
Copy Markdown
Doc Preview CI
Preview removed because the pull request was closed or merged.

@u7k4rs6 u7k4rs6 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Risk Summary

Quality Score: 3/10
Risk Level: high
Merge Recommendation: Do not merge
Rationale: The CI workflow file has several issues. Most critically, there's an incorrect handling of skipped jobs in CI aggregation, which could lead to inaccurate reporting of build statuses. Additionally, there are concerns about potentially unnecessary job executions, impacting CI performance. Several style issues, including redundant comments, inconsistent variable assignment, complex if conditions, lack of explicit success tracking, and redundant exit 0, further reduce the quality of the changes. These issues collectively pose a significant risk to the CI pipeline's reliability and efficiency.

High Risk Changes:

  • Incorrect handling of skipped jobs in CI aggregation.
  • Potentially unnecessary execution of CI jobs.

Comment thread .github/workflows/ci.yml
@@ -432,52 +432,42 @@ jobs:
steps:
- name: Exit
run: |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[STYLE] Redundant comments explaining the always() condition.

Why it matters: The extensive comments at the beginning of the job are no longer relevant to the new logic and add unnecessary noise. The new logic is self-explanatory with the check_result function.

Suggested fix: Remove the block of comments starting from line 434 down to line 473 (inclusive) that explains the always() condition and the previous logic.


Explanation for juniors: The comments describe a complex reasoning for using always() and the previous failure condition. With the introduction of the check_result function and a more explicit status checking mechanism, these comments are outdated and can be removed to improve readability.

Comment thread .github/workflows/ci.yml
@@ -432,52 +432,42 @@ jobs:
steps:
- name: Exit
run: |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BUG] Incorrect handling of skipped jobs in CI aggregation.

Why it matters: The previous logic cancelled || failure would cause jobs that were intentionally skipped (e.g., due to doc-only flag) to be treated as failures if their upstream dependencies were cancelled. This prevented the CI from passing when it should have.

Suggested fix: Replace the old if conditions checking for cancelled || failure with a new check_result function that explicitly checks for the expected status ('success' or 'skipped').


Explanation for juniors: The new check_result function, along with the expected variable, provides a more robust way to handle job statuses. It allows for defining what status is expected for each job, including 'skipped' when the doc-only flag is true. This ensures that intentionally skipped jobs do not cause the CI to fail.

Comment thread .github/workflows/ci.yml
@@ -432,52 +432,42 @@ jobs:
steps:
- name: Exit
run: |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[PERFORMANCE] Potentially unnecessary execution of CI jobs.

Why it matters: The if: always() on the job ensures that this aggregation step runs even if upstream jobs are skipped. While this is necessary for the aggregation logic to function, it might lead to this job running when it's not strictly needed if the should-skip condition is met.

Suggested fix: Add an explicit check at the beginning of the job to exit early if needs.should-skip.outputs.skip is true.


Explanation for juniors: The new logic includes an early exit condition: if [[ "${{ needs.should-skip.outputs.skip }}" == "true" ]]; then echo "[no-ci] - skipping aggregator checks"; exit 0; fi. This prevents the rest of the aggregation logic from running when the build is intended to be skipped, improving efficiency.

Comment thread .github/workflows/ci.yml
if [[ "${{ needs.should-skip.outputs.skip }}" == "true" ]]; then
echo "[no-ci] - skipping aggregator checks"
exit 0
fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[STYLE] Inconsistent variable assignment for doc_only.

Why it matters: The doc_only variable was previously assigned using ${{ needs.should-skip.outputs.doc-only }} directly within the if condition. Now, it's assigned to a shell variable doc_only using doc_only="${{ needs.should-skip.outputs.doc-only }}" before being used.

Suggested fix: Ensure consistent variable assignment. The new assignment doc_only="${{ needs.should-skip.outputs.doc-only }}" is correct for shell scripting.


Explanation for juniors: The change from using the GitHub Actions expression directly in the if statement to assigning it to a shell variable doc_only is a stylistic improvement for clarity and consistency within the shell script block. The new approach is more readable.

Comment thread .github/workflows/ci.yml
if [[ "${{ needs.should-skip.outputs.skip }}" == "true" ]]; then
echo "[no-ci] - skipping aggregator checks"
exit 0
fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[STYLE] Complex and repetitive if conditions.

Why it matters: The original if conditions checking for multiple job results (cancelled || failure) were long and repetitive, making them hard to read and maintain.

Suggested fix: Replace the complex if conditions with calls to the check_result function.


Explanation for juniors: The introduction of the check_result function encapsulates the logic for checking job statuses. This makes the code much cleaner, more readable, and easier to extend or modify in the future. Each job's status is now checked individually with a clear expected outcome.

Comment thread .github/workflows/ci.yml
if [[ "${{ needs.should-skip.outputs.skip }}" == "true" ]]; then
echo "[no-ci] - skipping aggregator checks"
exit 0
fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[STYLE] Lack of explicit success status tracking.

Why it matters: The previous logic relied on exit 1 to indicate failure. There was no explicit variable to track the overall success status of the aggregation.

Suggested fix: Introduce a status variable initialized to 'success' and update it to 'failed' within the check_result function if any check fails.


Explanation for juniors: The new status variable provides a clear way to track the overall outcome of the CI aggregation. The check_result function modifies this variable upon encountering an unexpected job result, and the final [[ "$status" == "success" ]] expression determines the job's overall success or failure.

Comment thread .github/workflows/ci.yml
if [[ "$result" != "$expected" ]]; then
echo "::error::$name did not match expected result"
status="failed"
fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[STYLE] Redundant exit 0.

Why it matters: The exit 0 at the end of the script was redundant because the script would naturally exit with a success code if no exit 1 was encountered. The new logic uses a final expression to determine the exit code.

Suggested fix: Remove the exit 0 statement.


Explanation for juniors: The new logic concludes with [[ "$status" == "success" ]]. This expression evaluates to true (exit code 0) if status is 'success', and false (exit code 1) otherwise. This is a more idiomatic way to handle script exit codes based on a condition and makes the redundant exit 0 unnecessary.

@u7k4rs6 u7k4rs6 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Risk Summary

Quality Score: 10/10
Risk Level: low
Merge Recommendation: Safe to merge
Rationale: The only file changed is a GitHub Actions CI workflow file. No functional code changes were introduced, and the review found no issues. This change is low risk.

@leofang

leofang commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

@u7k4rs6 please do not fire up nonsense agents against our repo and add noise. This is not a welcomed behavior.

@u7k4rs6

u7k4rs6 commented Jun 17, 2026

Copy link
Copy Markdown

@leofang Apologies, that was an automated agent I should not have run against your repo, especially on an already-merged PR. It added noise and got the core point wrong. I've turned it off. Thanks for the patience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working CI/CD CI/CD infrastructure P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI: "Check job status" aggregator passes when build jobs fail (skipped-as-success regression)

3 participants