ci: fix Check job status to detect skipped results (fixes #2208)#2210
Conversation
…esult Fixes NVIDIA#2208. The previous `Check job status` body only treated a dep's `result == 'cancelled' || == 'failure'` as failure, letting `'skipped'` slip through silently. When a `build-*` job fails, the dependent `test-*` job is set to `'skipped'` by default needs-failure propagation, and the aggregator passes -- exactly the case demonstrated by NVIDIA#2209. Adopt CCCL's `check_result` pattern: explicit `expected="success"` per dependency, with `expected="skipped"` for legitimate `[doc-only]` skips, and an early short-circuit for `[no-ci]`. Now any deviation from the expected status (including `'skipped'` from a failed upstream) fails the aggregator. Reference: NVIDIA/cccl ci-workflow-pull-request.yml L463-L526.
|
/ok to test 170c799 |
This comment has been minimized.
This comment has been minimized.
|
/ok to test 739c499 |
Mirrors the NVIDIA#2209 reproducer on this branch's new aggregator. Expected outcome: - All Build * matrix entries fail at the env-vars build step. - Downstream Test * jobs are skipped (cascaded needs-failure). - Check job status now reports failure -- the symmetric, post-fix counterpart to NVIDIA#2209 (where Check job status reported success despite the same set of failures). DO NOT MERGE while this commit is present. Remove before merge. Refs NVIDIA#2208, NVIDIA#2209.
|
/ok to test 811388c |
Gating-check fix validated end-to-endThe new aggregator catches the exact failure mode that motivated #2208. Demonstrated by running the #2209 reproducer (intentional
Same reproducer, opposite verdict. The skipped-as-success hole is closed. The demo commit |
This reverts commit 811388c.
|
/ok to test 7d31a04 |
rwgk
left a comment
There was a problem hiding this comment.
Looks good to me and codex gpt-5.5:
The new check_result approach is a clear improvement over the previous cancelled || failure checks: it closes the skipped-as-success hole while keeping the intentional [doc-only] and [no-ci] cases explicit.
The one subtle behavior change I noticed is that fork or non-NVIDIA repo runs may now report a red final aggregator where jobs are skipped by owner-gated conditions. I agree that is not a concern for this PR, and it is preferable to avoid weakening the aggregator in a way that could reintroduce false-green CI.
|
u7k4rs6
left a comment
There was a problem hiding this comment.
PR Risk Summary
Quality Score: 3/10
Risk Level: high
Merge Recommendation: Do not merge
Rationale: The CI workflow file has several issues. Most critically, there's an incorrect handling of skipped jobs in CI aggregation, which could lead to inaccurate reporting of build statuses. Additionally, there are concerns about potentially unnecessary job executions, impacting CI performance. Several style issues, including redundant comments, inconsistent variable assignment, complex if conditions, lack of explicit success tracking, and redundant exit 0, further reduce the quality of the changes. These issues collectively pose a significant risk to the CI pipeline's reliability and efficiency.
High Risk Changes:
- Incorrect handling of skipped jobs in CI aggregation.
- Potentially unnecessary execution of CI jobs.
| @@ -432,52 +432,42 @@ jobs: | |||
| steps: | |||
| - name: Exit | |||
| run: | | |||
There was a problem hiding this comment.
[STYLE] Redundant comments explaining the always() condition.
Why it matters: The extensive comments at the beginning of the job are no longer relevant to the new logic and add unnecessary noise. The new logic is self-explanatory with the check_result function.
Suggested fix: Remove the block of comments starting from line 434 down to line 473 (inclusive) that explains the always() condition and the previous logic.
Explanation for juniors: The comments describe a complex reasoning for using always() and the previous failure condition. With the introduction of the check_result function and a more explicit status checking mechanism, these comments are outdated and can be removed to improve readability.
| @@ -432,52 +432,42 @@ jobs: | |||
| steps: | |||
| - name: Exit | |||
| run: | | |||
There was a problem hiding this comment.
[BUG] Incorrect handling of skipped jobs in CI aggregation.
Why it matters: The previous logic cancelled || failure would cause jobs that were intentionally skipped (e.g., due to doc-only flag) to be treated as failures if their upstream dependencies were cancelled. This prevented the CI from passing when it should have.
Suggested fix: Replace the old if conditions checking for cancelled || failure with a new check_result function that explicitly checks for the expected status ('success' or 'skipped').
Explanation for juniors: The new check_result function, along with the expected variable, provides a more robust way to handle job statuses. It allows for defining what status is expected for each job, including 'skipped' when the doc-only flag is true. This ensures that intentionally skipped jobs do not cause the CI to fail.
| @@ -432,52 +432,42 @@ jobs: | |||
| steps: | |||
| - name: Exit | |||
| run: | | |||
There was a problem hiding this comment.
[PERFORMANCE] Potentially unnecessary execution of CI jobs.
Why it matters: The if: always() on the job ensures that this aggregation step runs even if upstream jobs are skipped. While this is necessary for the aggregation logic to function, it might lead to this job running when it's not strictly needed if the should-skip condition is met.
Suggested fix: Add an explicit check at the beginning of the job to exit early if needs.should-skip.outputs.skip is true.
Explanation for juniors: The new logic includes an early exit condition: if [[ "${{ needs.should-skip.outputs.skip }}" == "true" ]]; then echo "[no-ci] - skipping aggregator checks"; exit 0; fi. This prevents the rest of the aggregation logic from running when the build is intended to be skipped, improving efficiency.
| if [[ "${{ needs.should-skip.outputs.skip }}" == "true" ]]; then | ||
| echo "[no-ci] - skipping aggregator checks" | ||
| exit 0 | ||
| fi |
There was a problem hiding this comment.
[STYLE] Inconsistent variable assignment for doc_only.
Why it matters: The doc_only variable was previously assigned using ${{ needs.should-skip.outputs.doc-only }} directly within the if condition. Now, it's assigned to a shell variable doc_only using doc_only="${{ needs.should-skip.outputs.doc-only }}" before being used.
Suggested fix: Ensure consistent variable assignment. The new assignment doc_only="${{ needs.should-skip.outputs.doc-only }}" is correct for shell scripting.
Explanation for juniors: The change from using the GitHub Actions expression directly in the if statement to assigning it to a shell variable doc_only is a stylistic improvement for clarity and consistency within the shell script block. The new approach is more readable.
| if [[ "${{ needs.should-skip.outputs.skip }}" == "true" ]]; then | ||
| echo "[no-ci] - skipping aggregator checks" | ||
| exit 0 | ||
| fi |
There was a problem hiding this comment.
[STYLE] Complex and repetitive if conditions.
Why it matters: The original if conditions checking for multiple job results (cancelled || failure) were long and repetitive, making them hard to read and maintain.
Suggested fix: Replace the complex if conditions with calls to the check_result function.
Explanation for juniors: The introduction of the check_result function encapsulates the logic for checking job statuses. This makes the code much cleaner, more readable, and easier to extend or modify in the future. Each job's status is now checked individually with a clear expected outcome.
| if [[ "${{ needs.should-skip.outputs.skip }}" == "true" ]]; then | ||
| echo "[no-ci] - skipping aggregator checks" | ||
| exit 0 | ||
| fi |
There was a problem hiding this comment.
[STYLE] Lack of explicit success status tracking.
Why it matters: The previous logic relied on exit 1 to indicate failure. There was no explicit variable to track the overall success status of the aggregation.
Suggested fix: Introduce a status variable initialized to 'success' and update it to 'failed' within the check_result function if any check fails.
Explanation for juniors: The new status variable provides a clear way to track the overall outcome of the CI aggregation. The check_result function modifies this variable upon encountering an unexpected job result, and the final [[ "$status" == "success" ]] expression determines the job's overall success or failure.
| if [[ "$result" != "$expected" ]]; then | ||
| echo "::error::$name did not match expected result" | ||
| status="failed" | ||
| fi |
There was a problem hiding this comment.
[STYLE] Redundant exit 0.
Why it matters: The exit 0 at the end of the script was redundant because the script would naturally exit with a success code if no exit 1 was encountered. The new logic uses a final expression to determine the exit code.
Suggested fix: Remove the exit 0 statement.
Explanation for juniors: The new logic concludes with [[ "$status" == "success" ]]. This expression evaluates to true (exit code 0) if status is 'success', and false (exit code 1) otherwise. This is a more idiomatic way to handle script exit codes based on a condition and makes the redundant exit 0 unnecessary.
u7k4rs6
left a comment
There was a problem hiding this comment.
PR Risk Summary
Quality Score: 10/10
Risk Level: low
Merge Recommendation: Safe to merge
Rationale: The only file changed is a GitHub Actions CI workflow file. No functional code changes were introduced, and the review found no issues. This change is low risk.
|
@u7k4rs6 please do not fire up nonsense agents against our repo and add noise. This is not a welcomed behavior. |
|
@leofang Apologies, that was an automated agent I should not have run against your repo, especially on an already-merged PR. It added noise and got the core point wrong. I've turned it off. Thanks for the patience. |
Fixes #2208.
Replaces the
cancelled || failurepredicate in.github/workflows/ci.yml'schecks:job with CCCL'scheck_resultpattern: per-depexpected="success", withexpected="skipped"for legitimate[doc-only]skips, and an early short-circuit for[no-ci].Reference: https://github.com/NVIDIA/cccl/blob/8c0e6cb1b6412d7bd0070f9ac3f55fa80231961a/.github/workflows/ci-workflow-pull-request.yml#L463-L526
End-to-end validation (pre-fix vs post-fix reproducer): #2210 (comment)