Skip to content

refactor(ci): unify GPU Chainsaw layout and validation flow#587

Merged
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:codex/pr579-chainsaw-layout-followup
Apr 16, 2026
Merged

refactor(ci): unify GPU Chainsaw layout and validation flow#587
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:codex/pr579-chainsaw-layout-followup

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 commented Apr 15, 2026

Summary

Unify the GPU Chainsaw test layout and H100 Kind validation flow between inference and training. This moves shared assertions into the right ownership layers, keeps leaf-specific checks with their leaf suites, narrows workflow triggers to those owned inputs, and makes the executed validation flow consistent across both GPU workflows.

Motivation / Context

This is the follow-up cleanup after #579. That PR made the GPU training and inference workflows much more symmetric, but the Kind GPU suites still depended on cluster/ assertions in different ways, which kept test ownership, path filters, and workflow flow partially inconsistent.

This update also removes the inference-only Dynamo smoke test from the H100 Kind workflow for now. The current nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0 image adds significant latency and friction in Kind CI, and training does not yet have a symmetric smoke path. If smoke coverage comes back later, it should be reintroduced alongside a comparable training smoke test.

Fixes: N/A
Related: #579

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: CI workflows / Chainsaw tests

Implementation Notes

  • Move cross-environment shared assertions from tests/chainsaw/ai-conformance/cluster/ to tests/chainsaw/ai-conformance/common/
  • Keep Kind-shared assertions under tests/chainsaw/ai-conformance/kind-common/
  • Move inference-only Kind assertions into tests/chainsaw/ai-conformance/kind-inference-dynamo/
  • Move the training-only Kubeflow Trainer assertion into tests/chainsaw/ai-conformance/kind-training-kubeflow/
  • Update the external cluster/ suite to reference ../common/* instead of local copies
  • Update both H100 GPU workflows to rely on common/**, kind-common/**, and the leaf suite directories rather than explicit shared cluster/assert-*.yaml file lists
  • Align both H100 GPU workflows on the same executed validation flow: Chainsaw health checks, pre-validation resource existence check, ./aicr validate --phase conformance, post-run resource snapshot, and artifact upload
  • Capture the post-run resource snapshot as conformance-evidence/resource-existence-post.txt
  • Temporarily remove the inference-only Dynamo smoke path so training and inference share the same H100 Kind flow; keep the smoke manifest in-tree for later reintroduction if needed
  • Increase kind load timeouts in aicr-build action: smoke-test image 600s→900s, validator images 300s→600s (mitigates intermittent timeout failures on H100 runners)
  • Fix referencedAssertFiles to decode all YAML documents in multi-document chainsaw-test.yaml files, matching the existing parseYAMLFile behavior

Testing

unset GITLAB_TOKEN && make qualify

make qualify passed locally on the rebased branch.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes: N/A

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

Summary by CodeRabbit

  • Chores

    • Increased CI image load timeouts, made validation-artifact collection more robust and always-run, and removed an obsolete smoke-test execution flow.
    • Improved failure diagnostics with an on-failure debug step that collects cluster artifacts for troubleshooting.
  • Tests

    • Reorganized assertion files into shared common directories and added new conformance assertions for inference and training suites.
    • Added parsing logic and unit tests to resolve and aggregate assertion files referenced by test manifests.
  • Documentation

    • Updated conformance README to document shared assertions and ownership.

@yuanchen8911 yuanchen8911 added the enhancement New feature or request label Apr 15, 2026
@yuanchen8911 yuanchen8911 requested review from mchmarny and xdu31 April 15, 2026 21:21
@yuanchen8911 yuanchen8911 force-pushed the codex/pr579-chainsaw-layout-followup branch from 387663b to 5519aba Compare April 15, 2026 21:23
@yuanchen8911 yuanchen8911 requested review from a team as code owners April 15, 2026 21:23
@yuanchen8911 yuanchen8911 force-pushed the codex/pr579-chainsaw-layout-followup branch 3 times, most recently from 90bca0f to 9bf488a Compare April 15, 2026 21:53
@yuanchen8911 yuanchen8911 enabled auto-merge (squash) April 16, 2026 00:39
@yuanchen8911 yuanchen8911 force-pushed the codex/pr579-chainsaw-layout-followup branch from 9bf488a to 93133f8 Compare April 16, 2026 02:20
@yuanchen8911 yuanchen8911 changed the title refactor(ci): split shared and leaf GPU Chainsaw asserts refactor(ci): unify GPU Chainsaw layout and validation flow Apr 16, 2026
@yuanchen8911 yuanchen8911 force-pushed the codex/pr579-chainsaw-layout-followup branch from 93133f8 to b297b8f Compare April 16, 2026 03:12
@yuanchen8911 yuanchen8911 requested a review from dims April 16, 2026 03:46
@yuanchen8911 yuanchen8911 force-pushed the codex/pr579-chainsaw-layout-followup branch from b297b8f to 0b38c08 Compare April 16, 2026 04:03
@github-actions github-actions bot added size/XL and removed size/L labels Apr 16, 2026
@yuanchen8911 yuanchen8911 force-pushed the codex/pr579-chainsaw-layout-followup branch 2 times, most recently from 8743b5e to 70e59a9 Compare April 16, 2026 04:24
Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean refactor — the ownership model (common / kind-common / leaf) is well-defined and the workflow simplification is significant. The Go changes to auto-discover assert files from chainsaw-test.yaml are solid and well-tested (multi-doc support, fallback to dir scanning, dedup).

A few things to check before merging:

  1. H100 inference CI is red — training passed, inference failed. Needs a look to confirm it's not caused by this change (new --dir resolution vs old --file list).
  2. Minor path normalization inconsistency in the dedup map (see inline comment on main.go:240).
  3. Timeout bumps in aicr-build are reasonable but worth documenting the root cause so they don't drift further.

No blockers — nothing here is high/critical. Solid cleanup overall.

Comment thread .github/workflows/gpu-h100-inference-test.yaml
Comment thread tests/chainsaw/ai-conformance/main.go
Comment thread tests/chainsaw/ai-conformance/main.go
Comment thread tests/chainsaw/ai-conformance/main.go
Comment thread .github/actions/aicr-build/action.yml
Comment thread .github/workflows/gpu-h100-inference-test.yaml
Copy link
Copy Markdown
Contributor

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid cleanup. Key changes reviewed:

  1. common/ directory -- moving shared assertions out of cluster/ into common/ is the right call. The --dir flag replacing per-file --file flags eliminates the fragile enumeration that was previously causing silent CI gaps when new assert files were added.

  2. Evidence collection fix -- changing the condition from steps.validate-conformance.outcome == success|failure to always() && !cancelled() && steps.bundle-install.outcome == success fixes the regression where evidence was lost when earlier steps failed. Good.

  3. Dynamo smoke test removal -- intentionally disabled with a clear comment explaining why. Reasonable given the Kind CI flakiness.

  4. kind load timeouts -- 600->900s and 300->600s. These are pragmatic.

  5. main_test.go -- tests for the assert-loading logic add coverage for a path that was previously untested.

One note: the inference GPU test is still pending. The training test passed, which exercises the same code paths. LGTM.

@yuanchen8911
Copy link
Copy Markdown
Contributor Author

@mchmarny Thanks for the thorough review. Responded inline to each comment. Summary:

Comment Reply
Inference CI red Snapshot timeout, not --dir related; added snapshot diagnostics to gpu-snapshot-validate
filepath.Clean nit Fixed
Multi-doc YAML Acknowledged
Path traversal guard Declined — .. is valid for sibling refs like ../common/assert-*.yaml; follow-up if needed
Timeout docs Added comments explaining DinD bridge I/O contention
always() condition Leaving as-is — diagnostic-only, better noisy than missing

Pushed the fixes. Mind taking another look when you get a chance?

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 16, 2026

📝 Walkthrough

Walkthrough

Reorganized Chainsaw assertions into shared and suite-specific directories, updated GitHub Actions timeouts and added failure-only debug steps, adjusted workflows to use directory-based conformance inputs and renamed evidence collection, and extended test discovery to parse chainsaw-test.yaml with new unit tests.

Changes

Cohort / File(s) Summary
GitHub Actions - Timeouts
.github/actions/aicr-build/action.yml
Increased kind load docker-image timeouts: snapshot image attempts 600→900s, validator image attempts 300→600s; comments added about transfer size/contention.
GitHub Actions - Snapshot Debugging
.github/actions/gpu-snapshot-validate/action.yml
Added Debug snapshot Job composite step that runs on failure and collects kubectl outputs (job, pods, describe, logs, ConfigMap), each command allowed to fail (`
Workflows - GPU H100 Inference
.github/workflows/gpu-h100-inference-test.yaml
Removed Dynamo vLLM smoke-test flow and related cleanup; replaced per-component --file asserts with directory-based conformance inputs; renamed and changed evidence collection to “validation artifacts” with tee and set -o pipefail; adjusted artifact upload and cleanup steps.
Workflows - GPU H100 Training
.github/workflows/gpu-h100-training-test.yaml
Moved resource-existence check position; switched --file args to --dir (common & kind-common); renamed/changed post-conformance evidence step to always-run Collect validation artifacts with pipefail, tee, and continue-on-error; renamed artifact upload step.
Chainsaw Assertions — Common
tests/chainsaw/ai-conformance/common/*
Added shared assertion YAMLs for cert-manager, DRA driver, KAI scheduler, monitoring (Prometheus), and skyhook under common/.
Chainsaw Assertions — Kind: inference-dynamo
tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-crds.yaml, .../assert-dynamo.yaml, .../assert-kgateway.yaml, .../chainsaw-test.yaml
Added assert-crds.yaml (multiple CRD names), assert-dynamo.yaml (deployment Available checks in dynamo-system), assert-kgateway.yaml (deployment Available checks in kgateway-system); updated suite test to reference local and common asserts.
Chainsaw Assertions — Kind: training-kubeflow
tests/chainsaw/ai-conformance/kind-training-kubeflow/*
Added assert-kubeflow-trainer.yaml and updated suite test file: refs to use ../common/* for shared checks and local file for trainer.
Chainsaw Test Manifests & README
tests/chainsaw/ai-conformance/cluster/chainsaw-test.yaml, tests/chainsaw/ai-conformance/.../chainsaw-test.yaml, tests/chainsaw/ai-conformance/README.md
Updated assert.file references to ../common/* where appropriate, moved shared entries out of cluster/ listing, and documented common/ and ownership model in README.
Test Discovery & Unit Tests
tests/chainsaw/ai-conformance/main.go, tests/chainsaw/ai-conformance/main_test.go
Extended parseAssertFiles to parse chainsaw-test.yaml and collect referenced assert-*.yaml from `try

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hopped through files both near and far,

Found asserts now shared beneath one star,
Timeouts lengthened, debug steps in tow,
Chainsaw learns where all the manifests go—
A tiny rabbit cheers the CI flow!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main change: unifying GPU Chainsaw test layout and validation flow between inference and training workflows by reorganizing assertions and aligning execution paths.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@yuanchen8911
Copy link
Copy Markdown
Contributor Author

Solid cleanup. Key changes reviewed:

  1. common/ directory -- moving shared assertions out of cluster/ into common/ is the right call. The --dir flag replacing per-file --file flags eliminates the fragile enumeration that was previously causing silent CI gaps when new assert files were added.
  2. Evidence collection fix -- changing the condition from steps.validate-conformance.outcome == success|failure to always() && !cancelled() && steps.bundle-install.outcome == success fixes the regression where evidence was lost when earlier steps failed. Good.
  3. Dynamo smoke test removal -- intentionally disabled with a clear comment explaining why. Reasonable given the Kind CI flakiness.
  4. kind load timeouts -- 600->900s and 300->600s. These are pragmatic.
  5. main_test.go -- tests for the assert-loading logic add coverage for a path that was previously untested.

One note: the inference GPU test is still pending. The training test passed, which exercises the same code paths. LGTM.

@mchmarny Thanks for the thorough review. Responded inline to each comment. Summary:

Comment Reply
Inference CI red Snapshot timeout, not --dir related; added snapshot diagnostics to gpu-snapshot-validate
filepath.Clean nit Fixed
Multi-doc YAML Acknowledged
Path traversal guard Declined — .. is valid for sibling refs like ../common/assert-*.yaml; follow-up if needed
Timeout docs Added comments explaining DinD bridge I/O contention
always() condition Leaving as-is — diagnostic-only, better noisy than missing

Pushed the fixes. Mind taking another look when you get a chance?

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/actions/gpu-snapshot-validate/action.yml:
- Around line 68-85: The diagnostics step currently redirects stderr to
/dev/null for all kubectl invocations (e.g., the commands invoking "kubectl
--context=\"kind-${{ inputs.cluster_name }}\" -n default get job aicr -o yaml",
"get pods -l app.kubernetes.io/name=aicr -o wide", "describe job aicr",
"describe pods -l app.kubernetes.io/name=aicr", "logs ... --all-containers
--previous", and "get configmap aicr-snapshot -o yaml"), which hides useful
error output; remove the trailing "2>/dev/null" from these kubectl lines but
keep the "|| true" so the step won’t fail while preserving stderr for
troubleshooting.

In @.github/workflows/gpu-h100-inference-test.yaml:
- Around line 195-211: The "Collect validation artifacts" step currently enables
"set -o pipefail" then runs the diagnostic pipeline "go run ... | tee ...",
which can cause the pipeline to return a non-zero status and fail the job;
modify the step so the artifact collection cannot fail the workflow by disabling
pipefail for that command or swallowing the pipeline exit status—for example,
remove or override "set -o pipefail" locally for the pipeline command (e.g., run
the command in a subshell after "set +o pipefail") or append "|| true" to the
"go run ... | tee ..." line so the step always exits 0 while still writing the
conformance-evidence artifact.

In @.github/workflows/gpu-h100-training-test.yaml:
- Around line 188-205: The "Collect validation artifacts" step currently uses
"set -o pipefail" and runs "go run ... | tee
conformance-evidence/resource-existence-post.txt", which can fail the whole job;
mark this diagnostic-only step as non-blocking by adding continue-on-error: true
to the step definition (the step named "Collect validation artifacts") so any
failure in the "go run" / "tee" pipeline won’t cause the workflow to fail.

In `@tests/chainsaw/ai-conformance/main_test.go`:
- Around line 168-174: Add unit tests that exercise error conditions using
writeTestFile to create temporary files: (1) a test that writes an
invalid/malformed YAML assertion file and asserts parseAssertFiles returns a
non-nil error; (2) a test that writes an invalid/malformed chainsaw-test.yaml
and asserts the test-config parsing function (e.g., parseChainsawTest /
LoadTestConfig) returns an error; and (3) a test that writes a
chainsaw-test.yaml referencing a non-existent assertion file and asserts the
loader/runner returns an error for the missing file. Use Test* functions with
t.TempDir(), call writeTestFile to create the faulty inputs, then call the
existing parsing/loading functions and fail the test if err == nil.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 91676850-793f-4b26-b5a3-160f04fc2604

📥 Commits

Reviewing files that changed from the base of the PR and between 03b82bd and 016e07a.

📒 Files selected for processing (19)
  • .github/actions/aicr-build/action.yml
  • .github/actions/gpu-snapshot-validate/action.yml
  • .github/workflows/gpu-h100-inference-test.yaml
  • .github/workflows/gpu-h100-training-test.yaml
  • tests/chainsaw/ai-conformance/README.md
  • tests/chainsaw/ai-conformance/cluster/chainsaw-test.yaml
  • tests/chainsaw/ai-conformance/common/assert-cert-manager.yaml
  • tests/chainsaw/ai-conformance/common/assert-dra-driver.yaml
  • tests/chainsaw/ai-conformance/common/assert-kai-scheduler.yaml
  • tests/chainsaw/ai-conformance/common/assert-monitoring.yaml
  • tests/chainsaw/ai-conformance/common/assert-skyhook.yaml
  • tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-crds.yaml
  • tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-dynamo.yaml
  • tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-kgateway.yaml
  • tests/chainsaw/ai-conformance/kind-inference-dynamo/chainsaw-test.yaml
  • tests/chainsaw/ai-conformance/kind-training-kubeflow/assert-kubeflow-trainer.yaml
  • tests/chainsaw/ai-conformance/kind-training-kubeflow/chainsaw-test.yaml
  • tests/chainsaw/ai-conformance/main.go
  • tests/chainsaw/ai-conformance/main_test.go

Comment thread .github/actions/gpu-snapshot-validate/action.yml Outdated
Comment thread .github/workflows/gpu-h100-inference-test.yaml
Comment thread .github/workflows/gpu-h100-training-test.yaml
Comment thread tests/chainsaw/ai-conformance/main_test.go
@yuanchen8911 yuanchen8911 force-pushed the codex/pr579-chainsaw-layout-followup branch from 016e07a to 1806849 Compare April 16, 2026 15:21
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-dynamo.yaml`:
- Around line 15-32: The test only asserts two Deployments
(dynamo-platform-dynamo-operator-controller-manager and grove-operator) but the
step claims it verifies etcd/NATS; either add explicit assertions for the etcd
and NATS components or tighten the step description. To fix, add similar status
assertion blocks for the etcd and nats resources (for example assert the etcd
StatefulSet/Deployment and the nats Deployment/StatefulSet have
(conditions[?type == 'Available']): - status: "True") using the resource names
used in your cluster, or update the `assert-dynamo` step description in the test
manifest to remove mention of etcd/NATS so it matches the current assertions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 25e4b8a6-13b3-43cf-b59d-f71466973967

📥 Commits

Reviewing files that changed from the base of the PR and between 016e07a and 1806849.

📒 Files selected for processing (19)
  • .github/actions/aicr-build/action.yml
  • .github/actions/gpu-snapshot-validate/action.yml
  • .github/workflows/gpu-h100-inference-test.yaml
  • .github/workflows/gpu-h100-training-test.yaml
  • tests/chainsaw/ai-conformance/README.md
  • tests/chainsaw/ai-conformance/cluster/chainsaw-test.yaml
  • tests/chainsaw/ai-conformance/common/assert-cert-manager.yaml
  • tests/chainsaw/ai-conformance/common/assert-dra-driver.yaml
  • tests/chainsaw/ai-conformance/common/assert-kai-scheduler.yaml
  • tests/chainsaw/ai-conformance/common/assert-monitoring.yaml
  • tests/chainsaw/ai-conformance/common/assert-skyhook.yaml
  • tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-crds.yaml
  • tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-dynamo.yaml
  • tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-kgateway.yaml
  • tests/chainsaw/ai-conformance/kind-inference-dynamo/chainsaw-test.yaml
  • tests/chainsaw/ai-conformance/kind-training-kubeflow/assert-kubeflow-trainer.yaml
  • tests/chainsaw/ai-conformance/kind-training-kubeflow/chainsaw-test.yaml
  • tests/chainsaw/ai-conformance/main.go
  • tests/chainsaw/ai-conformance/main_test.go

@yuanchen8911 yuanchen8911 removed the request for review from dims April 16, 2026 16:22
Move shared assertions from cluster/ to common/, keep leaf-specific
checks with their suites, align both H100 GPU workflows on the same
validation flow, and temporarily remove the Dynamo smoke path.

Decode all YAML documents in referencedAssertFiles (was only reading
the first). Increase kind-load timeouts in aicr-build action.
@yuanchen8911 yuanchen8911 force-pushed the codex/pr579-chainsaw-layout-followup branch from 1806849 to d6fd810 Compare April 16, 2026 16:44
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-dynamo.yaml (1)

15-32: ⚠️ Potential issue | 🟠 Major

Add etcd/NATS checks or narrow the fixture’s scope.

This file still only asserts the two Deployments, so the assert-dynamo step can pass even when Dynamo’s etcd or NATS backing services are unhealthy. Please add explicit assertions for those components or tighten the parent step description to match the actual coverage.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-dynamo.yaml`
around lines 15 - 32, The current fixture only asserts Deployment availability
for dynamo-platform-dynamo-operator-controller-manager and grove-operator; add
explicit checks for Dynamo’s backing services (etcd and NATS) or narrow the step
description to avoid a false sense of coverage. Update assert-dynamo.yaml to
include status assertions for the etcd and NATS components (e.g., their
StatefulSet/Deployment names or ClusterService/Service objects used in this
stack) asserting (conditions[?type == 'Available']) - or change the top
comment/description to state that only the two operator Deployments are being
checked so the step doesn’t imply etcd/NATS health is validated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/actions/aicr-build/action.yml:
- Around line 57-63: Centralize the hardcoded timeout (900) by introducing a
single local variable (e.g., TIMEOUT_SECONDS) at the top of the action/job and
replace all literal 900 occurrences used with the timeout command (the `timeout
900 kind load docker-image ko.local:smoke-test --name "${KIND_CLUSTER_NAME}"`
invocations and any other `timeout 900 ...` uses) so tuning requires one change;
update references in the `kind load docker-image` retry blocks and any similar
timeout usages to reference TIMEOUT_SECONDS instead of the literal.

In @.github/actions/gpu-snapshot-validate/action.yml:
- Around line 68-85: Update the kubectl debug queries to target the actual
snapshot resources: replace references to the Job and ConfigMap named "aicr"
with "aicr-e2e-snapshot" (the job name used in snapshot-job.yaml and the
configmap in assert-configmap.yaml), and change pod selection from the label
selector "app.kubernetes.io/name=aicr" to selecting by the Job-owned pods via
"job-name=aicr-e2e-snapshot" for the logs, get pods and describe pods commands
so the snapshots actually match the deployed resources.

In `@tests/chainsaw/ai-conformance/main_test.go`:
- Around line 222-228: The helper writeTestFile currently calls t.Fatalf(...)
but doesn't follow it with an explicit return; update the writeTestFile function
so that immediately after the t.Fatalf(...) call you add a return statement to
make intent explicit and satisfy static analysis tools (reference: function
writeTestFile and the t.Fatalf invocation).

---

Duplicate comments:
In `@tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-dynamo.yaml`:
- Around line 15-32: The current fixture only asserts Deployment availability
for dynamo-platform-dynamo-operator-controller-manager and grove-operator; add
explicit checks for Dynamo’s backing services (etcd and NATS) or narrow the step
description to avoid a false sense of coverage. Update assert-dynamo.yaml to
include status assertions for the etcd and NATS components (e.g., their
StatefulSet/Deployment names or ClusterService/Service objects used in this
stack) asserting (conditions[?type == 'Available']) - or change the top
comment/description to state that only the two operator Deployments are being
checked so the step doesn’t imply etcd/NATS health is validated.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: a098fcf8-1f7a-4148-af1c-3414e891373d

📥 Commits

Reviewing files that changed from the base of the PR and between 1806849 and d6fd810.

📒 Files selected for processing (19)
  • .github/actions/aicr-build/action.yml
  • .github/actions/gpu-snapshot-validate/action.yml
  • .github/workflows/gpu-h100-inference-test.yaml
  • .github/workflows/gpu-h100-training-test.yaml
  • tests/chainsaw/ai-conformance/README.md
  • tests/chainsaw/ai-conformance/cluster/chainsaw-test.yaml
  • tests/chainsaw/ai-conformance/common/assert-cert-manager.yaml
  • tests/chainsaw/ai-conformance/common/assert-dra-driver.yaml
  • tests/chainsaw/ai-conformance/common/assert-kai-scheduler.yaml
  • tests/chainsaw/ai-conformance/common/assert-monitoring.yaml
  • tests/chainsaw/ai-conformance/common/assert-skyhook.yaml
  • tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-crds.yaml
  • tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-dynamo.yaml
  • tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-kgateway.yaml
  • tests/chainsaw/ai-conformance/kind-inference-dynamo/chainsaw-test.yaml
  • tests/chainsaw/ai-conformance/kind-training-kubeflow/assert-kubeflow-trainer.yaml
  • tests/chainsaw/ai-conformance/kind-training-kubeflow/chainsaw-test.yaml
  • tests/chainsaw/ai-conformance/main.go
  • tests/chainsaw/ai-conformance/main_test.go

Comment thread .github/actions/aicr-build/action.yml
Comment thread .github/actions/gpu-snapshot-validate/action.yml
Comment thread tests/chainsaw/ai-conformance/main_test.go
@yuanchen8911 yuanchen8911 merged commit 1cbe5d9 into NVIDIA:main Apr 16, 2026
29 of 32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants