refactor(ci): unify GPU Chainsaw layout and validation flow#587
Conversation
387663b to
5519aba
Compare
90bca0f to
9bf488a
Compare
9bf488a to
93133f8
Compare
93133f8 to
b297b8f
Compare
b297b8f to
0b38c08
Compare
8743b5e to
70e59a9
Compare
mchmarny
left a comment
There was a problem hiding this comment.
Clean refactor — the ownership model (common / kind-common / leaf) is well-defined and the workflow simplification is significant. The Go changes to auto-discover assert files from chainsaw-test.yaml are solid and well-tested (multi-doc support, fallback to dir scanning, dedup).
A few things to check before merging:
- H100 inference CI is red — training passed, inference failed. Needs a look to confirm it's not caused by this change (new
--dirresolution vs old--filelist). - Minor path normalization inconsistency in the dedup map (see inline comment on
main.go:240). - Timeout bumps in
aicr-buildare reasonable but worth documenting the root cause so they don't drift further.
No blockers — nothing here is high/critical. Solid cleanup overall.
ArangoGutierrez
left a comment
There was a problem hiding this comment.
Solid cleanup. Key changes reviewed:
-
common/directory -- moving shared assertions out ofcluster/intocommon/is the right call. The--dirflag replacing per-file--fileflags eliminates the fragile enumeration that was previously causing silent CI gaps when new assert files were added. -
Evidence collection fix -- changing the condition from
steps.validate-conformance.outcome == success|failuretoalways() && !cancelled() && steps.bundle-install.outcome == successfixes the regression where evidence was lost when earlier steps failed. Good. -
Dynamo smoke test removal -- intentionally disabled with a clear comment explaining why. Reasonable given the Kind CI flakiness.
-
kind load timeouts -- 600->900s and 300->600s. These are pragmatic.
-
main_test.go-- tests for the assert-loading logic add coverage for a path that was previously untested.
One note: the inference GPU test is still pending. The training test passed, which exercises the same code paths. LGTM.
|
@mchmarny Thanks for the thorough review. Responded inline to each comment. Summary:
Pushed the fixes. Mind taking another look when you get a chance? |
70e59a9 to
016e07a
Compare
📝 WalkthroughWalkthroughReorganized Chainsaw assertions into shared and suite-specific directories, updated GitHub Actions timeouts and added failure-only debug steps, adjusted workflows to use directory-based conformance inputs and renamed evidence collection, and extended test discovery to parse Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
@mchmarny Thanks for the thorough review. Responded inline to each comment. Summary:
Pushed the fixes. Mind taking another look when you get a chance? |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/actions/gpu-snapshot-validate/action.yml:
- Around line 68-85: The diagnostics step currently redirects stderr to
/dev/null for all kubectl invocations (e.g., the commands invoking "kubectl
--context=\"kind-${{ inputs.cluster_name }}\" -n default get job aicr -o yaml",
"get pods -l app.kubernetes.io/name=aicr -o wide", "describe job aicr",
"describe pods -l app.kubernetes.io/name=aicr", "logs ... --all-containers
--previous", and "get configmap aicr-snapshot -o yaml"), which hides useful
error output; remove the trailing "2>/dev/null" from these kubectl lines but
keep the "|| true" so the step won’t fail while preserving stderr for
troubleshooting.
In @.github/workflows/gpu-h100-inference-test.yaml:
- Around line 195-211: The "Collect validation artifacts" step currently enables
"set -o pipefail" then runs the diagnostic pipeline "go run ... | tee ...",
which can cause the pipeline to return a non-zero status and fail the job;
modify the step so the artifact collection cannot fail the workflow by disabling
pipefail for that command or swallowing the pipeline exit status—for example,
remove or override "set -o pipefail" locally for the pipeline command (e.g., run
the command in a subshell after "set +o pipefail") or append "|| true" to the
"go run ... | tee ..." line so the step always exits 0 while still writing the
conformance-evidence artifact.
In @.github/workflows/gpu-h100-training-test.yaml:
- Around line 188-205: The "Collect validation artifacts" step currently uses
"set -o pipefail" and runs "go run ... | tee
conformance-evidence/resource-existence-post.txt", which can fail the whole job;
mark this diagnostic-only step as non-blocking by adding continue-on-error: true
to the step definition (the step named "Collect validation artifacts") so any
failure in the "go run" / "tee" pipeline won’t cause the workflow to fail.
In `@tests/chainsaw/ai-conformance/main_test.go`:
- Around line 168-174: Add unit tests that exercise error conditions using
writeTestFile to create temporary files: (1) a test that writes an
invalid/malformed YAML assertion file and asserts parseAssertFiles returns a
non-nil error; (2) a test that writes an invalid/malformed chainsaw-test.yaml
and asserts the test-config parsing function (e.g., parseChainsawTest /
LoadTestConfig) returns an error; and (3) a test that writes a
chainsaw-test.yaml referencing a non-existent assertion file and asserts the
loader/runner returns an error for the missing file. Use Test* functions with
t.TempDir(), call writeTestFile to create the faulty inputs, then call the
existing parsing/loading functions and fail the test if err == nil.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 91676850-793f-4b26-b5a3-160f04fc2604
📒 Files selected for processing (19)
.github/actions/aicr-build/action.yml.github/actions/gpu-snapshot-validate/action.yml.github/workflows/gpu-h100-inference-test.yaml.github/workflows/gpu-h100-training-test.yamltests/chainsaw/ai-conformance/README.mdtests/chainsaw/ai-conformance/cluster/chainsaw-test.yamltests/chainsaw/ai-conformance/common/assert-cert-manager.yamltests/chainsaw/ai-conformance/common/assert-dra-driver.yamltests/chainsaw/ai-conformance/common/assert-kai-scheduler.yamltests/chainsaw/ai-conformance/common/assert-monitoring.yamltests/chainsaw/ai-conformance/common/assert-skyhook.yamltests/chainsaw/ai-conformance/kind-inference-dynamo/assert-crds.yamltests/chainsaw/ai-conformance/kind-inference-dynamo/assert-dynamo.yamltests/chainsaw/ai-conformance/kind-inference-dynamo/assert-kgateway.yamltests/chainsaw/ai-conformance/kind-inference-dynamo/chainsaw-test.yamltests/chainsaw/ai-conformance/kind-training-kubeflow/assert-kubeflow-trainer.yamltests/chainsaw/ai-conformance/kind-training-kubeflow/chainsaw-test.yamltests/chainsaw/ai-conformance/main.gotests/chainsaw/ai-conformance/main_test.go
016e07a to
1806849
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-dynamo.yaml`:
- Around line 15-32: The test only asserts two Deployments
(dynamo-platform-dynamo-operator-controller-manager and grove-operator) but the
step claims it verifies etcd/NATS; either add explicit assertions for the etcd
and NATS components or tighten the step description. To fix, add similar status
assertion blocks for the etcd and nats resources (for example assert the etcd
StatefulSet/Deployment and the nats Deployment/StatefulSet have
(conditions[?type == 'Available']): - status: "True") using the resource names
used in your cluster, or update the `assert-dynamo` step description in the test
manifest to remove mention of etcd/NATS so it matches the current assertions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 25e4b8a6-13b3-43cf-b59d-f71466973967
📒 Files selected for processing (19)
.github/actions/aicr-build/action.yml.github/actions/gpu-snapshot-validate/action.yml.github/workflows/gpu-h100-inference-test.yaml.github/workflows/gpu-h100-training-test.yamltests/chainsaw/ai-conformance/README.mdtests/chainsaw/ai-conformance/cluster/chainsaw-test.yamltests/chainsaw/ai-conformance/common/assert-cert-manager.yamltests/chainsaw/ai-conformance/common/assert-dra-driver.yamltests/chainsaw/ai-conformance/common/assert-kai-scheduler.yamltests/chainsaw/ai-conformance/common/assert-monitoring.yamltests/chainsaw/ai-conformance/common/assert-skyhook.yamltests/chainsaw/ai-conformance/kind-inference-dynamo/assert-crds.yamltests/chainsaw/ai-conformance/kind-inference-dynamo/assert-dynamo.yamltests/chainsaw/ai-conformance/kind-inference-dynamo/assert-kgateway.yamltests/chainsaw/ai-conformance/kind-inference-dynamo/chainsaw-test.yamltests/chainsaw/ai-conformance/kind-training-kubeflow/assert-kubeflow-trainer.yamltests/chainsaw/ai-conformance/kind-training-kubeflow/chainsaw-test.yamltests/chainsaw/ai-conformance/main.gotests/chainsaw/ai-conformance/main_test.go
Move shared assertions from cluster/ to common/, keep leaf-specific checks with their suites, align both H100 GPU workflows on the same validation flow, and temporarily remove the Dynamo smoke path. Decode all YAML documents in referencedAssertFiles (was only reading the first). Increase kind-load timeouts in aicr-build action.
1806849 to
d6fd810
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-dynamo.yaml (1)
15-32:⚠️ Potential issue | 🟠 MajorAdd etcd/NATS checks or narrow the fixture’s scope.
This file still only asserts the two Deployments, so the
assert-dynamostep can pass even when Dynamo’s etcd or NATS backing services are unhealthy. Please add explicit assertions for those components or tighten the parent step description to match the actual coverage.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-dynamo.yaml` around lines 15 - 32, The current fixture only asserts Deployment availability for dynamo-platform-dynamo-operator-controller-manager and grove-operator; add explicit checks for Dynamo’s backing services (etcd and NATS) or narrow the step description to avoid a false sense of coverage. Update assert-dynamo.yaml to include status assertions for the etcd and NATS components (e.g., their StatefulSet/Deployment names or ClusterService/Service objects used in this stack) asserting (conditions[?type == 'Available']) - or change the top comment/description to state that only the two operator Deployments are being checked so the step doesn’t imply etcd/NATS health is validated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/actions/aicr-build/action.yml:
- Around line 57-63: Centralize the hardcoded timeout (900) by introducing a
single local variable (e.g., TIMEOUT_SECONDS) at the top of the action/job and
replace all literal 900 occurrences used with the timeout command (the `timeout
900 kind load docker-image ko.local:smoke-test --name "${KIND_CLUSTER_NAME}"`
invocations and any other `timeout 900 ...` uses) so tuning requires one change;
update references in the `kind load docker-image` retry blocks and any similar
timeout usages to reference TIMEOUT_SECONDS instead of the literal.
In @.github/actions/gpu-snapshot-validate/action.yml:
- Around line 68-85: Update the kubectl debug queries to target the actual
snapshot resources: replace references to the Job and ConfigMap named "aicr"
with "aicr-e2e-snapshot" (the job name used in snapshot-job.yaml and the
configmap in assert-configmap.yaml), and change pod selection from the label
selector "app.kubernetes.io/name=aicr" to selecting by the Job-owned pods via
"job-name=aicr-e2e-snapshot" for the logs, get pods and describe pods commands
so the snapshots actually match the deployed resources.
In `@tests/chainsaw/ai-conformance/main_test.go`:
- Around line 222-228: The helper writeTestFile currently calls t.Fatalf(...)
but doesn't follow it with an explicit return; update the writeTestFile function
so that immediately after the t.Fatalf(...) call you add a return statement to
make intent explicit and satisfy static analysis tools (reference: function
writeTestFile and the t.Fatalf invocation).
---
Duplicate comments:
In `@tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-dynamo.yaml`:
- Around line 15-32: The current fixture only asserts Deployment availability
for dynamo-platform-dynamo-operator-controller-manager and grove-operator; add
explicit checks for Dynamo’s backing services (etcd and NATS) or narrow the step
description to avoid a false sense of coverage. Update assert-dynamo.yaml to
include status assertions for the etcd and NATS components (e.g., their
StatefulSet/Deployment names or ClusterService/Service objects used in this
stack) asserting (conditions[?type == 'Available']) - or change the top
comment/description to state that only the two operator Deployments are being
checked so the step doesn’t imply etcd/NATS health is validated.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: a098fcf8-1f7a-4148-af1c-3414e891373d
📒 Files selected for processing (19)
.github/actions/aicr-build/action.yml.github/actions/gpu-snapshot-validate/action.yml.github/workflows/gpu-h100-inference-test.yaml.github/workflows/gpu-h100-training-test.yamltests/chainsaw/ai-conformance/README.mdtests/chainsaw/ai-conformance/cluster/chainsaw-test.yamltests/chainsaw/ai-conformance/common/assert-cert-manager.yamltests/chainsaw/ai-conformance/common/assert-dra-driver.yamltests/chainsaw/ai-conformance/common/assert-kai-scheduler.yamltests/chainsaw/ai-conformance/common/assert-monitoring.yamltests/chainsaw/ai-conformance/common/assert-skyhook.yamltests/chainsaw/ai-conformance/kind-inference-dynamo/assert-crds.yamltests/chainsaw/ai-conformance/kind-inference-dynamo/assert-dynamo.yamltests/chainsaw/ai-conformance/kind-inference-dynamo/assert-kgateway.yamltests/chainsaw/ai-conformance/kind-inference-dynamo/chainsaw-test.yamltests/chainsaw/ai-conformance/kind-training-kubeflow/assert-kubeflow-trainer.yamltests/chainsaw/ai-conformance/kind-training-kubeflow/chainsaw-test.yamltests/chainsaw/ai-conformance/main.gotests/chainsaw/ai-conformance/main_test.go
Summary
Unify the GPU Chainsaw test layout and H100 Kind validation flow between inference and training. This moves shared assertions into the right ownership layers, keeps leaf-specific checks with their leaf suites, narrows workflow triggers to those owned inputs, and makes the executed validation flow consistent across both GPU workflows.
Motivation / Context
This is the follow-up cleanup after #579. That PR made the GPU training and inference workflows much more symmetric, but the Kind GPU suites still depended on
cluster/assertions in different ways, which kept test ownership, path filters, and workflow flow partially inconsistent.This update also removes the inference-only Dynamo smoke test from the H100 Kind workflow for now. The current
nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0image adds significant latency and friction in Kind CI, and training does not yet have a symmetric smoke path. If smoke coverage comes back later, it should be reintroduced alongside a comparable training smoke test.Fixes: N/A
Related: #579
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)Implementation Notes
tests/chainsaw/ai-conformance/cluster/totests/chainsaw/ai-conformance/common/tests/chainsaw/ai-conformance/kind-common/tests/chainsaw/ai-conformance/kind-inference-dynamo/tests/chainsaw/ai-conformance/kind-training-kubeflow/cluster/suite to reference../common/*instead of local copiescommon/**,kind-common/**, and the leaf suite directories rather than explicit sharedcluster/assert-*.yamlfile lists./aicr validate --phase conformance, post-run resource snapshot, and artifact uploadconformance-evidence/resource-existence-post.txtkind loadtimeouts inaicr-buildaction: smoke-test image 600s→900s, validator images 300s→600s (mitigates intermittent timeout failures on H100 runners)referencedAssertFilesto decode all YAML documents in multi-documentchainsaw-test.yamlfiles, matching the existingparseYAMLFilebehaviorTesting
make qualifypassed locally on the rebased branch.Risk Assessment
Rollout notes: N/A
Checklist
make testwith-race)make lint)git commit -S) — GPG signing infoSummary by CodeRabbit
Chores
Tests
Documentation