refactor(ci): unify GPU Chainsaw layout and validation flow by yuanchen8911 · Pull Request #587 · NVIDIA/aicr

yuanchen8911 · 2026-04-15T21:19:28Z

Summary

Unify the GPU Chainsaw test layout and H100 Kind validation flow between inference and training. This moves shared assertions into the right ownership layers, keeps leaf-specific checks with their leaf suites, narrows workflow triggers to those owned inputs, and makes the executed validation flow consistent across both GPU workflows.

Motivation / Context

This is the follow-up cleanup after #579. That PR made the GPU training and inference workflows much more symmetric, but the Kind GPU suites still depended on cluster/ assertions in different ways, which kept test ownership, path filters, and workflow flow partially inconsistent.

This update also removes the inference-only Dynamo smoke test from the H100 Kind workflow for now. The current nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0 image adds significant latency and friction in Kind CI, and training does not yet have a symmetric smoke path. If smoke coverage comes back later, it should be reintroduced alongside a comparable training smoke test.

Fixes: N/A
Related: #579

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Refactoring (no functional changes)
Build/CI/tooling

Component(s) Affected

CLI (cmd/aicr, pkg/cli)
API server (cmd/aicrd, pkg/api, pkg/server)
Recipe engine / data (pkg/recipe)
Bundlers (pkg/bundler, pkg/component/*)
Collectors / snapshotter (pkg/collector, pkg/snapshotter)
Validator (pkg/validator)
Core libraries (pkg/errors, pkg/k8s)
Docs/examples (docs/, examples/)
Other: CI workflows / Chainsaw tests

Implementation Notes

Move cross-environment shared assertions from tests/chainsaw/ai-conformance/cluster/ to tests/chainsaw/ai-conformance/common/
Keep Kind-shared assertions under tests/chainsaw/ai-conformance/kind-common/
Move inference-only Kind assertions into tests/chainsaw/ai-conformance/kind-inference-dynamo/
Move the training-only Kubeflow Trainer assertion into tests/chainsaw/ai-conformance/kind-training-kubeflow/
Update the external cluster/ suite to reference ../common/* instead of local copies
Update both H100 GPU workflows to rely on common/**, kind-common/**, and the leaf suite directories rather than explicit shared cluster/assert-*.yaml file lists
Align both H100 GPU workflows on the same executed validation flow: Chainsaw health checks, pre-validation resource existence check, ./aicr validate --phase conformance, post-run resource snapshot, and artifact upload
Capture the post-run resource snapshot as conformance-evidence/resource-existence-post.txt
Temporarily remove the inference-only Dynamo smoke path so training and inference share the same H100 Kind flow; keep the smoke manifest in-tree for later reintroduction if needed
Increase kind load timeouts in aicr-build action: smoke-test image 600s→900s, validator images 300s→600s (mitigates intermittent timeout failures on H100 runners)
Fix referencedAssertFiles to decode all YAML documents in multi-document chainsaw-test.yaml files, matching the existing parseYAMLFile behavior

Testing

unset GITLAB_TOKEN && make qualify

make qualify passed locally on the rebased branch.

Risk Assessment

Low — Isolated change, well-tested, easy to revert
Medium — Touches multiple components or has broader impact
High — Breaking change, affects critical paths, or complex rollout

Rollout notes: N/A

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
I updated docs if user-facing behavior changed
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S) — GPG signing info

Summary by CodeRabbit

Chores
- Increased CI image load timeouts, made validation-artifact collection more robust and always-run, and removed an obsolete smoke-test execution flow.
- Improved failure diagnostics with an on-failure debug step that collects cluster artifacts for troubleshooting.
Tests
- Reorganized assertion files into shared common directories and added new conformance assertions for inference and training suites.
- Added parsing logic and unit tests to resolve and aggregate assertion files referenced by test manifests.
Documentation
- Updated conformance README to document shared assertions and ownership.

mchmarny

Clean refactor — the ownership model (common / kind-common / leaf) is well-defined and the workflow simplification is significant. The Go changes to auto-discover assert files from chainsaw-test.yaml are solid and well-tested (multi-doc support, fallback to dir scanning, dedup).

A few things to check before merging:

H100 inference CI is red — training passed, inference failed. Needs a look to confirm it's not caused by this change (new --dir resolution vs old --file list).
Minor path normalization inconsistency in the dedup map (see inline comment on main.go:240).
Timeout bumps in aicr-build are reasonable but worth documenting the root cause so they don't drift further.

No blockers — nothing here is high/critical. Solid cleanup overall.

ArangoGutierrez

Solid cleanup. Key changes reviewed:

common/ directory -- moving shared assertions out of cluster/ into common/ is the right call. The --dir flag replacing per-file --file flags eliminates the fragile enumeration that was previously causing silent CI gaps when new assert files were added.
Evidence collection fix -- changing the condition from steps.validate-conformance.outcome == success|failure to always() && !cancelled() && steps.bundle-install.outcome == success fixes the regression where evidence was lost when earlier steps failed. Good.
Dynamo smoke test removal -- intentionally disabled with a clear comment explaining why. Reasonable given the Kind CI flakiness.
kind load timeouts -- 600->900s and 300->600s. These are pragmatic.
main_test.go -- tests for the assert-loading logic add coverage for a path that was previously untested.

One note: the inference GPU test is still pending. The training test passed, which exercises the same code paths. LGTM.

yuanchen8911 · 2026-04-16T14:34:32Z

@mchmarny Thanks for the thorough review. Responded inline to each comment. Summary:

Comment	Reply
Inference CI red	Snapshot timeout, not `--dir` related; added snapshot diagnostics to `gpu-snapshot-validate`
`filepath.Clean` nit	Fixed
Multi-doc YAML	Acknowledged
Path traversal guard	Declined — `..` is valid for sibling refs like `../common/assert-*.yaml`; follow-up if needed
Timeout docs	Added comments explaining DinD bridge I/O contention
`always()` condition	Leaving as-is — diagnostic-only, better noisy than missing

Pushed the fixes. Mind taking another look when you get a chance?

coderabbitai · 2026-04-16T14:39:37Z

📝 Walkthrough

Walkthrough

Reorganized Chainsaw assertions into shared and suite-specific directories, updated GitHub Actions timeouts and added failure-only debug steps, adjusted workflows to use directory-based conformance inputs and renamed evidence collection, and extended test discovery to parse chainsaw-test.yaml with new unit tests.

Changes

Cohort / File(s)	Summary
GitHub Actions - Timeouts `.github/actions/aicr-build/action.yml`	Increased `kind load docker-image` timeouts: snapshot image attempts 600→900s, validator image attempts 300→600s; comments added about transfer size/contention.
GitHub Actions - Snapshot Debugging `.github/actions/gpu-snapshot-validate/action.yml`	Added `Debug snapshot Job` composite step that runs on failure and collects `kubectl` outputs (job, pods, describe, logs, ConfigMap), each command allowed to fail (`
Workflows - GPU H100 Inference `.github/workflows/gpu-h100-inference-test.yaml`	Removed Dynamo vLLM smoke-test flow and related cleanup; replaced per-component `--file` asserts with directory-based conformance inputs; renamed and changed evidence collection to “validation artifacts” with `tee` and `set -o pipefail`; adjusted artifact upload and cleanup steps.
Workflows - GPU H100 Training `.github/workflows/gpu-h100-training-test.yaml`	Moved resource-existence check position; switched `--file` args to `--dir` (common & kind-common); renamed/changed post-conformance evidence step to always-run `Collect validation artifacts` with `pipefail`, `tee`, and `continue-on-error`; renamed artifact upload step.
Chainsaw Assertions — Common `tests/chainsaw/ai-conformance/common/*`	Added shared assertion YAMLs for `cert-manager`, `DRA driver`, `KAI scheduler`, `monitoring` (Prometheus), and `skyhook` under `common/`.
Chainsaw Assertions — Kind: inference-dynamo `tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-crds.yaml`, `.../assert-dynamo.yaml`, `.../assert-kgateway.yaml`, `.../chainsaw-test.yaml`	Added `assert-crds.yaml` (multiple CRD names), `assert-dynamo.yaml` (deployment Available checks in `dynamo-system`), `assert-kgateway.yaml` (deployment Available checks in `kgateway-system`); updated suite test to reference local and common asserts.
Chainsaw Assertions — Kind: training-kubeflow `tests/chainsaw/ai-conformance/kind-training-kubeflow/*`	Added `assert-kubeflow-trainer.yaml` and updated suite test `file:` refs to use `../common/*` for shared checks and local file for trainer.
Chainsaw Test Manifests & README `tests/chainsaw/ai-conformance/cluster/chainsaw-test.yaml`, `tests/chainsaw/ai-conformance/.../chainsaw-test.yaml`, `tests/chainsaw/ai-conformance/README.md`	Updated `assert.file` references to `../common/*` where appropriate, moved shared entries out of `cluster/` listing, and documented `common/` and ownership model in README.
Test Discovery & Unit Tests `tests/chainsaw/ai-conformance/main.go`, `tests/chainsaw/ai-conformance/main_test.go`	Extended `parseAssertFiles` to parse `chainsaw-test.yaml` and collect referenced `assert-*.yaml` from `try

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hopped through files both near and far,

Found asserts now shared beneath one star,
Timeouts lengthened, debug steps in tow,
Chainsaw learns where all the manifests go—
A tiny rabbit cheers the CI flow!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main change: unifying GPU Chainsaw test layout and validation flow between inference and training workflows by reorganizing assertions and aligning execution paths.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

yuanchen8911 · 2026-04-16T14:41:19Z

Solid cleanup. Key changes reviewed:

common/ directory -- moving shared assertions out of cluster/ into common/ is the right call. The --dir flag replacing per-file --file flags eliminates the fragile enumeration that was previously causing silent CI gaps when new assert files were added.

Evidence collection fix -- changing the condition from steps.validate-conformance.outcome == success|failure to always() && !cancelled() && steps.bundle-install.outcome == success fixes the regression where evidence was lost when earlier steps failed. Good.

Dynamo smoke test removal -- intentionally disabled with a clear comment explaining why. Reasonable given the Kind CI flakiness.

kind load timeouts -- 600->900s and 300->600s. These are pragmatic.

main_test.go -- tests for the assert-loading logic add coverage for a path that was previously untested.

One note: the inference GPU test is still pending. The training test passed, which exercises the same code paths. LGTM.

@mchmarny Thanks for the thorough review. Responded inline to each comment. Summary:

Comment	Reply
Inference CI red	Snapshot timeout, not `--dir` related; added snapshot diagnostics to `gpu-snapshot-validate`
`filepath.Clean` nit	Fixed
Multi-doc YAML	Acknowledged
Path traversal guard	Declined — `..` is valid for sibling refs like `../common/assert-*.yaml`; follow-up if needed
Timeout docs	Added comments explaining DinD bridge I/O contention
`always()` condition	Leaving as-is — diagnostic-only, better noisy than missing

Pushed the fixes. Mind taking another look when you get a chance?

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/actions/gpu-snapshot-validate/action.yml:
- Around line 68-85: The diagnostics step currently redirects stderr to
/dev/null for all kubectl invocations (e.g., the commands invoking "kubectl
--context=\"kind-${{ inputs.cluster_name }}\" -n default get job aicr -o yaml",
"get pods -l app.kubernetes.io/name=aicr -o wide", "describe job aicr",
"describe pods -l app.kubernetes.io/name=aicr", "logs ... --all-containers
--previous", and "get configmap aicr-snapshot -o yaml"), which hides useful
error output; remove the trailing "2>/dev/null" from these kubectl lines but
keep the "|| true" so the step won’t fail while preserving stderr for
troubleshooting.

In @.github/workflows/gpu-h100-inference-test.yaml:
- Around line 195-211: The "Collect validation artifacts" step currently enables
"set -o pipefail" then runs the diagnostic pipeline "go run ... | tee ...",
which can cause the pipeline to return a non-zero status and fail the job;
modify the step so the artifact collection cannot fail the workflow by disabling
pipefail for that command or swallowing the pipeline exit status—for example,
remove or override "set -o pipefail" locally for the pipeline command (e.g., run
the command in a subshell after "set +o pipefail") or append "|| true" to the
"go run ... | tee ..." line so the step always exits 0 while still writing the
conformance-evidence artifact.

In @.github/workflows/gpu-h100-training-test.yaml:
- Around line 188-205: The "Collect validation artifacts" step currently uses
"set -o pipefail" and runs "go run ... | tee
conformance-evidence/resource-existence-post.txt", which can fail the whole job;
mark this diagnostic-only step as non-blocking by adding continue-on-error: true
to the step definition (the step named "Collect validation artifacts") so any
failure in the "go run" / "tee" pipeline won’t cause the workflow to fail.

In `@tests/chainsaw/ai-conformance/main_test.go`:
- Around line 168-174: Add unit tests that exercise error conditions using
writeTestFile to create temporary files: (1) a test that writes an
invalid/malformed YAML assertion file and asserts parseAssertFiles returns a
non-nil error; (2) a test that writes an invalid/malformed chainsaw-test.yaml
and asserts the test-config parsing function (e.g., parseChainsawTest /
LoadTestConfig) returns an error; and (3) a test that writes a
chainsaw-test.yaml referencing a non-existent assertion file and asserts the
loader/runner returns an error for the missing file. Use Test* functions with
t.TempDir(), call writeTestFile to create the faulty inputs, then call the
existing parsing/loading functions and fail the test if err == nil.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 91676850-793f-4b26-b5a3-160f04fc2604

📥 Commits

Reviewing files that changed from the base of the PR and between 03b82bd and 016e07a.

📒 Files selected for processing (19)

.github/actions/aicr-build/action.yml
.github/actions/gpu-snapshot-validate/action.yml
.github/workflows/gpu-h100-inference-test.yaml
.github/workflows/gpu-h100-training-test.yaml
tests/chainsaw/ai-conformance/README.md
tests/chainsaw/ai-conformance/cluster/chainsaw-test.yaml
tests/chainsaw/ai-conformance/common/assert-cert-manager.yaml
tests/chainsaw/ai-conformance/common/assert-dra-driver.yaml
tests/chainsaw/ai-conformance/common/assert-kai-scheduler.yaml
tests/chainsaw/ai-conformance/common/assert-monitoring.yaml
tests/chainsaw/ai-conformance/common/assert-skyhook.yaml
tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-crds.yaml
tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-dynamo.yaml
tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-kgateway.yaml
tests/chainsaw/ai-conformance/kind-inference-dynamo/chainsaw-test.yaml
tests/chainsaw/ai-conformance/kind-training-kubeflow/assert-kubeflow-trainer.yaml
tests/chainsaw/ai-conformance/kind-training-kubeflow/chainsaw-test.yaml
tests/chainsaw/ai-conformance/main.go
tests/chainsaw/ai-conformance/main_test.go

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-dynamo.yaml`:
- Around line 15-32: The test only asserts two Deployments
(dynamo-platform-dynamo-operator-controller-manager and grove-operator) but the
step claims it verifies etcd/NATS; either add explicit assertions for the etcd
and NATS components or tighten the step description. To fix, add similar status
assertion blocks for the etcd and nats resources (for example assert the etcd
StatefulSet/Deployment and the nats Deployment/StatefulSet have
(conditions[?type == 'Available']): - status: "True") using the resource names
used in your cluster, or update the `assert-dynamo` step description in the test
manifest to remove mention of etcd/NATS so it matches the current assertions.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 25e4b8a6-13b3-43cf-b59d-f71466973967

📥 Commits

Reviewing files that changed from the base of the PR and between 016e07a and 1806849.

📒 Files selected for processing (19)

.github/actions/aicr-build/action.yml
.github/actions/gpu-snapshot-validate/action.yml
.github/workflows/gpu-h100-inference-test.yaml
.github/workflows/gpu-h100-training-test.yaml
tests/chainsaw/ai-conformance/README.md
tests/chainsaw/ai-conformance/cluster/chainsaw-test.yaml
tests/chainsaw/ai-conformance/common/assert-cert-manager.yaml
tests/chainsaw/ai-conformance/common/assert-dra-driver.yaml
tests/chainsaw/ai-conformance/common/assert-kai-scheduler.yaml
tests/chainsaw/ai-conformance/common/assert-monitoring.yaml
tests/chainsaw/ai-conformance/common/assert-skyhook.yaml
tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-crds.yaml
tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-dynamo.yaml
tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-kgateway.yaml
tests/chainsaw/ai-conformance/kind-inference-dynamo/chainsaw-test.yaml
tests/chainsaw/ai-conformance/kind-training-kubeflow/assert-kubeflow-trainer.yaml
tests/chainsaw/ai-conformance/kind-training-kubeflow/chainsaw-test.yaml
tests/chainsaw/ai-conformance/main.go
tests/chainsaw/ai-conformance/main_test.go

Move shared assertions from cluster/ to common/, keep leaf-specific checks with their suites, align both H100 GPU workflows on the same validation flow, and temporarily remove the Dynamo smoke path. Decode all YAML documents in referencedAssertFiles (was only reading the first). Increase kind-load timeouts in aicr-build action.

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-dynamo.yaml (1)
15-32: ⚠️ Potential issue | 🟠 Major

Add etcd/NATS checks or narrow the fixture’s scope.

This file still only asserts the two Deployments, so the assert-dynamo step can pass even when Dynamo’s etcd or NATS backing services are unhealthy. Please add explicit assertions for those components or tighten the parent step description to match the actual coverage.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-dynamo.yaml`
around lines 15 - 32, The current fixture only asserts Deployment availability
for dynamo-platform-dynamo-operator-controller-manager and grove-operator; add
explicit checks for Dynamo’s backing services (etcd and NATS) or narrow the step
description to avoid a false sense of coverage. Update assert-dynamo.yaml to
include status assertions for the etcd and NATS components (e.g., their
StatefulSet/Deployment names or ClusterService/Service objects used in this
stack) asserting (conditions[?type == 'Available']) - or change the top
comment/description to state that only the two operator Deployments are being
checked so the step doesn’t imply etcd/NATS health is validated.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/actions/aicr-build/action.yml:
- Around line 57-63: Centralize the hardcoded timeout (900) by introducing a
single local variable (e.g., TIMEOUT_SECONDS) at the top of the action/job and
replace all literal 900 occurrences used with the timeout command (the `timeout
900 kind load docker-image ko.local:smoke-test --name "${KIND_CLUSTER_NAME}"`
invocations and any other `timeout 900 ...` uses) so tuning requires one change;
update references in the `kind load docker-image` retry blocks and any similar
timeout usages to reference TIMEOUT_SECONDS instead of the literal.

In @.github/actions/gpu-snapshot-validate/action.yml:
- Around line 68-85: Update the kubectl debug queries to target the actual
snapshot resources: replace references to the Job and ConfigMap named "aicr"
with "aicr-e2e-snapshot" (the job name used in snapshot-job.yaml and the
configmap in assert-configmap.yaml), and change pod selection from the label
selector "app.kubernetes.io/name=aicr" to selecting by the Job-owned pods via
"job-name=aicr-e2e-snapshot" for the logs, get pods and describe pods commands
so the snapshots actually match the deployed resources.

In `@tests/chainsaw/ai-conformance/main_test.go`:
- Around line 222-228: The helper writeTestFile currently calls t.Fatalf(...)
but doesn't follow it with an explicit return; update the writeTestFile function
so that immediately after the t.Fatalf(...) call you add a return statement to
make intent explicit and satisfy static analysis tools (reference: function
writeTestFile and the t.Fatalf invocation).

---

Duplicate comments:
In `@tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-dynamo.yaml`:
- Around line 15-32: The current fixture only asserts Deployment availability
for dynamo-platform-dynamo-operator-controller-manager and grove-operator; add
explicit checks for Dynamo’s backing services (etcd and NATS) or narrow the step
description to avoid a false sense of coverage. Update assert-dynamo.yaml to
include status assertions for the etcd and NATS components (e.g., their
StatefulSet/Deployment names or ClusterService/Service objects used in this
stack) asserting (conditions[?type == 'Available']) - or change the top
comment/description to state that only the two operator Deployments are being
checked so the step doesn’t imply etcd/NATS health is validated.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: a098fcf8-1f7a-4148-af1c-3414e891373d

📥 Commits

Reviewing files that changed from the base of the PR and between 1806849 and d6fd810.

📒 Files selected for processing (19)

.github/actions/aicr-build/action.yml
.github/actions/gpu-snapshot-validate/action.yml
.github/workflows/gpu-h100-inference-test.yaml
.github/workflows/gpu-h100-training-test.yaml
tests/chainsaw/ai-conformance/README.md
tests/chainsaw/ai-conformance/cluster/chainsaw-test.yaml
tests/chainsaw/ai-conformance/common/assert-cert-manager.yaml
tests/chainsaw/ai-conformance/common/assert-dra-driver.yaml
tests/chainsaw/ai-conformance/common/assert-kai-scheduler.yaml
tests/chainsaw/ai-conformance/common/assert-monitoring.yaml
tests/chainsaw/ai-conformance/common/assert-skyhook.yaml
tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-crds.yaml
tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-dynamo.yaml
tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-kgateway.yaml
tests/chainsaw/ai-conformance/kind-inference-dynamo/chainsaw-test.yaml
tests/chainsaw/ai-conformance/kind-training-kubeflow/assert-kubeflow-trainer.yaml
tests/chainsaw/ai-conformance/kind-training-kubeflow/chainsaw-test.yaml
tests/chainsaw/ai-conformance/main.go
tests/chainsaw/ai-conformance/main_test.go

github-actions bot added area/ci area/tests size/L labels Apr 15, 2026

yuanchen8911 added the enhancement New feature or request label Apr 15, 2026

yuanchen8911 requested review from mchmarny and xdu31 April 15, 2026 21:21

yuanchen8911 force-pushed the codex/pr579-chainsaw-layout-followup branch from 387663b to 5519aba Compare April 15, 2026 21:23

yuanchen8911 requested review from a team as code owners April 15, 2026 21:23

yuanchen8911 force-pushed the codex/pr579-chainsaw-layout-followup branch 3 times, most recently from 90bca0f to 9bf488a Compare April 15, 2026 21:53

yuanchen8911 enabled auto-merge (squash) April 16, 2026 00:39

yuanchen8911 force-pushed the codex/pr579-chainsaw-layout-followup branch from 9bf488a to 93133f8 Compare April 16, 2026 02:20

yuanchen8911 mentioned this pull request Apr 16, 2026

feat(ci): harden inference waits and extract debug diagnostics #555

Closed

yuanchen8911 changed the title ~~refactor(ci): split shared and leaf GPU Chainsaw asserts~~ refactor(ci): unify GPU Chainsaw layout and validation flow Apr 16, 2026

yuanchen8911 force-pushed the codex/pr579-chainsaw-layout-followup branch from 93133f8 to b297b8f Compare April 16, 2026 03:12

yuanchen8911 requested a review from dims April 16, 2026 03:46

yuanchen8911 force-pushed the codex/pr579-chainsaw-layout-followup branch from b297b8f to 0b38c08 Compare April 16, 2026 04:03

github-actions bot added size/XL and removed size/L labels Apr 16, 2026

yuanchen8911 force-pushed the codex/pr579-chainsaw-layout-followup branch 2 times, most recently from 8743b5e to 70e59a9 Compare April 16, 2026 04:24

yuanchen8911 mentioned this pull request Apr 16, 2026

docs(agents): add errorlint rule, lint gate, and local overlay #590

Merged

6 tasks

mchmarny reviewed Apr 16, 2026

View reviewed changes

ArangoGutierrez previously approved these changes Apr 16, 2026

View reviewed changes

yuanchen8911 dismissed ArangoGutierrez’s stale review via 016e07a April 16, 2026 14:39

yuanchen8911 force-pushed the codex/pr579-chainsaw-layout-followup branch from 70e59a9 to 016e07a Compare April 16, 2026 14:39

coderabbitai bot reviewed Apr 16, 2026

View reviewed changes

Comment thread .github/actions/gpu-snapshot-validate/action.yml Outdated

Comment thread .github/workflows/gpu-h100-inference-test.yaml

Comment thread .github/workflows/gpu-h100-training-test.yaml

Comment thread tests/chainsaw/ai-conformance/main_test.go

yuanchen8911 force-pushed the codex/pr579-chainsaw-layout-followup branch from 016e07a to 1806849 Compare April 16, 2026 15:21

yuanchen8911 requested review from ArangoGutierrez and mchmarny April 16, 2026 15:26

coderabbitai bot reviewed Apr 16, 2026

View reviewed changes

Comment thread tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-dynamo.yaml

yuanchen8911 removed the request for review from dims April 16, 2026 16:22

yuanchen8911 force-pushed the codex/pr579-chainsaw-layout-followup branch from 1806849 to d6fd810 Compare April 16, 2026 16:44

coderabbitai bot reviewed Apr 16, 2026

View reviewed changes

Comment thread .github/actions/aicr-build/action.yml

Comment thread .github/actions/gpu-snapshot-validate/action.yml

Comment thread tests/chainsaw/ai-conformance/main_test.go

mchmarny approved these changes Apr 16, 2026

View reviewed changes

mchmarny assigned yuanchen8911 Apr 16, 2026

yuanchen8911 merged commit 1cbe5d9 into NVIDIA:main Apr 16, 2026
29 of 32 checks passed

Conversation

yuanchen8911 commented Apr 15, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Summary by CodeRabbit

Uh oh!

mchmarny left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArangoGutierrez left a comment

Choose a reason for hiding this comment

Uh oh!

yuanchen8911 commented Apr 16, 2026

Uh oh!

coderabbitai bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

yuanchen8911 commented Apr 16, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yuanchen8911 commented Apr 15, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 16, 2026 •

edited

Loading