Skip to content

fix(cli): --no-cluster must not deploy the snapshot-capture agent#604

Merged
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:fix/validate-no-cluster-snapshot-check
Apr 17, 2026
Merged

fix(cli): --no-cluster must not deploy the snapshot-capture agent#604
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:fix/validate-no-cluster-snapshot-check

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 commented Apr 17, 2026

Summary

--no-cluster is documented as "dry-run, skip cluster operations", but in pkg/cli/validate.go it currently only gates phase execution — the earlier if snapshotFilePath == "" branch still calls deployAgentForValidation(), which builds a K8s client, creates a Job, and captures a real snapshot from the live API. This PR gates that branch on --no-cluster and returns a clear ErrCodeInvalidRequest when the flag is set without --snapshot.

Motivation / Context

This surfaced while investigating why make qualify passed yesterday but failed today on the same tree.

We confirmed the failure is environment-dependent:

  • Yesterday there was no active / usable EKS context, so the --no-cluster tests happened to fail early while building the Kubernetes client. The three TestValidateCmd_RecipeKindHandling subtests that assert errContain: "kubernetes client" therefore passed by accident.
  • Today there is an active kube context, so the exact same tests can successfully reach the cluster. Instead of erroring, they deploy the snapshot-capture agent, create the aicr-validation namespace and aicr-validate Job, wait for completion, and then return nil. That makes the tests fail with expected error but got nil.

So there are two bugs with one root cause:

  1. Design bug: --no-cluster does not match its contract.

  2. Test env-coupling: the three "kind check passes" subtests in pkg/cli/validate_test.go assert errContain: "kubernetes client". That only happens when cluster access fails. With a valid kube context, those tests mutate the live cluster instead of failing deterministically.

Fixes: N/A (no issue filed)
Related: #595 (added the failing subtests)

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)

Implementation Notes

This is intentionally a small fix: add one explicit flag-combination guard in pkg/cli/validate.go, then update the three affected tests to assert the new deterministic error.

// --no-cluster means "do not touch the cluster". The agent-deploy
// branch below contradicts that (it creates a Job and captures a
// snapshot from the live API), so a snapshot file is the only valid
// data source in that mode. Placed after recipe.LoadFromFile so
// recipe kind-check and auto-hydration still run for CLI coverage.
if snapshotFilePath == "" && cmd.Bool("no-cluster") {
    return errors.New(errors.ErrCodeInvalidRequest,
        "--no-cluster requires --snapshot (cannot deploy the snapshot-capture agent without cluster access)")
}

Placement matters. The guard is inserted after recipe.LoadFromFile (not immediately after the --recipe is required check), so the RecipeKindHandling subtests still exercise recipe loading, kind validation, and RecipeMetadata auto-hydration through the CLI. Only the env-dependent downstream failure mode changes.

The three failing subtests' errContain is updated from "kubernetes client" (depends on whether cluster access happens to fail) to "--no-cluster requires --snapshot" (deterministic). The other subtests already fail earlier and are unaffected.

Known follow-up, deliberately out of scope: cm:// URIs for --recipe and --snapshot still imply cluster reads (via recipe.LoadFromFile / serializer.FromFileWithKubeconfig) under --no-cluster. This PR fixes the dangerous mutation path (deployAgentForValidation) but not every possible cluster read.

Testing

GOFLAGS=-mod=vendor go test -race -count=1 -run TestValidateCmd_RecipeKindHandling -v ./pkg/cli/...

Observed before this fix:

  • With a live kube context available: the test fails after actually deploying the validation Job, because the old assertion expected a kube-client error.
  • Without a usable kube context: the same test passes, because GetKubeClient() fails and the old "kubernetes client" assertion matches by accident.

That explains why make qualify passed yesterday and failed today without any relevant code change in this PR.

After this fix:

GOFLAGS=-mod=vendor go test -race -count=1 -run TestValidateCmd_RecipeKindHandling -v ./pkg/cli/...

passes deterministically, without creating cluster resources, because --no-cluster now rejects the invalid no-snapshot path before any agent deployment happens.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes: Tightens an input-validation check on a flag combo that did not make semantic sense. Anyone currently passing --no-cluster without --snapshot is silently deploying agents against their live cluster; they will now get a clear error telling them to provide a snapshot or drop --no-cluster.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

`--no-cluster` is documented as "dry-run, skip cluster operations", but
validate.go only propagates it into `validator.WithNoCluster(...)` which
affects *phase execution*. The earlier no-snapshot branch at
`if snapshotFilePath != ""` still calls `deployAgentForValidation()` —
which creates a K8s client, deploys a Job, and captures a snapshot from
the live API. That directly contradicts the flag's contract.

Two concrete consequences:

1. `go test ./pkg/cli/...` on any dev machine with a working ~/.kube/config
   mutates the live cluster as a side effect. Observed on aicr-cuj2:
   `TestValidateCmd_RecipeKindHandling` deployed the `aicr-validation`
   namespace and the `aicr-validate` Job before the test returned.

2. Three subtests in that test asserted `errContain: "kubernetes client"`
   — they only passed in CI because CI has no kubeconfig, so
   `GetKubeClient()` failed and the "failed to create kubernetes client"
   wrap propagated out. On dev machines the agent deploy succeeded,
   phases reported skipped, validate returned nil → `expected error but
   got nil`.

Guard the agent-deploy path on `--no-cluster` and return a clear
`ErrCodeInvalidRequest` when the user asks for no-cluster without a
snapshot (nothing to validate against). Placed after `recipe.LoadFromFile`
so the subtests continue to exercise the recipe loader, kind check, and
RecipeMetadata auto-hydration before hitting the new error.

Update the three subtests' `errContain` to the new message. The existing
`errAbsent` assertions still prove those same kind-check / auto-hydration
paths ran successfully.

Known follow-up (out of scope for this PR): `cm://` URIs for `--recipe`
and `--snapshot` still imply cluster reads under `--no-cluster`, so a
literal "no cluster access anywhere" mode needs further work (likely a
separate check in `recipe.LoadFromFile` / `serializer.FromFileWith…`
paths).
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 17, 2026

📝 Walkthrough

Walkthrough

The validate command now includes an early validation check that rejects the --no-cluster flag when the --snapshot option is not provided, returning a new error message before snapshot loading is attempted.

Changes

Cohort / File(s) Summary
Validation Logic
pkg/cli/validate.go
Added validation check that enforces --snapshot requirement when --no-cluster is specified; returns ErrCodeInvalidRequest if condition is not met.
Test Updates
pkg/cli/validate_test.go
Updated test assertions to expect --no-cluster requires --snapshot error message instead of kubernetes client for three validation failure scenarios.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Poem

🐰 A flag and its friend must dance together true,
No cluster without snapshot? The validation's due!
Early checks prevent the chaos that ensues,
With tests now aligned, there's nothing to lose! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately captures the main fix: preventing the snapshot-capture agent deployment when --no-cluster is set without --snapshot, which is the core issue addressed by adding validation logic in pkg/cli/validate.go.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@yuanchen8911 yuanchen8911 merged commit 8015fa8 into NVIDIA:main Apr 17, 2026
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants