fix(cli): --no-cluster must not deploy the snapshot-capture agent#604
Merged
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom Apr 17, 2026
Merged
Conversation
`--no-cluster` is documented as "dry-run, skip cluster operations", but validate.go only propagates it into `validator.WithNoCluster(...)` which affects *phase execution*. The earlier no-snapshot branch at `if snapshotFilePath != ""` still calls `deployAgentForValidation()` — which creates a K8s client, deploys a Job, and captures a snapshot from the live API. That directly contradicts the flag's contract. Two concrete consequences: 1. `go test ./pkg/cli/...` on any dev machine with a working ~/.kube/config mutates the live cluster as a side effect. Observed on aicr-cuj2: `TestValidateCmd_RecipeKindHandling` deployed the `aicr-validation` namespace and the `aicr-validate` Job before the test returned. 2. Three subtests in that test asserted `errContain: "kubernetes client"` — they only passed in CI because CI has no kubeconfig, so `GetKubeClient()` failed and the "failed to create kubernetes client" wrap propagated out. On dev machines the agent deploy succeeded, phases reported skipped, validate returned nil → `expected error but got nil`. Guard the agent-deploy path on `--no-cluster` and return a clear `ErrCodeInvalidRequest` when the user asks for no-cluster without a snapshot (nothing to validate against). Placed after `recipe.LoadFromFile` so the subtests continue to exercise the recipe loader, kind check, and RecipeMetadata auto-hydration before hitting the new error. Update the three subtests' `errContain` to the new message. The existing `errAbsent` assertions still prove those same kind-check / auto-hydration paths ran successfully. Known follow-up (out of scope for this PR): `cm://` URIs for `--recipe` and `--snapshot` still imply cluster reads under `--no-cluster`, so a literal "no cluster access anywhere" mode needs further work (likely a separate check in `recipe.LoadFromFile` / `serializer.FromFileWith…` paths).
📝 WalkthroughWalkthroughThe validate command now includes an early validation check that rejects the Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
dims
approved these changes
Apr 17, 2026
njhensley
approved these changes
Apr 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--no-clusteris documented as "dry-run, skip cluster operations", but inpkg/cli/validate.goit currently only gates phase execution — the earlierif snapshotFilePath == ""branch still callsdeployAgentForValidation(), which builds a K8s client, creates a Job, and captures a real snapshot from the live API. This PR gates that branch on--no-clusterand returns a clearErrCodeInvalidRequestwhen the flag is set without--snapshot.Motivation / Context
This surfaced while investigating why
make qualifypassed yesterday but failed today on the same tree.We confirmed the failure is environment-dependent:
--no-clustertests happened to fail early while building the Kubernetes client. The threeTestValidateCmd_RecipeKindHandlingsubtests that asserterrContain: "kubernetes client"therefore passed by accident.aicr-validationnamespace andaicr-validateJob, wait for completion, and then returnnil. That makes the tests fail withexpected error but got nil.So there are two bugs with one root cause:
Design bug:
--no-clusterdoes not match its contract.pkg/cli/validate.go#L519-L538callsdeployAgentForValidationin the no-snapshot branch regardless of--no-cluster.pkg/validator/validator.go#L172-L175honorsNoClusteronly in phase execution — far too late.Test env-coupling: the three "kind check passes" subtests in
pkg/cli/validate_test.goasserterrContain: "kubernetes client". That only happens when cluster access fails. With a valid kube context, those tests mutate the live cluster instead of failing deterministically.Fixes: N/A (no issue filed)
Related: #595 (added the failing subtests)
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)Implementation Notes
This is intentionally a small fix: add one explicit flag-combination guard in
pkg/cli/validate.go, then update the three affected tests to assert the new deterministic error.Placement matters. The guard is inserted after
recipe.LoadFromFile(not immediately after the--recipe is requiredcheck), so theRecipeKindHandlingsubtests still exercise recipe loading, kind validation, andRecipeMetadataauto-hydration through the CLI. Only the env-dependent downstream failure mode changes.The three failing subtests'
errContainis updated from"kubernetes client"(depends on whether cluster access happens to fail) to"--no-cluster requires --snapshot"(deterministic). The other subtests already fail earlier and are unaffected.Known follow-up, deliberately out of scope:
cm://URIs for--recipeand--snapshotstill imply cluster reads (viarecipe.LoadFromFile/serializer.FromFileWithKubeconfig) under--no-cluster. This PR fixes the dangerous mutation path (deployAgentForValidation) but not every possible cluster read.Testing
GOFLAGS=-mod=vendor go test -race -count=1 -run TestValidateCmd_RecipeKindHandling -v ./pkg/cli/...Observed before this fix:
GetKubeClient()fails and the old"kubernetes client"assertion matches by accident.That explains why
make qualifypassed yesterday and failed today without any relevant code change in this PR.After this fix:
GOFLAGS=-mod=vendor go test -race -count=1 -run TestValidateCmd_RecipeKindHandling -v ./pkg/cli/...passes deterministically, without creating cluster resources, because
--no-clusternow rejects the invalid no-snapshot path before any agent deployment happens.Risk Assessment
Rollout notes: Tightens an input-validation check on a flag combo that did not make semantic sense. Anyone currently passing
--no-clusterwithout--snapshotis silently deploying agents against their live cluster; they will now get a clear error telling them to provide a snapshot or drop--no-cluster.Checklist
make testwith-race)make lint)git commit -S) — GPG signing info