test(envtest): wait for child CurrentImage before pausing faker (TestInPlaceRollout_StuckUntilUnstuck)#359
Conversation
The pre-pause gate used the SND-level signal (TemplateHash set + no in-flight Rollout), which fires as soon as the init plan ends at mark-ready. For a fresh InPlace SND with no Genesis, the init plan does not run observe-image — Status.CurrentImage on each child is stamped only later via the steady-state drift NodeUpdate. When the test paused the faker before that drift pass observed the StatefulSet, observe-image stalled forever on ObservedGeneration < Generation, CurrentImage stayed "", and the final assertion (CurrentImage == v1) failed intermittently (#353, #357, #358 main CI). Replace the SND-level gate with a child-level barrier that waits for Status.CurrentImage == v1 on every child before Pause(). Also correct the misleading comment claiming v1 came from "the first rollout's ObserveImage" — there is no first rollout for v1; the v1 stamp comes from drift. Test-side patch only; controller behavior is unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PR SummaryLow Risk Overview The pre-pause barrier no longer treats SND Comments in the test are updated to document that the v1 stamp comes from drift Reviewed by Cursor Bugbot for commit 791f2c7. Bugbot is set up for automated code reviews on this repo. Configure here. |
Problem
TestInPlaceRollout_StuckUntilUnstuck(internal/controller/nodedeployment/envtest/inplace_stuck_test.go) has been intermittently failing on main CI — observed on #353, #357, #358. Verdict from prior investigation: test orchestration bug, not a controller bug.Mechanism
inplace_stuck_test.go:71-73used the SND-level signal:TemplateHash != "" && Rollout == nil.TemplateHashis stamped immediately, so the gate fires fast.testFaker.Pause()(line 78) before any child's driftNodeUpdatehas had a chance to runobserve-image.mark-readyand does NOT includeobserve-image.Status.CurrentImage = v1is stamped only via the steady-state drift correction (internal/controller/seinode/planner.go:707).observe-imagestalls forever onObservedGeneration < Generation(internal/task/observe_image.go:65-66).Status.CurrentImagestays"". The final assertion at line 148 (CurrentImage == v1) fails.Fix
Replace the SND-level gate with a child-level barrier that waits for
Status.CurrentImage == v1on every child beforetestFaker.Pause(). Also corrects the misleading comment at lines 143-147 — it claimed v1 came from "the first rollout's ObserveImage", but there is no first rollout for v1; the stamp comes from drift.Test-side patch only. No controller code changes (
observe_image.go,planner.go,status.gountouched). The controller behavior is correct as-is.Independent of #355
#355 correctly addresses a different race in
seinode_statefulset_test.go(test mutations vs controller status flush). This PR does not extend or revert that change — it targets a separate orchestration bug ininplace_stuck_test.go.Test plan
KUBEBUILDER_ASSETS=… go test -tags=envtest -run TestInPlaceRollout_StuckUntilUnstuck -count=10 -timeout 600s ./internal/controller/nodedeployment/envtest/...— 10/10 pass locally (162s wall).KUBEBUILDER_ASSETS=… go test -tags=envtest -count=2 -timeout 600s ./internal/controller/nodedeployment/envtest/...— full envtest suite passes 2x (106s wall).🤖 Generated with Claude Code