Skip to content

test(envtest): wait for child CurrentImage before pausing faker (TestInPlaceRollout_StuckUntilUnstuck)#359

Merged
bdchatham merged 1 commit into
mainfrom
fix/envtest-inplace-stuck-currentimage-barrier
May 28, 2026
Merged

test(envtest): wait for child CurrentImage before pausing faker (TestInPlaceRollout_StuckUntilUnstuck)#359
bdchatham merged 1 commit into
mainfrom
fix/envtest-inplace-stuck-currentimage-barrier

Conversation

@bdchatham
Copy link
Copy Markdown
Collaborator

Problem

TestInPlaceRollout_StuckUntilUnstuck (internal/controller/nodedeployment/envtest/inplace_stuck_test.go) has been intermittently failing on main CI — observed on #353, #357, #358. Verdict from prior investigation: test orchestration bug, not a controller bug.

Mechanism

  • Pre-pause gate at inplace_stuck_test.go:71-73 used the SND-level signal: TemplateHash != "" && Rollout == nil.
  • For a fresh InPlace SND with no Genesis, no SND-side plan runs — TemplateHash is stamped immediately, so the gate fires fast.
  • The test then calls testFaker.Pause() (line 78) before any child's drift NodeUpdate has had a chance to run observe-image.
  • The init plan ends at mark-ready and does NOT include observe-image. Status.CurrentImage = v1 is stamped only via the steady-state drift correction (internal/controller/seinode/planner.go:707).
  • Once the faker is paused, observe-image stalls forever on ObservedGeneration < Generation (internal/task/observe_image.go:65-66). Status.CurrentImage stays "". The final assertion at line 148 (CurrentImage == v1) fails.

Fix

Replace the SND-level gate with a child-level barrier that waits for Status.CurrentImage == v1 on every child before testFaker.Pause(). Also corrects the misleading comment at lines 143-147 — it claimed v1 came from "the first rollout's ObserveImage", but there is no first rollout for v1; the stamp comes from drift.

Test-side patch only. No controller code changes (observe_image.go, planner.go, status.go untouched). The controller behavior is correct as-is.

Independent of #355

#355 correctly addresses a different race in seinode_statefulset_test.go (test mutations vs controller status flush). This PR does not extend or revert that change — it targets a separate orchestration bug in inplace_stuck_test.go.

Test plan

  • KUBEBUILDER_ASSETS=… go test -tags=envtest -run TestInPlaceRollout_StuckUntilUnstuck -count=10 -timeout 600s ./internal/controller/nodedeployment/envtest/... — 10/10 pass locally (162s wall).
  • KUBEBUILDER_ASSETS=… go test -tags=envtest -count=2 -timeout 600s ./internal/controller/nodedeployment/envtest/... — full envtest suite passes 2x (106s wall).

🤖 Generated with Claude Code

The pre-pause gate used the SND-level signal (TemplateHash set + no
in-flight Rollout), which fires as soon as the init plan ends at
mark-ready. For a fresh InPlace SND with no Genesis, the init plan
does not run observe-image — Status.CurrentImage on each child is
stamped only later via the steady-state drift NodeUpdate. When the
test paused the faker before that drift pass observed the
StatefulSet, observe-image stalled forever on
ObservedGeneration < Generation, CurrentImage stayed "", and the
final assertion (CurrentImage == v1) failed intermittently
(#353, #357, #358 main CI).

Replace the SND-level gate with a child-level barrier that waits for
Status.CurrentImage == v1 on every child before Pause(). Also correct
the misleading comment claiming v1 came from "the first rollout's
ObserveImage" — there is no first rollout for v1; the v1 stamp comes
from drift.

Test-side patch only; controller behavior is unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@cursor
Copy link
Copy Markdown

cursor Bot commented May 28, 2026

PR Summary

Low Risk
Test-only orchestration and comment updates; no runtime controller or API behavior changes.

Overview
Fixes a flaky envtest in TestInPlaceRollout_StuckUntilUnstuck by tightening setup before the StatefulSet status faker is paused.

The pre-pause barrier no longer treats SND TemplateHash set and no Rollout as “steady state.” It now polls until every child SeiNode has status.currentImage == v1, so steady-state drift can run observe-image and stamp CurrentImage before testFaker.Pause(). Without that, pausing too early leaves CurrentImage empty and the stuck-rollout assertion (spec v2, status still v1) fails intermittently.

Comments in the test are updated to document that the v1 stamp comes from drift NodeUpdate, not from an initial rollout’s observe-image. No production controller changes.

Reviewed by Cursor Bugbot for commit 791f2c7. Bugbot is set up for automated code reviews on this repo. Configure here.

@bdchatham bdchatham merged commit 12ccf3a into main May 28, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant