Skip to content

test(envtest): fix race between test mutations and controller status flush#355

Merged
bdchatham merged 2 commits into
mainfrom
fix/envtest-flake-statefulset-race
May 26, 2026
Merged

test(envtest): fix race between test mutations and controller status flush#355
bdchatham merged 2 commits into
mainfrom
fix/envtest-flake-statefulset-race

Conversation

@bdchatham
Copy link
Copy Markdown
Collaborator

Summary

Fixes an intermittent flake in the envtest integration suite for TestSeiNode_StatefulSetRecreatedAfterDelete and TestSeiNode_StatefulSetImpostorReplaced (the `sts-recreated` and `sts-impostor` scenarios).

Diagnosis

Both tests issue two near-simultaneous mutations on the same SeiNode:

  1. A destabilizing mutation — Delete(sts) for sts-recreated, or Status().Update forging a UID mismatch for sts-impostor.
  2. A spec patch — bumping spec.overrides["chain.touch-counter"] to fire the SeiNode predicate deterministically.

The SeiNode reconciler uses optimistic locking for its status flush (internal/controller/node/controller.go:92):

```go
statusBase := client.MergeFromWithOptions(before, client.MergeFromWithOptimisticLock{})
```

When the spec patch in step (2) lands between the in-flight reconcile's Get and its Status.Patch, the patch fails with 409:

```
"flushing status: Operation cannot be fulfilled on seinodes.sei.io "sts-recreated":
the object has been modified; please apply your changes to the latest version and try again"
```

Controller-runtime re-queues on 409, but the exponential backoff occasionally stretches convergence past the 30s pollTimeout (helpers_test.go:26) — surfacing as a flake.

History

  • 2026-05-26 — main run 26460706817 failed with this exact pattern after chore: bump sei-config to v0.0.16 #353 squash-merged into main. The PR itself (chore: bump sei-config to v0.0.16 #353) had passed test-integration in 3m51s; only the post-merge run on main caught the race.
  • 2026-05-21 — main on commit 2cc9e8c0 also failed similarly.
  • Otherwise green — low-single-digit % flake rate, consistent with timing-sensitive interleaves rather than a behavior regression.

Fix

Test-only. Insert a waitFor between the two mutations in each test:

sts-recreated

After Delete(sts), wait for the Owns watch to fire the deletion-observation reconcile — condition succeeds when the StatefulSet is briefly NotFound or has been recreated with a new UID. Both signal the Owns watch has already done its work, so the subsequent spec patch triggers a fresh reconcile rather than racing one in flight.

sts-impostor

After Status().Update forging the UID, wait for the forged value to be durable on the apiserver before issuing the spec patch. The re-Get round-trip also serves as a drain barrier against any reconcile the Status.Update may have woken (predicate configurations that don't filter status updates).

What this PR does not change

  • No controller behavior change (the MergeFromWithOptimisticLock semantics stay — they're correct as defensive checks against stale-view writes).
  • No timeout change (pollTimeout/pollInterval unchanged).
  • Pure test-side ordering fix.

Validation

Local envtest unavailable on this machine (no etcd in /usr/local/kubebuilder/bin/). Compilation + vet clean. Relying on CI to validate; recommend at least one re-run after merge to confirm the flake is gone.

Refs

…flush

The two StatefulSet recovery tests (TestSeiNode_StatefulSetRecreatedAfterDelete
and TestSeiNode_StatefulSetImpostorReplaced) issued two near-simultaneous
mutations on the same SeiNode:
  1. A destabilizing mutation — Delete the StatefulSet (sts-recreated) or
     Status.Update forging a UID mismatch (sts-impostor).
  2. A spec patch — bumping spec.overrides["chain.touch-counter"] to fire
     the SeiNode predicate deterministically.

The SeiNode reconciler uses MergeFromWithOptimisticLock for its
status flush (internal/controller/node/controller.go:92). When the spec
patch in (2) lands between the in-flight reconcile's Get and its
Status.Patch, the patch fails with 409 "the object has been modified".
Controller-runtime re-queues on 409, but the exponential backoff
occasionally stretches convergence past the 30s pollTimeout — surfacing
as a flake. We hit this on main on 2026-05-26 (run 26460706817) and
previously on 2026-05-21 (sha 2cc9e8c); other main runs pass.

Both tests are functionally correct; the race is in the test's lack of a
synchronization barrier between the two mutations. Insert a waitFor
between (1) and (2):

  sts-recreated:
    After Delete(sts), wait for the controller's Owns watch to fire its
    deletion-observation reconcile — either the StatefulSet is briefly
    NotFound or has been recreated with a new UID. Both signal the
    Owns watch has already done its work, so the subsequent spec patch
    triggers a fresh reconcile rather than racing one in flight.

  sts-impostor:
    After Status().Update forging the UID, wait for the forged value
    to be durable on the apiserver before issuing the spec patch. The
    re-Get round-trip also serves as a drain barrier against any
    reconcile the Status.Update may have woken (some predicate
    configurations fire on status changes).

No controller behavior change. No timeout change. Test-only fix that
eliminates the race window by sequencing the mutations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cursor
Copy link
Copy Markdown

cursor Bot commented May 26, 2026

PR Summary

Low Risk
Only envtest ordering in two integration tests; no runtime controller or API behavior changes.

Overview
Adds test-only synchronization in two SeiNode envtest cases so a follow-up spec.overrides patch does not race an in-flight reconcile’s optimistic-lock status write (409 flakes).

In TestSeiNode_StatefulSetRecreatedAfterDelete, after deleting the owned StatefulSet, the test now waitFors until deletion is observed (NotFound or recreated with a new UID) before bumping the spec.

In TestSeiNode_StatefulSetImpostorReplaced, after forging a bad UID via Status().Update, it waitFors until that status is visible on a fresh Get before the spec bump.

No production controller or timeout changes.

Reviewed by Cursor Bugbot for commit 44e8710. Bugbot is set up for automated code reviews on this repo. Configure here.

10-line and 11-line comments were oversharing context that belongs in
the PR description. Cut to 3-4 lines anchoring on the invariant the
reader needs (barrier exists to avoid a status-flush race) without
restating the full mechanic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bdchatham bdchatham merged commit a377670 into main May 26, 2026
5 checks passed
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 44e8710. Configure here.

s := &appsv1.StatefulSet{}
err := testCli.Get(testCtx, stsKey, s)
return apierrors.IsNotFound(err) || (err == nil && s.UID != originalUID)
}, "controller must observe the StatefulSet deletion before next mutation")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Barrier passes immediately on NotFound, race unfixed

Medium Severity

The apierrors.IsNotFound(err) branch causes this barrier to pass on its very first poll, making it a no-op. testCli is a direct (non-cached) client, and the STS has no finalizers in envtest, so it's already gone from etcd when Delete returns. The IsNotFound branch always wins before the second branch (s.UID != originalUID) — which genuinely proves the controller finished its reconcile — can ever be reached. The Owns-watch-triggered reconcile is still async at that point, so the subsequent spec patch can still land between the controller's Get and its optimistic-locked flushStatus, producing the same 409 this barrier is meant to prevent.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 44e8710. Configure here.

}
return latest.Status.StatefulSet != nil &&
latest.Status.StatefulSet.UID == "00000000-0000-0000-0000-000000000000"
}, "Status.Update forging UID must be durable before next mutation")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impostor barrier trivially satisfied, risks new timeout

Medium Severity

Because testCli is a direct (non-cached) client and Status().Update is synchronous, this barrier's Get sees the forged UID on the very first poll and returns immediately — it never actually drains any reconcile woken by the status write. The original 409 race persists. Worse, if the predicate doesn't filter status-only updates and the controller is fast enough to process the forged UID before the first poll, it nils Status.StatefulSet (impostor-deletion branch), and the barrier's condition (UID == "00000000-…") can never become true again — introducing a new 30s timeout flake that didn't exist before.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 44e8710. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant