Skip to content

test(supernode): wait on EL finalized before fresh-data-dir restart#20945

Merged
ajsutton merged 1 commit into
developfrom
aj/fix/supernode-resync-post-activation-flake
May 21, 2026
Merged

test(supernode): wait on EL finalized before fresh-data-dir restart#20945
ajsutton merged 1 commit into
developfrom
aj/fix/supernode-resync-post-activation-flake

Conversation

@ajsutton
Copy link
Copy Markdown
Contributor

@ajsutton ajsutton commented May 21, 2026

Closes #20944.

TestSupernodeResyncResumesAtActivation_PostActivation was flaking (~6 hits in CI Insights) with first seal ts == backfill handoff ts == activationTimestamp, breaking Lessf(first.Timestamp, backfillHandoff) in AssertBackfillCovers.

Root cause: the supernode CL's safety.Finalized advances in-memory before the corresponding forkchoiceUpdated is delivered to and persisted by the EL. The test only waited on the CL's view, so under CI scheduling pressure RestartWithFreshDataDir could fire inside that window. op-reth then still reported finalized = L2 genesis to the fresh op-node, which correctly reset the pipeline back to genesis, wrote the L1=0 / L2=genesis SafeDB pin in onEngineConfirmedReset, and collapsed the cold-start backfill window to empty.

Fix is test-only — the supernode's reset-to-genesis behaviour is correct given op-reth's persisted state:

  • Wait on each EL's eth.Finalized label as well as the CL's safety.Finalized so op-reth has persisted the advance before we wipe the supernode data dir.
  • L2ELNode.AdvancedFn now defaults to block+30 polling attempts (was block+3) and takes a varargs WithTimeout option so callers can extend the budget. The supernode test passes WithTimeout(180) to match the CL wait (180 attempts × 2s = 360s).

No deterministic local repro — flake requires CI scheduling to land the restart inside the CL→EL FCU window.

The supernode CL's safety.Finalized advances in-memory before the
corresponding forkchoiceUpdated is delivered to and persisted by the EL.
If RestartWithFreshDataDir is called inside that window, op-reth still
reports finalized=L2 genesis to the fresh op-node, which then correctly
resets the pipeline back to L2 genesis. The resulting "L1=0 / L2=genesis"
SafeDB pin makes FirstSafeHeadTimestamp return the genesis time, the
cold-start backfill window collapses to empty, and the assertion
first sealed timestamp < FirstVerifiableTimestamp fires.

Also wait on each EL's finalized label so op-reth has persisted the
advance before we wipe the supernode data dir.

L2ELNode.AdvancedFn now defaults to block+30 polling attempts (was
block+3) and takes a varargs WithTimeout option so callers like this
test can extend the budget to match the CL wait (180 attempts = 360s).

Refs: #20944
@ajsutton ajsutton force-pushed the aj/fix/supernode-resync-post-activation-flake branch from 1f8acdc to ad1feb7 Compare May 21, 2026 05:27
@ajsutton ajsutton marked this pull request as ready for review May 21, 2026 10:57
@ajsutton ajsutton requested a review from a team as a code owner May 21, 2026 10:57
Copy link
Copy Markdown
Contributor

@karlfloersch karlfloersch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ajsutton ajsutton added this pull request to the merge queue May 21, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 21, 2026
@ajsutton ajsutton added this pull request to the merge queue May 21, 2026
Merged via the queue into develop with commit 4e8a088 May 21, 2026
68 checks passed
@ajsutton ajsutton deleted the aj/fix/supernode-resync-post-activation-flake branch May 21, 2026 23:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

flaky test: TestSupernodeResyncResumesAtActivation_PostActivation

2 participants