Skip to content

Add op-reth proofs-history reorg crash repro#20956

Draft
karlfloersch wants to merge 6 commits into
developfrom
karl/op-reth-reorg-repro
Draft

Add op-reth proofs-history reorg crash repro#20956
karlfloersch wants to merge 6 commits into
developfrom
karl/op-reth-reorg-repro

Conversation

@karlfloersch
Copy link
Copy Markdown
Contributor

@karlfloersch karlfloersch commented May 21, 2026

Summary

Adds a minimal draft interop acceptance repro for op-reth proofs-history during a CL-driven invalid executing-message reorg.

The repro path is TestReorgInvalidExecMsgOpRethProofsHistoryTinyWindow. It is intentionally a negative/failing test right now: a failure with repro observed means the repro worked.

Test Flow

  1. Start a two-L2 supernode interop devstack using op-reth with proofs-history enabled and --proofs-history.window=1.
  2. Reuse the existing invalid executing-message flow: chain B emits the initiating message, chain A force-includes an executing message with an invalid LogIndex.
  3. Build the invalid chain A divergence block and one child block with op-test-sequencer.
  4. Wait 12s before restarting the real chain A sequencer. This gives proofs-history time to flush and prune with the tiny retention window.
  5. Restart chain A sequencing and batching so the supernode detects the invalid exec message and initiates the CL-driven rewind/reorg.
  6. During the rewind, op-reth proofs-history receives the reorg notification and attempts to unwind past retained proof history.
  7. The test waits until chain A op-reth RPC fails, proving op-reth exited during the proofs-history unwind path.
  8. The test explicitly reaps and restarts chain A op-reth against the same data directory.
  9. If op-reth does not become RPC-ready within 15s of restart, the test intentionally fails with repro observed.
  10. If op-reth does restart, the test polls both chain A op-reth RPC and shared supernode_syncStatus for sustained post-restart non-recovery.
  11. If chain A op-reth RPC recovers and the EL unsafe head advances past the reorg point, the test fails with unexpected recovery.
  12. If supernode chain A local-safe or cross-safe advances past the reorg point, the test fails with unexpected supernode recovery.
  13. If chain A EL does not advance and supernode chain A safe does not advance for the observation window, the test intentionally fails with repro observed.

Observed Repro

Latest focused restart run showed:

  • chain A included invalid block 24d174..7c1ebb:15 and child ebbd92..01431b:16
  • proofs-history pruned chain A proof storage through block 20 with --proofs-history.window=1
  • supernode invalidated block 15 and drove the CL rewind/reorg
  • op-reth received the reorg notification and exited with ExEx proofs-history crashed: Attempted to unwind to block 14 beyond earliest stored block 20
  • the test restarted op-reth against the same data directory
  • after restart, op-reth crashed again with ExEx proofs-history crashed: Parent hash mismatch at block 22
  • supernode stayed unable to derive chain A at the reorg point, including chain 901 not ready for timestamp ... failed to determine L2BlockRef of height 15
  • for the observation window, chain A EL RPC did not advance and supernode chain A safe stayed at/pinned around the reorg point, then the test failed intentionally with repro observed

Repro Confidence

This now proves the restart shape we care about in devstack: proofs-history can kill op-reth during a CL-driven rewind, op-reth can fail again after restart against the same data directory, and the supernode does not recover automatically during the observation window.

Important caveat: this is still a forced-retention repro. It does not prove the full production state under normal proof retention, and the restarted node currently crashes again rather than remaining live with forkchoice pinned while serving partial RPC.

Review Notes / Follow-ups

  • Removed the earlier explicit op-reth crash/restart repro and devstack crash-control plumbing.
  • Added devstack option plumbing for op-reth proofs-history window overrides.
  • Added a timed op-reth restart path so a failed restart becomes a bounded repro result instead of waiting for the global test timeout.
  • Added a deliberately failing tiny-window proofs-history repro that waits for sustained chain A EL and supernode non-recovery after restart before failing.
  • Remaining: decide whether this PR should keep the negative repro as skipped/manual-only before merge, or whether it should be used only while developing the fix.
  • Remaining: a stronger prod-faithful repro would need normal proof retention plus a deterministic way to hit the inconsistent post-restart block availability state without forcing the ExEx to crash again.

Validation

  • mise exec -- go test ./op-acceptance-tests/tests/interop/reorgs -run '^$' -count=0
  • mise exec -- go test ./op-devstack/dsl ./op-devstack/presets ./op-devstack/sysgo -run '^$' -count=0
  • Focused restart run after commit 9658dedaab: LOG_LEVEL=info mise exec -- go test -v ./op-acceptance-tests/tests/interop/reorgs -run '^TestReorgInvalidExecMsgOpRethProofsHistoryTinyWindow$' -count=1 -timeout=8m
    • Result: failed as intended after op-reth restart and sustained chain A/supernode non-recovery.
    • Log saved locally at /tmp/op-reth-reorg-tiny-proof-window-restart-bounded.log.
  • Earlier full build context: cd op-acceptance-tests && mise exec -- just build-deps built contracts forge artifacts, op-program/cannon prestates, and release Rust binaries for kona-node, kona-host, and op-reth; it then stopped at the later op-rbuilder step because this worktree has no root op-rbuilder/ directory.

@karlfloersch karlfloersch force-pushed the karl/op-reth-reorg-repro branch from dea0b9b to 80b548d Compare May 21, 2026 17:03
@karlfloersch karlfloersch changed the title Add op-reth crash reorg acceptance repro Add op-reth proofs-history reorg repro May 21, 2026
@karlfloersch karlfloersch changed the title Add op-reth proofs-history reorg repro Add op-reth proofs-history reorg crash repro May 21, 2026
@karlfloersch
Copy link
Copy Markdown
Contributor Author

Prod log comparison against this repro

I pulled the sdg-v1 Loki logs for an-sdg-v1-1-ops-reth-a-sn-3 over 2026-05-20T15:00:00Z..17:30:00Z and compared them to the local repro run from /tmp/op-reth-reorg-tiny-proof-window-restart-bounded.log.

Production signal

Relevant prod context:

  • Chain B: 420120085
  • Reorg point: 1880582
  • Bad hash: 0x8335a150a2a51f63be6211cd778659b4432421e795dab5ca5096fd64eb43a2e9
  • Replacement hash: 0x734d96109400506318ff5f7929a7714aea04c60149183c12ee8bc2f658822f67

What the logs show:

  • Before the stuck state, op-reth imported and proofs-history processed later blocks. Example: block 1880695 was added to canonical chain and proofs-history stored trie updates around 2026-05-20T15:12:19Z.
  • Afterward, proofs-history repeatedly failed on missing blocks at the reorg boundary:
    • Missing block 1880582: 4,861 matching log lines in the fetched window.
    • Missing block 1880583: 5,006 matching log lines in the fetched window.
  • op-reth status stayed pinned at the reorg point while the reference advanced. Around 2026-05-20T16:45:05Z and 16:46:20Z, op-reth logged latest_block=1880582.
  • The comparison monitor repeatedly reported the first mismatch at 1880582, with the reference hash equal to the bad hash and target hash equal to the replacement hash:
    • referenceHash=0x8335a150...43a2e9
    • targetHash=0x734d9610...f67
    • the reference height advanced from 1886144 onward while the target height remained 1880582.

I did not find a literal SIGBUS / bus error string in this Loki slice. That may be in another stream/window, but it is not in the fetched namespace query, so I would not cite SIGBUS from this dataset.

Local repro signal

The local repro creates the same class of failure with a deliberately tiny proofs-history window:

  • Invalid exec message included on chain A at block 15:
    • Unsafe head after invalid exec msg has been included in chain A ... unsafeHead=24d174..7c1ebb:15 parent=2c9e05..925f7b:14
  • During the CL-driven invalidation/rewind, proofs-history exits:
    • ExEx proofs-history crashed: Attempted to unwind to block 14 beyond earliest stored block 20
  • The test restarts op-reth after the proofs-history unwind exit:
    • restarting op-reth after proofs-history unwind exit number=15 hash=24d174..7c1ebb
  • After restart, op-reth fails again on inconsistent history:
    • ExEx proofs-history crashed: Parent hash mismatch at block 22 ...
  • The bounded observation then shows non-recovery for 30s:
    • chain A EL RPC stopped returning an unsafe head (connection reset / timeout)
    • supernode chain A safe stayed at 14
    • pending/local safe stayed at 15
    • the test fails intentionally with repro observed because chain A EL did not advance and supernode did not advance chain A safe past the reorg point for 30s

Match / gap

The repro matches the core prod shape:

  • CL-driven invalidation/rewind around a replaced unsafe block.
  • proofs-history is active and becomes the component surfacing the bad state.
  • restart after the proofs-history/op-reth exit.
  • persistent post-restart inability to make the affected chain usable again.
  • supernode remains unable to advance the affected chain safe past the reorg point.

Known differences:

  • Prod is a long-range real network failure at 1880582/1880583; the test compresses this to a tiny deterministic history window around 14/15/20.
  • Prod shows repeated Missing block 1880582/1880583; the local test triggers deterministic proofs-history failures as unwind beyond earliest stored block and then Parent hash mismatch.
  • The fetched prod logs prove the stuck block/missing-block behavior and hash disagreement, but not the literal SIGBUS event.
  • The native engine reset branch is a useful contrast: it still triggers the proofs-history crash, then recovers and fails the repro assertion with unexpected recovery, which is exactly what we want from a candidate CL/supernode-side mitigation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant