Skip to content

supernode: harden direct rewind around op-reth restart#20962

Draft
karlfloersch wants to merge 1 commit into
karl/op-reth-reorg-reprofrom
karl/op-reth-reorg-recovery
Draft

supernode: harden direct rewind around op-reth restart#20962
karlfloersch wants to merge 1 commit into
karl/op-reth-reorg-reprofrom
karl/op-reth-reorg-recovery

Conversation

@karlfloersch
Copy link
Copy Markdown
Contributor

Summary

Draft follow-up to #20956. This tries the simplest CL/supernode-side recovery hardening for the op-reth proofs-history reorg repro:

  • make VirtualNode.Stop wait until the embedded op-node has actually stopped before RewindEngine rewinds the EL
  • reset the op-node SafeDB to the same target block after the direct engine rewind, matching the rewind target used by the EL
  • expose BlockAtTimestamp on the chain-container engine-controller interface so the chain container can resolve that SafeDB target

Local result

This does not fully recover the repro yet. It improves the sequence enough that, after the first op-reth restart, chain 901 is driven forward again, but op-reth then crashes a second time from the proofs-history ExEx while ingesting the replacement chain. That suggests the remaining failure is still in EL/ExEx state handling, not just supernode reset bookkeeping.

Relevant local acceptance run:

LOG_LEVEL=info mise exec -- go test -v ./op-acceptance-tests/tests/interop/reorgs \
  -run '^TestReorgInvalidExecMsgOpRethProofsHistoryTinyWindow$' \
  -count=1 -timeout=8m 2>&1 | tee /tmp/op-reth-reorg-recovery-attempt.log

Important signals from /tmp/op-reth-reorg-recovery-attempt.log:

  • chain_container/RewindEngine: rewound safe-db target=415c48..f9f0b7:14
  • first op-reth exit: ExEx proofs-history crashed: Attempted to unwind to block 14 beyond earliest stored block 20
  • restart occurred: restarting op-reth after proofs-history unwind exit
  • after restart, chain 901 progressed to UnsafeL2 ... :27, with PendingSafeL2/LocalSafeL2 ... :15
  • second op-reth exit: ExEx proofs-history crashed: Parent hash mismatch at block 22
  • final repro assertion: repro observed

Tests

  • go test ./op-supernode/supernode/chain_container/...
  • go test ./op-supernode/supernode/activity/interop -run 'TestApplyPendingTransition|TestReset|TestRewind'
  • focused acceptance repro above, still failing with the repro assertion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant