Fix inconsistent state detection and rollback #2983

alpe · 2026-01-15T21:03:59Z

Overview

Also check timestamp for Idempotency
Implement rollback for evm EngineClient

claude · 2026-01-15T21:04:15Z

Claude finished @alpe's task in 2m 6s —— View job

Code Review Complete ✓

Overview

This PR adds important improvements for handling inconsistent state between the execution layer (EL) and consensus layer (CL) during rolling restarts and Raft leadership changes. The changes introduce timestamp-based idempotency checks and implement EL rollback functionality.

Summary of Changes

1. Added Rollbackable Interface (core/execution/execution.go:106-121)

New optional interface for execution clients to support automatic rollback
Enables recovery when EL is ahead of consensus layer

2. Implemented Rollback for EVM EngineClient (execution/evm/execution.go:957-997)

Uses forkchoice update to reset EL head to target height
Resets head, safe, and finalized pointers to target block

3. Enhanced Idempotency Detection (execution/evm/execution.go:699-803)

Renamed checkIdempotency → reconcileExecutionAtHeight (better name!)
Added timestamp validation to detect dual-store conflicts
Validates payloadID still exists before resuming
Triggers automatic rollback on timestamp mismatch

Detailed Review

1. Code Quality & Architecture ⭐⭐⭐⭐

Strengths:

Clean separation of concerns via optional interface pattern
Excellent naming: reconcileExecutionAtHeight is more descriptive than checkIdempotency
Well-structured error handling with context preservation
Good use of mutex protection for concurrent state access
Comprehensive logging with appropriate levels (Info/Warn/Error)

Observations:

The rollback implementation is elegant - using forkchoice update rather than destructive operations
Comment quality is excellent, especially explaining the "why" (e.g., "Dual-Store Conflicts")

Minor Concern:

execution/evm/execution.go:788: The rollback is synchronous and could block. Consider adding timeout context or documenting expected latency.

2. Error Handling & Resilience ⭐⭐⭐⭐⭐

Strengths:

Proper error wrapping with context (fmt.Errorf("... %w", err))
Graceful degradation: idempotency check failures don't fail execution
Fallback behavior when payloadID is stale (lines 743-749)
Good use of structured logging before errors

Excellent Pattern (execution/evm/execution.go:346-359):

stateRoot, payloadID, found, err := c.reconcileExecutionAtHeight(ctx, blockHeight, timestamp, txs)
if err != nil {
    c.logger.Warn().Err(err).Uint64("height", blockHeight).Msg("ExecuteTxs: idempotency check failed")
    // Continue execution on error, as it might be transient
}

This is robust - a failure in the optimization doesn't break the primary flow.

Concern - Rollback Error Handling:

execution/evm/execution.go:788-794: If rollback fails, we return an error which will likely cause the node to crash or stall. Consider:

Should we retry rollback with backoff?
Should we expose a metric for rollback failures?
What happens if the target height block doesn't exist? (Currently returns error, which seems correct)

3. Security 🔒 ⭐⭐⭐⭐⭐

Strengths:

Timestamp validation prevents accepting stale/malicious ExecMeta
Payload validation prevents resume attacks (lines 736-742)
No exposure of sensitive data in logs
Proper context cancellation handling

Critical Security Property:
The timestamp check (lines 713-726) is essential for consensus safety. It ensures that even if local ExecMeta is compromised or stale, the EL state is verified against the authoritative Raft-replicated timestamp.

Question:

Should timestamp comparison be within a tolerance window? Unix timestamps are exact, but consider clock skew in distributed systems. However, since Raft is the source of truth, exact matching seems correct.

4. Performance & Resource Efficiency ⚡ ⭐⭐⭐⭐

Strengths:

Minimal additional RPC calls (only on mismatch detection)
Efficient early returns for idempotent cases
Proper mutex usage (lock only when needed)

Observations:

Rollback is an expensive operation (forkchoice update + reorg), but appropriately used only in exceptional cases
The payloadID validation (line 736) adds a GetPayload RPC call on every restart with pending execution. This is reasonable for correctness.

Potential Optimization:

execution/evm/execution.go:754-801: Two sequential RPC calls to getBlockInfo (lines 754 and 964). Consider caching or combining if timestamp mismatches are common during development/testing.

5. Testing & Reliability 🧪 ⭐⭐⭐ (Needs Improvement)

Critical Gap - Missing Test Coverage:

The PR adds significant new functionality but no new tests. Required test coverage:

Timestamp Mismatch Detection:

// Test: ExecMeta has stale timestamp → should ignore and proceed
// Setup: Save ExecMeta with old timestamp
// Execute: Call ExecuteTxs with new timestamp
// Assert: ExecMeta is ignored, fresh execution proceeds

Rollback on Timestamp Conflict:

// Test: EL has block with different timestamp → rollback
// Setup: EL has block at height N with timestamp T1
// Execute: ExecuteTxs at height N with timestamp T2 (T1 ≠ T2)
// Assert: Rollback called to height N-1

Stale PayloadID Handling:

// Test: ExecMeta has payloadID but EL doesn't recognize it
// Setup: Save ExecMeta with payloadID, restart EL (clears cache)
// Execute: Call ExecuteTxs
// Assert: Falls through to fresh execution

Rollback Implementation:

// Test: Rollback successfully resets EL state
// Setup: Build chain to height 5
// Execute: Rollback to height 3
// Assert: GetLatestHeight returns 3, forkchoice state correct

Recommendation:
Add comprehensive tests →

Existing Test Impact:

The existing TestEngineExecution should still pass (uses consistent timestamps)
Consider adding a variant that simulates timestamp mismatches

6. Documentation 📚 ⭐⭐⭐⭐

Strengths:

Excellent inline comments explaining the "why" behind decisions
Interface documentation is clear and comprehensive
Log messages are descriptive

Specific Praise:

Lines 776-778: Perfect explanation of dual-store conflict scenario
Lines 107-109: Clear documentation of when Rollbackable is needed

Minor Gap:

Consider adding a comment in Rollback() about what happens to blocks > targetHeight (are they pruned? orphaned? depends on EL implementation?)

Critical Issues 🚨

None - this is solid code!

Important Recommendations

Add Test Coverage (High Priority)
- Timestamp mismatch scenarios
- Rollback success/failure paths
- Stale payloadID handling
Consider Rollback Retry Logic (Medium Priority)
- If forkchoice update fails, should we retry?
- Add observability: metrics for rollback events
Document EL-Specific Behavior (Low Priority)
- How does Reth handle blocks after rollback?
- Are they permanently pruned or just orphaned?

Minor Suggestions

execution/evm/execution.go:713: Consider extracting timestamp comparison to a helper function for testability:

func (c *EngineClient) validateExecMetaTimestamp(execMeta *ExecMeta, expected time.Time) bool {
    return execMeta.Timestamp == expected.Unix()
}

execution/evm/execution.go:788: Add timeout context for rollback operation:

rollbackCtx, cancel := context.WithTimeout(ctx, 30*time.Second)
defer cancel()
if err := c.Rollback(rollbackCtx, height-1); err != nil {

core/execution/execution.go:120: Consider adding a return value for the old head height:
```
Rollback(ctx context.Context, targetHeight uint64) (oldHeight uint64, err error)
```
This would help with metrics/logging.

Verdict

Overall Assessment: ⭐⭐⭐⭐ (4/5)

This is a well-designed and well-implemented solution to a critical distributed systems problem. The code quality is high, error handling is robust, and the architecture is sound. The main gap is test coverage - this functionality deals with edge cases (rolling restarts, leadership changes) that are difficult to debug in production, so comprehensive tests are essential.

Recommendation: Request Changes - Add tests before merging.

The core logic is excellent, but we need confidence that it works correctly in the scenarios it's designed to handle. Once tests are added, this will be production-ready.

Tasks

Read modified files to understand changes
Analyze git diff to see exact modifications
Review code quality & architecture
Review error handling & resilience
Review security considerations
Review performance & resource efficiency
Review testing coverage
Review documentation
Post comprehensive review feedback

github-actions · 2026-01-15T21:04:16Z

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`⏩ skipped`	`✅ passed`	`✅ passed`	Jan 15, 2026, 9:04 PM

gemini-code-assist · 2026-01-15T21:04:17Z

Summary of Changes

Hello @alpe, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the resilience and consistency of the execution layer by introducing a robust rollback mechanism and refining idempotency checks. These changes enable the system to automatically recover from state inconsistencies, such as those arising from rolling restarts or consensus layer divergences, by ensuring the execution layer can revert to a correct historical state and re-synchronize with the authoritative chain.

Highlights

New Rollbackable Interface: Introduced a new Rollbackable interface in core/execution/execution.go that execution clients can implement to support automatic state rollback when the execution layer is ahead of the target height, facilitating recovery during rolling restarts.
EngineClient Rollback Implementation: The EngineClient in execution/evm/execution.go now implements the Rollbackable interface. Its Rollback method resets the execution layer head to a specified height by performing a forkchoiceUpdate, forcing the EL to reorg its canonical chain.
Enhanced Idempotency and State Reconciliation: The checkIdempotency method in EngineClient has been renamed to reconcileExecutionAtHeight and significantly improved. It now includes timestamp verification for already-promoted blocks to detect stale records and validates the existence of payloadID for in-progress executions, preventing the use of expired or invalid payload IDs after node restarts.
Automatic Rollback for Timestamp Mismatches: A new mechanism has been added to reconcileExecutionAtHeight to automatically trigger a rollback of the execution layer when a timestamp mismatch is detected for an existing block. This addresses 'Dual-Store Conflicts' by reverting the EL to height-1 to allow re-execution with the correct, Raft-replicated block.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

alpe · 2026-01-15T21:05:57Z

execution/evm/execution.go

 		if execMeta.Stage == ExecStagePromoted && len(execMeta.StateRoot) > 0 {
-			c.logger.Info().
+			if execMeta.Timestamp == timestamp.Unix() {
+				c.logger.Info().


This is still a weak check without hash but it fixed the problems on restart

gemini-code-assist

Code Review

This pull request introduces important fixes for handling inconsistent states between the consensus and execution layers, particularly around idempotency checks and automatic rollback. The addition of the Rollbackable interface and its implementation in the EVM EngineClient are well-designed. The logic to handle stale metadata and trigger rollbacks on timestamp mismatches significantly improves the system's robustness. My main concern is a critical race condition in the locking strategy for forkchoiceUpdated calls, which could lead to state inconsistencies under concurrent operations.

gemini-code-assist · 2026-01-15T21:06:18Z

execution/evm/execution.go

+	c.mu.Lock()
+	c.currentHeadBlockHash = blockHash
+	c.currentHeadHeight = targetHeight
+	c.currentSafeBlockHash = blockHash
+	c.currentFinalizedBlockHash = blockHash
+	args := engine.ForkchoiceStateV1{
+		HeadBlockHash:      blockHash,
+		SafeBlockHash:      blockHash,
+		FinalizedBlockHash: blockHash,
+	}
+	c.mu.Unlock()
+
+	if err := c.doForkchoiceUpdate(ctx, args, "Rollback"); err != nil {
+		return fmt.Errorf("forkchoice update for rollback failed: %w", err)
+	}


There is a critical race condition in how forkchoiceUpdated calls are managed. The mutex c.mu is unlocked before calling c.doForkchoiceUpdate, which means the update to the client's internal state (c.current... fields) and the RPC call that propagates this state to the execution layer are not atomic.

If another goroutine (e.g., handling SetFinal or another ExecuteTxs) modifies the state and sends its own forkchoiceUpdated call between the c.mu.Unlock() and the completion of this doForkchoiceUpdate, the state of the EngineClient and the execution layer can diverge.

For example:

Rollback locks, updates c.current... to height H-1, and unlocks.

Another goroutine calls SetFinal for height H. It locks, updates c.currentFinalizedBlockHash to H, and unlocks.

SetFinal's doForkchoiceUpdate is sent to the EL, finalizing H.

Rollback's doForkchoiceUpdate is sent, resetting head/safe/finalized to H-1.

The final state of the EL is H-1, but EngineClient's state has currentFinalizedBlockHash as H, which is a state inconsistency.

To ensure atomicity, the lock should be held over the doForkchoiceUpdate call. While this has performance implications due to holding a lock over a network call with retries, it is necessary for correctness. This atomicity should be enforced for all methods that call doForkchoiceUpdate.

Suggested change

c.mu.Lock()

c.currentHeadBlockHash = blockHash

c.currentHeadHeight = targetHeight

c.currentSafeBlockHash = blockHash

c.currentFinalizedBlockHash = blockHash

args := engine.ForkchoiceStateV1{

HeadBlockHash: blockHash,

SafeBlockHash: blockHash,

FinalizedBlockHash: blockHash,

}

c.mu.Unlock()

if err := c.doForkchoiceUpdate(ctx, args, "Rollback"); err != nil {

return fmt.Errorf("forkchoice update for rollback failed: %w", err)

}

c.mu.Lock()

defer c.mu.Unlock()

c.currentHeadBlockHash = blockHash

c.currentHeadHeight = targetHeight

c.currentSafeBlockHash = blockHash

c.currentFinalizedBlockHash = blockHash

args := engine.ForkchoiceStateV1{

HeadBlockHash: blockHash,

SafeBlockHash: blockHash,

FinalizedBlockHash: blockHash,

}

if err := c.doForkchoiceUpdate(ctx, args, "Rollback"); err != nil {

return fmt.Errorf("forkchoice update for rollback failed: %w", err)

}

codecov · 2026-01-15T21:07:20Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.72%. Comparing base (52825bf) to head (fbf49f7).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2983      +/-   ##
==========================================
- Coverage   58.77%   58.72%   -0.05%     
==========================================
  Files         101      101              
  Lines        9685     9685              
==========================================
- Hits         5692     5688       -4     
- Misses       3381     3385       +4     
  Partials      612      612

Flag	Coverage Δ
combined	`58.72% <ø> (-0.05%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

julienrbrt · 2026-01-15T21:54:50Z

execution/evm/execution.go

+// where the EL committed blocks that were not replicated to Raft).
+//
+// Implements the execution.Rollbackable interface.
+func (c *EngineClient) Rollback(ctx context.Context, targetHeight uint64) error {


Could we wire the command directly then?
This would fix #2967.

For consitency, a good follow-up would be to implement this in ev-abci and use it in the rollback command as well instead of directly calling: https://github.com/evstack/ev-abci/blob/e905b0b/server/rollback_cmd.go#L91-L95.

Fix inconsistent state detection and rollback

fbf49f7

github-actions bot assigned alpe Jan 15, 2026

alpe commented Jan 15, 2026

View reviewed changes

gemini-code-assist bot reviewed Jan 15, 2026

View reviewed changes

julienrbrt reviewed Jan 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix inconsistent state detection and rollback #2983

Fix inconsistent state detection and rollback #2983

alpe commented Jan 15, 2026

Uh oh!

claude bot commented Jan 15, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 15, 2026

Uh oh!

gemini-code-assist bot commented Jan 15, 2026

Uh oh!

alpe Jan 15, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 15, 2026

Uh oh!

codecov bot commented Jan 15, 2026

Uh oh!

julienrbrt Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix inconsistent state detection and rollback #2983

Are you sure you want to change the base?

Fix inconsistent state detection and rollback #2983

Conversation

alpe commented Jan 15, 2026

Overview

Uh oh!

claude bot commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete ✓

Overview

Summary of Changes

Detailed Review

1. Code Quality & Architecture ⭐⭐⭐⭐

2. Error Handling & Resilience ⭐⭐⭐⭐⭐

3. Security 🔒 ⭐⭐⭐⭐⭐

4. Performance & Resource Efficiency ⚡ ⭐⭐⭐⭐

5. Testing & Reliability 🧪 ⭐⭐⭐ (Needs Improvement)

6. Documentation 📚 ⭐⭐⭐⭐

Critical Issues 🚨

Important Recommendations

Minor Suggestions

Verdict

Tasks

Uh oh!

github-actions bot commented Jan 15, 2026

Uh oh!

gemini-code-assist bot commented Jan 15, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

alpe Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jan 15, 2026

Codecov Report

Uh oh!

julienrbrt Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

claude bot commented Jan 15, 2026 •

edited

Loading