-
Notifications
You must be signed in to change notification settings - Fork 245
Fix inconsistent state detection and rollback #2983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Claude finished @alpe's task in 2m 6s —— View job Code Review Complete ✓OverviewThis PR adds important improvements for handling inconsistent state between the execution layer (EL) and consensus layer (CL) during rolling restarts and Raft leadership changes. The changes introduce timestamp-based idempotency checks and implement EL rollback functionality. Summary of Changes1. Added
2. Implemented Rollback for EVM EngineClient (execution/evm/execution.go:957-997)
3. Enhanced Idempotency Detection (execution/evm/execution.go:699-803)
Detailed Review1. Code Quality & Architecture ⭐⭐⭐⭐Strengths:
Observations:
Minor Concern:
2. Error Handling & Resilience ⭐⭐⭐⭐⭐Strengths:
Excellent Pattern (execution/evm/execution.go:346-359): stateRoot, payloadID, found, err := c.reconcileExecutionAtHeight(ctx, blockHeight, timestamp, txs)
if err != nil {
c.logger.Warn().Err(err).Uint64("height", blockHeight).Msg("ExecuteTxs: idempotency check failed")
// Continue execution on error, as it might be transient
}This is robust - a failure in the optimization doesn't break the primary flow. Concern - Rollback Error Handling: execution/evm/execution.go:788-794: If rollback fails, we return an error which will likely cause the node to crash or stall. Consider:
3. Security 🔒 ⭐⭐⭐⭐⭐Strengths:
Critical Security Property: Question:
4. Performance & Resource Efficiency ⚡ ⭐⭐⭐⭐Strengths:
Observations:
Potential Optimization:
5. Testing & Reliability 🧪 ⭐⭐⭐ (Needs Improvement)Critical Gap - Missing Test Coverage: The PR adds significant new functionality but no new tests. Required test coverage:
Recommendation: Existing Test Impact:
6. Documentation 📚 ⭐⭐⭐⭐Strengths:
Specific Praise:
Minor Gap:
Critical Issues 🚨None - this is solid code! Important Recommendations
Minor Suggestions
VerdictOverall Assessment: ⭐⭐⭐⭐ (4/5) This is a well-designed and well-implemented solution to a critical distributed systems problem. The code quality is high, error handling is robust, and the architecture is sound. The main gap is test coverage - this functionality deals with edge cases (rolling restarts, leadership changes) that are difficult to debug in production, so comprehensive tests are essential. Recommendation: Request Changes - Add tests before merging. The core logic is excellent, but we need confidence that it works correctly in the scenarios it's designed to handle. Once tests are added, this will be production-ready. Tasks
|
|
The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).
|
Summary of ChangesHello @alpe, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the resilience and consistency of the execution layer by introducing a robust rollback mechanism and refining idempotency checks. These changes enable the system to automatically recover from state inconsistencies, such as those arising from rolling restarts or consensus layer divergences, by ensuring the execution layer can revert to a correct historical state and re-synchronize with the authoritative chain. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
| if execMeta.Stage == ExecStagePromoted && len(execMeta.StateRoot) > 0 { | ||
| c.logger.Info(). | ||
| if execMeta.Timestamp == timestamp.Unix() { | ||
| c.logger.Info(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still a weak check without hash but it fixed the problems on restart
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces important fixes for handling inconsistent states between the consensus and execution layers, particularly around idempotency checks and automatic rollback. The addition of the Rollbackable interface and its implementation in the EVM EngineClient are well-designed. The logic to handle stale metadata and trigger rollbacks on timestamp mismatches significantly improves the system's robustness. My main concern is a critical race condition in the locking strategy for forkchoiceUpdated calls, which could lead to state inconsistencies under concurrent operations.
| c.mu.Lock() | ||
| c.currentHeadBlockHash = blockHash | ||
| c.currentHeadHeight = targetHeight | ||
| c.currentSafeBlockHash = blockHash | ||
| c.currentFinalizedBlockHash = blockHash | ||
| args := engine.ForkchoiceStateV1{ | ||
| HeadBlockHash: blockHash, | ||
| SafeBlockHash: blockHash, | ||
| FinalizedBlockHash: blockHash, | ||
| } | ||
| c.mu.Unlock() | ||
|
|
||
| if err := c.doForkchoiceUpdate(ctx, args, "Rollback"); err != nil { | ||
| return fmt.Errorf("forkchoice update for rollback failed: %w", err) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a critical race condition in how forkchoiceUpdated calls are managed. The mutex c.mu is unlocked before calling c.doForkchoiceUpdate, which means the update to the client's internal state (c.current... fields) and the RPC call that propagates this state to the execution layer are not atomic.
If another goroutine (e.g., handling SetFinal or another ExecuteTxs) modifies the state and sends its own forkchoiceUpdated call between the c.mu.Unlock() and the completion of this doForkchoiceUpdate, the state of the EngineClient and the execution layer can diverge.
For example:
Rollbacklocks, updatesc.current...to heightH-1, and unlocks.- Another goroutine calls
SetFinalfor heightH. It locks, updatesc.currentFinalizedBlockHashtoH, and unlocks. SetFinal'sdoForkchoiceUpdateis sent to the EL, finalizingH.Rollback'sdoForkchoiceUpdateis sent, resetting head/safe/finalized toH-1.
The final state of the EL is H-1, but EngineClient's state has currentFinalizedBlockHash as H, which is a state inconsistency.
To ensure atomicity, the lock should be held over the doForkchoiceUpdate call. While this has performance implications due to holding a lock over a network call with retries, it is necessary for correctness. This atomicity should be enforced for all methods that call doForkchoiceUpdate.
| c.mu.Lock() | |
| c.currentHeadBlockHash = blockHash | |
| c.currentHeadHeight = targetHeight | |
| c.currentSafeBlockHash = blockHash | |
| c.currentFinalizedBlockHash = blockHash | |
| args := engine.ForkchoiceStateV1{ | |
| HeadBlockHash: blockHash, | |
| SafeBlockHash: blockHash, | |
| FinalizedBlockHash: blockHash, | |
| } | |
| c.mu.Unlock() | |
| if err := c.doForkchoiceUpdate(ctx, args, "Rollback"); err != nil { | |
| return fmt.Errorf("forkchoice update for rollback failed: %w", err) | |
| } | |
| c.mu.Lock() | |
| defer c.mu.Unlock() | |
| c.currentHeadBlockHash = blockHash | |
| c.currentHeadHeight = targetHeight | |
| c.currentSafeBlockHash = blockHash | |
| c.currentFinalizedBlockHash = blockHash | |
| args := engine.ForkchoiceStateV1{ | |
| HeadBlockHash: blockHash, | |
| SafeBlockHash: blockHash, | |
| FinalizedBlockHash: blockHash, | |
| } | |
| if err := c.doForkchoiceUpdate(ctx, args, "Rollback"); err != nil { | |
| return fmt.Errorf("forkchoice update for rollback failed: %w", err) | |
| } |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2983 +/- ##
==========================================
- Coverage 58.77% 58.72% -0.05%
==========================================
Files 101 101
Lines 9685 9685
==========================================
- Hits 5692 5688 -4
- Misses 3381 3385 +4
Partials 612 612
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| // where the EL committed blocks that were not replicated to Raft). | ||
| // | ||
| // Implements the execution.Rollbackable interface. | ||
| func (c *EngineClient) Rollback(ctx context.Context, targetHeight uint64) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we wire the command directly then?
This would fix #2967.
For consitency, a good follow-up would be to implement this in ev-abci and use it in the rollback command as well instead of directly calling: https://github.com/evstack/ev-abci/blob/e905b0b/server/rollback_cmd.go#L91-L95.
Overview