fix: replace panic crash in snapshot buffer with do_send#1430
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
📝 WalkthroughWalkthroughThis PR introduces async event dispatch capabilities and refactors event handling to remove trap-based error patterns. Key changes include adding Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can enable review details to help with troubleshooting, context usage and more.Enable the |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
crates/events/src/snapshot_buffer/timelock_queue.rs (1)
161-167: Verify: Popped timelocks cannot be retried ifdo_sendsilently drops.The timelock is popped from the heap before
do_send. If the BatchRouter mailbox were full,do_sendwould silently drop theFlushSeqand the batch would never flush.However, this is an acceptable trade-off because:
- BatchRouter has
MAILBOX_LIMIT(256) capacity- The previous
try_send+trapapproach crashed the entire system- During normal operation, having 256+ pending flush messages is unlikely
The current approach prevents crashes at the cost of a theoretical (but highly improbable) silent drop scenario.
💡 Optional: Add trace logging for observability
while self.timelocks.len() > 0 && self.next_timelock_lt(now_time) { if let Some(tl) = self.timelocks.pop() { let seq = tl.0.seq; debug!("Flushing seq {}", seq); self.batch_router.do_send(FlushSeq(seq)); } }Consider adding a trace log after
do_sendto aid debugging if flush issues arise in production.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/events/src/snapshot_buffer/timelock_queue.rs` around lines 161 - 167, The loop currently pops timelocks and calls self.batch_router.do_send(FlushSeq(seq)) which can silently drop messages; keep the current pop-and-fire behavior but add an observability trace: after calling do_send in the while loop inside timelock processing (references: timelocks, next_timelock_lt, batch_router, do_send, FlushSeq, MAILBOX_LIMIT) emit a trace/debug log that records the seq being flushed and current timelocks length (and optionally note MAILBOX_LIMIT) so dropped flushes are discoverable in logs without changing the send semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@crates/events/src/snapshot_buffer/timelock_queue.rs`:
- Around line 161-167: The loop currently pops timelocks and calls
self.batch_router.do_send(FlushSeq(seq)) which can silently drop messages; keep
the current pop-and-fire behavior but add an observability trace: after calling
do_send in the while loop inside timelock processing (references: timelocks,
next_timelock_lt, batch_router, do_send, FlushSeq, MAILBOX_LIMIT) emit a
trace/debug log that records the seq being flushed and current timelocks length
(and optionally note MAILBOX_LIMIT) so dropped flushes are discoverable in logs
without changing the send semantics.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 383c1baf-0d78-4869-b9f1-7d239e995f0f
📒 Files selected for processing (7)
crates/events/src/bus_handle.rscrates/events/src/sequencer.rscrates/events/src/snapshot_buffer/batch.rscrates/events/src/snapshot_buffer/batch_router.rscrates/events/src/snapshot_buffer/snapshot_buffer.rscrates/events/src/snapshot_buffer/timelock_queue.rscrates/sync/src/sync.rs
We faced the receiver issue again on the the aggregator side and it crashes during startup sync with:
When the node starts and replays historical EVM events the snapshot buffer pipeline overwhelms downstream actor mailboxes, since they all use
try_send()wrapped intrap, when any mailbox fills up, thetry_sendreturnsSendError::Full,trapcatches it and panics, crashing the entire process.The
trap + PanicDispatcherdoes not work well here since this is not a real error condition, it's transient backpressure during burst load. The data is valid, the actors are healthy, they just can't drain fast enough during sync replay and mailbox limit.Summary by CodeRabbit
New Features
Improvements