Skip to content

fix: replace panic crash in snapshot buffer with do_send#1430

Merged
ryardley merged 2 commits into
mainfrom
fix/batch-rtr-backpressure
Mar 17, 2026
Merged

fix: replace panic crash in snapshot buffer with do_send#1430
ryardley merged 2 commits into
mainfrom
fix/batch-rtr-backpressure

Conversation

@hmzakhalid

@hmzakhalid hmzakhalid commented Mar 16, 2026

Copy link
Copy Markdown
Collaborator

We faced the receiver issue again on the the aggregator side and it crashes during startup sync with:

MAJOR ISSUE: Failure!.
The error supplied was: send failed because receiver is full
As a precaution we are crashing the system.

When the node starts and replays historical EVM events the snapshot buffer pipeline overwhelms downstream actor mailboxes, since they all use try_send() wrapped in trap, when any mailbox fills up, the try_send returns SendError::Full, trap catches it and panics, crashing the entire process.

The trap + PanicDispatcher does not work well here since this is not a real error condition, it's transient backpressure during burst load. The data is valid, the actors are healthy, they just can't drain fast enough during sync replay and mailbox limit.

Summary by CodeRabbit

  • New Features

    • Added asynchronous event dispatch with error propagation.
  • Improvements

    • Streamlined event batching and routing for improved reliability.
    • Simplified event sequencer message handling with configurable mailbox capacity.
    • Enhanced historical event publishing with async dispatch semantics.

@vercel

vercel Bot commented Mar 16, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
crisp Ready Ready Preview, Comment Mar 17, 2026 1:43am
enclave-docs Ready Ready Preview, Comment Mar 17, 2026 1:43am

Request Review

@hmzakhalid hmzakhalid requested a review from ryardley March 16, 2026 17:32
@coderabbitai

coderabbitai Bot commented Mar 16, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

This PR introduces async event dispatch capabilities and refactors event handling to remove trap-based error patterns. Key changes include adding naked_dispatch_async to BusHandle for asynchronous event forwarding, configuring mailbox capacity in the Sequencer, and replacing error-handling wrappers with direct do_send calls throughout snapshot buffer components. The sync module is updated to use awaited async dispatch for historical event publishing.

Changes

Cohort / File(s) Summary
Event Bus & Dispatch
crates/events/src/bus_handle.rs
Added public async method naked_dispatch_async that forwards events to sequencer via send and awaits the result with error propagation, complementing the fire-and-forget naked_dispatch.
Sequencer Actor Setup
crates/events/src/sequencer.rs
Added started hook that initializes mailbox capacity to MAILBOX_LIMIT and imported the constant from e3_utils.
Snapshot Buffer Refactoring
crates/events/src/snapshot_buffer/batch.rs, crates/events/src/snapshot_buffer/batch_router.rs, crates/events/src/snapshot_buffer/snapshot_buffer.rs, crates/events/src/snapshot_buffer/timelock_queue.rs
Replaced trap-based error handling and try_send patterns with unconditional do_send calls. Removed PanicDispatcher and EType imports. Simplified control flow in Insert/EnclaveEvent/FlushSeq handlers; BatchRouter now handles missing prev_agg with error logging rather than panic traps; added trace logging for Insert forwarding.
Sync Module
crates/sync/src/sync.rs
Changed historical event publishing from synchronous naked_dispatch (fire-and-forget) to asynchronous naked_dispatch_async with awaited results and error propagation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • ryardley
  • ctrlc03

Poem

🐰 Async hops across the event bus so fast,
Traps removed, do_sends unsurpassed!
Sequencers spin with mailbox grace,
History publishes at measured pace—
Awaiting results in the async embrace! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: replace panic crash in snapshot buffer with do_send' directly and clearly summarizes the main change: replacing panic-based error handling with do_send calls throughout the snapshot buffer pipeline to avoid crashes during burst load.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/batch-rtr-backpressure
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can enable review details to help with troubleshooting, context usage and more.

Enable the reviews.review_details setting to include review details such as the model used, the time taken for each step and more in the review comments.

@hmzakhalid hmzakhalid changed the title fix: replace panic crash in snapshot buffer with do_send fix: replace panic crash in snapshot buffer with do_send Mar 16, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
crates/events/src/snapshot_buffer/timelock_queue.rs (1)

161-167: Verify: Popped timelocks cannot be retried if do_send silently drops.

The timelock is popped from the heap before do_send. If the BatchRouter mailbox were full, do_send would silently drop the FlushSeq and the batch would never flush.

However, this is an acceptable trade-off because:

  1. BatchRouter has MAILBOX_LIMIT (256) capacity
  2. The previous try_send + trap approach crashed the entire system
  3. During normal operation, having 256+ pending flush messages is unlikely

The current approach prevents crashes at the cost of a theoretical (but highly improbable) silent drop scenario.

💡 Optional: Add trace logging for observability
         while self.timelocks.len() > 0 && self.next_timelock_lt(now_time) {
             if let Some(tl) = self.timelocks.pop() {
                 let seq = tl.0.seq;
                 debug!("Flushing seq {}", seq);
                 self.batch_router.do_send(FlushSeq(seq));
             }
         }

Consider adding a trace log after do_send to aid debugging if flush issues arise in production.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/events/src/snapshot_buffer/timelock_queue.rs` around lines 161 - 167,
The loop currently pops timelocks and calls
self.batch_router.do_send(FlushSeq(seq)) which can silently drop messages; keep
the current pop-and-fire behavior but add an observability trace: after calling
do_send in the while loop inside timelock processing (references: timelocks,
next_timelock_lt, batch_router, do_send, FlushSeq, MAILBOX_LIMIT) emit a
trace/debug log that records the seq being flushed and current timelocks length
(and optionally note MAILBOX_LIMIT) so dropped flushes are discoverable in logs
without changing the send semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@crates/events/src/snapshot_buffer/timelock_queue.rs`:
- Around line 161-167: The loop currently pops timelocks and calls
self.batch_router.do_send(FlushSeq(seq)) which can silently drop messages; keep
the current pop-and-fire behavior but add an observability trace: after calling
do_send in the while loop inside timelock processing (references: timelocks,
next_timelock_lt, batch_router, do_send, FlushSeq, MAILBOX_LIMIT) emit a
trace/debug log that records the seq being flushed and current timelocks length
(and optionally note MAILBOX_LIMIT) so dropped flushes are discoverable in logs
without changing the send semantics.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 383c1baf-0d78-4869-b9f1-7d239e995f0f

📥 Commits

Reviewing files that changed from the base of the PR and between e06bcbc and e48db65.

📒 Files selected for processing (7)
  • crates/events/src/bus_handle.rs
  • crates/events/src/sequencer.rs
  • crates/events/src/snapshot_buffer/batch.rs
  • crates/events/src/snapshot_buffer/batch_router.rs
  • crates/events/src/snapshot_buffer/snapshot_buffer.rs
  • crates/events/src/snapshot_buffer/timelock_queue.rs
  • crates/sync/src/sync.rs

@ryardley ryardley enabled auto-merge (squash) March 17, 2026 02:06
@ryardley ryardley merged commit b6d3cd0 into main Mar 17, 2026
27 checks passed
@github-actions github-actions Bot deleted the fix/batch-rtr-backpressure branch March 24, 2026 03:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants