feat: add node data validation and upgrade hardening [skip-line-limit]#1577
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThis PR implements deterministic aggregator failover infrastructure with reorg-safe EVM ingestion, offline node validation, process-level instance fencing, crash-recovery for keyshare work, and deprecated-schema cleanup. It spans ~2200 lines adding 10+ new public APIs and restructuring configuration flow through confirmation depths. ChangesEnclave infrastructure changes
Sequence Diagram(s)sequenceDiagram
participant ChainConfig
participant EvmSystemBuilder
participant EvmEventConfigChain
participant Filters
participant EvmReadInterface
participant LogFetcher
participant confirmed_head_fn
ChainConfig->>EvmSystemBuilder: reorg_confirmations
EvmSystemBuilder->>EvmEventConfigChain: with_confirmations(value)
EvmEventConfigChain->>Filters: with_confirmations(value)
Filters->>EvmReadInterface: confirmations field set
EvmReadInterface->>confirmed_head_fn: confirmed_head(latest, confirmations)
confirmed_head_fn->>EvmReadInterface: safe_head (clamped)
EvmReadInterface->>LogFetcher: fetch from 0 to safe_head
LogFetcher->>confirmed_head_fn: confirmed_head(chain_head, confirmations)
confirmed_head_fn->>LogFetcher: safe_backfill_head
LogFetcher->>LogFetcher: ingest logs up to safe_backfill_head
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/evm/src/actors/evm_read_interface.rs (1)
300-355:⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy liftLive stream path bypasses confirmation gating and can commit orphanable state.
Line [301] and Line [354] clamp historical/backfill ingestion, but the live loop still processes incoming logs immediately at Line [412] and advances
last_blockat Line [409] regardless offilters.confirmations(). That allows unconfirmed logs into the append-only event stream and can also movelast_blockpast the safe head, so later backfill may skip replaying the block once it becomes confirmed.Please gate live ingestion by confirmation depth and only advance
last_blockwhen the block is confirmed (typically by buffering by block height and draining up toconfirmed_head).Also applies to: 408-414
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/evm/src/actors/evm_read_interface.rs` around lines 300 - 355, The live event loop is processing and advancing state on unconfirmed blocks (variables last_block and handling around lines using backfill_to_head and the live stream consumer), allowing orphanable logs into the append-only stream; fix by gating live ingestion with the same confirmation depth used for historical sync: compute confirmed_head = crate::domain::reorg::confirmed_head(current_head, filters.confirmations()) before consuming live logs, buffer incoming events keyed by block number, only emit/drain buffered events up to confirmed_head, and only advance last_block to confirmed_head (not to the raw latest block); apply this logic in the live consume path that follows backfill_to_head and where last_block is updated so live processing respects filters.confirmations().
🧹 Nitpick comments (1)
crates/sortition/src/domain/node_registry.rs (1)
254-266: ⚡ Quick winReturn
open_committeesin deterministic order for stable audits.Iteration over
HashMapis order-randomized, so validation output can vary run-to-run. Sort by(chain_id, committee_key)before returning.♻️ Suggested diff
pub fn open_committees(store: &HashMap<u64, NodeStateStore>) -> Vec<OpenCommittee> { let mut out = Vec::new(); for (chain_id, chain_state) in store { for (key, members) in &chain_state.e3_committees { out.push(OpenCommittee { chain_id: *chain_id, committee_key: key.clone(), members: members.clone(), }); } } + out.sort_by(|a, b| { + a.chain_id + .cmp(&b.chain_id) + .then_with(|| a.committee_key.cmp(&b.committee_key)) + }); out }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/sortition/src/domain/node_registry.rs` around lines 254 - 266, open_committees currently collects OpenCommittee entries by iterating HashMap which yields nondeterministic order; modify the function to sort the resulting Vec<OpenCommittee> by (chain_id, committee_key) before returning to ensure deterministic output (e.g. after building out, call sort_by_key(|c| (c.chain_id, c.committee_key.clone())) or sort_by with chain_id then committee_key comparison); update the function open_committees to perform this sort and then return the sorted vector.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/cli/src/node.rs`:
- Around line 27-31: The Validate branch (NodeCommands::Validate) opens
persisted state without acquiring the ProcessFence, so validation can race a
running node; fix by acquiring the same ProcessFence used by the start path
before calling validate_node, perform validate_node(config) while holding the
fence (ensuring the fence is released afterwards), and log/return the result as
before—use the existing ProcessFence API (acquire/guard) to mirror how start is
fenced and ensure offline-only validation.
In `@crates/cli/src/start.rs`:
- Around line 29-35: Move the cross-host fence acquisition to before starting
the daemon socket: call
e3_entrypoint::fence::ProcessFence::acquire(&config.db_file(), &config.name())
and hold the returned _fence before invoking
launch_socket_server(config.ctrl_port()). This ensures the exclusive lock is
obtained prior to binding the control port or spawning background work; update
the order so the ProcessFence::acquire call happens immediately after owo() and
before launch_socket_server.
In `@crates/entrypoint/src/validate.rs`:
- Around line 209-214: The check_sequence_integrity function currently only
looks for internal gaps (via detect_sequence_gaps) and thus treats sequences
like [5,6,7] as healthy; change it to reject any aggregate whose first sequence
!= 0: inside check_sequence_integrity (using AggregateId, seqs, CheckResult) add
a guard after the empty check that if seqs[0] != 0 returns
CheckResult::fail(name, format!("aggregate {}: first sequence {} != 0",
agg.to_usize(), seqs[0])); leave the subsequent detect_sequence_gaps logic
intact to still catch internal gaps.
- Around line 446-450: The validator currently removes duplicate sequence
numbers with all.dedup_by_key(|e| e.seq()), which hides the corruption the
integrity check should detect; remove that dedup_by_key call so the stream
remains with duplicates after all.retain(|e| e.aggregate_id() == aggregate) and
all.sort_by_key(|e| e.seq()), allowing the integrity check logic (the code
operating on the sorted all vector) to detect repeated seq() values like
0,1,1,2.
In `@crates/keyshare/src/actors/threshold_keyshare.rs`:
- Around line 1618-1627: The resume logic currently treats
KeyshareState::ReadyForDecryption as enough to re-publish via
publish_keyshare_created, but ReadyForDecryption can be set before C4/honest-set
verification; add a persisted "publish_authorized" marker (e.g., a bool on the
persisted state or a new enum variant like
ReadyForDecryptionAuthorized/ReadyToPublish) and only call
publish_keyshare_created when that marker/variant is set. Set that marker from
the canonical C4-complete paths (the code paths that currently authorize
publishing) and from the sole-honest fast path, and remove automatic publish on
plain ReadyForDecryption; update handle_calculate_decryption_key_response and
the C4-completion/fast-path handlers to flip the persisted flag/transition to
the authorized state so resume logic can safely decide to re-publish.
In `@crates/sortition/src/actors/ciphernode_selector.rs`:
- Around line 47-51: The struct field unresponsive is never written, so
effective_aggregator(...) never sees promoted standbys; add a mutation in the
failover decision path to populate self.unresponsive (use the same E3id ->
Vec<u64> shape) when you mark a party as presumed-down, then call
update_aggregator_status(...) immediately after that mutation so is_aggregator
recomputes; also ensure the existing cleanup code that removes entries still
works (i.e., remove keys from self.unresponsive when liveness returns) and
mirror this logic wherever failover decisions are made (the function/block that
computes/promotes standbys and the code paths referenced near the existing calls
to effective_aggregator and cleanup).
In `@crates/sync/src/actors/sync.rs`:
- Around line 91-101: The comment in sync.rs mistakenly implies actor-level
restart work is entirely not auto re-driven; update the wording to scope the
claim to “not re-driven by replaying original request events in sync” and note
that some actors (e.g. the keyshare actor resumes persisted in-flight work on
EffectsEnabled in crates/keyshare/src/actors/threshold_keyshare.rs) still resume
on restart; keep the rest of the rationale about not minting fresh EventId and
reference enclave node validate (crates/entrypoint/src/validate.rs) for offline
detection of orphaned tickets.
---
Outside diff comments:
In `@crates/evm/src/actors/evm_read_interface.rs`:
- Around line 300-355: The live event loop is processing and advancing state on
unconfirmed blocks (variables last_block and handling around lines using
backfill_to_head and the live stream consumer), allowing orphanable logs into
the append-only stream; fix by gating live ingestion with the same confirmation
depth used for historical sync: compute confirmed_head =
crate::domain::reorg::confirmed_head(current_head, filters.confirmations())
before consuming live logs, buffer incoming events keyed by block number, only
emit/drain buffered events up to confirmed_head, and only advance last_block to
confirmed_head (not to the raw latest block); apply this logic in the live
consume path that follows backfill_to_head and where last_block is updated so
live processing respects filters.confirmations().
---
Nitpick comments:
In `@crates/sortition/src/domain/node_registry.rs`:
- Around line 254-266: open_committees currently collects OpenCommittee entries
by iterating HashMap which yields nondeterministic order; modify the function to
sort the resulting Vec<OpenCommittee> by (chain_id, committee_key) before
returning to ensure deterministic output (e.g. after building out, call
sort_by_key(|c| (c.chain_id, c.committee_key.clone())) or sort_by with chain_id
then committee_key comparison); update the function open_committees to perform
this sort and then return the sorted vector.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 04499484-4920-4fff-9823-8c7c50455c0c
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (37)
agent/flow-trace/05_FAILURE_REFUND_SLASHING.mdcrates/aggregator/src/domain/failover.rscrates/aggregator/src/domain/mod.rscrates/ciphernode-builder/src/evm_system.rscrates/cli/src/cli.rscrates/cli/src/config_setup.rscrates/cli/src/main.rscrates/cli/src/node.rscrates/cli/src/start.rscrates/config/src/chain_config.rscrates/data/src/in_mem_kv_store.rscrates/entrypoint/Cargo.tomlcrates/entrypoint/src/fence.rscrates/entrypoint/src/lib.rscrates/entrypoint/src/validate.rscrates/events/src/committee.rscrates/events/src/enclave_event/accusation_quorum_reached.rscrates/events/src/store_keys.rscrates/events/src/sync.rscrates/evm/src/actors/evm_read_interface.rscrates/evm/src/actors/log_fetcher.rscrates/evm/src/domain/mod.rscrates/evm/src/domain/reorg.rscrates/keyshare/src/actors/threshold_keyshare.rscrates/request/src/domain/routing.rscrates/sortition/src/actors/ciphernode_selector.rscrates/sortition/src/domain/node_registry.rscrates/sync/src/actors/sync.rscrates/sync/src/domain/mod.rscrates/sync/src/domain/schema_version.rscrates/sync/src/repo.rscrates/tests/tests/integration.rscrates/utils/Cargo.tomlcrates/utils/src/committee_hash.rscrates/utils/src/lib.rscrates/zk-helpers/src/circuits/commitments.rscrates/zk-helpers/src/circuits/threshold/pk_generation/codegen.rs
💤 Files with no reviewable changes (11)
- crates/sync/src/repo.rs
- crates/events/src/enclave_event/accusation_quorum_reached.rs
- crates/sync/src/domain/mod.rs
- crates/zk-helpers/src/circuits/threshold/pk_generation/codegen.rs
- crates/events/src/store_keys.rs
- crates/sync/src/domain/schema_version.rs
- crates/utils/src/lib.rs
- crates/utils/src/committee_hash.rs
- crates/utils/Cargo.toml
- crates/zk-helpers/src/circuits/commitments.rs
- agent/flow-trace/05_FAILURE_REFUND_SLASHING.md
Add node validate, reorg rollback, in-flight re-drive, aggregator failover, and upgrade hardening
Summary by CodeRabbit
New Features
Documentation
Bug Fixes
Chores