Skip to content

feat: add node data validation and upgrade hardening [skip-line-limit]#1577

Merged
hmzakhalid merged 4 commits into
mainfrom
feat/validate-node-upgrade
Jun 4, 2026
Merged

feat: add node data validation and upgrade hardening [skip-line-limit]#1577
hmzakhalid merged 4 commits into
mainfrom
feat/validate-node-upgrade

Conversation

@hmzakhalid

@hmzakhalid hmzakhalid commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

Add node validate, reorg rollback, in-flight re-drive, aggregator failover, and upgrade hardening

Summary by CodeRabbit

  • New Features

    • Added a new CLI command for node validation and maintenance operations.
    • Introduced configurable blockchain confirmation depth for safer event ingestion under network reorganizations.
    • Implemented deterministic aggregator failover policy to improve system reliability.
  • Documentation

    • Updated slashing quorum decision logic documentation to reflect revised outcome handling.
  • Bug Fixes

    • Added process-level instance locking to prevent concurrent node execution.
  • Chores

    • Removed deprecated backward-compatibility module exports.

@vercel

vercel Bot commented Jun 4, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
crisp Ready Ready Preview, Comment Jun 4, 2026 4:02pm
enclave-docs Ready Ready Preview, Comment Jun 4, 2026 4:02pm
enclave-enclave-dashboard Ready Ready Preview, Comment Jun 4, 2026 4:02pm

Request Review

@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f94d2006-a153-492f-a709-5696d3f4cf3d

📥 Commits

Reviewing files that changed from the base of the PR and between bbb602c and 155d792.

📒 Files selected for processing (2)
  • crates/evm/src/domain/reorg.rs
  • crates/sync/src/actors/sync.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • crates/sync/src/actors/sync.rs

📝 Walkthrough

Walkthrough

This PR implements deterministic aggregator failover infrastructure with reorg-safe EVM ingestion, offline node validation, process-level instance fencing, crash-recovery for keyshare work, and deprecated-schema cleanup. It spans ~2200 lines adding 10+ new public APIs and restructuring configuration flow through confirmation depths.

Changes

Enclave infrastructure changes

Layer / File(s) Summary
Reorg-safety primitives and configuration
crates/evm/src/domain/reorg.rs, crates/events/src/sync.rs, crates/config/src/chain_config.rs
Introduces confirmed_head(chain_head, confirmations) pure function; adds confirmations: u64 field and builder methods to EvmEventConfigChain; extends ChainConfig with optional reorg_confirmations field that threads through TryFrom conversion.
EVM stream reorg-safe integration
crates/evm/src/actors/evm_read_interface.rs, crates/evm/src/actors/log_fetcher.rs, crates/ciphernode-builder/src/evm_system.rs
Wires confirmations field through Filters configuration; clamps historical sync head and live-loop backfill using confirmed_head() to prevent reorg-unsafe ingestion; updates parameter flow from chain config extraction through filter construction.
Aggregator failover policy and decision logic
crates/aggregator/src/domain/failover.rs, crates/aggregator/src/domain/mod.rs
Introduces AggregatorPhase (pending states), FailoverPolicy (timeout holder), FailoverDecision enum (Hold/Promote/Exhausted), and clock-free decide_failover() state machine with unit tests covering settled, within-budget, past-budget, promotion-skipping, and exhaustion scenarios.
Committee failover and selection APIs
crates/events/src/committee.rs
Adds active_aggregator_party_id() to select first non-skipped party, effective_aggregator() to merge expelled+unresponsive skip lists, and aggregator_standbys() to produce ordered standby candidate lists with filtering and limiting.
Ciphernode selector failover integration
crates/sortition/src/actors/ciphernode_selector.rs
Extends CiphernodeSelectorState with unresponsive map for locally-detected parties; updates aggregator status computation via effective_aggregator() using both on-chain expelled and local unresponsive signals; cleanups now remove unresponsive entries on request completion.
Node state inspection APIs
crates/sortition/src/domain/node_registry.rs
Adds open_committees() enumeration to list unreleased committees and new OpenCommittee struct capturing chain id, committee key, and members for validation/audit workflows.
Process-level fencing mechanism
crates/entrypoint/src/fence.rs, crates/entrypoint/src/lib.rs, crates/entrypoint/Cargo.toml
Implements ProcessFence guard using OS advisory locks on enclave.lock file to prevent concurrent active instances; acquires exclusive lock with parent-dir creation, records diagnostic metadata (node name, PID), and releases on drop; comprehensive tests validate exclusivity, isolation, and diagnostics.
CLI node subcommand infrastructure
crates/cli/src/cli.rs, crates/cli/src/main.rs, crates/cli/src/node.rs, crates/cli/src/start.rs
Adds new enclave node subcommand with Validate variant; wires dispatch through main Commands enum and execute() routing; cli/start.rs acquires process fence before control port binding; node module implements execute handler that runs validation and reports results.
Offline node validation system
crates/entrypoint/src/validate.rs, crates/entrypoint/src/lib.rs
Implements validate_node() async function with three validation passes: (1) per-aggregate event-store sequence integrity (contiguous, first-at-seq-1), (2) snapshot cursor consistency (cursor ≤ max event seq), (3) open-loop audit (flags orphaned committees). Returns ValidationReport with per-check CheckResult (name/severity/detail) and human-readable verdict; includes helpers for gap detection, orphan identification, terminal-key collection, and paginated event reading with comprehensive unit tests.
Keyshare crash-restart recovery
crates/keyshare/src/domain/keyshare_state.rs, crates/keyshare/src/actors/threshold_keyshare.rs, crates/request/src/domain/routing.rs
Adds keyshare_published persisted flag to gate re-emission of KeyshareCreated on resume; implements resume_in_flight_work() to conditionally re-publish keyshare and re-issue decryption-share requests after restart based on persisted state; wires EffectsEnabled boot event to trigger recovery with error logging; updates request router to broadcast both Shutdown and EffectsEnabled lifecycle events.
Deprecation removals and schema cleanup
crates/events/src/enclave_event/accusation_quorum_reached.rs, crates/sync/src/actors/sync.rs
Removes deprecated AccuserLied variant from AccusationOutcome enum and Display impl; deletes on-disk schema compatibility verification and related decision/constant imports from sync startup path; replaces removed TODO with detailed comment on crash-recovery work reconciliation via EVM re-fetch and event replay.
Configuration wiring, dependencies, and test updates
crates/utils/Cargo.toml, crates/utils/src/lib.rs, crates/utils/src/committee_hash.rs, crates/cli/src/config_setup.rs, crates/tests/tests/integration.rs, crates/data/src/in_mem_kv_store.rs, crates/zk-helpers/src/circuits/...
Removes unused e3-committee-hash workspace dependency and re-export; updates CLI config test to pass new Address::ZERO argument; adds reorg_confirmations: None to benchmark chain config; renaming/rewording doc comments for KV-store dump format; removes unused imports in test modules and adjusts codegen test imports.

Sequence Diagram(s)

sequenceDiagram
  participant ChainConfig
  participant EvmSystemBuilder
  participant EvmEventConfigChain
  participant Filters
  participant EvmReadInterface
  participant LogFetcher
  participant confirmed_head_fn
  ChainConfig->>EvmSystemBuilder: reorg_confirmations
  EvmSystemBuilder->>EvmEventConfigChain: with_confirmations(value)
  EvmEventConfigChain->>Filters: with_confirmations(value)
  Filters->>EvmReadInterface: confirmations field set
  EvmReadInterface->>confirmed_head_fn: confirmed_head(latest, confirmations)
  confirmed_head_fn->>EvmReadInterface: safe_head (clamped)
  EvmReadInterface->>LogFetcher: fetch from 0 to safe_head
  LogFetcher->>confirmed_head_fn: confirmed_head(chain_head, confirmations)
  confirmed_head_fn->>LogFetcher: safe_backfill_head
  LogFetcher->>LogFetcher: ingest logs up to safe_backfill_head
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • gnosisguild/enclave#1569: CLI config_setup test updated to match new address parameter signature introduced in referenced PR's config changes.
  • gnosisguild/enclave#1415: Updates to agent/flow-trace/05_FAILURE_REFUND_SLASHING.md slashing decision documentation directly reference flow-trace file introduced in that PR.
  • gnosisguild/enclave#1355: Removal of AccuserLied variant and changes to slashing logic directly relate to that PR's introduction of the AccusationManager and AccusationOutcome enum.

Suggested labels

ciphernode

Suggested reviewers

  • ctrlc03

Poem

🐰 Hops through failover fields with fences held tight,
Recovers lost keyshares in the reorg-safe light,
Validates the node path, confirms blocks ahead,
No AccuserLied tales in these schemas, instead!
The aggregator promotion hops onward in stride.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding node data validation and upgrade hardening, which aligns with the PR's core objectives of adding node validation, in-flight recovery, and failover logic.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/validate-node-upgrade

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/evm/src/actors/evm_read_interface.rs (1)

300-355: ⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Live stream path bypasses confirmation gating and can commit orphanable state.

Line [301] and Line [354] clamp historical/backfill ingestion, but the live loop still processes incoming logs immediately at Line [412] and advances last_block at Line [409] regardless of filters.confirmations(). That allows unconfirmed logs into the append-only event stream and can also move last_block past the safe head, so later backfill may skip replaying the block once it becomes confirmed.

Please gate live ingestion by confirmation depth and only advance last_block when the block is confirmed (typically by buffering by block height and draining up to confirmed_head).

Also applies to: 408-414

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/evm/src/actors/evm_read_interface.rs` around lines 300 - 355, The live
event loop is processing and advancing state on unconfirmed blocks (variables
last_block and handling around lines using backfill_to_head and the live stream
consumer), allowing orphanable logs into the append-only stream; fix by gating
live ingestion with the same confirmation depth used for historical sync:
compute confirmed_head = crate::domain::reorg::confirmed_head(current_head,
filters.confirmations()) before consuming live logs, buffer incoming events
keyed by block number, only emit/drain buffered events up to confirmed_head, and
only advance last_block to confirmed_head (not to the raw latest block); apply
this logic in the live consume path that follows backfill_to_head and where
last_block is updated so live processing respects filters.confirmations().
🧹 Nitpick comments (1)
crates/sortition/src/domain/node_registry.rs (1)

254-266: ⚡ Quick win

Return open_committees in deterministic order for stable audits.

Iteration over HashMap is order-randomized, so validation output can vary run-to-run. Sort by (chain_id, committee_key) before returning.

♻️ Suggested diff
 pub fn open_committees(store: &HashMap<u64, NodeStateStore>) -> Vec<OpenCommittee> {
     let mut out = Vec::new();
     for (chain_id, chain_state) in store {
         for (key, members) in &chain_state.e3_committees {
             out.push(OpenCommittee {
                 chain_id: *chain_id,
                 committee_key: key.clone(),
                 members: members.clone(),
             });
         }
     }
+    out.sort_by(|a, b| {
+        a.chain_id
+            .cmp(&b.chain_id)
+            .then_with(|| a.committee_key.cmp(&b.committee_key))
+    });
     out
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/sortition/src/domain/node_registry.rs` around lines 254 - 266,
open_committees currently collects OpenCommittee entries by iterating HashMap
which yields nondeterministic order; modify the function to sort the resulting
Vec<OpenCommittee> by (chain_id, committee_key) before returning to ensure
deterministic output (e.g. after building out, call sort_by_key(|c| (c.chain_id,
c.committee_key.clone())) or sort_by with chain_id then committee_key
comparison); update the function open_committees to perform this sort and then
return the sorted vector.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/cli/src/node.rs`:
- Around line 27-31: The Validate branch (NodeCommands::Validate) opens
persisted state without acquiring the ProcessFence, so validation can race a
running node; fix by acquiring the same ProcessFence used by the start path
before calling validate_node, perform validate_node(config) while holding the
fence (ensuring the fence is released afterwards), and log/return the result as
before—use the existing ProcessFence API (acquire/guard) to mirror how start is
fenced and ensure offline-only validation.

In `@crates/cli/src/start.rs`:
- Around line 29-35: Move the cross-host fence acquisition to before starting
the daemon socket: call
e3_entrypoint::fence::ProcessFence::acquire(&config.db_file(), &config.name())
and hold the returned _fence before invoking
launch_socket_server(config.ctrl_port()). This ensures the exclusive lock is
obtained prior to binding the control port or spawning background work; update
the order so the ProcessFence::acquire call happens immediately after owo() and
before launch_socket_server.

In `@crates/entrypoint/src/validate.rs`:
- Around line 209-214: The check_sequence_integrity function currently only
looks for internal gaps (via detect_sequence_gaps) and thus treats sequences
like [5,6,7] as healthy; change it to reject any aggregate whose first sequence
!= 0: inside check_sequence_integrity (using AggregateId, seqs, CheckResult) add
a guard after the empty check that if seqs[0] != 0 returns
CheckResult::fail(name, format!("aggregate {}: first sequence {} != 0",
agg.to_usize(), seqs[0])); leave the subsequent detect_sequence_gaps logic
intact to still catch internal gaps.
- Around line 446-450: The validator currently removes duplicate sequence
numbers with all.dedup_by_key(|e| e.seq()), which hides the corruption the
integrity check should detect; remove that dedup_by_key call so the stream
remains with duplicates after all.retain(|e| e.aggregate_id() == aggregate) and
all.sort_by_key(|e| e.seq()), allowing the integrity check logic (the code
operating on the sorted all vector) to detect repeated seq() values like
0,1,1,2.

In `@crates/keyshare/src/actors/threshold_keyshare.rs`:
- Around line 1618-1627: The resume logic currently treats
KeyshareState::ReadyForDecryption as enough to re-publish via
publish_keyshare_created, but ReadyForDecryption can be set before C4/honest-set
verification; add a persisted "publish_authorized" marker (e.g., a bool on the
persisted state or a new enum variant like
ReadyForDecryptionAuthorized/ReadyToPublish) and only call
publish_keyshare_created when that marker/variant is set. Set that marker from
the canonical C4-complete paths (the code paths that currently authorize
publishing) and from the sole-honest fast path, and remove automatic publish on
plain ReadyForDecryption; update handle_calculate_decryption_key_response and
the C4-completion/fast-path handlers to flip the persisted flag/transition to
the authorized state so resume logic can safely decide to re-publish.

In `@crates/sortition/src/actors/ciphernode_selector.rs`:
- Around line 47-51: The struct field unresponsive is never written, so
effective_aggregator(...) never sees promoted standbys; add a mutation in the
failover decision path to populate self.unresponsive (use the same E3id ->
Vec<u64> shape) when you mark a party as presumed-down, then call
update_aggregator_status(...) immediately after that mutation so is_aggregator
recomputes; also ensure the existing cleanup code that removes entries still
works (i.e., remove keys from self.unresponsive when liveness returns) and
mirror this logic wherever failover decisions are made (the function/block that
computes/promotes standbys and the code paths referenced near the existing calls
to effective_aggregator and cleanup).

In `@crates/sync/src/actors/sync.rs`:
- Around line 91-101: The comment in sync.rs mistakenly implies actor-level
restart work is entirely not auto re-driven; update the wording to scope the
claim to “not re-driven by replaying original request events in sync” and note
that some actors (e.g. the keyshare actor resumes persisted in-flight work on
EffectsEnabled in crates/keyshare/src/actors/threshold_keyshare.rs) still resume
on restart; keep the rest of the rationale about not minting fresh EventId and
reference enclave node validate (crates/entrypoint/src/validate.rs) for offline
detection of orphaned tickets.

---

Outside diff comments:
In `@crates/evm/src/actors/evm_read_interface.rs`:
- Around line 300-355: The live event loop is processing and advancing state on
unconfirmed blocks (variables last_block and handling around lines using
backfill_to_head and the live stream consumer), allowing orphanable logs into
the append-only stream; fix by gating live ingestion with the same confirmation
depth used for historical sync: compute confirmed_head =
crate::domain::reorg::confirmed_head(current_head, filters.confirmations())
before consuming live logs, buffer incoming events keyed by block number, only
emit/drain buffered events up to confirmed_head, and only advance last_block to
confirmed_head (not to the raw latest block); apply this logic in the live
consume path that follows backfill_to_head and where last_block is updated so
live processing respects filters.confirmations().

---

Nitpick comments:
In `@crates/sortition/src/domain/node_registry.rs`:
- Around line 254-266: open_committees currently collects OpenCommittee entries
by iterating HashMap which yields nondeterministic order; modify the function to
sort the resulting Vec<OpenCommittee> by (chain_id, committee_key) before
returning to ensure deterministic output (e.g. after building out, call
sort_by_key(|c| (c.chain_id, c.committee_key.clone())) or sort_by with chain_id
then committee_key comparison); update the function open_committees to perform
this sort and then return the sorted vector.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 04499484-4920-4fff-9823-8c7c50455c0c

📥 Commits

Reviewing files that changed from the base of the PR and between 85f86a1 and 0031c5e.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (37)
  • agent/flow-trace/05_FAILURE_REFUND_SLASHING.md
  • crates/aggregator/src/domain/failover.rs
  • crates/aggregator/src/domain/mod.rs
  • crates/ciphernode-builder/src/evm_system.rs
  • crates/cli/src/cli.rs
  • crates/cli/src/config_setup.rs
  • crates/cli/src/main.rs
  • crates/cli/src/node.rs
  • crates/cli/src/start.rs
  • crates/config/src/chain_config.rs
  • crates/data/src/in_mem_kv_store.rs
  • crates/entrypoint/Cargo.toml
  • crates/entrypoint/src/fence.rs
  • crates/entrypoint/src/lib.rs
  • crates/entrypoint/src/validate.rs
  • crates/events/src/committee.rs
  • crates/events/src/enclave_event/accusation_quorum_reached.rs
  • crates/events/src/store_keys.rs
  • crates/events/src/sync.rs
  • crates/evm/src/actors/evm_read_interface.rs
  • crates/evm/src/actors/log_fetcher.rs
  • crates/evm/src/domain/mod.rs
  • crates/evm/src/domain/reorg.rs
  • crates/keyshare/src/actors/threshold_keyshare.rs
  • crates/request/src/domain/routing.rs
  • crates/sortition/src/actors/ciphernode_selector.rs
  • crates/sortition/src/domain/node_registry.rs
  • crates/sync/src/actors/sync.rs
  • crates/sync/src/domain/mod.rs
  • crates/sync/src/domain/schema_version.rs
  • crates/sync/src/repo.rs
  • crates/tests/tests/integration.rs
  • crates/utils/Cargo.toml
  • crates/utils/src/committee_hash.rs
  • crates/utils/src/lib.rs
  • crates/zk-helpers/src/circuits/commitments.rs
  • crates/zk-helpers/src/circuits/threshold/pk_generation/codegen.rs
💤 Files with no reviewable changes (11)
  • crates/sync/src/repo.rs
  • crates/events/src/enclave_event/accusation_quorum_reached.rs
  • crates/sync/src/domain/mod.rs
  • crates/zk-helpers/src/circuits/threshold/pk_generation/codegen.rs
  • crates/events/src/store_keys.rs
  • crates/sync/src/domain/schema_version.rs
  • crates/utils/src/lib.rs
  • crates/utils/src/committee_hash.rs
  • crates/utils/Cargo.toml
  • crates/zk-helpers/src/circuits/commitments.rs
  • agent/flow-trace/05_FAILURE_REFUND_SLASHING.md

Comment thread crates/cli/src/node.rs
Comment thread crates/cli/src/start.rs
Comment thread crates/entrypoint/src/validate.rs
Comment thread crates/entrypoint/src/validate.rs Outdated
Comment thread crates/keyshare/src/actors/threshold_keyshare.rs
Comment thread crates/sortition/src/actors/ciphernode_selector.rs
Comment thread crates/sync/src/actors/sync.rs Outdated
Comment thread crates/evm/src/domain/reorg.rs Outdated
Comment thread crates/sync/src/actors/sync.rs

@ctrlc03 ctrlc03 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK

@hmzakhalid hmzakhalid merged commit e680c67 into main Jun 4, 2026
34 checks passed
@github-actions github-actions Bot deleted the feat/validate-node-upgrade branch June 12, 2026 03:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants