From b15fa560d5837c225a0fbccca5d42aca5aa79c75 Mon Sep 17 00:00:00 2001 From: 0xjei Date: Tue, 9 Jun 2026 10:44:02 +0200 Subject: [PATCH 1/2] handle E3RequestComplete for timeout failures --- agent/flow-trace/00_INDEX.md | 19 +-- .../flow-trace/05_FAILURE_REFUND_SLASHING.md | 12 +- crates/events/src/enclave_event/e3_failed.rs | 15 ++ .../keyshare/src/actors/threshold_keyshare.rs | 6 +- crates/request/src/domain/routing.rs | 146 ++++++++++++++++-- 5 files changed, 168 insertions(+), 30 deletions(-) diff --git a/agent/flow-trace/00_INDEX.md b/agent/flow-trace/00_INDEX.md index a6d5e3b4cf..9ca9edc8df 100644 --- a/agent/flow-trace/00_INDEX.md +++ b/agent/flow-trace/00_INDEX.md @@ -183,12 +183,13 @@ _Found during source-code cross-referencing of these trace documents._ ### Protocol Design Concerns -| # | Concern | Severity | Detail | -| --- | ------------------------------------------ | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| 1 | **Deregister-before-slash race** | Accepted | SlashingManager Lane B (evidence+appeal) has a window during which the operator can deregister and claim their exit. If they do, the slash executes against 0 funds. The contract comments acknowledge this as an accepted tradeoff for the appeal window design. | -| 2 | **Committee publication decentralized** | Resolved | `publishCommittee()` is permissionless. Off-chain role selection chooses the active aggregator, while on-chain C5 proof verification and the single-publish guard prevent invalid or duplicate committee publication. | -| 3 | **`gracePeriod` is dead code** | Medium | `gracePeriod` is stored and validated during config updates but never actually used in any timeout check. Either the deadlines already bake in sufficient buffer, or this is a missing feature. | -| 4 | **`activate` CLI command is misleading** | Low | Named "activate" but actually calls "register" — will fail for already-registered operators. There's no standalone way to trigger re-evaluation of active status; instead, `_updateOperatorStatus()` runs automatically inside `addTicketBalance()`, `bondLicense()`, etc. | -| 5 | **Active-job load balancing bug fixed** | Info | The Rust `NodeStateStore.available_tickets()` subtracts `active_jobs` from total tickets, reducing the chance of busy nodes being selected for new E3s. Previously, the `Sortition` actor's `Handler` was missing match arms for `E3Failed` and `E3StageChanged`, causing these events to fall to the default `_ => ()` — the typed handlers for decrementing jobs were dead code. This has been fixed: E3Failed and E3StageChanged are now routed to their handlers, and `finalized_committees` is cleaned up in `decrement_jobs_for_e3` to prevent unbounded memory growth. | -| 6 | **Committee member expulsion** | Info | `SlashingManager` can call `expelCommitteeMember()` mid-DKG. The `Sortition` actor enriches the raw `CommitteeMemberExpelled` event with the expelled member's `party_id` (resolved from its stored `Committee` list) and re-publishes it. `ThresholdKeyshare` then uses the enriched `party_id` to update its collectors, potentially completing DKG with fewer parties. `ThresholdKeyshare` itself does not hold committee state. | -| 7 | **ProofRequestActor failure bridge fixed** | Info | `ProofRequestActor` no longer leaves proof publication suppressed under log-only "will not be published" exits. `ComputeRequestError` and local proof-signing failures for DKG-path proofs (`C0` through `C5`) now emit `E3Failed { failed_at_stage: CommitteeFinalized, reason: DKGInvalidShares }`, while decryption-path proofs (`C6` and `C7`) emit `E3Failed { failed_at_stage: CiphertextReady, reason: DecryptionInvalidShares }`. | +| # | Concern | Severity | Detail | +| --- | ------------------------------------------------------ | -------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| 1 | **Deregister-before-slash race** | Accepted | SlashingManager Lane B (evidence+appeal) has a window during which the operator can deregister and claim their exit. If they do, the slash executes against 0 funds. The contract comments acknowledge this as an accepted tradeoff for the appeal window design. | +| 2 | **Committee publication decentralized** | Resolved | `publishCommittee()` is permissionless. Off-chain role selection chooses the active aggregator, while on-chain C5 proof verification and the single-publish guard prevent invalid or duplicate committee publication. | +| 3 | **`gracePeriod` is dead code** | Medium | `gracePeriod` is stored and validated during config updates but never actually used in any timeout check. Either the deadlines already bake in sufficient buffer, or this is a missing feature. | +| 4 | **`activate` CLI command is misleading** | Low | Named "activate" but actually calls "register" — will fail for already-registered operators. There's no standalone way to trigger re-evaluation of active status; instead, `_updateOperatorStatus()` runs automatically inside `addTicketBalance()`, `bondLicense()`, etc. | +| 5 | **Active-job load balancing bug fixed** | Info | The Rust `NodeStateStore.available_tickets()` subtracts `active_jobs` from total tickets, reducing the chance of busy nodes being selected for new E3s. Previously, the `Sortition` actor's `Handler` was missing match arms for `E3Failed` and `E3StageChanged`, causing these events to fall to the default `_ => ()` — the typed handlers for decrementing jobs were dead code. This has been fixed: E3Failed and E3StageChanged are now routed to their handlers, and `finalized_committees` is cleaned up in `decrement_jobs_for_e3` to prevent unbounded memory growth. | +| 6 | **Committee member expulsion** | Info | `SlashingManager` can call `expelCommitteeMember()` mid-DKG. The `Sortition` actor enriches the raw `CommitteeMemberExpelled` event with the expelled member's `party_id` (resolved from its stored `Committee` list) and re-publishes it. `ThresholdKeyshare` then uses the enriched `party_id` to update its collectors, potentially completing DKG with fewer parties. `ThresholdKeyshare` itself does not hold committee state. | +| 7 | **ProofRequestActor failure bridge fixed** | Info | `ProofRequestActor` no longer leaves proof publication suppressed under log-only "will not be published" exits. `ComputeRequestError` and local proof-signing failures for DKG-path proofs (`C0` through `C5`) now emit `E3Failed { failed_at_stage: CommitteeFinalized, reason: DKGInvalidShares }`, while decryption-path proofs (`C6` and `C7`) emit `E3Failed { failed_at_stage: CiphertextReady, reason: DecryptionInvalidShares }`. | +| 8 | **E3RequestComplete not triggered on timeout (fixed)** | Info | `E3Router` never published `E3RequestComplete` for timeout failures, leaving the per-E3 context alive indefinitely. Fixed: `E3Failed` events with a timeout reason (`CommitteeFormationTimeout`, `DKGTimeout`, `ComputeTimeout`, `DecryptionTimeout`) now trigger `PostForward::PublishComplete` in `routing.rs`, causing the router to publish `E3RequestComplete` and tear down the context. Local DKG collection timeouts in `ThresholdKeyshare` were also using `InsufficientCommitteeMembers` as the reason code; they now use `DKGTimeout` / `DecryptionTimeout` so the router correctly recognises them. `E3StageChanged(Failed)` and `E3Failed(timeout)` arriving after context teardown are now silently ignored. | diff --git a/agent/flow-trace/05_FAILURE_REFUND_SLASHING.md b/agent/flow-trace/05_FAILURE_REFUND_SLASHING.md index c22e2168ee..3c9d3768b8 100644 --- a/agent/flow-trace/05_FAILURE_REFUND_SLASHING.md +++ b/agent/flow-trace/05_FAILURE_REFUND_SLASHING.md @@ -1059,11 +1059,17 @@ When CommitteeMemberExpelled event arrives from EVM: │ ├─ Only processes raw events (party_id: None) │ └─ Removes expelled node from committee filter set │ -└─ When E3Failed / E3StageChanged(Complete|Failed) arrives: +└─ When E3Failed(timeout) / E3StageChanged(Complete) arrives: │ ├─ E3Router (central cleanup orchestrator): - │ └─ Converts E3Failed / E3StageChanged(Complete|Failed) → E3RequestComplete - │ → Single cleanup signal for all per-E3 actors + │ ├─ E3Failed with a timeout reason (CommitteeFormationTimeout, DKGTimeout, + │ │ ComputeTimeout, DecryptionTimeout) → publishes E3RequestComplete + │ │ → Single cleanup signal for all per-E3 actors + │ │ NOTE: E3Failed with a misbehaviour reason (DKGInvalidShares, etc.) does + │ │ NOT trigger E3RequestComplete — the accusation/slashing lifecycle must + │ │ complete first. + │ └─ E3StageChanged(Failed) and E3Failed(timeout) arriving after context teardown + │ are silently ignored (expected on-chain lag) │ ├─ CommitteeFinalizer (direct handler — semantic work): │ └─ Cancels any pending committee-finalization timer for this e3_id diff --git a/crates/events/src/enclave_event/e3_failed.rs b/crates/events/src/enclave_event/e3_failed.rs index ebce4b531a..a9f05977ff 100644 --- a/crates/events/src/enclave_event/e3_failed.rs +++ b/crates/events/src/enclave_event/e3_failed.rs @@ -27,6 +27,21 @@ pub enum FailureReason { VerificationFailed, } +impl FailureReason { + /// Returns true when the failure was caused purely by a deadline expiring rather + /// than by a node acting maliciously. Timeout failures have no associated + /// accusation/slashing lifecycle, so their E3 context can be torn down immediately. + pub fn is_timeout(&self) -> bool { + matches!( + self, + Self::CommitteeFormationTimeout + | Self::DKGTimeout + | Self::ComputeTimeout + | Self::DecryptionTimeout + ) + } +} + /// E3 lifecycle stage #[derive(Debug, Clone, PartialEq, Eq, Hash, Serialize, Deserialize)] pub enum E3Stage { diff --git a/crates/keyshare/src/actors/threshold_keyshare.rs b/crates/keyshare/src/actors/threshold_keyshare.rs index c41bcdcbcd..f0a17170fe 100644 --- a/crates/keyshare/src/actors/threshold_keyshare.rs +++ b/crates/keyshare/src/actors/threshold_keyshare.rs @@ -2047,7 +2047,7 @@ impl Handler for ThresholdKeyshare { self.bus.publish_without_context(E3Failed { e3_id: msg.e3_id, failed_at_stage: E3Stage::CommitteeFinalized, - reason: FailureReason::InsufficientCommitteeMembers, + reason: FailureReason::DKGTimeout, })?; // Stop this actor since we can't proceed without all encryption keys @@ -2081,7 +2081,7 @@ impl Handler for ThresholdKeyshare { self.bus.publish_without_context(E3Failed { e3_id: msg.e3_id, failed_at_stage: E3Stage::CommitteeFinalized, - reason: FailureReason::InsufficientCommitteeMembers, + reason: FailureReason::DKGTimeout, })?; ctx.stop(); @@ -2129,7 +2129,7 @@ impl Handler for ThresholdKeyshare { self.bus.publish_without_context(E3Failed { e3_id: msg.e3_id.clone(), failed_at_stage: E3Stage::CommitteeFinalized, - reason: FailureReason::InsufficientCommitteeMembers, + reason: FailureReason::DecryptionTimeout, })?; ctx.stop(); diff --git a/crates/request/src/domain/routing.rs b/crates/request/src/domain/routing.rs index cd262858b7..9cbde7b58e 100644 --- a/crates/request/src/domain/routing.rs +++ b/crates/request/src/domain/routing.rs @@ -66,14 +66,21 @@ impl RequestRouter { // If this e3 round has already been completed then this event is unexpected. if completed.contains(&e3_id) { - // Plaintext Aggregated Triggers E3RequestComplete which tears down the per-E3 context - // and mark it as completed, but the E3StageChanged(Complete) that arrives from the EVM - // after local teardown is expected and should be ignored rather than treated as an error. - if matches!( - msg.get_data(), + // On-chain confirmation events that lag behind local teardown are expected and + // should be silently ignored rather than treated as an error. + let is_late_terminal = match msg.get_data() { + // E3StageChanged(Complete) always lags local PlaintextAggregated completion. EnclaveEventData::E3StageChanged(data) - if matches!(data.new_stage, E3Stage::Complete) - ) { + if matches!(data.new_stage, E3Stage::Complete | E3Stage::Failed) => + { + true + } + // E3Failed from on-chain markE3Failed may arrive after a local timeout already + // cleaned up the context. + EnclaveEventData::E3Failed(data) if data.reason.is_timeout() => true, + _ => false, + }; + if is_late_terminal { return RoutingDecision::Ignore; } return RoutingDecision::AlreadyCompleted(e3_id); @@ -88,8 +95,12 @@ impl RequestRouter { { PostForward::PublishComplete } - // NOTE: E3Stage::Failed does NOT trigger E3RequestComplete. Failed rounds need the - // accusation/slashing lifecycle to complete before the context is torn down. + // Timeout failures have no accusation/slashing lifecycle, so the context can be + // torn down immediately. Misbehaviour failures (DKGInvalidShares, etc.) still need + // the accusation/slashing lifecycle to complete before teardown. + EnclaveEventData::E3Failed(data) if data.reason.is_timeout() => { + PostForward::PublishComplete + } EnclaveEventData::E3RequestComplete(_) => PostForward::Teardown, _ => PostForward::None, }; @@ -105,8 +116,8 @@ impl RequestRouter { mod tests { use super::*; use e3_events::{ - E3RequestComplete, E3Stage, E3StageChanged, EnclaveEvent, PlaintextAggregated, Sequenced, - Shutdown, + E3Failed, E3RequestComplete, E3Stage, E3StageChanged, EnclaveEvent, FailureReason, + PlaintextAggregated, Sequenced, Shutdown, }; fn e3id() -> E3id { @@ -190,9 +201,9 @@ mod tests { } #[test] - fn stage_changed_to_failed_still_errors_when_completed() { - // E3StageChanged(Failed) after completion IS unexpected and should still error, - // because the failed path goes through accusation/slashing, not simple completion. + fn stage_changed_to_failed_ignored_when_completed() { + // E3StageChanged(Failed) from the EVM can arrive after a local timeout already cleaned up + // the context. Treat it as a silent no-op, the same way we handle E3StageChanged(Complete). let id = e3id(); let mut completed = HashSet::new(); completed.insert(id.clone()); @@ -203,7 +214,7 @@ mod tests { }); assert_eq!( RequestRouter::route(&msg, &completed), - RoutingDecision::AlreadyCompleted(id) + RoutingDecision::Ignore ); } @@ -285,4 +296,109 @@ mod tests { } ); } + + // --- timeout-triggered E3Failed tests --- + + fn e3_failed(id: E3id, reason: FailureReason) -> EnclaveEvent { + from_data(E3Failed { + e3_id: id, + failed_at_stage: E3Stage::CommitteeFinalized, + reason, + }) + } + + #[test] + fn e3_failed_dkg_timeout_publishes_complete() { + let id = e3id(); + let msg = e3_failed(id.clone(), FailureReason::DKGTimeout); + assert_eq!( + RequestRouter::route(&msg, &HashSet::new()), + RoutingDecision::Process { + e3_id: id, + post_forward: PostForward::PublishComplete, + } + ); + } + + #[test] + fn e3_failed_committee_formation_timeout_publishes_complete() { + let id = e3id(); + let msg = e3_failed(id.clone(), FailureReason::CommitteeFormationTimeout); + assert_eq!( + RequestRouter::route(&msg, &HashSet::new()), + RoutingDecision::Process { + e3_id: id, + post_forward: PostForward::PublishComplete, + } + ); + } + + #[test] + fn e3_failed_compute_timeout_publishes_complete() { + let id = e3id(); + let msg = e3_failed(id.clone(), FailureReason::ComputeTimeout); + assert_eq!( + RequestRouter::route(&msg, &HashSet::new()), + RoutingDecision::Process { + e3_id: id, + post_forward: PostForward::PublishComplete, + } + ); + } + + #[test] + fn e3_failed_decryption_timeout_publishes_complete() { + let id = e3id(); + let msg = e3_failed(id.clone(), FailureReason::DecryptionTimeout); + assert_eq!( + RequestRouter::route(&msg, &HashSet::new()), + RoutingDecision::Process { + e3_id: id, + post_forward: PostForward::PublishComplete, + } + ); + } + + #[test] + fn e3_failed_invalid_shares_does_not_complete() { + // Slashable failures must NOT trigger E3RequestComplete — the accusation/slashing + // lifecycle must be allowed to finish first. + let id = e3id(); + let msg = e3_failed(id.clone(), FailureReason::DKGInvalidShares); + assert_eq!( + RequestRouter::route(&msg, &HashSet::new()), + RoutingDecision::Process { + e3_id: id, + post_forward: PostForward::None, + } + ); + } + + #[test] + fn e3_failed_timeout_ignored_when_already_completed() { + let id = e3id(); + let mut completed = HashSet::new(); + completed.insert(id.clone()); + let msg = e3_failed(id.clone(), FailureReason::DKGTimeout); + assert_eq!( + RequestRouter::route(&msg, &completed), + RoutingDecision::Ignore + ); + } + + #[test] + fn stage_changed_to_failed_ignored_when_already_completed() { + let id = e3id(); + let mut completed = HashSet::new(); + completed.insert(id.clone()); + let msg = from_data(E3StageChanged { + e3_id: id.clone(), + previous_stage: E3Stage::CommitteeFinalized, + new_stage: E3Stage::Failed, + }); + assert_eq!( + RequestRouter::route(&msg, &completed), + RoutingDecision::Ignore + ); + } } From 65771fc1dc929956970ce451d67d8024821e9ce2 Mon Sep 17 00:00:00 2001 From: 0xjei Date: Tue, 9 Jun 2026 11:38:36 +0200 Subject: [PATCH 2/2] fix wrong tests assertions --- crates/keyshare/src/actors/threshold_keyshare.rs | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/crates/keyshare/src/actors/threshold_keyshare.rs b/crates/keyshare/src/actors/threshold_keyshare.rs index f0a17170fe..639bc0e43c 100644 --- a/crates/keyshare/src/actors/threshold_keyshare.rs +++ b/crates/keyshare/src/actors/threshold_keyshare.rs @@ -2269,7 +2269,7 @@ mod tests { EnclaveEventData::E3Failed(data) if data.e3_id == failure.e3_id && data.failed_at_stage == E3Stage::CommitteeFinalized - && data.reason == FailureReason::InsufficientCommitteeMembers + && data.reason == FailureReason::DKGTimeout )); Ok(()) @@ -2300,7 +2300,7 @@ mod tests { EnclaveEventData::E3Failed(data) if data.e3_id == failure.e3_id && data.failed_at_stage == E3Stage::CommitteeFinalized - && data.reason == FailureReason::InsufficientCommitteeMembers + && data.reason == FailureReason::DKGTimeout )); Ok(()) @@ -2323,7 +2323,7 @@ mod tests { EnclaveEventData::E3Failed(data) if data.e3_id == failure.e3_id && data.failed_at_stage == E3Stage::CommitteeFinalized - && data.reason == FailureReason::InsufficientCommitteeMembers + && data.reason == FailureReason::DecryptionTimeout )); Ok(())