Skip to content

feat: multithread enable threadpool#1016

Merged
ctrlc03 merged 11 commits into
mainfrom
ry/multithread-enable-threadpool
Nov 15, 2025
Merged

feat: multithread enable threadpool#1016
ctrlc03 merged 11 commits into
mainfrom
ry/multithread-enable-threadpool

Conversation

@ryardley

@ryardley ryardley commented Nov 14, 2025

Copy link
Copy Markdown
Contributor

Closes: #784

  • Added a bunch of metrics to the test.
  • Enabled rayon
  • Simulated load and checked that rayon is working as expected it is
  • Looks like our trbfv code needs to use more parallelism if we can as the gains we are getting are minimal
Rayon Threads:                       1
Max Simultaneous Rayon Tasks:        1
Cores Available:                     7

Name                              Avg Duration       Runs  Total Duration
-------------------------------------------------------------------------
CalculateDecryptionKey            1.988880484s          5    9.944402424s
CalculateDecryptionShare          4.822244226s          5   24.111221132s
CalculateThresholdDecryption      6.429865015s          1    6.429865015s
GenEsiSss                         6.006048754s          5   30.030243774s
GenPkShareAndSkSss                2.966910702s          5    14.83455351s
Total time                                                  85.350285855s

Setup                                      : 1.077s
Committee Setup                            : 0.142s
Committee Finalization                     : 0.106s
All ThresholdShareCreated events           : 78.527s
ThresholdShares -> PublicKeyAggregated     : 16.036s
E3Request -> PublicKeyAggregated           : 95.503s
Application CT Gen                         : 5.958s
Running FHE Application                    : 0.101s
Ciphertext published -> PlaintextAggregated: 34.000s
Entire Test                                : 136.786s
test test_trbfv_actor ... ok
Rayon Threads:                       6
Max Simultaneous Rayon Tasks:        1
Cores Available:                     7

Name                              Avg Duration       Runs  Total Duration
-------------------------------------------------------------------------
CalculateDecryptionKey            2.094410447s          5   10.472052237s
CalculateDecryptionShare          4.759911175s          5   23.799555878s
CalculateThresholdDecryption      4.747719066s          1    4.747719066s
GenEsiSss                         4.849741052s          5    24.24870526s
GenPkShareAndSkSss                2.467378121s          5   12.336890608s
Total time                                                  75.604923049s

Setup                                      : 1.078s
Committee Setup                            : 0.133s
Committee Finalization                     : 0.104s
All ThresholdShareCreated events           : 75.257s
ThresholdShares -> PublicKeyAggregated     : 16.219s
E3Request -> PublicKeyAggregated           : 92.453s
Application CT Gen                         : 5.963s
Running FHE Application                    : 0.101s
Ciphertext published -> PlaintextAggregated: 32.225s
Entire Test                                : 131.961s
test test_trbfv_actor ... ok

Seems we might be able to leverage rayon a bit more as so far we only save 10sec from total test time and that might be within the margin of error. I expect this will become more apparent when we run proofs

Summary by CodeRabbit

  • New Features

    • Flexible multithreading controls with max-thread helpers and concurrent-job settings.
    • Performance reporting and duration tracking with optional event capture.
  • Bug Fixes

    • Improved error reporting with additional error variants and clearer messages.
  • Chores

    • Concurrency control and backpressure for multithreaded work; long-running task warnings.
  • Tests

    • Extended integration test timing and reporting to surface multithreaded performance.

@coderabbitai

coderabbitai Bot commented Nov 14, 2025

Copy link
Copy Markdown
Contributor

Warning

Rate limit exceeded

@ryardley has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 16 minutes and 0 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between d44b93d and e937686.

📒 Files selected for processing (2)
  • crates/ciphernode-builder/src/ciphernode_builder.rs (4 hunks)
  • crates/multithread/src/lib.rs (4 hunks)

Walkthrough

Adds configurable multithreading to the ciphernode builder, wires those options into a refactored Multithread actor with per-instance Rayon pools, semaphore-based concurrency limiting, timing instrumentation and optional event reporting; updates related call sites and integration tests to collect and print timing reports.

Changes

Cohort / File(s) Summary
CiphernodeBuilder Configuration
crates/ciphernode-builder/src/ciphernode_builder.rs, crates/entrypoint/src/start/aggregator_start.rs, crates/entrypoint/src/start/start.rs
Added builder fields and methods: multithread_concurrent_jobs: Option<usize>, multithread_capture_events: bool, with_max_threads(), with_max_threads_minus(), with_multithread_concurrent_jobs(), with_multithread_capture_events(). Propagates options to Multithread::attach. Adjusted deprecation text for with_threads.
ComputeRequest Event Types
crates/events/src/enclave_event/compute_request/mod.rs
Added impl ToString for ComputeRequest. Extended ComputeRequestError with public variants RecvError(String) and SemaphoreError(String) and updated Display/source handling.
Threshold Keyshare Refactor
crates/keyshare/src/threshold_keyshare.rs
Renamed private method handle_compute_requestmultithread_request and updated all call sites (no logic change).
Multithread Core Runtime
crates/multithread/src/lib.rs
Major refactor: per-instance Rayon ThreadPool, global Rayon pool initialization, rayon_limit: Semaphore for concurrency limiting, report: Option<MultithreadReport>. Multithread::new/attach accept rayon_threads, max_simultaneous_rayon_tasks, capture_events. Tasks spawn on Rayon, return via Tokio oneshot; backpressure via semaphore permits; added long-running job warnings and timing return values. Added TrackDuration and GetReport messages/handlers.
Multithread Reporting
crates/multithread/src/report.rs
New MultithreadReport and FlattenedReport types: collect TrackDuration events, compute totals/averages/runs, and implement Display to render a formatted report including config and per-name metrics.
Integration Test Instrumentation
crates/tests/tests/integration.rs
Added timing instrumentation and a serialize_report helper, switched multithread configuration to use dynamic concurrent_jobs / thread limits, retrieve and print multithread report at test end, and adjusted event waiting/checks.

Sequence Diagram(s)

sequenceDiagram
    participant Builder as CiphernodeBuilder
    participant Multithread as Multithread Actor
    participant Rayon as Rayon Pool
    participant Semaphore as Semaphore
    participant Task as Compute Task

    Builder->>Builder: with_max_threads() / with_multithread_concurrent_jobs() / with_multithread_capture_events()
    Builder->>Multithread: attach(rayon_threads, max_simultaneous, capture_events)

    Multithread->>Rayon: initialize global pool & create per-instance ThreadPool
    Multithread->>Semaphore: create with max_simultaneous permits

    rect rgb(230,245,255)
    Note over Multithread,Task: Submission with backpressure
    Multithread->>Semaphore: acquire permit (await)
    Semaphore-->>Multithread: permit acquired
    Multithread->>Rayon: spawn task (Rayon thread)
    end

    rect rgb(240,255,240)
    Note over Task,Rayon: Parallel execution and timing
    Rayon->>Task: run compute request
    Task-->>Rayon: send (result, duration) via oneshot
    Rayon->>Multithread: deliver result
    Multithread->>Semaphore: release permit
    end

    rect rgb(255,250,220)
    Note over Multithread: optional reporting
    alt capture_events == true
        Multithread->>Multithread: TrackDuration(name, duration)
    end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • Focus review on crates/multithread/src/lib.rs (concurrency/semaphore, oneshot handling, timing/warnings).
  • Verify crates/multithread/src/report.rs aggregation and Display formatting for correctness and edge cases.
  • Ensure builder-to-actor option propagation in crates/ciphernode-builder/src/ciphernode_builder.rs and entrypoint start files.
  • Check integration test sequencing/timing changes in crates/tests/tests/integration.rs.
  • Confirm new error variants in crates/events/src/enclave_event/compute_request/mod.rs are handled where errors are matched/propagated.

Possibly related PRs

Suggested labels

ciphernode

Suggested reviewers

  • hmzakhalid
  • 0xjei
  • auryn-macmillan
  • ozgurarmanc

Poem

🐰 I hop where threads now bound and queue,
Semaphores keep chaos true,
Timings tumble, nicely logged,
Pools and permits snugly hogged,
A rabbit cheers the multithread view.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 53.13% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: multithread enable threadpool' clearly summarizes the main change: enabling rayon threadpool for multithreading in the ciphernode builder, which is the primary focus across all modified files.
Linked Issues check ✅ Passed The PR implements the core objective from #784: re-enables multithreading via a semaphore-based concurrency limiter to prevent thread starvation, replaces hard-coded parameters with configurable options, and provides comprehensive timing instrumentation to verify performance improvements.
Out of Scope Changes check ✅ Passed All changes are within scope: multithreading enablement (builder configuration, Multithread struct updates), concurrency control (Semaphore-based backpressure), event reporting (TrackDuration/GetReport handlers), and integration test enhancements to verify the feature work as intended.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@vercel

vercel Bot commented Nov 14, 2025

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
crisp Ready Ready Preview Comment Nov 14, 2025 9:30pm
enclave-docs Ready Ready Preview Comment Nov 14, 2025 9:30pm

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
crates/keyshare/src/threshold_keyshare.rs (1)

720-750: Centralized multithread_request helper looks good; consider richer error context

The shared helper preserves the previous pattern (build ComputeRequest, send via Multithread, then map the response) and the new call sites are clean and DRY. The res?? idiom is also a neat way to unify mailbox and compute errors into a single Result for logging.

Two minor suggestions:

  • When request_fn or response_fn fails, you currently only log {e}. Given you now have a ComputeRequest::to_string() available, consider threading a human‑readable operation name into the log line to make failures easier to correlate to a particular TrBFV step.
  • Please double‑check that e3_utils::bail(self) indeed returns a ResponseActFuture<Self, ()> that behaves as you expect in this context (e.g., no accidental actor stopping), since this path is now the sole error return from the helper.

Also applies to: 781-785, 791-795, 801-805, 811-815

crates/multithread/src/report.rs (1)

31-73: Aggregation logic is correct; consider minor refinements for efficiency/robustness.

The per-name accumulation, average computation, and mt_total sum all look logically sound and match the intended reporting. Two small, optional tweaks:

  • You clone total_dur twice (for avg_dur and mt_total). You could iterate over &total_dur instead to avoid extra allocations, though this is probably negligible on test-only paths.
  • self.total_dur.get(name).unwrap() relies on the invariant that avg_dur and total_dur share identical keys. That holds with the current construction, but if future changes ever mutate one map independently, this becomes a panic point; switching to unwrap_or/expect with a clearer message or deriving total_dur directly from avg_dur would be slightly safer.

Given the current usage, these are nice-to-haves, not blockers.

Also applies to: 77-85, 87-120

crates/multithread/src/lib.rs (2)

115-186: I need to gather more specific information about the technical feasibility of the suggested fix and whether the concern about permit lifetime under cancellation is the actual root issue.

Now let me verify whether this is a practical concern by checking the codebase for how ResponseFutures are used:

Based on my verification, the review comment's concern about semaphore permit release under cancellation is valid and addresses a real edge case that can occur in production scenarios (client disconnection, server shutdown, timeout). The suggestion to use acquire_owned() and move the permit into the rayon closure is technically sound and implementable.

Semaphore backpressure limit can be exceeded under request cancellation; consider stronger permit ownership.

When futures are dropped, which can occur at any await point due to cancellation, the semaphore permit (_permit) held by the async handler is also dropped and released. This allows a concurrent rayon task to exceed the max_simultaneous_rayon_tasks limit while the CPU-bound job still runs, particularly under scenarios like client disconnection or server shutdown.

Rayon's spawn() requires 'static closures with move semantics, but this is compatible with your suggested solution: using Semaphore::acquire_owned() returns an OwnedSemaphorePermit that owns an Arc reference and can be moved into the rayon closure, keeping it alive for the job's duration rather than just the response future's.

This refactoring strengthens backpressure under cancellation; the current behavior is correct for normal completion but allows oversubscription when callers drop the ResponseFuture early.


189-207: Vec accumulation pattern is real, but metric key naming already follows best practices.

The reporting integration is straightforward with one notable pattern:

  • MultithreadReport accumulates every TrackDuration in a Vec before aggregating. For long-lived nodes under heavy load, consider aggregating as you go rather than retaining all events.
  • ComputeRequest's to_string() implementation already returns stable operation labels (e.g., "CalculateDecryptionShare", "GenEsiSss") rather than full request formatting, so metric cardinality is well-managed and per-name averages remain useful.

This is non-blocking and can be revisited if reporting moves beyond test/benchmark usage.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9287de5 and cc56f8a.

📒 Files selected for processing (8)
  • crates/ciphernode-builder/src/ciphernode_builder.rs (4 hunks)
  • crates/entrypoint/src/start/aggregator_start.rs (1 hunks)
  • crates/entrypoint/src/start/start.rs (1 hunks)
  • crates/events/src/enclave_event/compute_request/mod.rs (3 hunks)
  • crates/keyshare/src/threshold_keyshare.rs (5 hunks)
  • crates/multithread/src/lib.rs (4 hunks)
  • crates/multithread/src/report.rs (1 hunks)
  • crates/tests/tests/integration.rs (12 hunks)
🧰 Additional context used
🧠 Learnings (9)
📚 Learning: 2025-04-30T06:25:14.721Z
Learnt from: ryardley
Repo: gnosisguild/enclave PR: 372
File: packages/ciphernode/events/src/eventbus.rs:15-15
Timestamp: 2025-04-30T06:25:14.721Z
Learning: EnclaveEvent implements Display in packages/ciphernode/events/src/enclave_event/mod.rs, which satisfies the Event trait requirement. Static analysis tools may incorrectly flag implementations as missing when they do exist.

Applied to files:

  • crates/events/src/enclave_event/compute_request/mod.rs
  • crates/tests/tests/integration.rs
📚 Learning: 2024-10-22T02:10:34.864Z
Learnt from: ryardley
Repo: gnosisguild/enclave PR: 145
File: packages/ciphernode/tests/tests/test_aggregation_and_decryption.rs:82-83
Timestamp: 2024-10-22T02:10:34.864Z
Learning: In the file `packages/ciphernode/tests/tests/test_aggregation_and_decryption.rs`, when reviewing test functions like `generate_pk_share`, minor performance optimizations (e.g., minimizing mutex locks) are not a priority.

Applied to files:

  • crates/tests/tests/integration.rs
📚 Learning: 2024-10-10T23:24:43.341Z
Learnt from: ryardley
Repo: gnosisguild/enclave PR: 143
File: packages/ciphernode/sortition/src/sortition.rs:4-9
Timestamp: 2024-10-10T23:24:43.341Z
Learning: In the `Sortition` module (`packages/ciphernode/sortition/src/sortition.rs`), errors are sent to the event bus using `self.bus.err`, which handles logging and printing. Therefore, explicit use of the `tracing` crate for logging errors may not be necessary in this context.

Applied to files:

  • crates/tests/tests/integration.rs
📚 Learning: 2024-09-26T03:11:29.311Z
Learnt from: ryardley
Repo: gnosisguild/enclave PR: 107
File: packages/ciphernode/sortition/src/distance.rs:1-1
Timestamp: 2024-09-26T03:11:29.311Z
Learning: In `packages/ciphernode/core/src/events.rs`, the import statements use the correct and updated `alloy::primitives` module.

Applied to files:

  • crates/tests/tests/integration.rs
📚 Learning: 2024-10-08T19:45:18.209Z
Learnt from: ryardley
Repo: gnosisguild/enclave PR: 133
File: packages/ciphernode/tests/tests/test_aggregation_and_decryption.rs:137-139
Timestamp: 2024-10-08T19:45:18.209Z
Learning: In `packages/ciphernode/tests/tests/test_aggregation_and_decryption.rs`, using the same RNG instance `rng_test` for generating multiple key shares without advancing its state is acceptable.

Applied to files:

  • crates/tests/tests/integration.rs
📚 Learning: 2024-10-23T02:03:02.008Z
Learnt from: ryardley
Repo: gnosisguild/enclave PR: 156
File: packages/ciphernode/keyshare/src/encryption.rs:45-45
Timestamp: 2024-10-23T02:03:02.008Z
Learning: In the `packages/ciphernode/keyshare/src/encryption.rs` file, the environment variable `CIPHERNODE_SECRET` is used for the encryption password. A secure secret management solution is not currently available, but may be considered in future iterations.

Applied to files:

  • crates/tests/tests/integration.rs
📚 Learning: 2024-10-23T01:59:42.967Z
Learnt from: ryardley
Repo: gnosisguild/enclave PR: 156
File: packages/ciphernode/tests/tests/test_aggregation_and_decryption.rs:274-274
Timestamp: 2024-10-23T01:59:42.967Z
Learning: In the `packages/ciphernode/tests/tests/test_aggregation_and_decryption.rs` file and other test files within this project, hardcoding `CIPHERNODE_SECRET` is acceptable for testing purposes.

Applied to files:

  • crates/tests/tests/integration.rs
📚 Learning: 2024-10-12T10:24:07.572Z
Learnt from: ryardley
Repo: gnosisguild/enclave PR: 139
File: packages/ciphernode/aggregator/src/publickey_aggregator.rs:46-46
Timestamp: 2024-10-12T10:24:07.572Z
Learning: In `packages/ciphernode/router/src/hooks.rs`, the `src_chain_id` parameter in `PublicKeyAggregator::new` is correctly handled, even if not explicitly provided during instantiation.

Applied to files:

  • crates/entrypoint/src/start/aggregator_start.rs
  • crates/entrypoint/src/start/start.rs
📚 Learning: 2024-10-22T03:42:14.057Z
Learnt from: ryardley
Repo: gnosisguild/enclave PR: 145
File: packages/ciphernode/router/src/context.rs:94-97
Timestamp: 2024-10-22T03:42:14.057Z
Learning: In `packages/ciphernode/router/src/context.rs`, avoid adding complexity for batching checkpoint operations in code; rely on the database's batching capabilities instead.

Applied to files:

  • crates/entrypoint/src/start/start.rs
🧬 Code graph analysis (4)
crates/tests/tests/integration.rs (3)
crates/multithread/src/lib.rs (2)
  • get_max_threads_minus (85-91)
  • attach (93-108)
crates/tests/tests/integration_legacy.rs (1)
  • setup_score_sortition_environment (121-156)
crates/test-helpers/src/application.rs (1)
  • run_application (56-76)
crates/multithread/src/report.rs (1)
crates/multithread/src/lib.rs (2)
  • new (44-82)
  • new (217-219)
crates/multithread/src/lib.rs (2)
crates/tests/tests/integration.rs (1)
  • report (75-85)
crates/multithread/src/report.rs (1)
  • new (19-25)
crates/ciphernode-builder/src/ciphernode_builder.rs (1)
crates/multithread/src/lib.rs (1)
  • get_max_threads_minus (85-91)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: test_net
  • GitHub Check: integration_prebuild
  • GitHub Check: build_enclave_cli
  • GitHub Check: build_sdk
  • GitHub Check: rust_unit
  • GitHub Check: test_contracts
  • GitHub Check: rust_integration
  • GitHub Check: Build & Push Image
🔇 Additional comments (5)
crates/entrypoint/src/start/start.rs (1)

38-48: Wiring .with_max_threads() into the main builder chain looks good

The max‑threads configuration is applied consistently during node startup without altering control flow, which aligns with the multithreading goals of the PR.

crates/entrypoint/src/start/aggregator_start.rs (1)

38-48: Aggregator builder now matches main startup for max‑thread config

Adding .with_max_threads() here keeps the aggregator initialization consistent with the main ciphernode startup path and should help surface the Rayon improvements under load.

crates/tests/tests/integration.rs (1)

17-17: Timing/reporting additions and multithread wiring in the TRBFV integration test look solid

The new instrumentation is well-contained and doesn’t interfere with the test’s core assertions:

  • serialize_report is simple and robust (handles empty reports, aligns keys nicely).
  • The report: Vec<(&str, Duration)> only uses string literals as keys, so there are no lifetime issues.
  • Using Multithread::get_max_threads_minus(1) with concurrent_jobs = 1 mirrors the benchmark setups described in the PR and should give realistic coverage of the Rayon pool behavior.
  • Pulling a final GetReport from multithread and printing it, followed by the serialized phase timings, provides a useful snapshot for performance debugging while keeping the test logic readable.

No functional issues spotted in these additions.

Also applies to: 27-27, 72-87, 93-95, 118-120, 160-169, 206-207, 215-215, 227-228, 272-283, 285-297, 298-318, 347-347, 349-355, 389-392, 429-434

crates/multithread/src/lib.rs (2)

226-240: Timing instrumentation is correctly implemented and verified.

The code appropriately uses std::time::Instant, which is intended to be monotonic and suitable for measuring operation durations. The pattern of Instant::now() followed by .elapsed() is the recommended approach for timing code in normal environments. The implementation in timefunc is consistent, idiomatic, and produces accurate duration measurements across all variants.


84-91: Code handling of thread count is correct and defensive.

The function correctly calls available_parallelism(), which returns Result<NonZeroUsize> and may error if the platform cannot determine available parallelism. The code defensively handles this with .unwrap_or(1), providing a sensible fallback. The saturating_sub and max(1, ...) clamping prevent underflow and enforce a minimum thread count. Since available_parallelism() is intended as a portable default for sizing worker pools, the semantics align with the function's purpose. No changes needed.

Comment thread crates/ciphernode-builder/src/ciphernode_builder.rs
Comment thread crates/events/src/enclave_event/compute_request/mod.rs
Comment thread crates/multithread/src/lib.rs

@ctrlc03 ctrlc03 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK

@ctrlc03 ctrlc03 merged commit 7048a47 into main Nov 15, 2025
24 checks passed
@ctrlc03 ctrlc03 deleted the ry/multithread-enable-threadpool branch November 15, 2025 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enclave Multithread Queue for faster TrBFV

2 participants