feat: multithread enable threadpool#1016
Conversation
|
Warning Rate limit exceeded@ryardley has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 16 minutes and 0 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (2)
WalkthroughAdds configurable multithreading to the ciphernode builder, wires those options into a refactored Multithread actor with per-instance Rayon pools, semaphore-based concurrency limiting, timing instrumentation and optional event reporting; updates related call sites and integration tests to collect and print timing reports. Changes
Sequence Diagram(s)sequenceDiagram
participant Builder as CiphernodeBuilder
participant Multithread as Multithread Actor
participant Rayon as Rayon Pool
participant Semaphore as Semaphore
participant Task as Compute Task
Builder->>Builder: with_max_threads() / with_multithread_concurrent_jobs() / with_multithread_capture_events()
Builder->>Multithread: attach(rayon_threads, max_simultaneous, capture_events)
Multithread->>Rayon: initialize global pool & create per-instance ThreadPool
Multithread->>Semaphore: create with max_simultaneous permits
rect rgb(230,245,255)
Note over Multithread,Task: Submission with backpressure
Multithread->>Semaphore: acquire permit (await)
Semaphore-->>Multithread: permit acquired
Multithread->>Rayon: spawn task (Rayon thread)
end
rect rgb(240,255,240)
Note over Task,Rayon: Parallel execution and timing
Rayon->>Task: run compute request
Task-->>Rayon: send (result, duration) via oneshot
Rayon->>Multithread: deliver result
Multithread->>Semaphore: release permit
end
rect rgb(255,250,220)
Note over Multithread: optional reporting
alt capture_events == true
Multithread->>Multithread: TrackDuration(name, duration)
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (4)
crates/keyshare/src/threshold_keyshare.rs (1)
720-750: Centralizedmultithread_requesthelper looks good; consider richer error contextThe shared helper preserves the previous pattern (build
ComputeRequest, send viaMultithread, then map the response) and the new call sites are clean and DRY. Theres??idiom is also a neat way to unify mailbox and compute errors into a singleResultfor logging.Two minor suggestions:
- When
request_fnorresponse_fnfails, you currently only log{e}. Given you now have aComputeRequest::to_string()available, consider threading a human‑readable operation name into the log line to make failures easier to correlate to a particular TrBFV step.- Please double‑check that
e3_utils::bail(self)indeed returns aResponseActFuture<Self, ()>that behaves as you expect in this context (e.g., no accidental actor stopping), since this path is now the sole error return from the helper.Also applies to: 781-785, 791-795, 801-805, 811-815
crates/multithread/src/report.rs (1)
31-73: Aggregation logic is correct; consider minor refinements for efficiency/robustness.The per-name accumulation, average computation, and
mt_totalsum all look logically sound and match the intended reporting. Two small, optional tweaks:
- You clone
total_durtwice (foravg_durandmt_total). You could iterate over&total_durinstead to avoid extra allocations, though this is probably negligible on test-only paths.self.total_dur.get(name).unwrap()relies on the invariant thatavg_durandtotal_durshare identical keys. That holds with the current construction, but if future changes ever mutate one map independently, this becomes a panic point; switching tounwrap_or/expectwith a clearer message or derivingtotal_durdirectly fromavg_durwould be slightly safer.Given the current usage, these are nice-to-haves, not blockers.
Also applies to: 77-85, 87-120
crates/multithread/src/lib.rs (2)
115-186: I need to gather more specific information about the technical feasibility of the suggested fix and whether the concern about permit lifetime under cancellation is the actual root issue.Now let me verify whether this is a practical concern by checking the codebase for how ResponseFutures are used:
Based on my verification, the review comment's concern about semaphore permit release under cancellation is valid and addresses a real edge case that can occur in production scenarios (client disconnection, server shutdown, timeout). The suggestion to use
acquire_owned()and move the permit into the rayon closure is technically sound and implementable.Semaphore backpressure limit can be exceeded under request cancellation; consider stronger permit ownership.
When futures are dropped, which can occur at any await point due to cancellation, the semaphore permit (
_permit) held by the async handler is also dropped and released. This allows a concurrent rayon task to exceed themax_simultaneous_rayon_taskslimit while the CPU-bound job still runs, particularly under scenarios like client disconnection or server shutdown.Rayon's
spawn()requires 'static closures with move semantics, but this is compatible with your suggested solution: usingSemaphore::acquire_owned()returns anOwnedSemaphorePermitthat owns an Arc reference and can be moved into the rayon closure, keeping it alive for the job's duration rather than just the response future's.This refactoring strengthens backpressure under cancellation; the current behavior is correct for normal completion but allows oversubscription when callers drop the ResponseFuture early.
189-207: Vec accumulation pattern is real, but metric key naming already follows best practices.The reporting integration is straightforward with one notable pattern:
MultithreadReportaccumulates everyTrackDurationin aVecbefore aggregating. For long-lived nodes under heavy load, consider aggregating as you go rather than retaining all events.- ComputeRequest's
to_string()implementation already returns stable operation labels (e.g., "CalculateDecryptionShare", "GenEsiSss") rather than full request formatting, so metric cardinality is well-managed and per-name averages remain useful.This is non-blocking and can be revisited if reporting moves beyond test/benchmark usage.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
crates/ciphernode-builder/src/ciphernode_builder.rs(4 hunks)crates/entrypoint/src/start/aggregator_start.rs(1 hunks)crates/entrypoint/src/start/start.rs(1 hunks)crates/events/src/enclave_event/compute_request/mod.rs(3 hunks)crates/keyshare/src/threshold_keyshare.rs(5 hunks)crates/multithread/src/lib.rs(4 hunks)crates/multithread/src/report.rs(1 hunks)crates/tests/tests/integration.rs(12 hunks)
🧰 Additional context used
🧠 Learnings (9)
📚 Learning: 2025-04-30T06:25:14.721Z
Learnt from: ryardley
Repo: gnosisguild/enclave PR: 372
File: packages/ciphernode/events/src/eventbus.rs:15-15
Timestamp: 2025-04-30T06:25:14.721Z
Learning: EnclaveEvent implements Display in packages/ciphernode/events/src/enclave_event/mod.rs, which satisfies the Event trait requirement. Static analysis tools may incorrectly flag implementations as missing when they do exist.
Applied to files:
crates/events/src/enclave_event/compute_request/mod.rscrates/tests/tests/integration.rs
📚 Learning: 2024-10-22T02:10:34.864Z
Learnt from: ryardley
Repo: gnosisguild/enclave PR: 145
File: packages/ciphernode/tests/tests/test_aggregation_and_decryption.rs:82-83
Timestamp: 2024-10-22T02:10:34.864Z
Learning: In the file `packages/ciphernode/tests/tests/test_aggregation_and_decryption.rs`, when reviewing test functions like `generate_pk_share`, minor performance optimizations (e.g., minimizing mutex locks) are not a priority.
Applied to files:
crates/tests/tests/integration.rs
📚 Learning: 2024-10-10T23:24:43.341Z
Learnt from: ryardley
Repo: gnosisguild/enclave PR: 143
File: packages/ciphernode/sortition/src/sortition.rs:4-9
Timestamp: 2024-10-10T23:24:43.341Z
Learning: In the `Sortition` module (`packages/ciphernode/sortition/src/sortition.rs`), errors are sent to the event bus using `self.bus.err`, which handles logging and printing. Therefore, explicit use of the `tracing` crate for logging errors may not be necessary in this context.
Applied to files:
crates/tests/tests/integration.rs
📚 Learning: 2024-09-26T03:11:29.311Z
Learnt from: ryardley
Repo: gnosisguild/enclave PR: 107
File: packages/ciphernode/sortition/src/distance.rs:1-1
Timestamp: 2024-09-26T03:11:29.311Z
Learning: In `packages/ciphernode/core/src/events.rs`, the import statements use the correct and updated `alloy::primitives` module.
Applied to files:
crates/tests/tests/integration.rs
📚 Learning: 2024-10-08T19:45:18.209Z
Learnt from: ryardley
Repo: gnosisguild/enclave PR: 133
File: packages/ciphernode/tests/tests/test_aggregation_and_decryption.rs:137-139
Timestamp: 2024-10-08T19:45:18.209Z
Learning: In `packages/ciphernode/tests/tests/test_aggregation_and_decryption.rs`, using the same RNG instance `rng_test` for generating multiple key shares without advancing its state is acceptable.
Applied to files:
crates/tests/tests/integration.rs
📚 Learning: 2024-10-23T02:03:02.008Z
Learnt from: ryardley
Repo: gnosisguild/enclave PR: 156
File: packages/ciphernode/keyshare/src/encryption.rs:45-45
Timestamp: 2024-10-23T02:03:02.008Z
Learning: In the `packages/ciphernode/keyshare/src/encryption.rs` file, the environment variable `CIPHERNODE_SECRET` is used for the encryption password. A secure secret management solution is not currently available, but may be considered in future iterations.
Applied to files:
crates/tests/tests/integration.rs
📚 Learning: 2024-10-23T01:59:42.967Z
Learnt from: ryardley
Repo: gnosisguild/enclave PR: 156
File: packages/ciphernode/tests/tests/test_aggregation_and_decryption.rs:274-274
Timestamp: 2024-10-23T01:59:42.967Z
Learning: In the `packages/ciphernode/tests/tests/test_aggregation_and_decryption.rs` file and other test files within this project, hardcoding `CIPHERNODE_SECRET` is acceptable for testing purposes.
Applied to files:
crates/tests/tests/integration.rs
📚 Learning: 2024-10-12T10:24:07.572Z
Learnt from: ryardley
Repo: gnosisguild/enclave PR: 139
File: packages/ciphernode/aggregator/src/publickey_aggregator.rs:46-46
Timestamp: 2024-10-12T10:24:07.572Z
Learning: In `packages/ciphernode/router/src/hooks.rs`, the `src_chain_id` parameter in `PublicKeyAggregator::new` is correctly handled, even if not explicitly provided during instantiation.
Applied to files:
crates/entrypoint/src/start/aggregator_start.rscrates/entrypoint/src/start/start.rs
📚 Learning: 2024-10-22T03:42:14.057Z
Learnt from: ryardley
Repo: gnosisguild/enclave PR: 145
File: packages/ciphernode/router/src/context.rs:94-97
Timestamp: 2024-10-22T03:42:14.057Z
Learning: In `packages/ciphernode/router/src/context.rs`, avoid adding complexity for batching checkpoint operations in code; rely on the database's batching capabilities instead.
Applied to files:
crates/entrypoint/src/start/start.rs
🧬 Code graph analysis (4)
crates/tests/tests/integration.rs (3)
crates/multithread/src/lib.rs (2)
get_max_threads_minus(85-91)attach(93-108)crates/tests/tests/integration_legacy.rs (1)
setup_score_sortition_environment(121-156)crates/test-helpers/src/application.rs (1)
run_application(56-76)
crates/multithread/src/report.rs (1)
crates/multithread/src/lib.rs (2)
new(44-82)new(217-219)
crates/multithread/src/lib.rs (2)
crates/tests/tests/integration.rs (1)
report(75-85)crates/multithread/src/report.rs (1)
new(19-25)
crates/ciphernode-builder/src/ciphernode_builder.rs (1)
crates/multithread/src/lib.rs (1)
get_max_threads_minus(85-91)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: test_net
- GitHub Check: integration_prebuild
- GitHub Check: build_enclave_cli
- GitHub Check: build_sdk
- GitHub Check: rust_unit
- GitHub Check: test_contracts
- GitHub Check: rust_integration
- GitHub Check: Build & Push Image
🔇 Additional comments (5)
crates/entrypoint/src/start/start.rs (1)
38-48: Wiring.with_max_threads()into the main builder chain looks goodThe max‑threads configuration is applied consistently during node startup without altering control flow, which aligns with the multithreading goals of the PR.
crates/entrypoint/src/start/aggregator_start.rs (1)
38-48: Aggregator builder now matches main startup for max‑thread configAdding
.with_max_threads()here keeps the aggregator initialization consistent with the main ciphernode startup path and should help surface the Rayon improvements under load.crates/tests/tests/integration.rs (1)
17-17: Timing/reporting additions and multithread wiring in the TRBFV integration test look solidThe new instrumentation is well-contained and doesn’t interfere with the test’s core assertions:
serialize_reportis simple and robust (handles empty reports, aligns keys nicely).- The
report: Vec<(&str, Duration)>only uses string literals as keys, so there are no lifetime issues.- Using
Multithread::get_max_threads_minus(1)withconcurrent_jobs = 1mirrors the benchmark setups described in the PR and should give realistic coverage of the Rayon pool behavior.- Pulling a final
GetReportfrommultithreadand printing it, followed by the serialized phase timings, provides a useful snapshot for performance debugging while keeping the test logic readable.No functional issues spotted in these additions.
Also applies to: 27-27, 72-87, 93-95, 118-120, 160-169, 206-207, 215-215, 227-228, 272-283, 285-297, 298-318, 347-347, 349-355, 389-392, 429-434
crates/multithread/src/lib.rs (2)
226-240: Timing instrumentation is correctly implemented and verified.The code appropriately uses
std::time::Instant, which is intended to be monotonic and suitable for measuring operation durations. The pattern ofInstant::now()followed by.elapsed()is the recommended approach for timing code in normal environments. The implementation intimefuncis consistent, idiomatic, and produces accurate duration measurements across all variants.
84-91: Code handling of thread count is correct and defensive.The function correctly calls
available_parallelism(), which returnsResult<NonZeroUsize>and may error if the platform cannot determine available parallelism. The code defensively handles this with.unwrap_or(1), providing a sensible fallback. Thesaturating_subandmax(1, ...)clamping prevent underflow and enforce a minimum thread count. Sinceavailable_parallelism()is intended as a portable default for sizing worker pools, the semantics align with the function's purpose. No changes needed.
Closes: #784
Seems we might be able to leverage rayon a bit more as so far we only save 10sec from total test time and that might be within the margin of error. I expect this will become more apparent when we run proofs
Summary by CodeRabbit
New Features
Bug Fixes
Chores
Tests