service: pin gRPC ECALL dispatch to a fixed-size worker pool by ikehara · Pull Request #144 · datachainlab/lcp

ikehara · 2026-06-11T06:36:14Z

Summary

Adds EcallPool, a long-lived pool of --max-enclave-concurrency OS threads dedicated to executing ECALLs. All gRPC handlers in modules/service/src/elc.rs dispatch their ECALLs onto the pool via tokio::task::spawn_blocking(|| pool.run(|| ...)) instead of either calling the enclave directly from a tonic async handler or going through the default tokio::task::spawn_blocking pool.

Why

Under TCSPolicy=BIND the Intel SGX SDK pins one TCS to each ECALL-issuing host thread for the thread's lifetime, releasing it only on thread termination. The existing with_ecall_permit semaphore in enclave-api only bounds concurrent ECALLs; it does not bound the cumulative set of distinct host threads that have ever ECALLed. Once that cumulative set exceeds TCSNum, any new ECALL-issuing thread fails with SGX_ERROR_OUT_OF_TCS.

This is what was observed on bcc-dev during a long-running Arbitrum activate-client even with --max-enclave-concurrency=4 and TCSNum=8. Full root-cause analysis: datachainlab/docs#10123.

The leak sources were:

update_client and update_client_stream dispatched ECALLs through tokio::task::spawn_blocking, whose default pool can grow to hundreds of cached threads (10 s idle retention).
create_client, aggregate_messages, verify_membership, verify_non_membership, and client (Query) called the enclave directly from the tonic async fn, binding a TCS to every tokio worker that happened to handle the request.

With this PR, ECALLs from all gRPC handlers run on exactly size permanent threads, so the cumulative TCS bindings created by the service stay at size for the entire process lifetime.

Out of scope

The speculative scheduler in modules/service/src/speculative/scheduler.rs still uses std::thread::scope per batch. Its workers terminate at scope exit, so the SDK destructor releases their TCS bindings on each batch boundary; the speculative path therefore does not exhibit the cumulative-binding leak. Folding speculative dispatch into the same pool (and removing the now-redundant with_ecall_permit gate) is a worthwhile follow-up but kept out of this PR to limit blast radius.

CLI help text

The current --max-enclave-concurrency help reads:

Set this to match the loaded enclave's TCSNum; the default assumes a conservative TCS budget of 4.

That guidance is unsafe under BIND once any unbounded ECALL-thread source exists. Updated wording describes the flag as the pool size and recommends cap < TCSNum to leave headroom for the SDK runtime and the still-unbounded speculative path.

Tests

cargo test --package service
# 37 passed; 0 failed
cargo check --workspace
# clean

Three new unit tests in modules/service/src/ecall_pool.rs:

pool_limits_concurrent_jobs_to_worker_count — concurrency cap verification.
pool_returns_job_result_to_caller — happy-path round-trip.
pool_workers_have_stable_thread_ids_across_jobs — asserts that the set of OS thread ids that execute jobs is bounded by pool size, which is the property TCSPolicy=BIND relies on.

References

datachainlab/docs#10123 — root-cause analysis and discussion of mitigation options.

Add `EcallPool`, a long-lived pool of `--max-enclave-concurrency` OS threads dedicated to executing ECALLs. All gRPC handlers in `elc.rs` (`create_client`, `update_client`, `update_client_stream`, `aggregate_messages`, `verify_membership`, `verify_non_membership`, `client`) now dispatch their ECALLs onto the pool via `tokio::task::spawn_blocking(|| pool.run(|| ...))` instead of either calling the enclave directly from a tonic async handler (which would bind a TCS to every tokio worker that ever ECALLs) or going through the default `tokio::task::spawn_blocking` pool (which can grow to hundreds of cached threads, each binding its own TCS). Under `TCSPolicy=BIND` the Intel SGX SDK pins one TCS to each ECALL-issuing host thread for the thread's lifetime. The `with_ecall_permit` semaphore in `enclave-api` only bounds *concurrent* ECALLs, not the cumulative set of distinct threads that have ever ECALLed. Once that cumulative set exceeds `TCSNum`, any new ECALL-issuing thread fails with `SGX_ERROR_OUT_OF_TCS`. This is what was observed during a long-running Arbitrum activate-client run on bcc-dev even with `--max-enclave-concurrency=4` and `TCSNum=8` (datachainlab/docs#10123). By pinning ECALL execution to `EcallPool`'s `size` permanent workers, the cumulative TCS bindings created by the service stay at `size` for the entire process lifetime, restoring the invariant the existing CLI guidance assumed. Speculative scheduler's own `thread::scope` workers (`speculative/scheduler.rs`) are intentionally left as-is in this change. Their threads terminate at scope exit so the SDK destructor releases their TCS bindings on each batch; addressing that path (folding speculative dispatch into the same pool) is a follow-up so this PR stays small. CLI help text for `--max-enclave-concurrency` is updated: the prior wording recommended `cap == TCSNum`, which is unsafe under BIND once any unbounded ECALL-thread source exists. The new wording describes the flag as the pool size and recommends `cap < TCSNum` to leave headroom for the SDK runtime and the speculative path. Tests: - `cargo test -p service` (37 tests, includes 3 new `ecall_pool` tests asserting concurrency cap, result delivery, and that observed worker-thread ids are bounded by pool size). - `cargo check --workspace` is clean.

ikehara · 2026-06-11T07:09:11Z

Closing in favor of pushing the commit directly to feature/explicit-state-parallelization to match the existing workflow on this debug branch. Same commit content; design discussion remains in datachainlab/docs#10123.

ikehara closed this Jun 11, 2026

ikehara deleted the feat/ecall-pool-fixed-workers branch June 11, 2026 07:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

service: pin gRPC ECALL dispatch to a fixed-size worker pool#144

service: pin gRPC ECALL dispatch to a fixed-size worker pool#144
ikehara wants to merge 1 commit into
feature/explicit-state-parallelizationfrom
feat/ecall-pool-fixed-workers

ikehara commented Jun 11, 2026

Uh oh!

ikehara commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ikehara commented Jun 11, 2026

Summary

Why

Out of scope

CLI help text

Tests

References

Uh oh!

ikehara commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant