Skip to content

service: pin gRPC ECALL dispatch to a fixed-size worker pool#144

Closed
ikehara wants to merge 1 commit into
feature/explicit-state-parallelizationfrom
feat/ecall-pool-fixed-workers
Closed

service: pin gRPC ECALL dispatch to a fixed-size worker pool#144
ikehara wants to merge 1 commit into
feature/explicit-state-parallelizationfrom
feat/ecall-pool-fixed-workers

Conversation

@ikehara

@ikehara ikehara commented Jun 11, 2026

Copy link
Copy Markdown

Summary

Adds EcallPool, a long-lived pool of --max-enclave-concurrency OS threads dedicated to executing ECALLs. All gRPC handlers in modules/service/src/elc.rs dispatch their ECALLs onto the pool via tokio::task::spawn_blocking(|| pool.run(|| ...)) instead of either calling the enclave directly from a tonic async handler or going through the default tokio::task::spawn_blocking pool.

Why

Under TCSPolicy=BIND the Intel SGX SDK pins one TCS to each ECALL-issuing host thread for the thread's lifetime, releasing it only on thread termination. The existing with_ecall_permit semaphore in enclave-api only bounds concurrent ECALLs; it does not bound the cumulative set of distinct host threads that have ever ECALLed. Once that cumulative set exceeds TCSNum, any new ECALL-issuing thread fails with SGX_ERROR_OUT_OF_TCS.

This is what was observed on bcc-dev during a long-running Arbitrum activate-client even with --max-enclave-concurrency=4 and TCSNum=8. Full root-cause analysis: datachainlab/docs#10123.

The leak sources were:

  1. update_client and update_client_stream dispatched ECALLs through tokio::task::spawn_blocking, whose default pool can grow to hundreds of cached threads (10 s idle retention).
  2. create_client, aggregate_messages, verify_membership, verify_non_membership, and client (Query) called the enclave directly from the tonic async fn, binding a TCS to every tokio worker that happened to handle the request.

With this PR, ECALLs from all gRPC handlers run on exactly size permanent threads, so the cumulative TCS bindings created by the service stay at size for the entire process lifetime.

Out of scope

The speculative scheduler in modules/service/src/speculative/scheduler.rs still uses std::thread::scope per batch. Its workers terminate at scope exit, so the SDK destructor releases their TCS bindings on each batch boundary; the speculative path therefore does not exhibit the cumulative-binding leak. Folding speculative dispatch into the same pool (and removing the now-redundant with_ecall_permit gate) is a worthwhile follow-up but kept out of this PR to limit blast radius.

CLI help text

The current --max-enclave-concurrency help reads:

Set this to match the loaded enclave's TCSNum; the default assumes a conservative TCS budget of 4.

That guidance is unsafe under BIND once any unbounded ECALL-thread source exists. Updated wording describes the flag as the pool size and recommends cap < TCSNum to leave headroom for the SDK runtime and the still-unbounded speculative path.

Tests

cargo test --package service
# 37 passed; 0 failed
cargo check --workspace
# clean

Three new unit tests in modules/service/src/ecall_pool.rs:

  • pool_limits_concurrent_jobs_to_worker_count — concurrency cap verification.
  • pool_returns_job_result_to_caller — happy-path round-trip.
  • pool_workers_have_stable_thread_ids_across_jobs — asserts that the set of OS thread ids that execute jobs is bounded by pool size, which is the property TCSPolicy=BIND relies on.

References

  • datachainlab/docs#10123 — root-cause analysis and discussion of mitigation options.

Add `EcallPool`, a long-lived pool of `--max-enclave-concurrency` OS
threads dedicated to executing ECALLs. All gRPC handlers in `elc.rs`
(`create_client`, `update_client`, `update_client_stream`,
`aggregate_messages`, `verify_membership`, `verify_non_membership`,
`client`) now dispatch their ECALLs onto the pool via
`tokio::task::spawn_blocking(|| pool.run(|| ...))` instead of either
calling the enclave directly from a tonic async handler (which would
bind a TCS to every tokio worker that ever ECALLs) or going through the
default `tokio::task::spawn_blocking` pool (which can grow to hundreds
of cached threads, each binding its own TCS).

Under `TCSPolicy=BIND` the Intel SGX SDK pins one TCS to each
ECALL-issuing host thread for the thread's lifetime. The
`with_ecall_permit` semaphore in `enclave-api` only bounds *concurrent*
ECALLs, not the cumulative set of distinct threads that have ever
ECALLed. Once that cumulative set exceeds `TCSNum`, any new
ECALL-issuing thread fails with `SGX_ERROR_OUT_OF_TCS`. This is what
was observed during a long-running Arbitrum activate-client run on
bcc-dev even with `--max-enclave-concurrency=4` and `TCSNum=8`
(datachainlab/docs#10123).

By pinning ECALL execution to `EcallPool`'s `size` permanent workers,
the cumulative TCS bindings created by the service stay at `size` for
the entire process lifetime, restoring the invariant the existing CLI
guidance assumed.

Speculative scheduler's own `thread::scope` workers
(`speculative/scheduler.rs`) are intentionally left as-is in this
change. Their threads terminate at scope exit so the SDK destructor
releases their TCS bindings on each batch; addressing that path
(folding speculative dispatch into the same pool) is a follow-up so
this PR stays small.

CLI help text for `--max-enclave-concurrency` is updated: the prior
wording recommended `cap == TCSNum`, which is unsafe under BIND once
any unbounded ECALL-thread source exists. The new wording describes
the flag as the pool size and recommends `cap < TCSNum` to leave
headroom for the SDK runtime and the speculative path.

Tests:
- `cargo test -p service` (37 tests, includes 3 new `ecall_pool` tests
  asserting concurrency cap, result delivery, and that observed
  worker-thread ids are bounded by pool size).
- `cargo check --workspace` is clean.
@ikehara

ikehara commented Jun 11, 2026

Copy link
Copy Markdown
Author

Closing in favor of pushing the commit directly to feature/explicit-state-parallelization to match the existing workflow on this debug branch. Same commit content; design discussion remains in datachainlab/docs#10123.

@ikehara ikehara closed this Jun 11, 2026
@ikehara ikehara deleted the feat/ecall-pool-fixed-workers branch June 11, 2026 07:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant