Explicit-state batch execution and base-state seeding#143
Conversation
There was a problem hiding this comment.
Pull request overview
This PR extends the explicit-state execution path to support dependency-aware speculative UpdateClient batch execution with bounded concurrency, plus the underlying store/enclave plumbing needed to extract and later stitch speculative write-sets. It also adds a height-qualified client query and bumps SGX enclave TCS capacity for experiments.
Changes:
- Add proto/Rust RPC + message types for explicit-state speculative
UpdateClientbatch execution, and addheighttoQueryClientRequest. - Introduce speculative/read transaction write-set extraction (
take_write_set) via an overlay store layer, and add transaction intent (TxMode) + write-set stitching helpers inenclave-api. - Add service-side batch executor, per-client serialization, and bounded enclave-entry/speculative concurrency controls; raise enclave
TCSNum.
Reviewed changes
Copilot reviewed 24 out of 25 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| proto/src/prost/lcp.service.elc.v1.rs | Generated Rust types + gRPC stubs for new batch RPC and query height. |
| proto/definitions/lcp/service/elc/v1/tx.proto | Adds ExecuteSpeculativeUpdateClientBatch RPC and related messages. |
| proto/definitions/lcp/service/elc/v1/query.proto | Adds optional height to QueryClientRequest. |
| modules/store/src/transaction.rs | Adds TxMode and default take_write_set hook on CommitStore. |
| modules/store/src/rocksdb.rs | Implements overlay-backed read/speculative writes and take_write_set for RocksDB. |
| modules/store/src/overlay.rs | New overlay KV store to accumulate isolated speculative writes. |
| modules/store/src/memory.rs | Implements take_write_set for MemStore test backend. |
| modules/store/src/lib.rs | Exposes overlay module and defines WriteSet type alias. |
| modules/service/src/service.rs | Adds permit gating + per-client serialization infrastructure to AppService. |
| modules/service/src/lib.rs | Re-exports explicit-state batch types from the service crate. |
| modules/service/src/explicit_state.rs | New explicit-state speculative batch executor + conflict detection + tests. |
| modules/service/src/elc.rs | Wires new batch RPC into tonic server and applies enclave gating. |
| modules/service/Cargo.toml | Adds serde/bincode + test deps to support explicit-state batch logic/tests. |
| modules/enclave-api/src/lib.rs | Re-exports new speculative execution types. |
| modules/enclave-api/src/enclave.rs | Adds tx-mode helpers + write-set extraction and stitching (apply_write_set). |
| modules/enclave-api/src/api/primitive.rs | Adds explicit tx-mode execution and speculative command execution helper. |
| modules/enclave-api/src/api/command.rs | Adds speculative update-client execution with base-state seeding. |
| modules/enclave-api/src/api.rs | Re-exports speculative types from the API module. |
| modules/ecall-commands/src/msgs.rs | Maps new query height field into internal input type. |
| modules/ecall-commands/src/light_client.rs | Extends QueryClientInput to include optional height. |
| enclave/Enclave.config.xml | Raises TCSNum from 2 to 4. |
| enclave-modules/ecall-handler/src/light_client/query.rs | Uses requested height (or latest) when fetching consensus state. |
| Cargo.lock | Locks new dependencies pulled in by service/enclave-api changes. |
| app/src/commands/service.rs | Adds --max-enclave-concurrency and passes concurrency limits into service. |
| app/src/commands/elc.rs | Adds hidden debug/admin command to execute speculative batch from JSON. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
be0f1b6 to
c3077f5
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 21 out of 22 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ecc4d85 to
deb1db3
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 37 out of 38 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
modules/store/src/rocksdb.rs:285
CommitStore::commiton a speculative RocksDB transaction will currently returnOk(())and silently drop the overlay writes becauseStoreTransaction::commitonly commitsUpdatetransactions. This makes it easy for callers to accidentally lose speculative writes without noticing. Consider rejectingcommit()for speculative transactions (return an error) or making commit semantics explicit forSpeculativeto avoid silent no-ops.
Serial gRPC UpdateClient (unary + streaming) is serialized with speculative batches by the same KeyLockMap[client_id]. Calls to proto_update_client outside the service path (for example app/src/commands/elc.rs) are outside this guarantee.
7e471ca to
d2ae4dc
Compare
Add `EcallPool`, a long-lived pool of `--max-enclave-concurrency` OS threads dedicated to executing ECALLs. All gRPC handlers in `elc.rs` (`create_client`, `update_client`, `update_client_stream`, `aggregate_messages`, `verify_membership`, `verify_non_membership`, `client`) now dispatch their ECALLs onto the pool via `tokio::task::spawn_blocking(|| pool.run(|| ...))` instead of either calling the enclave directly from a tonic async handler (which would bind a TCS to every tokio worker that ever ECALLs) or going through the default `tokio::task::spawn_blocking` pool (which can grow to hundreds of cached threads, each binding its own TCS). Under `TCSPolicy=BIND` the Intel SGX SDK pins one TCS to each ECALL-issuing host thread for the thread's lifetime. The `with_ecall_permit` semaphore in `enclave-api` only bounds *concurrent* ECALLs, not the cumulative set of distinct threads that have ever ECALLed. Once that cumulative set exceeds `TCSNum`, any new ECALL-issuing thread fails with `SGX_ERROR_OUT_OF_TCS`. This is what was observed during a long-running Arbitrum activate-client run on bcc-dev even with `--max-enclave-concurrency=4` and `TCSNum=8`. By pinning ECALL execution to `EcallPool`'s `size` permanent workers, the cumulative TCS bindings created by the service stay at `size` for the entire process lifetime, restoring the invariant the existing CLI guidance assumed. Speculative scheduler's own `thread::scope` workers (`speculative/scheduler.rs`) are intentionally left as-is in this change. Their threads terminate at scope exit so the SDK destructor releases their TCS bindings on each batch; addressing that path (folding speculative dispatch into the same pool) is a follow-up so this PR stays small. CLI help text for `--max-enclave-concurrency` is updated: the prior wording recommended `cap == TCSNum`, which is unsafe under BIND once any unbounded ECALL-thread source exists. The new wording describes the flag as the pool size and recommends `cap < TCSNum` to leave headroom for the SDK runtime and the speculative path. Tests: - `cargo test -p service` (37 tests, includes 3 new `ecall_pool` tests asserting concurrency cap, result delivery, and that observed worker-thread ids are bounded by pool size). - `cargo check --workspace` is clean.
The previous fix in `9e75101` routed gRPC-handler ECALLs (update_client,
update_client_stream, create_client, aggregate_messages, verify_*,
client) through the long-lived `EcallPool` workers, but left the
speculative scheduler's `thread::scope` workers in
`modules/service/src/speculative/scheduler.rs` issuing their own ECALLs
directly on those scoped threads. Under `TCSPolicy=BIND` each scope
worker still bound a TCS for the duration of a batch, so peak TCS
occupancy was `pool_size + scope_size` whenever a speculative batch
overlapped with any non-speculative gRPC ECALL. For the deployed
configuration `pool_size = scope_size = 4` and `TCSNum = 8`, leaving no
headroom for the SDK runtime and reintroducing `SGX_ERROR_OUT_OF_TCS`
risk under service-mode peak load.
This change hands off the per-unit ECALL inside the scope worker to
`AppService.ecall_pool.run(...)`. The scope worker continues to dequeue
work, hold the per-stream `speculative_request_permit`, and record the
result; it just no longer enters the enclave on its own thread. The
ECALL executes on one of the permanent `EcallPool` workers, whose TCS
binding is stable for the lifetime of the LCP process.
Peak TCS occupancy after this change:
pool_size + 0 (scope workers no longer ECALL) = pool_size
`--max-enclave-concurrency` (= `EcallPool` size) is now the single
structural bound on host threads that ever enter the enclave,
regardless of how many speculative scope workers exist at any moment.
Test impact: `streaming_speculative_batch_parallelizes_complete_base_state_units`
now requires the `AppService` ecall_concurrency to be at least the
per-stream speculative cap, since the pool is the effective ECALL
concurrency. Bumped the test fixture from `ecall_concurrency=1` to `=3`
to match `SpeculativeService::new(3)`. All other speculative tests
already pass at `ecall_concurrency=1` because they assert
`observed_max_in_flight() >= 1` or `== 1`, which still holds.
Tests:
- `cargo test -p service` — 37 passed, 0 failed.
- `cargo check --workspace --tests` — clean.
After `6afd485` routed both gRPC handler ECALLs and speculative
scheduler ECALLs through `service::EcallPool`, the `ECallGate`
semaphore in `modules/enclave-api/src/enclave.rs` became structurally
redundant: the only host threads that ever issue an ECALL are the
permanent EcallPool workers, whose count is fixed at the configured
`--max-enclave-concurrency`. The host-side semaphore had nothing left
to gate.
Removed:
- `ECallGate`, `ECallGateState`, `ECallPermitGuard`, and their tests
in `modules/enclave-api/src/enclave.rs`.
- `Enclave::new` / `Enclave::create` no longer take an
`ecall_concurrency` argument.
- `EnclaveInfo::with_ecall_permit` trait method.
- `EnclaveLoader::load_with_ecall_concurrency` and its implementation;
the remaining `EnclaveLoader::load` no longer needs an
`ecall_concurrency` parameter.
- The semaphore acquire/release call site in
`modules/enclave-api/src/api/primitive.rs::execute_prepared_command`.
Updated:
- `app/src/commands/service.rs` switches `enclave_loader.load(...)`,
and the `--max-enclave-concurrency` CLI help text is rewritten to
describe the flag as the EcallPool size only.
- `tests/integration/src/lib.rs` drops the now-removed argument from
`Enclave::create`.
Concurrency control after this change:
EcallPool size (--max-enclave-concurrency)
= number of permanent host threads that ever ECALL
= max concurrent ECALLs in flight
= max cumulative TCS bindings under TCSPolicy=BIND
Set this to a value at most `TCSNum`; the default 4 leaves room for
the SDK runtime when `TCSNum=8`.
Tests:
- `cargo test -p service` — 37 passed.
- `cargo test -p enclave-api` — 3 passed.
- `cargo check --workspace` — clean.
- The integration-test crate compile-passes; runtime test
`tests::test_lcp` requires actual SGX hardware and is not affected
by this change.
68ab7bb to
56fcda1
Compare
The stitch-phase base verification recomputed the expected prev_state_id with gen_state_id_from_any over the raw caller-supplied Anys. Light clients derive state IDs from a canonicalized client state (e.g. latest_height/frozen reset before hashing), and that canonicalization is ELC-specific and only available inside the enclave, so the recomputed hash never matches the enclave-observed prev_state_id for any ELC whose canonicalization is not the identity. This made every explicit-state speculative batch fail with BaseStateMismatch at the stitch phase. Drop the raw recompute and rely on the remaining checks: the supplied (client_state, consensus_state) bytes are pinned to the canonical store, and the enclave-observed prev_state_id must match the height-indexed state_id stored by the in-enclave light client at create/update time.
…verify The state_id hash check already covers canonical equivalence; the bincode byte-equality checks on client_state / consensus_state are over-strict on encoding-only differences in the supplied Anys.
| if tx.kind != MemTxKind::Speculative { | ||
| return Err(crate::Error::not_supported_operation( | ||
| "take_write_set is only available for speculative transactions".to_string(), | ||
| )); | ||
| } |
| prev_height: &Height, | ||
| _client_state: &Any, | ||
| _consensus_state: &Any, | ||
| prev_state_id: Option<&[u8]>, | ||
| ) -> Result<()> |
Drop over-specific examples from the comment and the operator hint from the missing-entry error; keep the canonical-equivalence rationale.
The previous 60s budget fires during normal per-unit build pauses on slow upstream provers; raise the cap so transient producer pacing does not close the stream mid-batch.
What changed
UpdateClientexecution: run updates against a complete explicit base state supplied by the relayer in isolated transactions, then stitch effective write sets into canonical stateclient_id; units within a batch may execute concurrently only when each unit carries a complete explicit base state (prev_height+client_state+consensus_state)client_stateandconsensus_state(prev_height)under the per-client update lock before stitchingWhy
The normal
UpdateClientpath reads its base state from canonical enclave storage and commits each update in order. That is the safest default, but it prevents the service from evaluating a batch of related updates speculatively before deciding how their resulting write sets should be committed.The explicit-state path lets the service execute
UpdateClientagainst a complete base state supplied by the relayer, instead of always reading the next base state from canonical enclave storage. The execution result is kept in an isolated transaction first, validated against the expected ordered state chain, checked against the canonical first base state, and only then stitched into canonical state.This provides the substrate for bounded speculative batch execution. The service does not rebase dependent units; every speculative unit must already carry its own complete explicit base state. The relayer is responsible for precomputing those base states off-service.
Impact
--max-speculative-concurrency--max-enclave-concurrency), preventing TCS overrunUpdateClient(unary + streaming) is serialized with speculative batches by the sameKeyLockMap[client_id]; calls toproto_update_clientoutside the service path are outside this guaranteeValidation
cargo fmt --checkcargo test -p enclave-api --libcargo test -p service --libNotes
Speculative units are explicit-state only:
prev_height,client_state, andconsensus_statemust all be populated. The service validates that observed transitions form a linear chain before stitching effective write sets; it does not perform in-service rebase. Before the stitched write set is committed, the first unit's base state is also compared with canonical storage while holding the per-client update lock.Production-grade multi-lane parallelism is not claimed by this PR. The parallel dispatch path is covered by unit tests; real-enclave E2E coverage depends on downstream configuration of both
--max-speculative-concurrencyand--max-enclave-concurrency.The in-tree
enclave/Enclave.config.xmlTCSNumis the dev default; production enclave builds override it via the downstream enclaves repository.All ECALLs are guarded by the enclave ECALL gate (
--max-enclave-concurrency, default 4). This can make serial RPCs wait when the gate is saturated, but prevents exceeding the enclave TCS budget.--max-speculative-concurrencycontrols how many speculativeUpdateClientrequests are started concurrently. Setting it less than or equal to--max-enclave-concurrencyis recommended; if it is larger, extra speculative workers will wait on the ECALL gate.--threadscontrols the Tokio runtime worker count only. It does not control enclave ECALL/TCS concurrency.