Explicit-state batch execution and base-state seeding by ikehara · Pull Request #143 · datachainlab/lcp

ikehara · 2026-04-06T08:32:50Z

What changed

introduce explicit-state speculative UpdateClient execution: run updates against a complete explicit base state supplied by the relayer in isolated transactions, then stitch effective write sets into canonical state
add bounded enclave-entry control and speculative batch / stream execution infrastructure for the explicit-state path
reject non-linear same-client unit chains at stitch via state-id continuity checks
serialize speculative batches per client_id; units within a batch may execute concurrently only when each unit carries a complete explicit base state (prev_height + client_state + consensus_state)
move speculative execution off the global enclave mutex while keeping client-scoped serialization
seed explicit-state base payloads into speculative transactions and return only effective write sets with seed-equal entries removed
verify the first speculative unit's explicit base state against canonical client_state and consensus_state(prev_height) under the per-client update lock before stitching
add service-level coverage for streaming scheduling, linear-chain validation, and concurrent speculative dispatch

Why

The normal UpdateClient path reads its base state from canonical enclave storage and commits each update in order. That is the safest default, but it prevents the service from evaluating a batch of related updates speculatively before deciding how their resulting write sets should be committed.

The explicit-state path lets the service execute UpdateClient against a complete base state supplied by the relayer, instead of always reading the next base state from canonical enclave storage. The execution result is kept in an isolated transaction first, validated against the expected ordered state chain, checked against the canonical first base state, and only then stitched into canonical state.

This provides the substrate for bounded speculative batch execution. The service does not rebase dependent units; every speculative unit must already carry its own complete explicit base state. The relayer is responsible for precomputing those base states off-service.

Impact

the service can execute explicit-state speculative batches with bounded concurrency and stricter safety checks
units that carry complete explicit base states are dispatched concurrently up to --max-speculative-concurrency
all ECALLs, including serial RPCs and speculative execution, share a single enclave-side ECALL gate (--max-enclave-concurrency), preventing TCS overrun
serial gRPC UpdateClient (unary + streaming) is serialized with speculative batches by the same KeyLockMap[client_id]; calls to proto_update_client outside the service path are outside this guarantee
before applying a stitched write set, the first unit's explicit base state must still match canonical storage for that client

Validation

cargo fmt --check
cargo test -p enclave-api --lib
cargo test -p service --lib

Notes

Speculative units are explicit-state only: prev_height, client_state, and consensus_state must all be populated. The service validates that observed transitions form a linear chain before stitching effective write sets; it does not perform in-service rebase. Before the stitched write set is committed, the first unit's base state is also compared with canonical storage while holding the per-client update lock.

Production-grade multi-lane parallelism is not claimed by this PR. The parallel dispatch path is covered by unit tests; real-enclave E2E coverage depends on downstream configuration of both --max-speculative-concurrency and --max-enclave-concurrency.

The in-tree enclave/Enclave.config.xml TCSNum is the dev default; production enclave builds override it via the downstream enclaves repository.

All ECALLs are guarded by the enclave ECALL gate (--max-enclave-concurrency, default 4). This can make serial RPCs wait when the gate is saturated, but prevents exceeding the enclave TCS budget.

--max-speculative-concurrency controls how many speculative UpdateClient requests are started concurrently. Setting it less than or equal to --max-enclave-concurrency is recommended; if it is larger, extra speculative workers will wait on the ECALL gate.

--threads controls the Tokio runtime worker count only. It does not control enclave ECALL/TCS concurrency.

Copilot

Pull request overview

This PR extends the explicit-state execution path to support dependency-aware speculative UpdateClient batch execution with bounded concurrency, plus the underlying store/enclave plumbing needed to extract and later stitch speculative write-sets. It also adds a height-qualified client query and bumps SGX enclave TCS capacity for experiments.

Changes:

Add proto/Rust RPC + message types for explicit-state speculative UpdateClient batch execution, and add height to QueryClientRequest.
Introduce speculative/read transaction write-set extraction (take_write_set) via an overlay store layer, and add transaction intent (TxMode) + write-set stitching helpers in enclave-api.
Add service-side batch executor, per-client serialization, and bounded enclave-entry/speculative concurrency controls; raise enclave TCSNum.

Reviewed changes

Copilot reviewed 24 out of 25 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
proto/src/prost/lcp.service.elc.v1.rs	Generated Rust types + gRPC stubs for new batch RPC and query height.
proto/definitions/lcp/service/elc/v1/tx.proto	Adds `ExecuteSpeculativeUpdateClientBatch` RPC and related messages.
proto/definitions/lcp/service/elc/v1/query.proto	Adds optional `height` to `QueryClientRequest`.
modules/store/src/transaction.rs	Adds `TxMode` and default `take_write_set` hook on `CommitStore`.
modules/store/src/rocksdb.rs	Implements overlay-backed read/speculative writes and `take_write_set` for RocksDB.
modules/store/src/overlay.rs	New overlay KV store to accumulate isolated speculative writes.
modules/store/src/memory.rs	Implements `take_write_set` for MemStore test backend.
modules/store/src/lib.rs	Exposes `overlay` module and defines `WriteSet` type alias.
modules/service/src/service.rs	Adds permit gating + per-client serialization infrastructure to AppService.
modules/service/src/lib.rs	Re-exports explicit-state batch types from the service crate.
modules/service/src/explicit_state.rs	New explicit-state speculative batch executor + conflict detection + tests.
modules/service/src/elc.rs	Wires new batch RPC into tonic server and applies enclave gating.
modules/service/Cargo.toml	Adds serde/bincode + test deps to support explicit-state batch logic/tests.
modules/enclave-api/src/lib.rs	Re-exports new speculative execution types.
modules/enclave-api/src/enclave.rs	Adds tx-mode helpers + write-set extraction and stitching (`apply_write_set`).
modules/enclave-api/src/api/primitive.rs	Adds explicit tx-mode execution and speculative command execution helper.
modules/enclave-api/src/api/command.rs	Adds speculative update-client execution with base-state seeding.
modules/enclave-api/src/api.rs	Re-exports speculative types from the API module.
modules/ecall-commands/src/msgs.rs	Maps new query `height` field into internal input type.
modules/ecall-commands/src/light_client.rs	Extends `QueryClientInput` to include optional `height`.
enclave/Enclave.config.xml	Raises `TCSNum` from 2 to 4.
enclave-modules/ecall-handler/src/light_client/query.rs	Uses requested height (or latest) when fetching consensus state.
Cargo.lock	Locks new dependencies pulled in by service/enclave-api changes.
app/src/commands/service.rs	Adds `--max-enclave-concurrency` and passes concurrency limits into service.
app/src/commands/elc.rs	Adds hidden debug/admin command to execute speculative batch from JSON.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 37 out of 38 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

modules/store/src/rocksdb.rs:285

CommitStore::commit on a speculative RocksDB transaction will currently return Ok(()) and silently drop the overlay writes because StoreTransaction::commit only commits Update transactions. This makes it easy for callers to accidentally lose speculative writes without noticing. Consider rejecting commit() for speculative transactions (return an error) or making commit semantics explicit for Speculative to avoid silent no-ops.

Serial gRPC UpdateClient (unary + streaming) is serialized with speculative batches by the same KeyLockMap[client_id]. Calls to proto_update_client outside the service path (for example app/src/commands/elc.rs) are outside this guarantee.

Add `EcallPool`, a long-lived pool of `--max-enclave-concurrency` OS threads dedicated to executing ECALLs. All gRPC handlers in `elc.rs` (`create_client`, `update_client`, `update_client_stream`, `aggregate_messages`, `verify_membership`, `verify_non_membership`, `client`) now dispatch their ECALLs onto the pool via `tokio::task::spawn_blocking(|| pool.run(|| ...))` instead of either calling the enclave directly from a tonic async handler (which would bind a TCS to every tokio worker that ever ECALLs) or going through the default `tokio::task::spawn_blocking` pool (which can grow to hundreds of cached threads, each binding its own TCS). Under `TCSPolicy=BIND` the Intel SGX SDK pins one TCS to each ECALL-issuing host thread for the thread's lifetime. The `with_ecall_permit` semaphore in `enclave-api` only bounds *concurrent* ECALLs, not the cumulative set of distinct threads that have ever ECALLed. Once that cumulative set exceeds `TCSNum`, any new ECALL-issuing thread fails with `SGX_ERROR_OUT_OF_TCS`. This is what was observed during a long-running Arbitrum activate-client run on bcc-dev even with `--max-enclave-concurrency=4` and `TCSNum=8`. By pinning ECALL execution to `EcallPool`'s `size` permanent workers, the cumulative TCS bindings created by the service stay at `size` for the entire process lifetime, restoring the invariant the existing CLI guidance assumed. Speculative scheduler's own `thread::scope` workers (`speculative/scheduler.rs`) are intentionally left as-is in this change. Their threads terminate at scope exit so the SDK destructor releases their TCS bindings on each batch; addressing that path (folding speculative dispatch into the same pool) is a follow-up so this PR stays small. CLI help text for `--max-enclave-concurrency` is updated: the prior wording recommended `cap == TCSNum`, which is unsafe under BIND once any unbounded ECALL-thread source exists. The new wording describes the flag as the pool size and recommends `cap < TCSNum` to leave headroom for the SDK runtime and the speculative path. Tests: - `cargo test -p service` (37 tests, includes 3 new `ecall_pool` tests asserting concurrency cap, result delivery, and that observed worker-thread ids are bounded by pool size). - `cargo check --workspace` is clean.

The previous fix in `9e75101` routed gRPC-handler ECALLs (update_client, update_client_stream, create_client, aggregate_messages, verify_*, client) through the long-lived `EcallPool` workers, but left the speculative scheduler's `thread::scope` workers in `modules/service/src/speculative/scheduler.rs` issuing their own ECALLs directly on those scoped threads. Under `TCSPolicy=BIND` each scope worker still bound a TCS for the duration of a batch, so peak TCS occupancy was `pool_size + scope_size` whenever a speculative batch overlapped with any non-speculative gRPC ECALL. For the deployed configuration `pool_size = scope_size = 4` and `TCSNum = 8`, leaving no headroom for the SDK runtime and reintroducing `SGX_ERROR_OUT_OF_TCS` risk under service-mode peak load. This change hands off the per-unit ECALL inside the scope worker to `AppService.ecall_pool.run(...)`. The scope worker continues to dequeue work, hold the per-stream `speculative_request_permit`, and record the result; it just no longer enters the enclave on its own thread. The ECALL executes on one of the permanent `EcallPool` workers, whose TCS binding is stable for the lifetime of the LCP process. Peak TCS occupancy after this change: pool_size + 0 (scope workers no longer ECALL) = pool_size `--max-enclave-concurrency` (= `EcallPool` size) is now the single structural bound on host threads that ever enter the enclave, regardless of how many speculative scope workers exist at any moment. Test impact: `streaming_speculative_batch_parallelizes_complete_base_state_units` now requires the `AppService` ecall_concurrency to be at least the per-stream speculative cap, since the pool is the effective ECALL concurrency. Bumped the test fixture from `ecall_concurrency=1` to `=3` to match `SpeculativeService::new(3)`. All other speculative tests already pass at `ecall_concurrency=1` because they assert `observed_max_in_flight() >= 1` or `== 1`, which still holds. Tests: - `cargo test -p service` — 37 passed, 0 failed. - `cargo check --workspace --tests` — clean.

After `6afd485` routed both gRPC handler ECALLs and speculative scheduler ECALLs through `service::EcallPool`, the `ECallGate` semaphore in `modules/enclave-api/src/enclave.rs` became structurally redundant: the only host threads that ever issue an ECALL are the permanent EcallPool workers, whose count is fixed at the configured `--max-enclave-concurrency`. The host-side semaphore had nothing left to gate. Removed: - `ECallGate`, `ECallGateState`, `ECallPermitGuard`, and their tests in `modules/enclave-api/src/enclave.rs`. - `Enclave::new` / `Enclave::create` no longer take an `ecall_concurrency` argument. - `EnclaveInfo::with_ecall_permit` trait method. - `EnclaveLoader::load_with_ecall_concurrency` and its implementation; the remaining `EnclaveLoader::load` no longer needs an `ecall_concurrency` parameter. - The semaphore acquire/release call site in `modules/enclave-api/src/api/primitive.rs::execute_prepared_command`. Updated: - `app/src/commands/service.rs` switches `enclave_loader.load(...)`, and the `--max-enclave-concurrency` CLI help text is rewritten to describe the flag as the EcallPool size only. - `tests/integration/src/lib.rs` drops the now-removed argument from `Enclave::create`. Concurrency control after this change: EcallPool size (--max-enclave-concurrency) = number of permanent host threads that ever ECALL = max concurrent ECALLs in flight = max cumulative TCS bindings under TCSPolicy=BIND Set this to a value at most `TCSNum`; the default 4 leaves room for the SDK runtime when `TCSNum=8`. Tests: - `cargo test -p service` — 37 passed. - `cargo test -p enclave-api` — 3 passed. - `cargo check --workspace` — clean. - The integration-test crate compile-passes; runtime test `tests::test_lcp` requires actual SGX hardware and is not affected by this change.

The stitch-phase base verification recomputed the expected prev_state_id with gen_state_id_from_any over the raw caller-supplied Anys. Light clients derive state IDs from a canonicalized client state (e.g. latest_height/frozen reset before hashing), and that canonicalization is ELC-specific and only available inside the enclave, so the recomputed hash never matches the enclave-observed prev_state_id for any ELC whose canonicalization is not the identity. This made every explicit-state speculative batch fail with BaseStateMismatch at the stitch phase. Drop the raw recompute and rely on the remaining checks: the supplied (client_state, consensus_state) bytes are pinned to the canonical store, and the enclave-observed prev_state_id must match the height-indexed state_id stored by the in-enclave light client at create/update time.

…verify The state_id hash check already covers canonical equivalence; the bincode byte-equality checks on client_state / consensus_state are over-strict on encoding-only differences in the supplied Anys.

Copilot

Pull request overview

Copilot reviewed 39 out of 40 changed files in this pull request and generated 2 comments.

+        if tx.kind != MemTxKind::Speculative {
+            return Err(crate::Error::not_supported_operation(
+                "take_write_set is only available for speculative transactions".to_string(),
+            ));
+        }


+        prev_height: &Height,
+        _client_state: &Any,
+        _consensus_state: &Any,
+        prev_state_id: Option<&[u8]>,
+    ) -> Result<()>


Drop over-specific examples from the comment and the operator hint from the missing-entry error; keep the canonical-equivalence rationale.

The previous 60s budget fires during normal per-unit build pauses on slow upstream provers; raise the cap so transient producer pacing does not close the stream mid-batch.

ikehara changed the title ~~[codex] explicit-state batch execution and base-state seeding~~ prototype: Explicit-state batch execution and base-state seeding Apr 6, 2026

ikehara requested a review from Copilot April 6, 2026 08:34

Copilot started reviewing on behalf of ikehara April 6, 2026 08:35 View session

Copilot AI reviewed Apr 6, 2026

View reviewed changes

ikehara force-pushed the feature/explicit-state-parallelization branch from be0f1b6 to c3077f5 Compare April 16, 2026 09:45

ikehara requested a review from Copilot April 22, 2026 08:48

Copilot started reviewing on behalf of ikehara April 22, 2026 08:49 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

Comment thread modules/service/src/explicit_state.rs Outdated

Comment thread modules/enclave-api/src/api/command.rs Outdated

ikehara self-assigned this Apr 23, 2026

ikehara changed the title ~~prototype: Explicit-state batch execution and base-state seeding~~ Explicit-state batch execution and base-state seeding Apr 28, 2026

Kiyoshi Nakao added 6 commits May 12, 2026 15:23

types: centralize light client store keys

48448fd

store: add speculative transaction overlays

62cdaf3

enclave-api: add speculative command support

e19a38e

proto: add speculative update client stream RPC

478f836

service: add speculative update client execution

3823fdc

app: configure speculative service concurrency

deb1db3

ikehara force-pushed the feature/explicit-state-parallelization branch from ecc4d85 to deb1db3 Compare May 12, 2026 06:31

Fix speculative service formatting

4c395d8

ikehara requested a review from Copilot May 12, 2026 09:57

Copilot started reviewing on behalf of ikehara May 12, 2026 09:57 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

Comment thread modules/enclave-api/src/enclave.rs

Comment thread modules/service/src/speculative/scheduler.rs Outdated

Kiyoshi Nakao added 2 commits May 14, 2026 10:08

service: bound resident speculative headers

daad0e5

service: defer speculative header digest

3eaae1b

ikehara marked this pull request as ready for review May 18, 2026 07:27

Kiyoshi Nakao added 4 commits May 22, 2026 18:19

Reject incomplete non-leading speculative base states

98ca960

Require complete speculative base states

9de88e3

Require seeded speculative base state

6885c75

ikehara force-pushed the feature/explicit-state-parallelization branch from 7e471ca to d2ae4dc Compare May 29, 2026 06:32

Return effective speculative write sets

0ab323d

Kiyoshi Nakao added 13 commits June 2, 2026 15:00

Preserve first speculative scheduler failure

8ad0186

service: validate speculative base by stored height

8f3acd3

service: validate speculative base by stored state id

6142089

service: store light-client state id on init

bfe621d

Verify speculative base client state

55ae241

Add idle timeout to speculative stream

4f49954

Align memstore speculative write extraction

b46a2eb

Reject stale speculative base client state

a7673df

Prevent partial speculative stream commits

d9425c4

Tighten speculative stream resource handling

485e348

ikehara force-pushed the feature/explicit-state-parallelization branch from 68ab7bb to 56fcda1 Compare June 11, 2026 10:56

Kiyoshi Nakao added 9 commits June 11, 2026 19:56

fmt: apply cargo fmt to pre-existing formatting drift

a187699

service: box streaming speculative batch input unit to satisfy clippy

9189c57

service: prevent speculative stream hang on ECALL worker panic

9aa6643

service: recover per-client update lock from poisoning

586d6f5

service: time out speculative header memory reservation waits

813465c

enclave-api: report missing stored state_id distinctly from mismatch

a3b76bc

service: document speculative base binding scope for non-first units

5b7f330

enclave-api: drop inappropriate bincode checks from speculative base …

c6fa4ce

…verify The state_id hash check already covers canonical equivalence; the bincode byte-equality checks on client_state / consensus_state are over-strict on encoding-only differences in the supplied Anys.

ikehara requested a review from siburu June 16, 2026 01:10

ikehara mentioned this pull request Jun 16, 2026

Source on-chain explicit-state base payload from LCP canonical store datachainlab/lcp-go#74

Closed

ikehara requested a review from Copilot June 16, 2026 01:29

Copilot started reviewing on behalf of ikehara June 16, 2026 01:30 View session

Copilot AI reviewed Jun 16, 2026

View reviewed changes

Kiyoshi Nakao added 2 commits June 16, 2026 15:03

enclave-api: tighten verify_expected_base_state_in_tx comments

c437e30

Drop over-specific examples from the comment and the operator hint from the missing-entry error; keep the canonical-equivalence rationale.

service: raise speculative batch stream idle timeout to 300s

c441ed7

The previous 60s budget fires during normal per-unit build pauses on slow upstream provers; raise the cap so transient producer pacing does not close the stream mid-batch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explicit-state batch execution and base-state seeding#143

Explicit-state batch execution and base-state seeding#143
ikehara wants to merge 48 commits into
mainfrom
feature/explicit-state-parallelization

ikehara commented Apr 6, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ikehara commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Why

Impact

Validation

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ikehara commented Apr 6, 2026 •

edited

Loading