Skip to content

Explicit-state batch execution and base-state seeding#143

Open
ikehara wants to merge 48 commits into
mainfrom
feature/explicit-state-parallelization
Open

Explicit-state batch execution and base-state seeding#143
ikehara wants to merge 48 commits into
mainfrom
feature/explicit-state-parallelization

Conversation

@ikehara

@ikehara ikehara commented Apr 6, 2026

Copy link
Copy Markdown

What changed

  • introduce explicit-state speculative UpdateClient execution: run updates against a complete explicit base state supplied by the relayer in isolated transactions, then stitch effective write sets into canonical state
  • add bounded enclave-entry control and speculative batch / stream execution infrastructure for the explicit-state path
  • reject non-linear same-client unit chains at stitch via state-id continuity checks
  • serialize speculative batches per client_id; units within a batch may execute concurrently only when each unit carries a complete explicit base state (prev_height + client_state + consensus_state)
  • move speculative execution off the global enclave mutex while keeping client-scoped serialization
  • seed explicit-state base payloads into speculative transactions and return only effective write sets with seed-equal entries removed
  • verify the first speculative unit's explicit base state against canonical client_state and consensus_state(prev_height) under the per-client update lock before stitching
  • add service-level coverage for streaming scheduling, linear-chain validation, and concurrent speculative dispatch

Why

The normal UpdateClient path reads its base state from canonical enclave storage and commits each update in order. That is the safest default, but it prevents the service from evaluating a batch of related updates speculatively before deciding how their resulting write sets should be committed.

The explicit-state path lets the service execute UpdateClient against a complete base state supplied by the relayer, instead of always reading the next base state from canonical enclave storage. The execution result is kept in an isolated transaction first, validated against the expected ordered state chain, checked against the canonical first base state, and only then stitched into canonical state.

This provides the substrate for bounded speculative batch execution. The service does not rebase dependent units; every speculative unit must already carry its own complete explicit base state. The relayer is responsible for precomputing those base states off-service.

Impact

  • the service can execute explicit-state speculative batches with bounded concurrency and stricter safety checks
  • units that carry complete explicit base states are dispatched concurrently up to --max-speculative-concurrency
  • all ECALLs, including serial RPCs and speculative execution, share a single enclave-side ECALL gate (--max-enclave-concurrency), preventing TCS overrun
  • serial gRPC UpdateClient (unary + streaming) is serialized with speculative batches by the same KeyLockMap[client_id]; calls to proto_update_client outside the service path are outside this guarantee
  • before applying a stitched write set, the first unit's explicit base state must still match canonical storage for that client

Validation

  • cargo fmt --check
  • cargo test -p enclave-api --lib
  • cargo test -p service --lib

Notes

Speculative units are explicit-state only: prev_height, client_state, and consensus_state must all be populated. The service validates that observed transitions form a linear chain before stitching effective write sets; it does not perform in-service rebase. Before the stitched write set is committed, the first unit's base state is also compared with canonical storage while holding the per-client update lock.

Production-grade multi-lane parallelism is not claimed by this PR. The parallel dispatch path is covered by unit tests; real-enclave E2E coverage depends on downstream configuration of both --max-speculative-concurrency and --max-enclave-concurrency.

The in-tree enclave/Enclave.config.xml TCSNum is the dev default; production enclave builds override it via the downstream enclaves repository.

All ECALLs are guarded by the enclave ECALL gate (--max-enclave-concurrency, default 4). This can make serial RPCs wait when the gate is saturated, but prevents exceeding the enclave TCS budget.

--max-speculative-concurrency controls how many speculative UpdateClient requests are started concurrently. Setting it less than or equal to --max-enclave-concurrency is recommended; if it is larger, extra speculative workers will wait on the ECALL gate.

--threads controls the Tokio runtime worker count only. It does not control enclave ECALL/TCS concurrency.

@ikehara ikehara changed the title [codex] explicit-state batch execution and base-state seeding prototype: Explicit-state batch execution and base-state seeding Apr 6, 2026
@ikehara ikehara requested a review from Copilot April 6, 2026 08:34

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the explicit-state execution path to support dependency-aware speculative UpdateClient batch execution with bounded concurrency, plus the underlying store/enclave plumbing needed to extract and later stitch speculative write-sets. It also adds a height-qualified client query and bumps SGX enclave TCS capacity for experiments.

Changes:

  • Add proto/Rust RPC + message types for explicit-state speculative UpdateClient batch execution, and add height to QueryClientRequest.
  • Introduce speculative/read transaction write-set extraction (take_write_set) via an overlay store layer, and add transaction intent (TxMode) + write-set stitching helpers in enclave-api.
  • Add service-side batch executor, per-client serialization, and bounded enclave-entry/speculative concurrency controls; raise enclave TCSNum.

Reviewed changes

Copilot reviewed 24 out of 25 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
proto/src/prost/lcp.service.elc.v1.rs Generated Rust types + gRPC stubs for new batch RPC and query height.
proto/definitions/lcp/service/elc/v1/tx.proto Adds ExecuteSpeculativeUpdateClientBatch RPC and related messages.
proto/definitions/lcp/service/elc/v1/query.proto Adds optional height to QueryClientRequest.
modules/store/src/transaction.rs Adds TxMode and default take_write_set hook on CommitStore.
modules/store/src/rocksdb.rs Implements overlay-backed read/speculative writes and take_write_set for RocksDB.
modules/store/src/overlay.rs New overlay KV store to accumulate isolated speculative writes.
modules/store/src/memory.rs Implements take_write_set for MemStore test backend.
modules/store/src/lib.rs Exposes overlay module and defines WriteSet type alias.
modules/service/src/service.rs Adds permit gating + per-client serialization infrastructure to AppService.
modules/service/src/lib.rs Re-exports explicit-state batch types from the service crate.
modules/service/src/explicit_state.rs New explicit-state speculative batch executor + conflict detection + tests.
modules/service/src/elc.rs Wires new batch RPC into tonic server and applies enclave gating.
modules/service/Cargo.toml Adds serde/bincode + test deps to support explicit-state batch logic/tests.
modules/enclave-api/src/lib.rs Re-exports new speculative execution types.
modules/enclave-api/src/enclave.rs Adds tx-mode helpers + write-set extraction and stitching (apply_write_set).
modules/enclave-api/src/api/primitive.rs Adds explicit tx-mode execution and speculative command execution helper.
modules/enclave-api/src/api/command.rs Adds speculative update-client execution with base-state seeding.
modules/enclave-api/src/api.rs Re-exports speculative types from the API module.
modules/ecall-commands/src/msgs.rs Maps new query height field into internal input type.
modules/ecall-commands/src/light_client.rs Extends QueryClientInput to include optional height.
enclave/Enclave.config.xml Raises TCSNum from 2 to 4.
enclave-modules/ecall-handler/src/light_client/query.rs Uses requested height (or latest) when fetching consensus state.
Cargo.lock Locks new dependencies pulled in by service/enclave-api changes.
app/src/commands/service.rs Adds --max-enclave-concurrency and passes concurrency limits into service.
app/src/commands/elc.rs Adds hidden debug/admin command to execute speculative batch from JSON.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread modules/service/src/service.rs Outdated
Comment thread modules/service/src/service.rs Outdated
Comment thread modules/service/src/service.rs Outdated
Comment thread modules/service/src/explicit_state.rs Outdated
Comment thread modules/service/src/explicit_state.rs Outdated
Comment thread modules/enclave-api/src/api/command.rs Outdated
Comment thread app/src/commands/service.rs Outdated
@ikehara ikehara force-pushed the feature/explicit-state-parallelization branch from be0f1b6 to c3077f5 Compare April 16, 2026 09:45
@ikehara ikehara requested a review from Copilot April 22, 2026 08:48

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread modules/service/src/explicit_state.rs Outdated
Comment thread modules/enclave-api/src/api/command.rs Outdated
@ikehara ikehara self-assigned this Apr 23, 2026
@ikehara ikehara changed the title prototype: Explicit-state batch execution and base-state seeding Explicit-state batch execution and base-state seeding Apr 28, 2026
@ikehara ikehara force-pushed the feature/explicit-state-parallelization branch from ecc4d85 to deb1db3 Compare May 12, 2026 06:31

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 37 out of 38 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

modules/store/src/rocksdb.rs:285

  • CommitStore::commit on a speculative RocksDB transaction will currently return Ok(()) and silently drop the overlay writes because StoreTransaction::commit only commits Update transactions. This makes it easy for callers to accidentally lose speculative writes without noticing. Consider rejecting commit() for speculative transactions (return an error) or making commit semantics explicit for Speculative to avoid silent no-ops.

Comment thread modules/enclave-api/src/enclave.rs
Comment thread modules/service/src/speculative/scheduler.rs Outdated
@ikehara ikehara marked this pull request as ready for review May 18, 2026 07:27
Kiyoshi Nakao added 4 commits May 22, 2026 18:19
Serial gRPC UpdateClient (unary + streaming) is serialized with speculative batches by the same KeyLockMap[client_id]. Calls to proto_update_client outside the service path (for example app/src/commands/elc.rs) are outside this guarantee.
@ikehara ikehara force-pushed the feature/explicit-state-parallelization branch from 7e471ca to d2ae4dc Compare May 29, 2026 06:32
Kiyoshi Nakao added 13 commits June 2, 2026 15:00
Add `EcallPool`, a long-lived pool of `--max-enclave-concurrency` OS
threads dedicated to executing ECALLs. All gRPC handlers in `elc.rs`
(`create_client`, `update_client`, `update_client_stream`,
`aggregate_messages`, `verify_membership`, `verify_non_membership`,
`client`) now dispatch their ECALLs onto the pool via
`tokio::task::spawn_blocking(|| pool.run(|| ...))` instead of either
calling the enclave directly from a tonic async handler (which would
bind a TCS to every tokio worker that ever ECALLs) or going through the
default `tokio::task::spawn_blocking` pool (which can grow to hundreds
of cached threads, each binding its own TCS).

Under `TCSPolicy=BIND` the Intel SGX SDK pins one TCS to each
ECALL-issuing host thread for the thread's lifetime. The
`with_ecall_permit` semaphore in `enclave-api` only bounds *concurrent*
ECALLs, not the cumulative set of distinct threads that have ever
ECALLed. Once that cumulative set exceeds `TCSNum`, any new
ECALL-issuing thread fails with `SGX_ERROR_OUT_OF_TCS`. This is what
was observed during a long-running Arbitrum activate-client run on
bcc-dev even with `--max-enclave-concurrency=4` and `TCSNum=8`.

By pinning ECALL execution to `EcallPool`'s `size` permanent workers,
the cumulative TCS bindings created by the service stay at `size` for
the entire process lifetime, restoring the invariant the existing CLI
guidance assumed.

Speculative scheduler's own `thread::scope` workers
(`speculative/scheduler.rs`) are intentionally left as-is in this
change. Their threads terminate at scope exit so the SDK destructor
releases their TCS bindings on each batch; addressing that path
(folding speculative dispatch into the same pool) is a follow-up so
this PR stays small.

CLI help text for `--max-enclave-concurrency` is updated: the prior
wording recommended `cap == TCSNum`, which is unsafe under BIND once
any unbounded ECALL-thread source exists. The new wording describes
the flag as the pool size and recommends `cap < TCSNum` to leave
headroom for the SDK runtime and the speculative path.

Tests:
- `cargo test -p service` (37 tests, includes 3 new `ecall_pool` tests
  asserting concurrency cap, result delivery, and that observed
  worker-thread ids are bounded by pool size).
- `cargo check --workspace` is clean.
The previous fix in `9e75101` routed gRPC-handler ECALLs (update_client,
update_client_stream, create_client, aggregate_messages, verify_*,
client) through the long-lived `EcallPool` workers, but left the
speculative scheduler's `thread::scope` workers in
`modules/service/src/speculative/scheduler.rs` issuing their own ECALLs
directly on those scoped threads. Under `TCSPolicy=BIND` each scope
worker still bound a TCS for the duration of a batch, so peak TCS
occupancy was `pool_size + scope_size` whenever a speculative batch
overlapped with any non-speculative gRPC ECALL. For the deployed
configuration `pool_size = scope_size = 4` and `TCSNum = 8`, leaving no
headroom for the SDK runtime and reintroducing `SGX_ERROR_OUT_OF_TCS`
risk under service-mode peak load.

This change hands off the per-unit ECALL inside the scope worker to
`AppService.ecall_pool.run(...)`. The scope worker continues to dequeue
work, hold the per-stream `speculative_request_permit`, and record the
result; it just no longer enters the enclave on its own thread. The
ECALL executes on one of the permanent `EcallPool` workers, whose TCS
binding is stable for the lifetime of the LCP process.

Peak TCS occupancy after this change:

    pool_size + 0 (scope workers no longer ECALL) = pool_size

`--max-enclave-concurrency` (= `EcallPool` size) is now the single
structural bound on host threads that ever enter the enclave,
regardless of how many speculative scope workers exist at any moment.

Test impact: `streaming_speculative_batch_parallelizes_complete_base_state_units`
now requires the `AppService` ecall_concurrency to be at least the
per-stream speculative cap, since the pool is the effective ECALL
concurrency. Bumped the test fixture from `ecall_concurrency=1` to `=3`
to match `SpeculativeService::new(3)`. All other speculative tests
already pass at `ecall_concurrency=1` because they assert
`observed_max_in_flight() >= 1` or `== 1`, which still holds.

Tests:
- `cargo test -p service` — 37 passed, 0 failed.
- `cargo check --workspace --tests` — clean.
After `6afd485` routed both gRPC handler ECALLs and speculative
scheduler ECALLs through `service::EcallPool`, the `ECallGate`
semaphore in `modules/enclave-api/src/enclave.rs` became structurally
redundant: the only host threads that ever issue an ECALL are the
permanent EcallPool workers, whose count is fixed at the configured
`--max-enclave-concurrency`. The host-side semaphore had nothing left
to gate.

Removed:
- `ECallGate`, `ECallGateState`, `ECallPermitGuard`, and their tests
  in `modules/enclave-api/src/enclave.rs`.
- `Enclave::new` / `Enclave::create` no longer take an
  `ecall_concurrency` argument.
- `EnclaveInfo::with_ecall_permit` trait method.
- `EnclaveLoader::load_with_ecall_concurrency` and its implementation;
  the remaining `EnclaveLoader::load` no longer needs an
  `ecall_concurrency` parameter.
- The semaphore acquire/release call site in
  `modules/enclave-api/src/api/primitive.rs::execute_prepared_command`.

Updated:
- `app/src/commands/service.rs` switches `enclave_loader.load(...)`,
  and the `--max-enclave-concurrency` CLI help text is rewritten to
  describe the flag as the EcallPool size only.
- `tests/integration/src/lib.rs` drops the now-removed argument from
  `Enclave::create`.

Concurrency control after this change:

    EcallPool size (--max-enclave-concurrency)
      = number of permanent host threads that ever ECALL
      = max concurrent ECALLs in flight
      = max cumulative TCS bindings under TCSPolicy=BIND

Set this to a value at most `TCSNum`; the default 4 leaves room for
the SDK runtime when `TCSNum=8`.

Tests:
- `cargo test -p service` — 37 passed.
- `cargo test -p enclave-api` — 3 passed.
- `cargo check --workspace` — clean.
- The integration-test crate compile-passes; runtime test
  `tests::test_lcp` requires actual SGX hardware and is not affected
  by this change.
@ikehara ikehara force-pushed the feature/explicit-state-parallelization branch from 68ab7bb to 56fcda1 Compare June 11, 2026 10:56
Kiyoshi Nakao added 9 commits June 11, 2026 19:56
The stitch-phase base verification recomputed the expected prev_state_id
with gen_state_id_from_any over the raw caller-supplied Anys. Light
clients derive state IDs from a canonicalized client state (e.g.
latest_height/frozen reset before hashing), and that canonicalization is
ELC-specific and only available inside the enclave, so the recomputed
hash never matches the enclave-observed prev_state_id for any ELC whose
canonicalization is not the identity. This made every explicit-state
speculative batch fail with BaseStateMismatch at the stitch phase.

Drop the raw recompute and rely on the remaining checks: the supplied
(client_state, consensus_state) bytes are pinned to the canonical store,
and the enclave-observed prev_state_id must match the height-indexed
state_id stored by the in-enclave light client at create/update time.
…verify

The state_id hash check already covers canonical equivalence; the
bincode byte-equality checks on client_state / consensus_state are
over-strict on encoding-only differences in the supplied Anys.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 39 out of 40 changed files in this pull request and generated 2 comments.

Comment on lines +178 to +182
if tx.kind != MemTxKind::Speculative {
return Err(crate::Error::not_supported_operation(
"take_write_set is only available for speculative transactions".to_string(),
));
}
Comment on lines +191 to +195
prev_height: &Height,
_client_state: &Any,
_consensus_state: &Any,
prev_state_id: Option<&[u8]>,
) -> Result<()>
Kiyoshi Nakao added 2 commits June 16, 2026 15:03
Drop over-specific examples from the comment and the operator hint
from the missing-entry error; keep the canonical-equivalence rationale.
The previous 60s budget fires during normal per-unit build pauses on
slow upstream provers; raise the cap so transient producer pacing
does not close the stream mid-batch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants