relay: stream explicit-state source units for speculative update-client#68
relay: stream explicit-state source units for speculative update-client#68ikehara wants to merge 29 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR extends the relayer’s explicit-state update-client flow to properly exercise multi-header Tendermint updates (as a single-lane sequential chain), adds a Tendermint multi-header collector, and refreshes local E2E tooling to run the tm2tm explicit-state path again.
Changes:
- Add explicit-state planning + speculative batch execution support (lane/unit planning, deferred base-state resolution, and integration tests).
- Add Tendermint multi-header source collection with source-header capping and runtime diagnostics for lane limiting.
- Update E2E scripts and simapp build tooling (Go/Docker version bumps, configurable sleeps, new E2E flags).
Reviewed changes
Copilot reviewed 16 out of 17 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/e2e/cases/tm2tm/scripts/test-elc-cmd | Makes the ELC update sleep configurable via env var. |
| scripts/run_e2e_test.sh | Adds explicit-state CLI flags + configurable LCP thread/concurrency knobs. |
| simapp/go.mod | Bumps simapp module Go version and updates indirect deps. |
| simapp/go.sum | Updates checksums after dependency refresh. |
| simapp/Dockerfile | Updates Go toolchain image for simapp builds. |
| relay/prover.go | Integrates multi-header collection and explicit-state speculative batch execution into update flow. |
| relay/lcp.go | Threads counterparty context through EKI persistence/removal calls. |
| relay/db.go | Moves EKI persistence to per-client subdirectories and changes removal behavior to delete the cache root. |
| relay/client.go | Stores the underlying grpc.ClientConn for raw Invoke calls. |
| relay/explicit_state_update_client.go | Adds explicit-state lane planning, base-state resolution logic, and query helpers. |
| relay/explicit_state_source_header_units.go | Adds source-header unit collection + Tendermint multi-header collector + lane-limit diagnostics. |
| relay/explicit_state_plan.go | Adds explicit-state unit/lane planning and batch execution wrapper. |
| relay/explicit_state_batch.go | Defines speculative batch request/response message types and an Invoke helper. |
| relay/explicit_state_header_units.go | Builds header units with trusted-height extraction. |
| relay/*_test.go (new) | Adds unit + integration coverage for planning, collection, and batch invocation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
13f1f66 to
d3a5bd7
Compare
2fd975e to
6d1fdb0
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 14 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
52a84a6 to
f324655
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 16 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
664a69b to
246ef67
Compare
|
|
||
| // Register standard grpc.health.v1 liveness service. The overall ("") | ||
| // status reports SERVING for as long as this gRPC server is accepting | ||
| // requests. | ||
| healthSrv := health.NewServer() | ||
| healthpb.RegisterHealthServer(grpcServer, healthSrv) | ||
|
|
There was a problem hiding this comment.
Please resume the health endpoint if the removal is not intentional.
There was a problem hiding this comment.
Restored the gRPC health service registration in relay/elcupdater/module/service.go. This removal was unintentional.
| ) | ||
|
|
||
| type LCPServiceClient struct { | ||
| conn *grpc.ClientConn |
There was a problem hiding this comment.
conn is added but not used at all.
There was a problem hiding this comment.
Removed the unused conn field from LCPServiceClient. The explicit-state path no longer depends on storing the raw client connection.
| func unitIDForError(unit *SpeculativeUpdateClientUnit) string { | ||
| if unit == nil { | ||
| return "" | ||
| } | ||
| return unit.UnitId | ||
| } |
There was a problem hiding this comment.
This function is not used at all.
There was a problem hiding this comment.
Removed the unused helper from relay/explicit_state_batch.go.
| sendErrAfter int | ||
| sendErr error |
There was a problem hiding this comment.
These two variables are never set values.
There was a problem hiding this comment.
Cleaned up this test path so the previously unset variables are no longer present.
| closeAndRecvCalled bool | ||
| closeSendCalled bool |
There was a problem hiding this comment.
These two variables are set true but never referenced.
There was a problem hiding this comment.
Cleaned up this test path so the write-only variables are no longer present.
| // Serial fallback is allowed only for capability failures and explicit-state | ||
| // prediction mismatches that indicate this ELC cannot provide a stable ordered | ||
| // chain base state for the current attempt. Ordered-chain merge/write-set | ||
| // failures must stay visible so we do not silently mask explicit-state bugs as | ||
| // a successful serial update-client. | ||
| func shouldFallbackToSerialUpdateClient(err error) bool { |
There was a problem hiding this comment.
I think that serial fallback is unnecessary and should be avoided because it can mask critical configuration failures.
There was a problem hiding this comment.
Addressed. The automatic serial fallback has been removed. Speculative batch errors now cancel/drain the source-unit producer and return the error instead of masking it with serial execution. Incomplete explicit-state base payloads now hard-fail as well.
| DefaultMaxChunkSize = 4*1024*1024 - 1024 | ||
| DefaultMaxChunkSize = 4*1024*1024 - 1024 | ||
| MaxSpeculativeBatchHeaderChunkSize = DefaultMaxChunkSize | ||
| MaxSpeculativeBatchUnitsLimit = 256 |
There was a problem hiding this comment.
256 is good as default, but too small as hard limit, right?
There was a problem hiding this comment.
Kept 256 as the default and clarified the hard limit separately as MaxSpeculativeBatchUnitsLimit. The limit is currently the peer LCP service protocol limit, so raising it should be done together with the LCP-side MAX_SPECULATIVE_BATCH_UNITS.
| "error", err.Error(), | ||
| ) | ||
| } | ||
| remainingUnits, collectErr := drainExplicitStateSourceHeaderUnitStream(sourceHeaderUnitStream) |
There was a problem hiding this comment.
If there is a big gap between the trusted height and the latest finalized height, draining all headers from the stream could cause an out-of-memory error.
There was a problem hiding this comment.
Addressed by changing the error path to cancel the source producer and discard-drain the stream without retaining units. This lets unbuffered producers exit cleanly while avoiding memory growth from accumulating drained headers.
|
Added a follow-up change to make the explicit-state UpdateClient path opt-in instead of default-on.
|
|
Addressed the follow-up review items in
Validation: |
a59b779 to
aef0b44
Compare
- restore the ELC updater gRPC health service - hard-fail explicit-state speculative batch failures instead of falling back to serial update-client - cancel and discard-drain source streams on batch errors so unbuffered producers can exit - reject nil or incomplete explicit-state base states instead of serializing those units - remove unused explicit-state fallback/dead helper code and stale client connection storage - clarify speculative batch unit limit as the peer LCP service protocol limit - update tests for hard-fail behavior, blocked producer cleanup, explicit-vs-legacy parity, and removed fallback paths
- replace disable_explicit_state_update_client with enable_explicit_state_update_client - keep the legacy serial path as the default after upgrades - enable the explicit-state path explicitly in tests that exercise it
This reverts commit d0121f1.
aef0b44 to
3a53bdd
Compare
| // The explicit-state path cannot anchor at a non-latest canonical | ||
| // state, so it cannot heal this divergence by itself. | ||
| return nil, fmt.Errorf( | ||
| "explicit-state base state mismatch persisted after %d attempts; if the LCP canonical state is ahead of the on-chain committed state, disable enable_explicit_state_update_client for one update cycle so the serial path can heal the gap: %w", |
io.EOF, gRPC DeadlineExceeded/Unavailable, and LCP idle/memory timeouts are now treated as transient and reuse the existing maxExplicitStateAttempts retry budget alongside base-state mismatch.
What changed
ExplicitStateChunkProvidertoSpeculativeUpdateClientBatchStreamenable_explicit_state_update_clientprover config flag; the legacy serial path remains the default after upgradesUpdateClientStreampathlcpsubmodule to the service/proto revision required for speculative batch executionWhy
Some update-client headers are large and expensive to prepare. Sending them only through the serial
UpdateClientStreampath forces the relayer to wait for each update-client result before the next dependent unit can be submitted, which leaves enclave-side execution and prover-side header preparation poorly overlapped.The speculative batch stream lets the relayer preserve the required ordered update-client chain while allowing later units to be prepared and submitted before earlier speculative results are finalized. This gives LCP enough structure to execute independent work in parallel and validate the ordered dependency chain.
Because the explicit-state path is still experimental and can change runtime behavior, it is now opt-in. Existing configurations keep using the legacy serial path unless
enable_explicit_state_update_clientis set. If speculative batch execution fails, the relayer cancels the source-unit producer, discards any remaining streamed units to let the producer exit cleanly, and returns the error. There is no automatic fallback to serial execution.Behavior
Validation
go test ./relaygo test ./...go test ./... -count=1go build -tags customcert ./relay/bin