Skip to content

relay: stream explicit-state source units for speculative update-client#68

Open
ikehara wants to merge 29 commits into
mainfrom
feature/explicit-state-parallelization
Open

relay: stream explicit-state source units for speculative update-client#68
ikehara wants to merge 29 commits into
mainfrom
feature/explicit-state-parallelization

Conversation

@ikehara

@ikehara ikehara commented Apr 6, 2026

Copy link
Copy Markdown

What changed

  • introduce explicit-state source-unit streaming from ExplicitStateChunkProvider to SpeculativeUpdateClientBatchStream
  • add an opt-in enable_explicit_state_update_client prover config flag; the legacy serial path remains the default after upgrades
  • use speculative batch execution only for provers that both enable the flag and can provide explicit-state source units
  • remove automatic serial fallback after speculative batch execution errors; failures now cancel/drain the source-unit producer and return the error
  • reject incomplete explicit-state base payloads instead of executing those units through a per-unit serial fallback
  • keep non-explicit-state update-client flows on the existing serial UpdateClientStream path
  • update the lcp submodule to the service/proto revision required for speculative batch execution

Why

Some update-client headers are large and expensive to prepare. Sending them only through the serial UpdateClientStream path forces the relayer to wait for each update-client result before the next dependent unit can be submitted, which leaves enclave-side execution and prover-side header preparation poorly overlapped.

The speculative batch stream lets the relayer preserve the required ordered update-client chain while allowing later units to be prepared and submitted before earlier speculative results are finalized. This gives LCP enough structure to execute independent work in parallel and validate the ordered dependency chain.

Because the explicit-state path is still experimental and can change runtime behavior, it is now opt-in. Existing configurations keep using the legacy serial path unless enable_explicit_state_update_client is set. If speculative batch execution fails, the relayer cancels the source-unit producer, discards any remaining streamed units to let the producer exit cleanly, and returns the error. There is no automatic fallback to serial execution.

Behavior

  • explicit-state execution treats chunks for the same ELC client as an ordered speculative chain
  • provider-produced units are streamed in order; the first unit can be sent before later units are built
  • explicit-state UpdateClient is disabled by default and must be enabled per prover config
  • non-explicit-state paths remain unchanged and continue to use serial update-client execution
  • if speculative batch execution fails, the relayer cancels/drains the source-unit stream and returns the error
  • if a provider emits an incomplete explicit-state base payload, the relayer returns an error instead of masking it with serial fallback

Validation

  • go test ./relay
  • go test ./...
  • go test ./... -count=1
  • go build -tags customcert ./relay/bin

@ikehara ikehara changed the title [codex] explicit-state planning for multi-header tendermint updates prototype: Explicit-state planning for multi-header tendermint updates Apr 6, 2026
@ikehara ikehara requested a review from Copilot April 6, 2026 08:34

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the relayer’s explicit-state update-client flow to properly exercise multi-header Tendermint updates (as a single-lane sequential chain), adds a Tendermint multi-header collector, and refreshes local E2E tooling to run the tm2tm explicit-state path again.

Changes:

  • Add explicit-state planning + speculative batch execution support (lane/unit planning, deferred base-state resolution, and integration tests).
  • Add Tendermint multi-header source collection with source-header capping and runtime diagnostics for lane limiting.
  • Update E2E scripts and simapp build tooling (Go/Docker version bumps, configurable sleeps, new E2E flags).

Reviewed changes

Copilot reviewed 16 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/e2e/cases/tm2tm/scripts/test-elc-cmd Makes the ELC update sleep configurable via env var.
scripts/run_e2e_test.sh Adds explicit-state CLI flags + configurable LCP thread/concurrency knobs.
simapp/go.mod Bumps simapp module Go version and updates indirect deps.
simapp/go.sum Updates checksums after dependency refresh.
simapp/Dockerfile Updates Go toolchain image for simapp builds.
relay/prover.go Integrates multi-header collection and explicit-state speculative batch execution into update flow.
relay/lcp.go Threads counterparty context through EKI persistence/removal calls.
relay/db.go Moves EKI persistence to per-client subdirectories and changes removal behavior to delete the cache root.
relay/client.go Stores the underlying grpc.ClientConn for raw Invoke calls.
relay/explicit_state_update_client.go Adds explicit-state lane planning, base-state resolution logic, and query helpers.
relay/explicit_state_source_header_units.go Adds source-header unit collection + Tendermint multi-header collector + lane-limit diagnostics.
relay/explicit_state_plan.go Adds explicit-state unit/lane planning and batch execution wrapper.
relay/explicit_state_batch.go Defines speculative batch request/response message types and an Invoke helper.
relay/explicit_state_header_units.go Builds header units with trusted-height extraction.
relay/*_test.go (new) Adds unit + integration coverage for planning, collection, and batch invocation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread relay/db.go Outdated
Comment thread relay/explicit_state_source_header_units.go Outdated
Comment thread relay/explicit_state_source_header_units.go Outdated
Comment thread relay/explicit_state_batch.go Outdated
Comment thread relay/explicit_state_update_client.go Outdated
@ikehara ikehara force-pushed the feature/explicit-state-parallelization branch from 13f1f66 to d3a5bd7 Compare April 8, 2026 08:32
@ikehara ikehara force-pushed the feature/explicit-state-parallelization branch 2 times, most recently from 2fd975e to 6d1fdb0 Compare April 16, 2026 05:40
@ikehara ikehara requested a review from Copilot April 22, 2026 08:48

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread relay/config.go Outdated
Comment thread relay/explicit_state_batch.go Outdated
Comment thread relay/explicit_state_source_header_units.go Outdated
Comment thread relay/explicit_state_source_header_units.go Outdated
@ikehara ikehara force-pushed the feature/explicit-state-parallelization branch from 52a84a6 to f324655 Compare April 23, 2026 05:39
@ikehara ikehara requested a review from Copilot April 23, 2026 06:51

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 16 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread relay/explicit_state_update_client.go Outdated
Comment thread relay/explicit_state_update_client.go Outdated
Comment thread relay/explicit_state_update_client.go Outdated
Comment thread relay/prover.go Outdated
@ikehara ikehara self-assigned this Apr 23, 2026
@ikehara ikehara changed the title prototype: Explicit-state planning for multi-header tendermint updates relay: stream explicit-state source units for speculative update-client Apr 23, 2026
@ikehara ikehara force-pushed the feature/explicit-state-parallelization branch from 664a69b to 246ef67 Compare May 12, 2026 07:00
@ikehara ikehara requested a review from Copilot May 12, 2026 09:56

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 16 changed files in this pull request and generated 4 comments.

Files not reviewed (1)
  • relay/config.pb.go: Language not supported

Comment thread relay/explicit_state_update_client.go Outdated
Comment thread relay/explicit_state_update_client.go
Comment thread relay/prover.go Outdated
Comment thread relay/config.go Outdated
@ikehara ikehara marked this pull request as ready for review May 18, 2026 07:35

@siburu siburu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpicks

Comment on lines -208 to -214

// Register standard grpc.health.v1 liveness service. The overall ("")
// status reports SERVING for as long as this gRPC server is accepting
// requests.
healthSrv := health.NewServer()
healthpb.RegisterHealthServer(grpcServer, healthSrv)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please resume the health endpoint if the removal is not intentional.

@ikehara ikehara Jun 4, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restored the gRPC health service registration in relay/elcupdater/module/service.go. This removal was unintentional.

Comment thread relay/client.go Outdated
)

type LCPServiceClient struct {
conn *grpc.ClientConn

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conn is added but not used at all.

@ikehara ikehara Jun 4, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the unused conn field from LCPServiceClient. The explicit-state path no longer depends on storing the raw client connection.

Comment thread relay/explicit_state_batch.go Outdated
Comment on lines +192 to +197
func unitIDForError(unit *SpeculativeUpdateClientUnit) string {
if unit == nil {
return ""
}
return unit.UnitId
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is not used at all.

@ikehara ikehara Jun 4, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the unused helper from relay/explicit_state_batch.go.

Comment thread relay/explicit_state_batch_test.go Outdated
Comment on lines +71 to +72
sendErrAfter int
sendErr error

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two variables are never set values.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned up this test path so the previously unset variables are no longer present.

Comment thread relay/explicit_state_batch_test.go Outdated
Comment on lines +74 to +75
closeAndRecvCalled bool
closeSendCalled bool

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two variables are set true but never referenced.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned up this test path so the write-only variables are no longer present.

Comment thread relay/prover.go Outdated
Comment on lines +471 to +476
// Serial fallback is allowed only for capability failures and explicit-state
// prediction mismatches that indicate this ELC cannot provide a stable ordered
// chain base state for the current attempt. Ordered-chain merge/write-set
// failures must stay visible so we do not silently mask explicit-state bugs as
// a successful serial update-client.
func shouldFallbackToSerialUpdateClient(err error) bool {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that serial fallback is unnecessary and should be avoided because it can mask critical configuration failures.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. The automatic serial fallback has been removed. Speculative batch errors now cancel/drain the source-unit producer and return the error instead of masking it with serial execution. Incomplete explicit-state base payloads now hard-fail as well.

Comment thread relay/config.go Outdated
DefaultMaxChunkSize = 4*1024*1024 - 1024
DefaultMaxChunkSize = 4*1024*1024 - 1024
MaxSpeculativeBatchHeaderChunkSize = DefaultMaxChunkSize
MaxSpeculativeBatchUnitsLimit = 256

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

256 is good as default, but too small as hard limit, right?

@ikehara ikehara Jun 4, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kept 256 as the default and clarified the hard limit separately as MaxSpeculativeBatchUnitsLimit. The limit is currently the peer LCP service protocol limit, so raising it should be done together with the LCP-side MAX_SPECULATIVE_BATCH_UNITS.

Comment thread relay/prover.go Outdated
"error", err.Error(),
)
}
remainingUnits, collectErr := drainExplicitStateSourceHeaderUnitStream(sourceHeaderUnitStream)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a big gap between the trusted height and the latest finalized height, draining all headers from the stream could cause an out-of-memory error.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed by changing the error path to cancel the source producer and discard-drain the stream without retaining units. This lets unbuffered producers exit cleanly while avoiding memory growth from accumulating drained headers.

@ikehara

ikehara commented Jun 4, 2026

Copy link
Copy Markdown
Author

Added a follow-up change to make the explicit-state UpdateClient path opt-in instead of default-on.

  • replaced disable_explicit_state_update_client with enable_explicit_state_update_client
  • default behavior remains the legacy serial UpdateClient path after upgrades
  • explicit-state speculative execution is used only when the prover config explicitly sets enable_explicit_state_update_client: true
  • tests that exercise the explicit-state path now opt in explicitly

@ikehara

ikehara commented Jun 4, 2026

Copy link
Copy Markdown
Author

Addressed the follow-up review items in 3cf81e6:

  • removed the serial fallback path for speculative batch failures; failures now return an error
  • cancel and discard-drain the source-unit stream on batch errors so unbuffered producers can exit without retaining drained units
  • changed incomplete explicit-state base payloads to hard-fail instead of per-unit serial fallback
  • restored the ELC updater gRPC health service
  • removed unused/dead helper code and stale test-only helpers
  • added a stricter blocked-producer cleanup test for the batch-error path
  • added an explicit-state vs legacy parity test for UpdateClient results
  • switched cloneAny to proto.Clone
  • changed speculative unit IDs to unit-%d and clarified the 256-unit limit as the peer LCP service protocol limit
  • documented the provider contract for non-nil AnyHeader / complete base state
  • updated the PR description to reflect hard-fail behavior instead of fallback

Validation: go test ./relay, go test ./..., and go test ./... -count=1 pass locally.

@ikehara ikehara force-pushed the feature/explicit-state-parallelization branch from a59b779 to aef0b44 Compare June 4, 2026 12:01
Kiyoshi Nakao added 5 commits June 4, 2026 21:07
- restore the ELC updater gRPC health service
- hard-fail explicit-state speculative batch failures instead of falling back to serial update-client
- cancel and discard-drain source streams on batch errors so unbuffered producers can exit
- reject nil or incomplete explicit-state base states instead of serializing those units
- remove unused explicit-state fallback/dead helper code and stale client connection storage
- clarify speculative batch unit limit as the peer LCP service protocol limit
- update tests for hard-fail behavior, blocked producer cleanup, explicit-vs-legacy parity, and removed fallback paths
- replace disable_explicit_state_update_client with enable_explicit_state_update_client

- keep the legacy serial path as the default after upgrades

- enable the explicit-state path explicitly in tests that exercise it
@ikehara ikehara force-pushed the feature/explicit-state-parallelization branch from aef0b44 to 3a53bdd Compare June 4, 2026 12:08

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 14 changed files in this pull request and generated 1 comment.

Files not reviewed (1)
  • relay/config.pb.go: Language not supported

Comment thread relay/prover.go Outdated
// The explicit-state path cannot anchor at a non-latest canonical
// state, so it cannot heal this divergence by itself.
return nil, fmt.Errorf(
"explicit-state base state mismatch persisted after %d attempts; if the LCP canonical state is ahead of the on-chain committed state, disable enable_explicit_state_update_client for one update cycle so the serial path can heal the gap: %w",
Kiyoshi Nakao added 5 commits June 11, 2026 10:14
io.EOF, gRPC DeadlineExceeded/Unavailable, and LCP idle/memory
timeouts are now treated as transient and reuse the existing
maxExplicitStateAttempts retry budget alongside base-state mismatch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants