Skip to content

refactor(kona/service): move cancellation out of actors and other testability changes#20948

Open
op-will wants to merge 10 commits into
developfrom
op-will/kona/consolidateCancellationAndTestabilityUpdates
Open

refactor(kona/service): move cancellation out of actors and other testability changes#20948
op-will wants to merge 10 commits into
developfrom
op-will/kona/consolidateCancellationAndTestabilityUpdates

Conversation

@op-will
Copy link
Copy Markdown
Contributor

@op-will op-will commented May 21, 2026

Summary

Selectively port the cancellation-token ownership changes from
ethereum-optimism/optimism#19141
into the kona node service, without adopting that PR's Builder /
InboundData / trait-level init() pattern. Pulls in the within-actor
testability and restructuring improvements; keeps direct dependency
injection into constructors instead of routing everything through
builder structs.

NodeActor trait

  • Replace start(self, ctx) -> Result<(), Error> with step(&mut self) -> Result<(), Error>.
  • Drop the StartData associated type and the CancellableContext trait entirely.
  • The orchestrator macro (spawn_and_wait!) now owns the umbrella CancellationToken, the loop, and the select! against cancelled(); no actor sees a cancellation primitive.

Actor migrations

  • Derivation / DelegateDerivation / L1Watcher / Sequencer: drop the cancellation_token field and CancellableContext impl; move loop-local state onto self (tickers, payload-to-seal, last seal duration, etc).
  • DelegateDerivation (deps): take the sync-status fetcher as a new DerivationDelegateProvider trait, and the L1 provider as any kona_derive::ChainProvider, instead of the concrete DerivationDelegateClient and AlloyChainProvider. Unlocks unit-testing of sync-status validation; production wiring unchanged.
  • Network: take a live NetworkHandler at construction; build/start the libp2p swarm in RollupNode::start so the actor constructor stays sync. The NetworkInboundData bundle is deleted in favor of injecting senders individually.
  • Rpc: module assembly and the initial server launch move upstream; the actor holds the launcher + modules + handle and relaunches up to RpcBuilder::restart_count on stop. A Drop impl calls RpcServerHandle::stop() so graceful cancellation shuts the jsonrpsee server cleanly. RpcActor is generic over a new RpcServerLauncher trait (with a production JsonrpseeServerLauncher impl) so the restart/stop logic can be unit-tested with a controllable mock handle.

Engine actor restructure

  • The fan-out EngineActor (which only routed EngineActorRequest variants into one of two sub-tasks) is deleted.
  • EngineProcessorEngineActor; EngineRpcProcessorEngineRpcActor. Both run as first-class peers under spawn_and_wait!.
  • EngineProcessingRequest takes over the freed name EngineActorRequest; the old fan-out EngineActorRequest enum and its RpcRequest variant are deleted (the rpc client sends EngineRpcRequest directly).
  • The EngineRequestReceiver and EngineRpcRequestReceiver placeholder traits (testing-only scaffolding per their own doc comments) are deleted.
  • No more JoinHandle polling, is_finished() checks, or PhantomData generics in the engine actors.
  • EngineRpcActor is constrained to a new read-only EngineRpcClient trait (in kona-engine, blanket-impl'd for every T: EngineClient) instead of the full EngineClient. The trait exposes only l2_block_by_label and a narrowed get_storage_hash projection of get_proof. The actor cannot reach Engine API mutation methods at the type level. Field renamed engine_clientengine_rpc_client to match.

RollupNode::start

  • Single CancellationToken owned by the macro; no actor receives it.
  • All cross-actor channels (mpsc + watch) created at the top of start() in one visually-grouped block.
  • Actor construction broken out into five build_* helpers: build_engine_actors, build_derivation_actor, build_l1_watcher, build_sequencer, build_rpc_actor — plus four private type aliases.
  • create_engine_actor helper deleted (inlined into build_engine_actors).
  • start() docs gain a short "Shutdown" section: shutdown is unordered, channel-closed errors at teardown are expected.

External callers

  • bin/node and examples/gossip: build the swarm handler upstream and spawn a step loop on a tokio task. Fixes a pre-existing bug in bin/node's net subcommand where the prior network.start(()).await? blocked forever, making subsequent interval-poll code unreachable.
  • Network/sequencer integration tests: TestNetworkBuilder::build is now async; TestNetwork holds the four inbound senders individually.
  • SequencerActor::new tests pass block_time: 2 to avoid tokio::time::interval(0) panicking.

Follow-ups (to file as separate issues)

  • RPC restart budget semantics. RpcBuilder::restart_count is currently interpreted as a lifetime cap across the actor's lifetime; a long-running node whose RPC port flaps occasionally over months will eventually exhaust it and shut down. The more useful shape is a rate (e.g. N restarts per unit time), so transient flapping doesn't permanently disable the RPC. Track separately; out of scope here.
  • DelegateDerivationActor unit tests. The DI seam landed in this PR but no tests were added with it — adding tests would balloon the diff and the test matrix deserves separate thought. Validation logic worth covering: sync-status fetch failure (silent-skip), L1 inconsistency (warn-and-skip), happy-path forward to engine, and the "ignore irrelevant requests in delegated mode" handler.

Testing

  • cargo build -p kona-node-service — 0 errors
  • cargo check --all-targets -p kona-node-service -p kona-node -p example-gossip -p kona-engine — 0 errors
  • cargo test --lib -p kona-node-service — 112 passed
  • cargo test --test integration test_p2p_network_conn — passed
  • cargo test --test integration test_sequencer_network_conn — passed
  • cargo +nightly fmt -p kona-node-service -p kona-engine — clean
  • cargo clippy -p kona-node-service -p kona-engine --all-targets — no new warnings
  • RUSTDOCFLAGS="--cfg docsrs -D warnings" cargo +nightly doc -p kona-node-service -p kona-engine --all-features --no-deps --document-private-items — clean

@op-will op-will force-pushed the op-will/kona/consolidateCancellationAndTestabilityUpdates branch from ca6da49 to f538932 Compare May 21, 2026 05:38
…t with step; split engine actor

Selectively port the cancellation-token ownership changes from
#19141 without adopting that PR's Builder /
InboundData / init() trait pattern.

NodeActor trait:
- Replace `start(self, ctx) -> Result<(), Error>` with
  `step(&mut self) -> Result<(), Error>`.
- Drop the StartData associated type and the CancellableContext trait.
- The orchestrator macro (`spawn_and_wait!`) now owns the umbrella
  CancellationToken, the loop, and the select! against `cancelled()`;
  no actor sees a cancellation primitive.

Actor migrations:
- Derivation, DelegateDerivation, L1Watcher, Sequencer: drop the
  cancellation_token field + CancellableContext impl; move loop-local
  state onto self (tickers, payload-to-seal, last seal duration, etc).
- Network: take a live NetworkHandler at construction; build/start the
  libp2p swarm in RollupNode::start so the constructor stays sync. The
  NetworkInboundData bundle is deleted in favor of injecting senders
  individually.
- Rpc: become non-generic. Module assembly and the initial server
  launch move upstream; the actor holds config + modules + handle and
  relaunches up to RpcBuilder::restart_count on stop. A Drop impl calls
  ServerHandle::stop() so graceful cancellation shuts the jsonrpsee
  server cleanly.

Engine actor restructure:
- The fan-out EngineActor (which only routed EngineActorRequest
  variants into one of two sub-tasks) is deleted.
- EngineProcessor is promoted to EngineActor; EngineRpcProcessor is
  promoted to EngineRpcActor. Both run as first-class peers under
  spawn_and_wait!.
- EngineProcessingRequest takes over the freed name EngineActorRequest;
  the old fan-out EngineActorRequest enum and its RpcRequest variant
  are deleted (the rpc client sends EngineRpcRequest directly).
- EngineRequestReceiver and EngineRpcRequestReceiver placeholder traits
  (testing-only scaffolding per their own doc comments) are deleted.
- No more JoinHandle polling, is_finished() checks, or PhantomData
  generics in the engine actors.

RollupNode::start:
- Single CancellationToken owned by the macro; no actor receives it.
- All cross-actor channels (mpsc + watch) created at the top of start()
  in one visually-grouped block.
- Actor construction broken out into five build_* helpers
  (build_engine_actors, build_derivation_actor, build_l1_watcher,
  build_sequencer, build_rpc_actor) plus three private type aliases.
- create_engine_actor helper deleted (inlined into build_engine_actors).

External callers:
- bin/node and examples/gossip: build the swarm handler upstream and
  spawn a step loop on a tokio task. Fixes a pre-existing bug in
  bin/node's net subcommand where the prior `network.start(()).await?`
  blocked forever, making subsequent interval-poll code unreachable.
- Network/sequencer integration tests: TestNetworkBuilder::build is now
  async; TestNetwork holds the four inbound senders individually.
- SequencerActor::new tests pass `block_time: 2` to avoid
  tokio::time::interval(0) panicking.

Verification:
  cargo build -p kona-node-service:                                            0 errors
  cargo check --all-targets -p kona-node-service -p kona-node -p example-gossip: 0 errors
  cargo test --lib -p kona-node-service:                                       109 passed
  cargo test --test integration test_p2p_network_conn:                         passed
  cargo test --test integration test_sequencer_network_conn:                   passed
@op-will op-will force-pushed the op-will/kona/consolidateCancellationAndTestabilityUpdates branch from f538932 to 9ef3414 Compare May 21, 2026 12:09
op-will added 3 commits May 21, 2026 14:11
Introduces RpcServerLauncher + RpcServerHandle traits so the actor's
relaunch and shutdown logic can be unit-tested with a controllable mock
instead of a real jsonrpsee server. The production path is unchanged: a
new JsonrpseeServerLauncher wraps RpcBuilder and produces a real
ServerHandle. Adds six unit tests covering the restart budget, failed
relaunches, and the Drop-stops-handle path. Also documents that RollupNode
shutdown is unordered, and that L1WatcherActor's builder intentionally
returns impl NodeActor because its block-stream type is unnameable.
The accessor's name shadowed the field of the same name, making
self.engine_config() and &self.engine_config visually ambiguous at
the only call site. Since the accessor only cloned the field, callers
can clone inline.
Constructing the actor with an unstarted handler causes step() to hang
or fail on the first gossip poll. The constructor stays sync to keep
NodeActor minimal, so the live-handler invariant lives in the caller —
document that explicitly.
/// Returning `Ok(())` indicates the actor is ready to be stepped again.
/// Returning `Err(_)` indicates the actor has encountered a fatal
/// condition and should not be stepped further.
async fn step(&mut self) -> Result<(), Self::Error>;
Copy link
Copy Markdown
Contributor Author

@op-will op-will May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Note: I kept step() from the reference PR, as it's good enough for now.

If/when we separate actors from tasks, we can change this trait to more closely fit actors.

op-will added 3 commits May 21, 2026 15:04
The doc comment referenced crate::service::node::RollupNode, but service
is a pub(crate) module so the path is not part of the documented graph.
RollupNode is re-exported from the crate root; link there instead.
DelegateDerivationActor used to take its two external dependencies —
the sync-status fetch client and the L1 chain provider — by concrete
type, so the actor's validation logic (sync-status fetch, L1
consistency check, conditional forwarding) could not be exercised
without standing up real HTTP and RPC clients. That is why the actor
has no unit tests today.

Introduces a one-method DerivationDelegateProvider trait, generalizes
the L1 provider to any kona_derive::ChainProvider, and threads the two
generics through the actor and its enum wrapper in RollupNode. The
production path is unchanged: DerivationDelegateClient implements the
new trait and AlloyChainProvider already implements ChainProvider.

No new tests in this commit — the DI seam alone unlocks future test
work without committing to a specific test matrix here.
EngineRpcActor previously held an Arc<EngineClient_>, giving it access
to the full Engine API surface — including mutation methods like
forkchoiceUpdated, newPayload, and getPayload that an RPC query actor
must never call. Constrain the actor to a new EngineRpcClient trait
exposing only the two methods EngineQueries::handle actually needs:
l2_block_by_label and get_storage_hash (a narrowed projection of
get_proof that returns just the storage hash field used to compute the
L2-to-L1 message-passer storage root pre-Isthmus).

A blanket impl of EngineRpcClient for every T: EngineClient keeps
production wiring (OpEngineClient) unchanged; only the actor's static
type bound has tightened. Tests can now implement the two-method trait
directly instead of the entire EngineClient/OpEngineApi surface.

Also rename the field engine_client -> engine_rpc_client to match the
new narrowed responsibility.
Comment on lines +60 to +61
let mut delegated_derivation_ticker =
time::interval(Self::DERIVATION_DELEGATE_POLL_INTERVAL);
Copy link
Copy Markdown
Contributor Author

@op-will op-will May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could argue this should be a new(...) argument, but unit tests can just create DelegateDerivationActor via struct literal if they need to change it.

}
}
async fn step(&mut self) -> Result<(), Self::Error> {
let request = self.inbound_request_rx.recv().await.ok_or_else(|| {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: removing info!(target: "derivation", "Starting derivation"); because

  1. It doesn't fit into this step(...) redesign
  2. We are not necessarily starting derivation, as we need to wait for a request (i.e. ELSyncCompleted)

{
type Error = EngineError;

async fn step(&mut self) -> Result<(), Self::Error> {
Copy link
Copy Markdown
Contributor Author

@op-will op-will May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Design note: start(...) used to create two subtasks:

  • a "processing handler" that is the actual engine (in engine_request_processor.rs)
  • an RPC handler (in rpc_request_processor.rs)

Instead of having start(...) / step(...) proxy requests to the subtasks, I just spun out EngineRpcActor as its own actor since it doesn't share any meaningful state with EngineActor. This has the added benefit of removing request queue contention between RPC requests and actual engine requests.

This looks like a big delta, but really it's just inlining the contents of engine_request_processor.rs as the new EngineActor and removing the proxy logic.

}
L1WatcherQueries::L1State(sender) => {
let current_l1 = *latest_head.borrow();
let current_l1 = *self.latest_head.borrow();
Copy link
Copy Markdown
Contributor Author

@op-will op-will May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This used to subscribe to its own latest_head, which is unnecessary because we can just borrow the value from the publisher

op-will added 3 commits May 21, 2026 16:40
…tency

All receiver fields now share the _rx suffix:
- signer        -> unsafe_block_signer_rx
- p2p_rpc       -> p2p_rpc_rx
- admin_rpc     -> admin_query_rx

publish_rx and unsafe_block_rx already followed the convention.

Also rename the local select! binding signer -> unsafe_block_signer
so it no longer collides with self.handler.signer, which represents a
different concept (the local block-signing key, not an address).

Call sites are positional so no caller updates required.
…ractly

Rewrite the constructor doc comment so it describes the precondition
(the libp2p swarm must already be built and started) rather than naming
the specific NetworkBuilder method chain. The trade-off rationale —
sync constructor over an init() trait method — is retained.
test_launch_no_modules, test_launch_with_modules, and
test_real_launcher_smoke all bound real localhost sockets via
jsonrpsee. Unit tests should not spin up actual servers; the mock-driven
RpcActor tests still cover the restart/stop logic, and the production
JsonrpseeServerLauncher is exercised end-to-end by integration tests at
the RollupNode level.

With the only remaining users of the free `launch` function now inside
launcher.rs itself, drop its pub(crate) visibility too.
@op-will op-will marked this pull request as ready for review May 21, 2026 22:07
@op-will op-will requested a review from a team as a code owner May 21, 2026 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant