Skip to content

fix(op-devstack): eliminate rollup-boost RPC port-bind race#20938

Open
hdcesario-op wants to merge 2 commits into
developfrom
cesario/fix-rollup-boost-rpc-port-race
Open

fix(op-devstack): eliminate rollup-boost RPC port-bind race#20938
hdcesario-op wants to merge 2 commits into
developfrom
cesario/fix-rollup-boost-rpc-port-race

Conversation

@hdcesario-op
Copy link
Copy Markdown
Contributor

@hdcesario-op hdcesario-op commented May 20, 2026

Summary

  • The rollup-boost RPC port is pre-allocated in Go (net.Listen(":0")Close() → return the port number) and then handed to the Rust subprocess via --rpc-port=N. Between the Go-side Close() and the Rust-side bind() (hundreds of ms later, after Tokio init), any other net.Listen(":0") in the same test process can be handed the same port. Result: intermittent EADDRINUSE on memory-all-opn-op-{reth,geth}, surfacing as either fast-fail (TestFlashblocksStream, TestFlashblocksTransfer with "TCP endpoint not ready within 5s") or slow-hang (TestFlashblocksTransfer with 30-minute context timeout terminating at op-devstack/sysgo/rollup_boost.go:132).
  • This binary's other ports (flashblocks WS, debug server) already avoid the race by passing 0 and logging the bound address. The RPC port was the lone outlier; the existing block comment at op-devstack/sysgo/rollup_boost.go:119–122 flagged exactly this gap as a TODO.
  • This PR closes the gap by adopting the same pattern for RPC. Pre-allocation is removed entirely.

Closes #19883
Closes #19934

The rollup-boost RPC port was pre-allocated by the Go harness
(net.Listen("127.0.0.1:0") -> Close() -> return port number), then
handed to the Rust subprocess via --rpc-port=N. The Rust binary had to
bind that port hundreds of milliseconds later (after Tokio init), during
which any other net.Listen(":0") in the test process could be handed
the same port by the kernel. Result: intermittent EADDRINUSE on
memory-all-opn-op-{reth,geth}, most visible on TestFlashblocksTransfer.

Adopt the same port-discovery pattern this binary already uses for its
flashblocks and debug ports:
  - rollup-boost (cli.rs) logs "RPC server listening on <addr>" after
    Server::build() returns, using the existing local_addr() API
    (already used at proxy.rs:210, 842).
  - op-devstack passes --rpc-port=0 and parses the bound address from
    the log stream, mirroring the existing flashblocks parser.

Pre-allocation is removed entirely; cfg.RPCPort > 0 still pins a
specific port for callers that need it.

Resolves the TODO at op-devstack/sysgo/rollup_boost.go:119-122.

Follow-up: op-devstack/sysgo/mixed_runtime.go:486 (kona-node
KONA_METRICS_PORT) uses the same pre-allocation pattern and should be
migrated separately once kona-node logs its bound metrics address.
The stdout parser callback is held by r.sub (via NewSubProcess) and can
fire after Start() returns. tasks.Await does not require channel
closure — it returns on first value or ctx done. The select-default
already drops duplicate emits while the channel is open, but if the
channel is closed first, a send would panic.

Drop the defers and document the lifecycle. No functional change for
the success path; eliminates a latent panic if a late/duplicate log
line ever races the deferred close.
@hdcesario-op hdcesario-op marked this pull request as ready for review May 21, 2026 09:03
@hdcesario-op hdcesario-op requested a review from a team as a code owner May 21, 2026 09:03
@pcw109550 pcw109550 added this pull request to the merge queue May 21, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

flaky test: TestFlashblocksTransfer flaky test: TestFlashblocksStream

2 participants