Skip to content

PlainNettyTransport.close: use quietPeriod=0 in shutdownGracefully#474

Open
codexcoder21 wants to merge 1 commit into
libp2p:developfrom
CodexCoder21Organization:fix-plain-netty-transport-quiet-period
Open

PlainNettyTransport.close: use quietPeriod=0 in shutdownGracefully#474
codexcoder21 wants to merge 1 commit into
libp2p:developfrom
CodexCoder21Organization:fix-plain-netty-transport-quiet-period

Conversation

@codexcoder21

Copy link
Copy Markdown

Summary

PlainNettyTransport.close() currently calls workerGroup.shutdownGracefully() and bossGroup.shutdownGracefully() with no arguments. That falls through to Netty's defaults of quietPeriod = 2 SECONDS, timeout = 15 SECONDS. After PR #412 chained the returned futures via CompletableFuture.allOf(...).thenApply { }, the caller's close() future no longer resolves until those shutdownGracefully futures resolve — which means every close() blocks for the full 2-second quiet period, even when there is no traffic in flight to drain.

This patch passes an explicit quietPeriod = 0 to both calls. The event loop terminates as soon as its run-loop notices the shutdown flag, the returned future resolves in tens of milliseconds, and the caller-can-wait correctness contract introduced in #412 is preserved.

Why the current behaviour is wrong for typical callers

The quiet period is the time shutdownGracefully keeps the event loop alive AFTER its last task finished, in case more work is submitted. Once close() has been called on this transport:

  • Every existing channel has had .close() invoked on it.
  • closed = true is set, so listen() and dial() throw Libp2pException.
  • There is by construction no code path that submits new work to either workerGroup or bossGroup.

Waiting 2 seconds for "more work that isn't coming" is pure latency. The timeout argument (which I keep at 5 seconds) is the meaningful upper bound — it caps how long shutdownGracefully will wait for the loop to actually exit once shutdown has been requested.

Cumulative impact downstream

The 2-second per-close latency compounds badly for callers that drive many short-lived host lifecycles back-to-back, e.g. test fixtures that create-and-tear-down a host per iteration. A consumer @Timeout(30) test of 20 host create/close cycles times out after 12-13 iterations (each ~2 s of quiet-period wait plus iteration overhead), even though every individual close() is doing exactly what it was asked to do.

That is what surfaced as 12 timed-out tests in CodexCoder21Organization/UrlResolver run 7f19e875, once a downstream classpath change started loading post-#412 classes instead of the pre-#412 classes for the first time. The full investigation is in the report on the run, but the short version is: every test that creates many UrlProtocol2 instances in a loop hits the cumulative quiet-period budget exhaustion this PR fixes.

Regression test

PlainNettyTransportShutdownTimingTest exercises the contract directly. Both tests pass with this PR's fix and FAIL on develop HEAD:

Test On develop With fix
single close should resolve well below the default 2 second quiet period FAIL: returns in ~2023 ms PASS: returns in ~30 ms
20 sequential listen-then-close cycles complete in under 5 seconds FAIL: 41,146 ms total (~2,057 ms / iter) PASS: < 5 s total

The 1-second budget on the single-close test is deliberately conservative against any future change that might restore a smaller-but-nonzero quiet period — even a hypothetical 1 s default would fail this test decisively.

Verification

$ ./gradlew :libp2p:test --tests "io.libp2p.transport.PlainNettyTransportShutdownTimingTest"
...
BUILD SUCCESSFUL

(Pre-fix: 2 tests, 2 failed — matching the timing data above.)

Risk

Minimal. The change tightens shutdown latency without changing what state shutdown leaves behind. Any caller that genuinely relies on a 2-second quiet period to absorb in-flight tasks should already be calling close() only after explicitly draining its traffic — and if it does have an unflushed task at close time, the 5-second timeout argument still gives the loop time to drain it before the future completes.

🤖 Generated with Claude Code

@codexcoder21 codexcoder21 force-pushed the fix-plain-netty-transport-quiet-period branch 2 times, most recently from c0cffe2 to 8d636b3 Compare May 7, 2026 15:53
…efully

PlainNettyTransport.close() and QuicTransport.close() chain the
EventLoopGroup.shutdownGracefully() futures through CompletableFuture.allOf,
so the caller's close() future doesn't resolve until shutdownGracefully does.
The no-arg call falls through to Netty's defaults (quietPeriod=2s,
timeout=15s). The 2-second quiet period keeps the loop alive in case more
work is submitted -- but by the time close() runs, every channel is closed
and the transport is marked closed, so no further work can be submitted to
either group. The 2-second wait is pure latency on every close.

Pass quietPeriod=0 explicitly so the loop terminates as soon as the run-loop
sees the shutdown flag. The 5-second timeout argument still bounds how long
shutdownGracefully will wait for the loop to actually exit.

Adds PlainNettyTransportShutdownTimingTest covering both single-close and
20-cycle scenarios. Both tests fail on the unfixed code (single close
~2 s, 20-cycle loop >40 s) and pass after the fix (single close ~5 ms,
20-cycle loop ~330 ms).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codexcoder21 codexcoder21 force-pushed the fix-plain-netty-transport-quiet-period branch from 8d636b3 to 117377c Compare May 7, 2026 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants