fix: reuse HTTP connections in fallback service to prevent port exhaustion#6115
fix: reuse HTTP connections in fallback service to prevent port exhaustion#6115
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #6115 +/- ##
==========================================
- Coverage 70.61% 70.60% -0.01%
==========================================
Files 409 409
Lines 109295 109442 +147
Branches 18072 18037 -35
==========================================
+ Hits 77178 77276 +98
- Misses 21269 21359 +90
+ Partials 10848 10807 -41 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…stion The FallbackServiceClient uses a persistent background thread with a long-lived KJ HTTP client to serve all module resolution requests, but connections were not actually being reused because: 1. 301 redirect response bodies were never drained. KJ's HTTP client calls abortRead() on unread bodies, setting broken=true on the connection, which makes canReuse() return false. Every subsequent request opened a new TCP socket. 2. Between requests, the background thread blocks on MutexGuarded::when() (OS-level mutex), not the KJ event loop. KJ's watchForClose() can only detect server-side connection closure via the event loop, so stale pooled connections caused DISCONNECTED exceptions. Fixes: - Drain response bodies on 301 redirects (readAllBytes().wait()) - Add retry-once on kj::Exception::Type::DISCONNECTED for stale connections - Wire up FallbackServiceClient in both V1 (server.c++) and V2 (workerd-api.c++) Without this fix, each test run creates ~4000 TIME_WAIT sockets, causing port exhaustion after 2-4 consecutive runs of vitest-pool-workers test suites. With the fix, TIME_WAIT stays at ~1000-1300 across 20+ runs.
51abe5b to
ae5977b
Compare
Merging this PR will degrade performance by 15.89%
Performance Changes
Comparing Footnotes
|
ae5977b to
e148d48
Compare
| } | ||
|
|
||
| // Block until the background thread has processed our request. | ||
| return state.when([](const SharedState& s) { return s.responseReady || s.shutdown; }, |
There was a problem hiding this comment.
I'll need to experiment with it a bit to verify everything but this is good. One thing we might consider, however, is having a small pool of these clients so that multiple concurrent loads aren't blocking each other and bottlenecking things quite as much.
|
|
||
| // Retry once on disconnect (stale pooled connection). | ||
| for (int attempt = 0; attempt < 2; attempt++) { | ||
| try { |
There was a problem hiding this comment.
| try { | |
| KJ_TRY { |
Harris recently introduce a new utility here, KJ_TRY { } KJ_CATCH(exception) { } that reduces the boilerplate for this pattern. Using this, you won't need to worry about the auto exception = kj::getCaughtExceptionAsKj(); pattern.
Recommend making the change here and elsewhere in this pr where you use this pattern.
|
The generated output of |
a537526 to
7fca872
Compare
…eptionAsKj Replace the boilerplate pattern of catch(...) followed by kj::getCaughtExceptionAsKj() with the newer KJ_TRY/KJ_CATCH macros that handle the conversion automatically.
7fca872 to
574c0da
Compare
Summary
The
FallbackServiceClient(used for module resolution in local dev viavitest-pool-workers) maintains a persistent background thread with a long-lived KJ HTTP client, but connections were never actually reused. This caused ~4000TIME_WAITsockets per test run, leading to port exhaustion (EADDRNOTAVAIL) after 2-4 consecutive runs.Root Cause
Two issues prevented connection reuse:
301 redirect response bodies were never drained. KJ's HTTP client calls
abortRead()on unread response bodies, which setsbroken = trueon the connection. This makescanReuse()return false, so the connection pool discards the connection and opens a new TCP socket for every subsequent request. The fallback service returns 301 redirects frequently for module aliasing.Stale pooled connections caused DISCONNECTED errors. Between requests, the background thread blocks on
kj::MutexGuarded::when()(OS-level mutex wait), not the KJ event loop. KJ'swatchForClose()can only detect server-side connection closure via the event loop, so if the server closes an idle connection during the mutex wait, the next request uses a dead connection.Changes
readAllBytes().wait()) in both V1 and V2 paths so KJ can return the connection to the poolkj::Exception::Type::DISCONNECTEDto handle stale pooled connections transparentlyFallbackServiceClientin both V1 (server.c++) and V2 (workerd-api.c++) code paths — previously both paths were using per-request threads via the freetryResolve()functionValidation
TIME_WAITsockets)TIME_WAITstable at ~1,000-1,300singleWorker: false, stressing concurrent module resolutionFixes cloudflare/workers-sdk#7954