Skip to content

Comments

fix: reuse HTTP connections in fallback service to prevent port exhaustion#6115

Draft
penalosa wants to merge 2 commits intomainfrom
penalosa/fallback-service-thread-pool
Draft

fix: reuse HTTP connections in fallback service to prevent port exhaustion#6115
penalosa wants to merge 2 commits intomainfrom
penalosa/fallback-service-thread-pool

Conversation

@penalosa
Copy link
Contributor

@penalosa penalosa commented Feb 19, 2026

Summary

The FallbackServiceClient (used for module resolution in local dev via vitest-pool-workers) maintains a persistent background thread with a long-lived KJ HTTP client, but connections were never actually reused. This caused ~4000 TIME_WAIT sockets per test run, leading to port exhaustion (EADDRNOTAVAIL) after 2-4 consecutive runs.

Root Cause

Two issues prevented connection reuse:

  1. 301 redirect response bodies were never drained. KJ's HTTP client calls abortRead() on unread response bodies, which sets broken = true on the connection. This makes canReuse() return false, so the connection pool discards the connection and opens a new TCP socket for every subsequent request. The fallback service returns 301 redirects frequently for module aliasing.

  2. Stale pooled connections caused DISCONNECTED errors. Between requests, the background thread blocks on kj::MutexGuarded::when() (OS-level mutex wait), not the KJ event loop. KJ's watchForClose() can only detect server-side connection closure via the event loop, so if the server closes an idle connection during the mutex wait, the next request uses a dead connection.

Changes

  • Drain response bodies on 301 redirects (readAllBytes().wait()) in both V1 and V2 paths so KJ can return the connection to the pool
  • Retry once on kj::Exception::Type::DISCONNECTED to handle stale pooled connections transparently
  • Wire up FallbackServiceClient in both V1 (server.c++) and V2 (workerd-api.c++) code paths — previously both paths were using per-request threads via the free tryResolve() function

Validation

  • Before fix (npm workerd): Port exhaustion after 2 runs (16,273 TIME_WAIT sockets)
  • After fix: 20 consecutive runs all passing (51/51 tests each), TIME_WAIT stable at ~1,000-1,300
  • Test suite: 51 vitest files with singleWorker: false, stressing concurrent module resolution

Fixes cloudflare/workers-sdk#7954

@codecov-commenter
Copy link

codecov-commenter commented Feb 19, 2026

Codecov Report

❌ Patch coverage is 1.43541% with 206 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.60%. Comparing base (4f0e03e) to head (574c0da).
⚠️ Report is 24 commits behind head on main.

Files with missing lines Patch % Lines
src/workerd/server/fallback-service.c++ 0.00% 193 Missing ⚠️
src/workerd/server/workerd-api.c++ 0.00% 7 Missing ⚠️
src/workerd/server/server.c++ 0.00% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6115      +/-   ##
==========================================
- Coverage   70.61%   70.60%   -0.01%     
==========================================
  Files         409      409              
  Lines      109295   109442     +147     
  Branches    18072    18037      -35     
==========================================
+ Hits        77178    77276      +98     
- Misses      21269    21359      +90     
+ Partials    10848    10807      -41     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…stion

The FallbackServiceClient uses a persistent background thread with a
long-lived KJ HTTP client to serve all module resolution requests, but
connections were not actually being reused because:

1. 301 redirect response bodies were never drained. KJ's HTTP client
   calls abortRead() on unread bodies, setting broken=true on the
   connection, which makes canReuse() return false. Every subsequent
   request opened a new TCP socket.

2. Between requests, the background thread blocks on MutexGuarded::when()
   (OS-level mutex), not the KJ event loop. KJ's watchForClose() can only
   detect server-side connection closure via the event loop, so stale
   pooled connections caused DISCONNECTED exceptions.

Fixes:
- Drain response bodies on 301 redirects (readAllBytes().wait())
- Add retry-once on kj::Exception::Type::DISCONNECTED for stale connections
- Wire up FallbackServiceClient in both V1 (server.c++) and V2 (workerd-api.c++)

Without this fix, each test run creates ~4000 TIME_WAIT sockets, causing
port exhaustion after 2-4 consecutive runs of vitest-pool-workers test
suites. With the fix, TIME_WAIT stays at ~1000-1300 across 20+ runs.
@penalosa penalosa force-pushed the penalosa/fallback-service-thread-pool branch from 51abe5b to ae5977b Compare February 19, 2026 16:47
@codspeed-hq
Copy link

codspeed-hq bot commented Feb 19, 2026

Merging this PR will degrade performance by 15.89%

❌ 1 regressed benchmark
✅ 69 untouched benchmarks
⏩ 129 skipped benchmarks1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Benchmark BASE HEAD Efficiency
jsonResponse[Response] 39.8 µs 47.4 µs -15.89%

Comparing penalosa/fallback-service-thread-pool (574c0da) with main (02c8669)

Open in CodSpeed

Footnotes

  1. 129 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@penalosa penalosa force-pushed the penalosa/fallback-service-thread-pool branch from ae5977b to e148d48 Compare February 20, 2026 11:49
}

// Block until the background thread has processed our request.
return state.when([](const SharedState& s) { return s.responseReady || s.shutdown; },
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need to experiment with it a bit to verify everything but this is good. One thing we might consider, however, is having a small pool of these clients so that multiple concurrent loads aren't blocking each other and bottlenecking things quite as much.


// Retry once on disconnect (stale pooled connection).
for (int attempt = 0; attempt < 2; attempt++) {
try {
Copy link
Collaborator

@jasnell jasnell Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
try {
KJ_TRY {

Harris recently introduce a new utility here, KJ_TRY { } KJ_CATCH(exception) { } that reduces the boilerplate for this pattern. Using this, you won't need to worry about the auto exception = kj::getCaughtExceptionAsKj(); pattern.

Recommend making the change here and elsewhere in this pr where you use this pattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@github-actions
Copy link

github-actions bot commented Feb 20, 2026

The generated output of @cloudflare/workers-types matches the snapshot in types/generated-snapshot 🎉

@penalosa penalosa force-pushed the penalosa/fallback-service-thread-pool branch from a537526 to 7fca872 Compare February 20, 2026 21:32
…eptionAsKj

Replace the boilerplate pattern of catch(...) followed by
kj::getCaughtExceptionAsKj() with the newer KJ_TRY/KJ_CATCH macros
that handle the conversion automatically.
@penalosa penalosa force-pushed the penalosa/fallback-service-thread-pool branch from 7fca872 to 574c0da Compare February 20, 2026 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🐛 BUG: Error running tests in when using singleWorker: false

3 participants