feat(bitswap/httpnet): dedupe traffic across peer IDs sharing one HTTP gateway by lidel · Pull Request #1151 · ipfs/boxo

lidel · 2026-04-30T22:03:52Z

Important

WIP. This is prework for things like Webseeds too (even if the code is not reused, we need to solve the same problems)

I want to test on rainbow staging in an A/B configuration before rolling this out broadly. The intent is to cut redundant requests when multiple peer IDs resolve to the same HTTP gateway, and the behaviour changes around throttling are real enough to warrant traffic-shaped validation. Better safe than sorry.

Background

Delegated routing routinely returns multiple peer IDs for one HTTP gateway. As of today, /dns/a-fil-http.aur.lu/tcp/443/https is advertised under three peer IDs in cid.contact responses. Bitswap creates one MessageQueue per peer ID, so a single want-have broadcast for one CID becomes three identical HEAD requests against the same upstream. Want-have probes that bitswap sends to non-best peers fan out the same way. The pattern compounds as HTTP retrieval grows. This will only get worse in the future if we don't clean this up.

Changes

Six commits, each reviewable in isolation.

feat(httpnet): coalesce concurrent requests per HTTP endpoint

Concurrent identical requests against the same (scheme, host, sni, method, cid) share one round trip via a small singleflight tracker. Each waiter still updates its own per-URL cooldown and bitswap accounting from the shared response.

feat(httpnet): refcount pinger per HTTP endpoint

Multiple peer IDs sharing a gateway share one 5s ticker and one EWMA latency reading. Idle traffic for a three-peer-on-one-gateway record drops from ~36 to ~12 HEAD /ipfs/bafkqaaa requests per minute.

feat(httpnet): skip Connect probe for known HTTP endpoints

The second peer to resolve to a gateway already proven working skips the HEAD probe and inherits the cached HEAD-support decision.

feat(httpnet): per-host breaker and error counters

serverErrors and errorTracker.counts are shared per endpointKey. One peer's real server error trips the breaker for every peer using the gateway; the 404-storm threshold trips on combined volume rather than N x threshold.

The same commit also:

stops the worker from incrementing the breaker on typeRetryLater second-hit. Throttling (429/502/503/504) means "I am busy", not "I am broken".
gives tryURL a time.Now().Before(dl) guard so a long-lived senderURL resumes naturally once Retry-After elapses.
has the periodic pinger ticker skip probes inside the cooldown window.
drops the per-host breaker and errorTracker entries when the last peer using a host disconnects, so the next reconnect starts fresh.

feat(httpnet): respect Retry-After in on-demand Ping

The on-demand Network.Ping path also skips probes for hosts in their cooldown window, matching the periodic ticker.

test(httpnet): cover per-host breaker, cooldown, and throttle semantics

Unit and integration tests pinning the behaviours the rest of the branch rests on, including an explicit assertion in TestBackOff that the per-host breaker counter stayed at 0 through a Retry-After cycle.

Behaviour changes

For a single peer on a unique host:

Real server errors (500 default arm, network failure, body read failure, malformed response): disconnect on MaxRetries, exactly as today.
404 storms: errorTracker thresholds unchanged.
Throttling (429 / 502 / 503 / 504): changed. The peer stays connected, returns DONT_HAVE during the cooldown window, and resumes once Retry-After elapses. Today these disconnect after one retry cycle.

For multiple peer IDs sharing one gateway:

Throttle: one peer's 429 sets the host cooldown. Every peer sharing it returns DONT_HAVE without HTTP, and none disconnects. All resume at once after the window.
Real server error: one peer's typeServer increments the shared breaker. Every peer using the host disconnects on its next SendMsg.
404 storm: combined error count trips at the threshold instead of N × threshold.
Connect probe: one HEAD per gateway instead of N.
Idle pings: one ticker per gateway instead of N.

For the live a-fil-http.aur.lu case (three peer IDs, Retry-After: 5), HTTP traffic during the throttle window drops from ~15 requests (3 initial 429s plus ~12 provider-rediscovery probe attempts) to ~4 (3 initial 429s plus 1 pinger tick). Recovery time drops from ~6s (cooldown plus provider re-discovery cycle) to ~5s (cooldown alone). Honouring Retry-After also means we stop probing the gateway during the very window it asked us to wait.

Defaults

DefaultSendErrorBackoff = 1s: fallback when 429/504 arrive without a Retry-After header.
DefaultMaxBackoff = 1m: hard cap on any cooldown.
- Feels low, I'm tempted to raise to 5m but we could start with 1m

Test plan

A/B on rainbow staging against rainbow v1.23
Verify retrieval success rate is unchanged or improved.
Confirm gateways no longer see probe traffic during their Retry-After windows.
If safe, ramp; otherwise iterate.
changelog

multiple peer ids regularly resolve to the same http gateway. for example, /dns/a-fil-http.aur.lu/tcp/443/https is currently advertised under three peer ids by delegated-ipfs.dev. bitswap creates one messagequeue per peer id, so a single want-have broadcast for one cid fans out into n identical head requests against the same upstream; want-have probes to non-best peers fan out the same way. the pattern will compound as http retrieval adoption grows. introduce a small singleflight tracker keyed on (scheme, host, sni, method, cid). concurrent identical requests share one round trip; each caller still updates its own per-url cooldown from the shared response so bitswap's per-peer accounting is preserved. - inflight.go: tracker plus key builder - msg_sender.go: split tryurl into executerequest (wire i/o, runs once) and handleresponse (per-caller classification + cooldown) - factor clearcooldown / applybackoff helpers out of the response switch

multiple peer ids that resolve to the same http gateway each ran their own 5s ping ticker. for a record like /dns/a-fil-http.aur.lu/tcp/443/https that is currently advertised under three peer ids, that meant ~36 head /ipfs/bafkqaaa requests per minute to one upstream while idle. the gateway is a single physical resource; ping it once. key pinger state by (scheme, host, sni) instead of peer.id. peers sharing an endpoint bump a refcount on the existing hostping; the ticker runs while refcount > 0. ewma latency is stored per host; latency(peer.id) averages across that peer's hosts. extract endpointkey out of inflightkey so the inflight tracker and the pinger share one helper.

when delegated routing returns multiple peer ids that share one http gateway (currently /dns/a-fil-http.aur.lu/tcp/443/https is advertised under three peer ids), each peer's connect issued its own head /ipfs/bafkqaaa probe to the same upstream. n peers => n probes. before probing, look up the url in the pinger's host map. if an earlier connect already proved the endpoint working, add the url straight to workingaddrs and inherit the cached head-support decision. the cache is valid as long as at least one peer is still registered against the endpoint; once refcount hits zero the entry is dropped and the next connect re-probes. while here, replace the hardcoded "HEAD" / "GET" strings in the package with http.MethodHead / http.MethodGet from net/http.

multiple peer ids regularly resolve to one http gateway. for example, /dns/a-fil-http.aur.lu/tcp/443/https is currently advertised under three peer ids by delegated-ipfs.dev. the per-peer breaker and 404 counter each let a misbehaving gateway burn n x maxretries / n x maxdonthave attempts before everyone gives up; sharing them trips once for the whole gateway. key changes: - senderURL.serverErrors becomes *atomic.Int64 shared per (scheme, host, sni) via a new breaker - errorTracker keys on endpointkey, takes []*senderurl, distributes client error counts across each request's hosts - clearcooldown also resets the per-host serverErrors on 200/4xx; one good answer forgives accrued breaker state for all peers using the host - pinger.stoppinging drops breaker and errortracker entries when the last peer using a host disconnects, so reconnects start fresh throttle handling matches http semantics: - httpworker no longer increments serverErrors on the typeretrylater 2nd-hit path. retry-after is "i am busy", not "i am broken"; charging the breaker on throttle would disconnect peers when the gateway is alive and just slow. - tryurl auto-expires its cooldown deadline so a long-lived senderurl resumes naturally once retry-after elapses, instead of staying stuck on a stale dl - pinger ticker skips probes while the host is on cooldown so we honour retry-after instead of polling inside the wait window at 50 req/sec against one popular gateway with retry-after: 5, this takes http traffic during the wait window from ~15 connect-probe attempts down to ~4 (3 initial 429s + 1 pinger-skip-aware tick), and recovery from cooldown + provider-rediscovery (~6s) down to cooldown alone (~5s). on retry-after-less 429/504 the default backoff is 1s (SendErrorBackoff), capped at maxbackoff=1m. real server errors (500 default arm, network failures, body read failures, malformed responses) still increment the breaker via typeserver as before, so single-peer behavior on genuine breakage is unchanged.

the periodic ticker already skips probes while a host is on cooldown. the on-demand Network.Ping path (called once per messagequeue start by donthavetimeoutmgr to bootstrap its dont-have timeout) still issued probes during the wait window. apply the same rule everywhere: if the host has an active cooldown, skip the probe and report errcooldownactive. donthavetimeoutmgr already treats any ping error by falling back to its default timeout, which real message latency refines as soon as traffic flows; no metric or behaviour regresses.

add unit tests and integration tests pinning down the behaviours that the host-dedup branch's correctness rests on: - breaker_test.go: counter aliasing for the same key, independence across keys, reset drops the entry and a fresh allocation is handed out next, concurrent counter/reset is data-race clean - cooldown_test.go: ondemand reports active-vs-expired-vs-missing correctly, setbyduration clamps at maxbackoff - testbackoff: assert the per-host serverErrors counter stayed at 0 after a retry-after cycle. throttling must not feed the breaker; without this assertion a regression that re-introduces the worker's retrylater 2nd-hit increment would still pass on dont-have counts. - testsharedbreakerdisconnectsacrosspeers: peer a's typeserver increment via 500 must trip peer b's bestUrl on its next sendmsg. with two peers connected to the same host, both end up disconnected from one round of failures rather than one round per peer. - testtryurl_cooldownautoexpiry: a senderurl carrying a stale (already elapsed) cooldown deadline must let trydir-thru. fillsenderurls only checks the deadline at construction; tryurl needs the time.now().before(dl) guard to avoid sticking forever.

codecov · 2026-04-30T22:07:35Z

Codecov Report

❌ Patch coverage is 68.91892% with 92 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.45%. Comparing base (23c380b) to head (eac3972).

Files with missing lines	Patch %	Lines
bitswap/network/httpnet/pinger.go	51.56%	57 Missing and 5 partials ⚠️
bitswap/network/httpnet/msg_sender.go	64.38%	22 Missing and 4 partials ⚠️
bitswap/network/httpnet/error_tracker.go	90.47%	1 Missing and 1 partial ⚠️
bitswap/network/httpnet/httpnet.go	87.50%	1 Missing and 1 partial ⚠️

@@            Coverage Diff             @@
##             main    #1151      +/-   ##
==========================================
+ Coverage   63.10%   63.45%   +0.35%     
==========================================
  Files         267      269       +2     
  Lines       26777    26918     +141     
==========================================
+ Hits        16897    17081     +184     
+ Misses       8158     8114      -44     
- Partials     1722     1723       +1

Files with missing lines	Coverage Δ
bitswap/network/httpnet/breaker.go	`100.00% <100.00%> (ø)`
bitswap/network/httpnet/cooldown.go	`85.96% <100.00%> (+28.27%)`	⬆️
bitswap/network/httpnet/inflight.go	`100.00% <100.00%> (ø)`
bitswap/network/httpnet/error_tracker.go	`91.66% <90.47%> (-8.34%)`	⬇️
bitswap/network/httpnet/httpnet.go	`62.50% <87.50%> (+3.01%)`	⬆️
bitswap/network/httpnet/msg_sender.go	`71.93% <64.38%> (+11.08%)`	⬆️
bitswap/network/httpnet/pinger.go	`54.71% <51.56%> (+18.15%)`	⬆️

... and 7 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

lidel added 6 commits April 30, 2026 15:14

lidel added the status/blocked Unable to be worked further until needs are met label Apr 30, 2026

lidel self-assigned this Apr 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bitswap/httpnet): dedupe traffic across peer IDs sharing one HTTP gateway#1151

feat(bitswap/httpnet): dedupe traffic across peer IDs sharing one HTTP gateway#1151
lidel wants to merge 6 commits intomainfrom
feat/httpnet-host-dedup

lidel commented Apr 30, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lidel commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Changes

Behaviour changes

Defaults

Test plan

Uh oh!

codecov Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lidel commented Apr 30, 2026 •

edited

Loading

codecov Bot commented Apr 30, 2026 •

edited

Loading