fix: bound Connection::abandoned_paths (prune on discard)#683
Open
caiogondim wants to merge 1 commit into
Open
fix: bound Connection::abandoned_paths (prune on discard)#683caiogondim wants to merge 1 commit into
caiogondim wants to merge 1 commit into
Conversation
`Connection::abandoned_paths` (a `FxHashSet<PathId>`) grew without bound: when
a path is fully discarded, `discard_path` reclaims `paths`, `number_spaces` and
`path_stats` (and the `PathDrained` timer clears `local_cid_state`), but the id
was never removed from `abandoned_paths`. The set was kept only as the path-id
monotonicity guard — `open_path` computed the next id from
`abandoned_paths.iter().max()` to avoid reusing a discarded id — which is also
what the existing `TODO(flub)` on the field anticipated ("a set together with a
minimum").
Replace the lifetime-growing set with a `max_abandoned_path_id` scalar (updated
where ids are inserted into the set), read that in `open_path`, and prune the
set in `discard_path`. The set is now bounded by the number of concurrently
abandoned-but-not-yet-drained paths, not by the lifetime abandon count, while
path-id allocation stays strictly monotonic.
Adds a deterministic regression test (`abandon_cycle_bounded`) that open/abandons
a path each cycle while keeping one live anchor and asserts every per-path
collection stays bounded, plus `connection_freed_after_peer_vanishes` showing a
connection is fully reclaimed when the peer vanishes without a graceful close.
Test-only `path_collection_sizes` / `PathStatsMap::len` accessors back the gauge.
flub
requested changes
Jun 1, 2026
| /// above this. It is kept as a scalar — rather than read off | ||
| /// [`Connection::abandoned_paths`] — so the set can be pruned on discard | ||
| /// without losing the reuse guard. `None` until the first abandon. | ||
| max_abandoned_path_id: Option<PathId>, |
Collaborator
There was a problem hiding this comment.
Fixing this by adding more state to the Connection struct is not the way to go. It spreads out the logic of what is happening over the many places that touch it and increase the overall complexity. The logical behaviour of the FxHashSet should be preserved, this should be all that the rest of the application deals with.
So instead the fix for this should look at replacing the type of abandoned_paths with a custom type that has a fixed amount of storage, directly proportional to the configured max_concurrent_multipath_paths. The external interface then stays the same, or can be equivalent but better e.g. in the case of the .iter().max().copied() which isn't exactly great.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
First, thanks for the awesome work 🙇.
Found while profiling a churn-proportional memory climb in an iroh-gossip mesh.
Reproduced deterministically against
noq 1.0.0-rc.0(6ee7cf2f) and confirmedstill unfixed on
main(1.0.0-rc.1).This is part of a memory leak I could reproduce on a local and distributed mesh using iroh-gossip.
Note
Related: n0-computer/iroh-gossip#147 and @drlukeangel's
#682 — a separate noq-level connection
leak (a
Connectingdropped before the handshake completes is never closed).We hit that one too and have adopted @drlukeangel's code for it alongside this
abandoned_pathsfix in our downstream fork.The bug
Connection::abandoned_paths: FxHashSet<PathId>(
noq-proto/src/connection/mod.rs) is only ever inserted into and read — neverpruned. When a path is fully torn down,
discard_pathreclaims the heavyper-path state:
…but it never removes
path_idfromabandoned_paths. So for a long-livedconnection that opens and abandons paths over time (e.g. repeated NAT
rebinds / path migration),
abandoned_pathsgrows by one entry per abandon forthe life of the connection, even though every other per-path collection is
correctly reclaimed.
The field already carries a
TODOacknowledging the shape is wrong:Why it was retained
It is kept solely as the path-id monotonicity guard.
open_pathpicks thenext id as one past the highest id ever seen, including discarded ones, to avoid
reusing a
PathId:Once a path is discarded it leaves
self.paths, so the only reason its idstayed in
abandoned_pathswas to keepmax_abandonedcorrect. Everycontains()check (open_path_ensure,close_path_inner, the last-open-pathtest) concerns live paths in
self.paths, so none of them needs discardedids.
Deterministic reproduction
noq-protois sans-IO, so this reproduces with the existingConnPairtestharness — no network. The added test
abandon_cycle_bounded(
noq-proto/src/tests/multipath.rs) opens a fresh path and abandons theprevious one each cycle, always keeping one live anchor (so
PATH_ABANDONisdeliverable and the path drains the normal way), and gauges every per-path
collection after each cycle via a test-only
Connection::path_collection_sizes().On
1.0.0-rc.0, before the fix:abandoned_pathsclimbs 1-per-cycle without bound; every other collectionstays flat. The test asserts each collection is
<= 6and so fails onabandoned_pathsat cycle 6 against unfixed code, and passes with the fix.The fix
Replace the lifetime-growing set's role-as-high-water with an explicit scalar,
and prune the set on discard:
max_abandoned_path_id: Option<PathId>toConnection.abandon_path, where the id is inserted intoabandoned_paths, alsoself.max_abandoned_path_id = self.max_abandoned_path_id.max(Some(path_id));open_path, use that scalar instead of scanning the set:let max_abandoned = self.max_abandoned_path_id;discard_path,self.abandoned_paths.remove(&path_id);Result:
abandoned_pathsis bounded by the number of concurrentlyabandoned-but-not-yet-drained paths (small), path-id allocation stays strictly
monotonic, and the
contains()semantics for live paths are unchanged. This isexactly the "set together with a minimum/maximum" the
TODO(flub)suggested.Patch + tests: branch
fix/abandoned-path-leak(commit1a9d2b03) off6ee7cf2fin our fork — touchesnoq-proto/src/connection/mod.rs(+ test-onlypath_collection_sizes),noq-proto/src/connection/stats.rs(test-onlylen),noq-proto/src/tests/multipath.rs(abandon_cycle_bounded). Fullcargo test -p noq-protostays green (363 tests), includingopen_path_ensure_after_abandonand the other reuse/abandon tests that provemonotonicity is preserved.
What this is not
Our original motivation was a churn-proportional ~18–23 MB/min RSS climb on a
multi-machine iroh-gossip soak (a NAT'd cross-continent peer driving ~60–70
connection flaps/min).
abandoned_pathsentries arePathId-sized, so this bugcannot account for that magnitude, and a fleet re-run with this fix confirmed
the climb persists. We also verified noq's connection teardown is clean — a
connection_freed_after_peer_vanishestest (peer disappears with no gracefulCONNECTION_CLOSE) shows the driver task terminates andConnectionInnerisreclaimed via the idle timeout. So the dominant leak is above noq, in the
iroh / iroh-gossip per-node/per-dial bookkeeping under that churn — a separate
investigation. This report is just the bounded-memory fix for
abandoned_paths.Reproduce