Skip to content

fix: bound Connection::abandoned_paths (prune on discard)#683

Open
caiogondim wants to merge 1 commit into
n0-computer:mainfrom
agent-habilis:fix/abandoned-path-leak-pr
Open

fix: bound Connection::abandoned_paths (prune on discard)#683
caiogondim wants to merge 1 commit into
n0-computer:mainfrom
agent-habilis:fix/abandoned-path-leak-pr

Conversation

@caiogondim
Copy link
Copy Markdown

@caiogondim caiogondim commented May 30, 2026

First, thanks for the awesome work 🙇.

Found while profiling a churn-proportional memory climb in an iroh-gossip mesh.
Reproduced deterministically against noq 1.0.0-rc.0 (6ee7cf2f) and confirmed
still unfixed on main (1.0.0-rc.1).

This is part of a memory leak I could reproduce on a local and distributed mesh using iroh-gossip.

Note

Related: n0-computer/iroh-gossip#147 and @drlukeangel's
#682 — a separate noq-level connection
leak (a Connecting dropped before the handshake completes is never closed).
We hit that one too and have adopted @drlukeangel's code for it alongside this
abandoned_paths fix in our downstream fork.

The bug

Connection::abandoned_paths: FxHashSet<PathId>
(noq-proto/src/connection/mod.rs) is only ever inserted into and read — never
pruned. When a path is fully torn down, discard_path reclaims the heavy
per-path state:

// discard_path(...)
self.paths.remove(&path_id);
self.spaces[SpaceId::Data].number_spaces.remove(&path_id);
// path_stats.discard(&path_id) already called above;
// PathDrained timer handler removes local_cid_state

…but it never removes path_id from abandoned_paths. So for a long-lived
connection that opens and abandons paths over time (e.g. repeated NAT
rebinds / path migration), abandoned_paths grows by one entry per abandon for
the life of the connection, even though every other per-path collection is
correctly reclaimed.

The field already carries a TODO acknowledging the shape is wrong:

// TODO(flub): Make this a more efficient data structure.  Like ranges of
//    abandoned paths.  Or a set together with a minimum.  Or something.
abandoned_paths: FxHashSet<PathId>,

Why it was retained

It is kept solely as the path-id monotonicity guard. open_path picks the
next id as one past the highest id ever seen, including discarded ones, to avoid
reusing a PathId:

let max_abandoned = self.abandoned_paths.iter().max().copied();
let max_used = self.paths.keys().last().copied();
let path_id = max_abandoned.max(max_used).unwrap_or(PathId::ZERO).saturating_add(1u8);

Once a path is discarded it leaves self.paths, so the only reason its id
stayed in abandoned_paths was to keep max_abandoned correct. Every
contains() check (open_path_ensure, close_path_inner, the last-open-path
test) concerns live paths in self.paths, so none of them needs discarded
ids.

Deterministic reproduction

noq-proto is sans-IO, so this reproduces with the existing ConnPair test
harness — no network. The added test abandon_cycle_bounded
(noq-proto/src/tests/multipath.rs) opens a fresh path and abandons the
previous one each cycle, always keeping one live anchor (so PATH_ABANDON is
deliverable and the path drains the normal way), and gauges every per-path
collection after each cycle via a test-only Connection::path_collection_sizes().

On 1.0.0-rc.0, before the fix:

cycle paths abandoned_paths remote_cids local_cid_state path_stats number_spaces
0 1 1 6 6 1 1
1 1 2 6 6 1 1
1 6 6 1 1
6 1 7 6 6 1 1

abandoned_paths climbs 1-per-cycle without bound; every other collection
stays flat
. The test asserts each collection is <= 6 and so fails on
abandoned_paths at cycle 6 against unfixed code, and passes with the fix.

The fix

Replace the lifetime-growing set's role-as-high-water with an explicit scalar,
and prune the set on discard:

  1. Add max_abandoned_path_id: Option<PathId> to Connection.
  2. In abandon_path, where the id is inserted into abandoned_paths, also
    self.max_abandoned_path_id = self.max_abandoned_path_id.max(Some(path_id));
  3. In open_path, use that scalar instead of scanning the set:
    let max_abandoned = self.max_abandoned_path_id;
  4. In discard_path, self.abandoned_paths.remove(&path_id);

Result: abandoned_paths is bounded by the number of concurrently
abandoned-but-not-yet-drained paths (small), path-id allocation stays strictly
monotonic, and the contains() semantics for live paths are unchanged. This is
exactly the "set together with a minimum/maximum" the TODO(flub) suggested.

Patch + tests: branch fix/abandoned-path-leak (commit 1a9d2b03) off
6ee7cf2f in our fork — touches noq-proto/src/connection/mod.rs (+ test-only
path_collection_sizes), noq-proto/src/connection/stats.rs (test-only
len), noq-proto/src/tests/multipath.rs (abandon_cycle_bounded). Full
cargo test -p noq-proto stays green (363 tests), including
open_path_ensure_after_abandon and the other reuse/abandon tests that prove
monotonicity is preserved.

What this is not

Our original motivation was a churn-proportional ~18–23 MB/min RSS climb on a
multi-machine iroh-gossip soak (a NAT'd cross-continent peer driving ~60–70
connection flaps/min). abandoned_paths entries are PathId-sized, so this bug
cannot account for that magnitude, and a fleet re-run with this fix confirmed
the climb persists. We also verified noq's connection teardown is clean — a
connection_freed_after_peer_vanishes test (peer disappears with no graceful
CONNECTION_CLOSE) shows the driver task terminates and ConnectionInner is
reclaimed via the idle timeout. So the dominant leak is above noq, in the
iroh / iroh-gossip per-node/per-dial bookkeeping under that churn — a separate
investigation. This report is just the bounded-memory fix for abandoned_paths.

Reproduce

git clone https://github.com/n0-computer/noq && cd noq
git checkout 6ee7cf2f            # noq 1.0.0-rc.0
# add the abandon_cycle_bounded test (see the fork branch) and:
cargo test -p noq-proto abandon_cycle_bounded   # fails (abandoned_paths grows)
# apply the fix → passes

`Connection::abandoned_paths` (a `FxHashSet<PathId>`) grew without bound: when
a path is fully discarded, `discard_path` reclaims `paths`, `number_spaces` and
`path_stats` (and the `PathDrained` timer clears `local_cid_state`), but the id
was never removed from `abandoned_paths`. The set was kept only as the path-id
monotonicity guard — `open_path` computed the next id from
`abandoned_paths.iter().max()` to avoid reusing a discarded id — which is also
what the existing `TODO(flub)` on the field anticipated ("a set together with a
minimum").

Replace the lifetime-growing set with a `max_abandoned_path_id` scalar (updated
where ids are inserted into the set), read that in `open_path`, and prune the
set in `discard_path`. The set is now bounded by the number of concurrently
abandoned-but-not-yet-drained paths, not by the lifetime abandon count, while
path-id allocation stays strictly monotonic.

Adds a deterministic regression test (`abandon_cycle_bounded`) that open/abandons
a path each cycle while keeping one live anchor and asserts every per-path
collection stays bounded, plus `connection_freed_after_peer_vanishes` showing a
connection is fully reclaimed when the peer vanishes without a graceful close.
Test-only `path_collection_sizes` / `PathStatsMap::len` accessors back the gauge.
/// above this. It is kept as a scalar — rather than read off
/// [`Connection::abandoned_paths`] — so the set can be pruned on discard
/// without losing the reuse guard. `None` until the first abandon.
max_abandoned_path_id: Option<PathId>,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing this by adding more state to the Connection struct is not the way to go. It spreads out the logic of what is happening over the many places that touch it and increase the overall complexity. The logical behaviour of the FxHashSet should be preserved, this should be all that the rest of the application deals with.

So instead the fix for this should look at replacing the type of abandoned_paths with a custom type that has a fixed amount of storage, directly proportional to the configured max_concurrent_multipath_paths. The external interface then stays the same, or can be equivalent but better e.g. in the case of the .iter().max().copied() which isn't exactly great.

@github-project-automation github-project-automation Bot moved this from 🚑 Needs Triage to 🏗 In progress in iroh Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🏗 In progress

Development

Successfully merging this pull request may close these issues.

2 participants