Skip to content

fix(iroh): evict cached mapped addrs when a RemoteStateActor shuts down#4294

Open
drlukeangel wants to merge 1 commit into
n0-computer:mainfrom
drlukeangel:fix/churn-memory-leak
Open

fix(iroh): evict cached mapped addrs when a RemoteStateActor shuts down#4294
drlukeangel wants to merge 1 commit into
n0-computer:mainfrom
drlukeangel:fix/churn-memory-leak

Conversation

@drlukeangel
Copy link
Copy Markdown

@drlukeangel drlukeangel commented May 29, 2026

Description

While running iroh under sustained remote churn (remotes continuously appearing and going away) in
a downstream project, we saw the per-remote address cache grow without bound. AddrMap — the cache
of mapped endpoint/relay addresses behind mapped_addrs — has no eviction path and is never pruned.
When a RemoteStateActor shuts down cleanly, RemoteMap::remove_or_restart_actor removes its
senders entry but leaves that remote's cached addresses in mapped_addrs forever, so
endpoint_addrs/relay_addrs grow once per remote-ever-seen.

(Also: thank you for iroh — the endpoint/discovery/QUIC stack let us build a multi-node mesh without
hand-rolling NAT traversal or relay handling, which is exactly why we could focus on this one cache.)

Fix

Add remove() and retain() to AddrMap (keeping the forward addrs and reverse lookup maps
consistent), and call them in the clean-shutdown branch of remove_or_restart_actor to drop the
gone remote's cached endpoint and relay addresses. Eviction happens only on clean shutdown, so
active remotes are untouched. relay_addrs is keyed by (addr, remote_id), so it uses retain to
drop all entries for the departing remote.

Breaking Changes

None. remove/retain are pub(super); behavior change is limited to releasing state for remotes
that have already gone away.

Notes & open questions

  • Test: added a unit test for AddrMap::remove/retain covering forward removal and
    reverse-lookup consistency (the subtle part — keeping addrs and lookup in sync). The
    end-to-end "cache tracks only live remotes" behavior is churn-dependent; we validated it via a
    downstream heap profile under continuous remote kill/respawn, where the mapped_addrs entries
    stopped growing with remotes-ever-seen. A before/after dhat capture on your churn harness is the
    cleanest way to quantify it.
  • Open question: are there other intended lifetimes for mapped_addrs entries (e.g. deliberately
    cached across reconnects)? If so, the eviction point may want to be narrower — guidance welcome.

Checklist

  • Self-review
  • Documented the disjoint-borrow reasoning in retain
  • Tests — unit test for remove/retain; integration validation noted above
  • No breaking changes

Related

Found together with two other churn-related leak fixes in the iroh stack:

Resolves #4293

AddrMap (the per-remote endpoint/relay address cache behind mapped_addrs) has no
eviction path and is never pruned. On clean RemoteStateActor shutdown,
remove_or_restart_actor removed the senders entry but left the remote's cached
addresses in endpoint_addrs/relay_addrs forever, so they grew once per
remote-ever-seen under churn.

Add remove()/retain() to AddrMap (keeping the forward addrs and reverse lookup
maps consistent) and evict the departing remote's cached addresses on the
clean-shutdown path. Add a unit test for remove/retain forward+reverse
consistency.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🚑 Needs Triage

Development

Successfully merging this pull request may close these issues.

mapped_addrs cache is never evicted, grows per remote-ever-seen under churn

1 participant