Skip to content

fix(noq): close the connection when a Connecting is dropped before handshake#682

Open
drlukeangel wants to merge 1 commit into
n0-computer:mainfrom
drlukeangel:fix/churn-memory-leak
Open

fix(noq): close the connection when a Connecting is dropped before handshake#682
drlukeangel wants to merge 1 commit into
n0-computer:mainfrom
drlukeangel:fix/churn-memory-leak

Conversation

@drlukeangel
Copy link
Copy Markdown

@drlukeangel drlukeangel commented May 29, 2026

Description

While running noq (via iroh) under sustained churn in a downstream project — peers abruptly killed
without a graceful CONNECTION_CLOSE — we found in-progress connections were never reclaimed.
Connecting had no Drop implementation, so dropping a Connecting future before the handshake
completed didn't signal the connection task. Before handshake completion the idle timeout isn't yet
in effect and there's no handshake timeout, and the background ConnectionDriver plus the
EndpointInner event-channel sender keep the connection alive — so the half-open connection state
lingered in the endpoint maps indefinitely, once per aborted inbound handshake.

(Thanks for noq, too — having a workable multipath QUIC we could build on underneath iroh is a big
deal, and the Connecting/driver/endpoint separation made the right fix point clear.)

Fix

Implement Drop for Connecting: on drop, if the handshake hasn't completed, take the inner
connection reference, and if it isn't already closed, call implicit_close(). That transitions the
connection to drained, dispatches the close to the endpoint (which removes the handle from its
active maps), and wakes the driver to exit cleanly — releasing the connection, its channels, and its
packet spaces.

To satisfy the borrow checker (E0509: cannot move out of a type implementing Drop), the oneshot
connected receiver is wrapped in Option, and into_0rtt / Future::poll take() it (poll
restores it on Pending).

Breaking Changes

None. Connecting's public API is unchanged; this only adds cleanup on the drop path.

Notes & open questions

  • Test: a focused regression test (start an inbound handshake, drop the Connecting, assert the
    connection is removed from the endpoint maps) needs a connected-endpoint harness; glad to add one
    in whatever style fits your test suite. We validated the behavior via a downstream chaos soak
    (continuous kill/respawn): with the fix, the previously monotonic per-connection growth flattens
    and connection structures no longer accumulate in heap profiles.
  • Open question: would you prefer this paired with an explicit pre-handshake timeout, or is
    drop-driven cleanup the right scope here? Happy to follow your lead.

Checklist

  • Self-review
  • Documented the Option-wrap reasoning (E0509) and the drop-path intent
  • Tests — see note above; glad to add a harness-based regression test on request
  • No breaking changes

Related

Found together with two other churn-related leak fixes in the iroh stack:

Resolves #681

A Connecting dropped before handshake completion left its connection task, packet spaces, and channels stranded in EndpointInner forever (before handshake completion max_idle_timeout is not yet enabled and there is no handshake timeout). Drop now takes the inner connection and calls implicit_close() (guarded by !is_closed) to drain and release. The connected oneshot Receiver is wrapped in Option so poll/into_0rtt can take() it without moving out of a type implementing Drop (E0509).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🚑 Needs Triage

Development

Successfully merging this pull request may close these issues.

Connecting dropped before handshake leaks the connection (no Drop cleanup)

1 participant