Skip to content

fix(netwatch): reconcile interface state periodically#158

Closed
cbenhagen wants to merge 2 commits into
n0-computer:mainfrom
cbenhagen:netmon-periodic-reconcile
Closed

fix(netwatch): reconcile interface state periodically#158
cbenhagen wants to merge 2 commits into
n0-computer:mainfrom
cbenhagen:netmon-periodic-reconcile

Conversation

@cbenhagen

@cbenhagen cbenhagen commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Description

The netmon actor only re-enumerates interfaces in reaction to events from the OS route monitor (plus a detected wall-time jump). The OS route monitors are best-effort and can miss events or stop delivering them entirely, and when that happens the actor never recomputes State, so the interface state stays frozen forever at whatever it was when the last event was processed.

This was observed in production on macOS: a long-lived endpoint silently lost one of its interfaces (and that interface's address) and never recovered, even though the interface was up and carrying traffic the whole time. It stopped publishing/announcing that address until the process restarted.

Root cause

Two things compose:

  1. The route monitor can miss or stop delivering events, without any error. Routing sockets are best-effort by design: when the per-socket receive buffer overflows, the kernel drops the overflowing RTM_* messages. On macOS (XNU bsd/net/raw_usrreq.c, raw_input -> sbappendaddr) this drop is silent: so_error is not set and there is no SO_RERROR socket option at all, so a subsequent recv() just returns the next message that fit and the reader cannot tell anything was lost. (route(4) documents the loss; the macOS-silent specifics are in XNU source. FreeBSD only surfaces ENOBUFS with the opt-in SO_RERROR, off by default; Linux netlink surfaces ENOBUFS by default. So on macOS a change can be missed on an otherwise-working socket with no signal at all.)

    Separately, I observed something stronger in production: after a burst of route messages the AF_ROUTE socket.read(&mut buffer) in netmon/bsd.rs produced no further reads and no error for the rest of the process lifetime. I did not determine the exact cause and am not asserting one: macOS PF_ROUTE is independently documented to be unreliable (it can fail to emit RTM_* events for some interface transitions at all, see Apple Developer Forums 112068, where Apple steers callers to SystemConfiguration), and the socket is read through a tokio UnixStream (a stream abstraction over a SOCK_RAW fd), so a lost edge-triggered readiness wakeup is also possible. The read loop only recovers on a read error or when the actor channel closes, so for a no-error stall neither path fires.

  2. The actor has no fallback. Actor::run reacts to NetworkMessage::Change from the monitor and to wall-time jumps, but otherwise never re-enumerates. So a single missed or absent event freezes interface_state indefinitely. This looks like the same failure as iroh #3449 (macOS, AF_ROUTE: failed to parse rib: MessageTooShort after sleep, "iroh appeared defunct... restarting fixed it").

Reproduction

A standalone program on a multi-homed macOS host: build a netmon::Monitor and, every few seconds, log its interface_state() alongside a fresh interfaces::State::new(), netdev::get_interfaces(), and ifconfig. A brief real interface address change (DHCP renew / link blip) is enough to trigger it. After the blip:

Monitor.interface_state()  = {}                 <-- frozen, interface gone
interfaces::State::new()   = {192.168.x.y}      <-- the SAME call the actor rebuilds from
netdev::get_interfaces()   = {192.168.x.y}
ifconfig                   = {192.168.x.y}

The interface and its address are back everywhere except in the long-lived monitor, and the monitor's route socket records no further reads for the rest of the process lifetime.

Changes

Re-enumerate on the existing periodic wake-up in addition to wall-time jumps, so a missed or absent event self-heals within one interval (15s on desktop; unchanged 1h on mobile to preserve the existing battery trade-off). handle_potential_change already diffs the freshly enumerated state against the current one, so this only publishes a change when something actually changed; the steady-state cost is one interface enumeration per interval.

The change is intentionally at the actor level rather than trying to detect the stalled socket: the route monitors are best-effort on every platform (messages can be silently dropped on a perfectly healthy socket), so reconciling periodically makes netmon correct regardless of how or why an event was missed. The reconcile reuses the existing periodic wake-up rather than adding a new timer, and because handle_potential_change diffs against the current state it only emits a change when something actually changed.

This complements the existing error-path hardening (AF_ROUTE socket rebind on read errors, iroh #2909 / iroh #2913 / net-tools #105) by covering the case where events stop without an error.

To make the recovery path testable, the interface enumeration is made injectable (StateFn, defaulting to State::new) and the poll interval is made a field. The OS route monitor becomes Option<RouteMonitor> so a test can run the actor with no monitor at all.

Test recovers_state_without_route_events: drives the actor with a stub enumeration that reports a changed interface set, sends no route-monitor events whatsoever, and asserts interface_state still converges to the new value. It fails (times out) without the periodic reconcile and passes with it. cargo test -p netwatch, cargo clippy -p netwatch --all-targets, and cargo fmt are clean.

Breaking Changes

None. All changes are within the crate-private netmon module; the public Monitor API is unchanged.

Notes & open questions

  • The poll interval reuses the existing wall-time-poll cadence (15s desktop / 1h mobile). Happy to split them into separate constants if you would prefer a different reconcile cadence, particularly on mobile.
  • Should this reference / close iroh #3449? It looks like the same failure.
  • Follow-up worth considering (separate PR): the bsd.rs route monitor reads a SOCK_RAW AF_ROUTE fd through a tokio UnixStream. The documented primitive for an arbitrary fd is AsyncFd with a poll_read_ready + try_io drain-to-WouldBlock loop (this is what netlink-sys's TokioSocket does), which avoids stream-typing the socket and makes the edge-triggered re-arm explicit. This would reduce post-stall event latency below the poll interval, but it is not needed for correctness once the periodic reconcile is in place, and it cannot be unit-tested deterministically.

Change checklist

  • Self-review.
  • Documentation updates following the style guide, if relevant.
  • Tests if relevant.
  • All breaking changes documented. (None.)

@n0bot n0bot Bot added this to iroh Jun 12, 2026
@github-project-automation github-project-automation Bot moved this to 🚑 Needs Triage in iroh Jun 12, 2026
The netmon actor only re-enumerated interfaces in response to events
from the OS route monitor (or a detected wall-time jump). Those events
can be missed or stop entirely with no error:

- Routing sockets are best-effort. On receive-buffer overflow the
  kernel drops the overflowing RTM_* messages, and on macOS (XNU
  raw_input/sbappendaddr) the drop is silent: so_error is not set and
  there is no SO_RERROR option, so the reader cannot tell a message was
  lost. A change can be missed on an otherwise-working socket.
- In production we additionally observed the macOS AF_ROUTE read stop
  delivering events entirely (no further reads, no error) for the rest
  of the process lifetime. macOS PF_ROUTE is independently documented
  to be unreliable (it can fail to emit some interface-change events at
  all), and the socket is read through a tokio UnixStream over a
  SOCK_RAW fd, so a lost edge-triggered readiness wakeup is also
  possible; the exact cause was not determined and the fix does not
  depend on it.

With no fallback the actor never recomputes State, so the interface
state freezes at whatever it was when the last event was processed. In
practice a brief interface change (a DHCP renew or link blip) made an
endpoint permanently lose an interface and its addresses even though
the interface was up and its address returned immediately; a fresh
enumeration reflected the correct state the whole time. This matches
iroh#3449.

Re-enumerate on the existing periodic wake-up in addition to wall-time
jumps, so a missed or absent event self-heals within one interval (15s
on desktop). handle_potential_change already diffs against the current
state, so this only emits a change when something actually changed.
This is the same event-plus-periodic-reconcile pattern Tailscale uses
(15s pollWallTimeInterval backstop) and complements the existing
error-path socket rebind (net-tools#105) by covering the no-error case.

The interface enumeration is made injectable and the poll interval
configurable so the recovery path can be tested deterministically
without a route monitor. Adds a regression test that drives the actor
with no route-monitor events and asserts the state still converges to
a changed enumeration.
@cbenhagen cbenhagen force-pushed the netmon-periodic-reconcile branch from 79b33b0 to c0b2e65 Compare June 12, 2026 14:42
@rklaehn

rklaehn commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Unfortunately this won't make it into iroh 1.0.

In general I am a big fan of having a brute force periodic poll fallback to event based mechanisms, so this looks good.

// current state, so this is a no-op (beyond the enumeration
// itself) whenever nothing actually changed.
pending_change = true;
debounce.as_mut().reset(Instant::now() + DEBOUNCE);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now at every interval this always does a full enumeration. What does the check_wall_time_advance logic still achieve at this point? Because it basically does a full enumeration every 15s now. Which IIUC the wall-clock jump was trying to optimise to only do if there is a clock jump. And they both trigger by the same timer, so it is not like one triggers faster.

Could you point to the exact code in tailscale that does a full network enumeration every 15s? Because they also have this time-jump detection and I'd like to understand how they let these two interact.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, just resume-from-sleep detection now: stamps last_unsuspend and force-publishes on wake even when interfaces are unchanged (the reconcile treats that as a no-op). Though last_unsuspend is write-only in netwatch atm, so the only live effect is that wake notification.

Sorry to send you down the wrong trail here and wasting your time. 🙈 I didn't fully verify this claim and it proved to be wrong. Tailscale doesn't reconcile every 15s.

Comment thread netwatch/src/netmon/actor.rs Outdated
get_state: StateFn,
/// Interval at which the actor re-enumerates interfaces and checks wall
/// time. Defaults to [`POLL_INTERVAL`].
poll_interval: Duration,

@flub flub Jun 15, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW all these changes to the struct are only to support a test. I'm not really liking this currently, it's a bunch of complexity for little gain. Maybe there are better testing strategies to think off. Starting and running with paused tokio time is probably part of that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the route monitor and interface enumeration into Actor::run parameters and driving the test under paused tokio time.

@flub

flub commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

WRT message loss due to the kernel's buffer for the socket overflowing and silent drops: Is this a real problem in the current design? IIUC we only receive some of the netlink events, but then we don't actually parse the events: instead we wait for a debounce timer to expire and than construct a new interfaces table. It is unlikely that the lost event is therefore critical.

The 2nd problem of the socket failing to produce events at all for the remaining lifetime is more worrying. That would indeed stop updates from happening. Just to be clear: this is without any wall-clock timejumps occurring?

The linked Apple Developers Forum thread doesn't really sound like the responder knows much about what they're talking I'm afraid.

Generally I have the impression the bug we need to find and fix is somewhere lower in the stack than where this work-around is happening. Possibly in the bsd.rs RouteMonitor impl. I'm not sure if your observation of a STREAM socket vs raw has much impact.

Move the route monitor and interface enumeration into Actor::run
parameters instead of struct fields, and drive the recovery test under
paused tokio time so the poll interval need not be configurable. The
Actor struct no longer carries any field that exists only for testing.
@cbenhagen

cbenhagen commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the review!

Do we want to have this periodic fallback at all? I am happy to close this PR and concentrate on fixing the root issue. Yes this is happening without any wall-clock jumps.

I agree that only if the socket fails to produce any events for the remaining lifetime is a real issue. I am currently trying to find faster ways to reproduce the issue and will come up with another PR if I find a good way too prove and fix it.

@cbenhagen

Copy link
Copy Markdown
Contributor Author

@flub #165 should fix the root issue.

@flub

flub commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

I think for now I'm more comfortable not doing this yet, until there is more evidence that this is really needed. Thanks for finding the root cause in #165!

@flub flub closed this Jun 16, 2026
@github-project-automation github-project-automation Bot moved this from 🚑 Needs Triage to ✅ Done in iroh Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

bug: netwatch::netmon::bsd: AF_ROUTE: failed to parse rib: MessageTooShort

3 participants