fix(netwatch): reconcile interface state periodically#158
Conversation
The netmon actor only re-enumerated interfaces in response to events from the OS route monitor (or a detected wall-time jump). Those events can be missed or stop entirely with no error: - Routing sockets are best-effort. On receive-buffer overflow the kernel drops the overflowing RTM_* messages, and on macOS (XNU raw_input/sbappendaddr) the drop is silent: so_error is not set and there is no SO_RERROR option, so the reader cannot tell a message was lost. A change can be missed on an otherwise-working socket. - In production we additionally observed the macOS AF_ROUTE read stop delivering events entirely (no further reads, no error) for the rest of the process lifetime. macOS PF_ROUTE is independently documented to be unreliable (it can fail to emit some interface-change events at all), and the socket is read through a tokio UnixStream over a SOCK_RAW fd, so a lost edge-triggered readiness wakeup is also possible; the exact cause was not determined and the fix does not depend on it. With no fallback the actor never recomputes State, so the interface state freezes at whatever it was when the last event was processed. In practice a brief interface change (a DHCP renew or link blip) made an endpoint permanently lose an interface and its addresses even though the interface was up and its address returned immediately; a fresh enumeration reflected the correct state the whole time. This matches iroh#3449. Re-enumerate on the existing periodic wake-up in addition to wall-time jumps, so a missed or absent event self-heals within one interval (15s on desktop). handle_potential_change already diffs against the current state, so this only emits a change when something actually changed. This is the same event-plus-periodic-reconcile pattern Tailscale uses (15s pollWallTimeInterval backstop) and complements the existing error-path socket rebind (net-tools#105) by covering the no-error case. The interface enumeration is made injectable and the poll interval configurable so the recovery path can be tested deterministically without a route monitor. Adds a regression test that drives the actor with no route-monitor events and asserts the state still converges to a changed enumeration.
79b33b0 to
c0b2e65
Compare
|
Unfortunately this won't make it into iroh 1.0. In general I am a big fan of having a brute force periodic poll fallback to event based mechanisms, so this looks good. |
| // current state, so this is a no-op (beyond the enumeration | ||
| // itself) whenever nothing actually changed. | ||
| pending_change = true; | ||
| debounce.as_mut().reset(Instant::now() + DEBOUNCE); |
There was a problem hiding this comment.
So now at every interval this always does a full enumeration. What does the check_wall_time_advance logic still achieve at this point? Because it basically does a full enumeration every 15s now. Which IIUC the wall-clock jump was trying to optimise to only do if there is a clock jump. And they both trigger by the same timer, so it is not like one triggers faster.
Could you point to the exact code in tailscale that does a full network enumeration every 15s? Because they also have this time-jump detection and I'd like to understand how they let these two interact.
There was a problem hiding this comment.
Right, just resume-from-sleep detection now: stamps last_unsuspend and force-publishes on wake even when interfaces are unchanged (the reconcile treats that as a no-op). Though last_unsuspend is write-only in netwatch atm, so the only live effect is that wake notification.
Sorry to send you down the wrong trail here and wasting your time. 🙈 I didn't fully verify this claim and it proved to be wrong. Tailscale doesn't reconcile every 15s.
| get_state: StateFn, | ||
| /// Interval at which the actor re-enumerates interfaces and checks wall | ||
| /// time. Defaults to [`POLL_INTERVAL`]. | ||
| poll_interval: Duration, |
There was a problem hiding this comment.
FWIW all these changes to the struct are only to support a test. I'm not really liking this currently, it's a bunch of complexity for little gain. Maybe there are better testing strategies to think off. Starting and running with paused tokio time is probably part of that.
There was a problem hiding this comment.
Moved the route monitor and interface enumeration into Actor::run parameters and driving the test under paused tokio time.
|
WRT message loss due to the kernel's buffer for the socket overflowing and silent drops: Is this a real problem in the current design? IIUC we only receive some of the netlink events, but then we don't actually parse the events: instead we wait for a debounce timer to expire and than construct a new interfaces table. It is unlikely that the lost event is therefore critical. The 2nd problem of the socket failing to produce events at all for the remaining lifetime is more worrying. That would indeed stop updates from happening. Just to be clear: this is without any wall-clock timejumps occurring? The linked Apple Developers Forum thread doesn't really sound like the responder knows much about what they're talking I'm afraid. Generally I have the impression the bug we need to find and fix is somewhere lower in the stack than where this work-around is happening. Possibly in the bsd.rs RouteMonitor impl. I'm not sure if your observation of a STREAM socket vs raw has much impact. |
Move the route monitor and interface enumeration into Actor::run parameters instead of struct fields, and drive the recovery test under paused tokio time so the poll interval need not be configurable. The Actor struct no longer carries any field that exists only for testing.
|
Thanks for the review! Do we want to have this periodic fallback at all? I am happy to close this PR and concentrate on fixing the root issue. Yes this is happening without any wall-clock jumps. I agree that only if the socket fails to produce any events for the remaining lifetime is a real issue. I am currently trying to find faster ways to reproduce the issue and will come up with another PR if I find a good way too prove and fix it. |
|
I think for now I'm more comfortable not doing this yet, until there is more evidence that this is really needed. Thanks for finding the root cause in #165! |
Description
The
netmonactor only re-enumerates interfaces in reaction to events from the OS route monitor (plus a detected wall-time jump). The OS route monitors are best-effort and can miss events or stop delivering them entirely, and when that happens the actor never recomputesState, so the interface state stays frozen forever at whatever it was when the last event was processed.This was observed in production on macOS: a long-lived endpoint silently lost one of its interfaces (and that interface's address) and never recovered, even though the interface was up and carrying traffic the whole time. It stopped publishing/announcing that address until the process restarted.
Root cause
Two things compose:
The route monitor can miss or stop delivering events, without any error. Routing sockets are best-effort by design: when the per-socket receive buffer overflows, the kernel drops the overflowing
RTM_*messages. On macOS (XNUbsd/net/raw_usrreq.c,raw_input->sbappendaddr) this drop is silent:so_erroris not set and there is noSO_RERRORsocket option at all, so a subsequentrecv()just returns the next message that fit and the reader cannot tell anything was lost. (route(4)documents the loss; the macOS-silent specifics are in XNU source. FreeBSD only surfacesENOBUFSwith the opt-inSO_RERROR, off by default; Linux netlink surfacesENOBUFSby default. So on macOS a change can be missed on an otherwise-working socket with no signal at all.)Separately, I observed something stronger in production: after a burst of route messages the
AF_ROUTEsocket.read(&mut buffer)innetmon/bsd.rsproduced no further reads and no error for the rest of the process lifetime. I did not determine the exact cause and am not asserting one: macOS PF_ROUTE is independently documented to be unreliable (it can fail to emitRTM_*events for some interface transitions at all, see Apple Developer Forums 112068, where Apple steers callers to SystemConfiguration), and the socket is read through a tokioUnixStream(a stream abstraction over aSOCK_RAWfd), so a lost edge-triggered readiness wakeup is also possible. The read loop only recovers on a read error or when the actor channel closes, so for a no-error stall neither path fires.The actor has no fallback.
Actor::runreacts toNetworkMessage::Changefrom the monitor and to wall-time jumps, but otherwise never re-enumerates. So a single missed or absent event freezesinterface_stateindefinitely. This looks like the same failure as iroh #3449 (macOS,AF_ROUTE: failed to parse rib: MessageTooShortafter sleep, "iroh appeared defunct... restarting fixed it").Reproduction
A standalone program on a multi-homed macOS host: build a
netmon::Monitorand, every few seconds, log itsinterface_state()alongside a freshinterfaces::State::new(),netdev::get_interfaces(), andifconfig. A brief real interface address change (DHCP renew / link blip) is enough to trigger it. After the blip:The interface and its address are back everywhere except in the long-lived monitor, and the monitor's route socket records no further reads for the rest of the process lifetime.
Changes
Re-enumerate on the existing periodic wake-up in addition to wall-time jumps, so a missed or absent event self-heals within one interval (15s on desktop; unchanged 1h on mobile to preserve the existing battery trade-off).
handle_potential_changealready diffs the freshly enumerated state against the current one, so this only publishes a change when something actually changed; the steady-state cost is one interface enumeration per interval.The change is intentionally at the actor level rather than trying to detect the stalled socket: the route monitors are best-effort on every platform (messages can be silently dropped on a perfectly healthy socket), so reconciling periodically makes
netmoncorrect regardless of how or why an event was missed. The reconcile reuses the existing periodic wake-up rather than adding a new timer, and becausehandle_potential_changediffs against the current state it only emits a change when something actually changed.This complements the existing error-path hardening (
AF_ROUTEsocket rebind on read errors, iroh #2909 / iroh #2913 / net-tools #105) by covering the case where events stop without an error.To make the recovery path testable, the interface enumeration is made injectable (
StateFn, defaulting toState::new) and the poll interval is made a field. The OS route monitor becomesOption<RouteMonitor>so a test can run the actor with no monitor at all.Test
recovers_state_without_route_events: drives the actor with a stub enumeration that reports a changed interface set, sends no route-monitor events whatsoever, and assertsinterface_statestill converges to the new value. It fails (times out) without the periodic reconcile and passes with it.cargo test -p netwatch,cargo clippy -p netwatch --all-targets, andcargo fmtare clean.Breaking Changes
None. All changes are within the crate-private
netmonmodule; the publicMonitorAPI is unchanged.Notes & open questions
bsd.rsroute monitor reads aSOCK_RAWAF_ROUTEfd through a tokioUnixStream. The documented primitive for an arbitrary fd isAsyncFdwith apoll_read_ready+try_iodrain-to-WouldBlockloop (this is whatnetlink-sys'sTokioSocketdoes), which avoids stream-typing the socket and makes the edge-triggered re-arm explicit. This would reduce post-stall event latency below the poll interval, but it is not needed for correctness once the periodic reconcile is in place, and it cannot be unit-tested deterministically.Change checklist