fix(hid): replay a node's last inventory through transient probe failures#222
Conversation
Greptile SummaryThis PR introduces a
Confidence Score: 5/5Safe to merge — the ledger logic is well-reasoned, all edge cases (first probe, grace expiry, saturated failure counter, vanished nodes, open-failure replay) behave correctly, and 6 targeted unit tests cover the decision table. The NodeLedger state machine is straightforward and the code handles every path traced: nodes that never had a healthy probe correctly surface the live result immediately (no stale replay), grace expiry cleanly removes last_good so a recovered node re-arms from a fresh baseline, saturating_add prevents counter overflow, and retain_nodes is called before probing so vanished nodes are pruned before any settle call can touch stale state. The open_failures path correctly ignores evict_channel since there is no cached channel to drop. No correctness issues were found beyond what is already documented in the previous review thread. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[OS enumerates HID candidates] --> B[Build seen_nodes]
B --> C{Channel cached?}
C -- Yes --> D[Add to active]
C -- No --> E[open_hidpp_channel]
E -- Ok Some --> F[Insert channel + add to active]
E -- Ok None --> G[Ignore - not HID++]
E -- Err --> H[open_failures list]
B --> I[retain: drop vanished channels & ledger entries]
D --> J[Concurrent probe with PROBE_BUDGET timeout]
F --> J
J -- timeout --> K[NodeProbe::failed\nhealthy=false, inventory=None]
J -- Ok --> L[NodeProbe\nhealthy=true/false, inventory=Some/None]
K --> M[ledger.settle]
L --> M
H --> N[ledger.settle\nhealthy=false, live=None]
M --> O{healthy?}
O -- Yes --> P[Update last_good\nreset failures]
O -- No --> Q{failures <= NODE_MISS_GRACE=3?}
Q -- Yes --> R[Replay last_good]
Q -- No --> S[Surface live result\nclear last_good]
M --> T{failures >= CHANNEL_EVICT_AFTER=2?}
T -- Yes --> U[evict_channel=true\nremove from channels map]
T -- No --> V[Keep channel]
P --> W[Output inventory]
R --> W
S --> W
N --> X[Output replayed/live inventory]
Reviews (3): Last reviewed commit: "fix(hid): replay a node's last inventory..." | Re-trigger Greptile |
|
This is a cleaner take on the same #218 root cause than my #221 (now closed in favor of this) — the per-node ledger handles the multi-receiver "one bad receiver suppresses the others" case and the cache-eviction starvation that a whole-cycle guard doesn't. One small gap the ledger doesn't reach: the one-shot |
…ures A HID node the OS still enumerates can stop answering HID++ — receiver register reads time out, or the transport read loop parks on a Disconnected handle. Such a tick used to yield an Ok(empty/partial) snapshot that is indistinguishable from "checked, no devices", so the GUI flapped between the device list and the empty state (#218), and a parked channel — only ever evicted when its node vanishes — wedged enumeration until the agent was restarted. Per-node ledger in the Enumerator: - probe_one now reports whether the node actually answered (receiver pairing-count register + complete slot walk; direct probes require a completed feature walk before their verdict counts) - a failed node replays its last completed inventory for a bounded grace (NODE_MISS_GRACE = 3 ticks) before surfacing the live result - from CHANNEL_EVICT_AFTER = 2 consecutive failures the node's cached channel is dropped and reopened fresh, completing the recovery path the transport read-park contract expects Also splits the pure HID++→core mappings out of inventory.rs (mappings.rs) to keep it under the size limit.
39e8a43 to
0fc51f9
Compare
|
Yes please — send the follow-up! You've read the gap exactly right: a fresh
And thanks for the gracious close on #221 — the diagnosis there matched what landed here almost line for line, and the One more ask, honestly the most valuable one: you have the only known repro rig (macOS 27 beta, MX Master 4 on Bolt). Now that this is merged, would you run a master build against it for a minute? Expected: the device list stays populated (a bad cycle logs a replay instead of flapping to "No devices"), and powering a device off still drops it within a cycle or two. If the flap survives on your stack, just reopen #218. |
…ss (AprilNEA#218) Follow-up to AprilNEA#222. The per-node `NodeLedger` from AprilNEA#222 replays a node's last good inventory across watcher ticks, but the one-shot `enumerate()` free fn (CLI `list` / `diag`) builds a fresh `Enumerator` whose ledger is empty — so a transient probe miss has nothing to replay and still surfaces as an empty/partial list. That's exactly the "two isolated runs read 3 devices and 0" case from AprilNEA#218. - openlogi-hid: split the enumerate pass into `enumerate_reporting_health`, which also reports whether every probed node answered cleanly this pass (a probe timeout, open failure, or a receiver read short of its pairing count is unhealthy). The polling watcher ignores the flag (the ledger handles replay); the one-shot `enumerate()` retries on it, reusing the same enumerator so its ledger accumulates a snapshot to replay and the channel stays warm. AprilNEA#226's 5 s `HidppChannel::send` timeout keeps a dead probe fast, so the bounded retry (4 attempts, 300 ms apart) is cheap. - openlogi-agent-core: factor the watcher tick->event decision into `WatchState::classify` (no behaviour change — `Err` already kept the last snapshot) and unit-test it: an enumerate failure after a success keeps the last snapshot, a clean empty `Ok` is still forwarded as a disconnect, and persistent initial failure reports `Unavailable` once then recovers. (Invited in the AprilNEA#222 review; the watcher was untouched by AprilNEA#222.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Following up here, since the live re-test on #218 lands squarely on this PR's code. On the repro rig (macOS 27 beta, MX Master 4 on Bolt) the ledger absorbs short glitches exactly as designed — but it surfaced a sustained failure: a paired online device's deep The telling part: the Unifying path already guards this with I've opened #251 to mirror that guard onto Bolt (a |
… the whole receiver (AprilNEA#218) On the AprilNEA#218 repro rig (macOS 27 beta, MX Master 4 on a Bolt receiver) the device list still dropped to "No devices" under a *sustained* failure that AprilNEA#222's ledger and AprilNEA#237's retry don't cover: a single paired online device's deep feature walk (`probe_features` / `Device::new`) hangs every cycle, burning the whole 5 s `PROBE_BUDGET`. Because `probe_one` is the unit wrapped in `timeout(PROBE_BUDGET)`, firing it discards the entire node — including the pairing-register reads that succeeded — so the receiver yields nothing. Confirmed device/OS-level (a standalone `openlogi list` hangs identically); only a physical receiver replug cleared it. The Unifying slot probe already guards exactly this with `UNIFYING_SLOT_PROBE` (a per-slot cap, so a slow walk returns partial instead of starving the budget). The Bolt slot probe had no equivalent — `probe_or_reuse` was awaited bare, so one hung device blew the whole-receiver budget. Mirror that guard to Bolt: a new `BOLT_SLOT_PROBE` (1.5 s) wraps the Bolt slot's `probe_or_reuse`; on timeout the slot falls back to its cached / identity data (codename + kind + online come from the pairing register, which reads fine every cycle) instead of the whole `probe_one` timing out. A hung device keeps showing with its last-known capabilities while the rest of the receiver enumerates. 1.5 s is well above a healthy BTLE walk (sub-second) and small enough that two simultaneously-hung slots still fit `PROBE_BUDGET` after the arrival drain; the rare all-slots-hung case degrades to the existing per-node ledger replay (AprilNEA#222). Refs AprilNEA#218, AprilNEA#222. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… the whole receiver (AprilNEA#218) On the AprilNEA#218 repro rig (macOS 27 beta, MX Master 4 on a Bolt receiver) the device list still dropped to "No devices" under a *sustained* failure that AprilNEA#222's ledger and AprilNEA#237's retry don't cover: a single paired online device's deep feature walk (`probe_features` / `Device::new`) hangs every cycle, burning the whole 5 s `PROBE_BUDGET`. Because `probe_one` is the unit wrapped in `timeout(PROBE_BUDGET)`, firing it discards the entire node — including the pairing-register reads that succeeded — so the receiver yields nothing. Confirmed device/OS-level (a standalone `openlogi list` hangs identically); only a physical receiver replug cleared it. The Unifying slot probe already guards exactly this with `UNIFYING_SLOT_PROBE` (a per-slot cap, so a slow walk returns partial instead of starving the budget). The Bolt slot probe had no equivalent — `probe_or_reuse` was awaited bare, so one hung device blew the whole-receiver budget. Mirror that guard to Bolt: a new `BOLT_SLOT_PROBE` (1.5 s) wraps the Bolt slot's `probe_or_reuse`; on timeout the slot falls back to its cached / identity data (codename + kind + online come from the pairing register, which reads fine every cycle) instead of the whole `probe_one` timing out. A hung device keeps showing with its last-known capabilities while the rest of the receiver enumerates. 1.5 s is well above a healthy BTLE walk (sub-second) and small enough that two simultaneously-hung slots still fit `PROBE_BUDGET` after the arrival drain; the rare all-slots-hung case degrades to the existing per-node ledger replay (AprilNEA#222). Refs AprilNEA#218, AprilNEA#222. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ss (#218) (#237) Follow-up to #222. The per-node `NodeLedger` from #222 replays a node's last good inventory across watcher ticks, but the one-shot `enumerate()` free fn (CLI `list` / `diag`) builds a fresh `Enumerator` whose ledger is empty — so a transient probe miss has nothing to replay and still surfaces as an empty/partial list. That's exactly the "two isolated runs read 3 devices and 0" case from #218. - openlogi-hid: split the enumerate pass into `enumerate_reporting_health`, which also reports whether every probed node answered cleanly this pass (a probe timeout, open failure, or a receiver read short of its pairing count is unhealthy). The polling watcher ignores the flag (the ledger handles replay); the one-shot `enumerate()` retries on it, reusing the same enumerator so its ledger accumulates a snapshot to replay and the channel stays warm. #226's 5 s `HidppChannel::send` timeout keeps a dead probe fast, so the bounded retry (4 attempts, 300 ms apart) is cheap. - openlogi-agent-core: factor the watcher tick->event decision into `WatchState::classify` (no behaviour change — `Err` already kept the last snapshot) and unit-test it: an enumerate failure after a success keeps the last snapshot, a clean empty `Ok` is still forwarded as a disconnect, and persistent initial failure reports `Unavailable` once then recovers. (Invited in the #222 review; the watcher was untouched by #222.)
Problem
Enumerator::enumerate()returnsOkwith an empty or partial device list when a receiver probe fails or times out — the inventory watcher only guards againstErrfrom OS-level enumeration. A transient HID++ hiccup (busy receiver, timed-out feature walk, failed open) therefore masquerades as "checked: no devices", and the agent wipes live bindings/DPI state for devices that never went anywhere.Worse, a sick channel was only ever evicted when its OS node vanished.
read_reportparks forever onHidError::Disconnectedexpecting the watcher to evict the channel (see the comment intransport.rs), but for a node that stays visible while its channel is wedged, that eviction never came — the probe failed every tick from then on. This is the agent-side half of the 0.6.6 device-loss reports (#218).Fix
New
NodeLedger(per-HID-node probe-health ledger) inopenlogi-hid:paired.len() == pairing_count); a direct probe is healthy when its capabilities walk succeeded.NODE_MISS_GRACE = 3consecutive misses, so a transient failure no longer erases the node's devices. Recovery resets the count; after the grace expires, whatever the live (possibly partial) probe returned is surfaced.CHANNEL_EVICT_AFTER = 2consecutive misses, the node's HID++ channel is dropped and reopened on the next tick — completing the recovery looptransport.rsdocuments ("until the inventory watcher evicts the channel"). A wedged channel can now actually heal.Pure device-classification helpers moved out of
inventory.rsinto a newmappings.rs(file-size limit); no behavior change there.Notes
async_hid::DeviceIdis#[non_exhaustive]and cfg-gated per OS), which is also what makes it unit-testable.Fixes #218
Tests
Noneclearing, vanished-node pruning.cargo fmt/clippy --workspace --all-targets -D warnings/ full test suite green.