Skip to content

fix(hid): replay a node's last inventory through transient probe failures#222

Merged
AprilNEA merged 1 commit into
masterfrom
fix/receiver-probe-resilience
Jun 13, 2026
Merged

fix(hid): replay a node's last inventory through transient probe failures#222
AprilNEA merged 1 commit into
masterfrom
fix/receiver-probe-resilience

Conversation

@AprilNEA

Copy link
Copy Markdown
Owner

Problem

Enumerator::enumerate() returns Ok with an empty or partial device list when a receiver probe fails or times out — the inventory watcher only guards against Err from OS-level enumeration. A transient HID++ hiccup (busy receiver, timed-out feature walk, failed open) therefore masquerades as "checked: no devices", and the agent wipes live bindings/DPI state for devices that never went anywhere.

Worse, a sick channel was only ever evicted when its OS node vanished. read_report parks forever on HidError::Disconnected expecting the watcher to evict the channel (see the comment in transport.rs), but for a node that stays visible while its channel is wedged, that eviction never came — the probe failed every tick from then on. This is the agent-side half of the 0.6.6 device-loss reports (#218).

Fix

New NodeLedger (per-HID-node probe-health ledger) in openlogi-hid:

  • Each probe now reports a health verdict alongside its inventory: a Bolt probe is healthy when the pairing enumeration is complete (paired.len() == pairing_count); a direct probe is healthy when its capabilities walk succeeded.
  • On an unhealthy or timed-out probe (and on a failed open), the ledger replays the node's last good inventory for up to NODE_MISS_GRACE = 3 consecutive misses, so a transient failure no longer erases the node's devices. Recovery resets the count; after the grace expires, whatever the live (possibly partial) probe returned is surfaced.
  • After CHANNEL_EVICT_AFTER = 2 consecutive misses, the node's HID++ channel is dropped and reopened on the next tick — completing the recovery loop transport.rs documents ("until the inventory watcher evicts the channel"). A wedged channel can now actually heal.
  • Vanished nodes are pruned from the ledger so it can't grow unboundedly.

Pure device-classification helpers moved out of inventory.rs into a new mappings.rs (file-size limit); no behavior change there.

Notes

  • The ledger is generic over the node key (async_hid::DeviceId is #[non_exhaustive] and cfg-gated per OS), which is also what makes it unit-testable.
  • Agent-internal only — no IPC wire-format change, no protocol bump.

Fixes #218

Tests

  • 6 new ledger tests: replay-within-grace, grace expiry, recovery reset, eviction threshold, healthy-None clearing, vanished-node pruning.
  • cargo fmt / clippy --workspace --all-targets -D warnings / full test suite green.

@greptile-apps

greptile-apps Bot commented Jun 12, 2026

Copy link
Copy Markdown

Greptile Summary

This PR introduces a NodeLedger to fix transient HID++ probe failures masquerading as "no devices" and a channel-wedge scenario that blocked recovery without an agent restart (#218).

  • node_ledger.rs: New NodeLedger<K> replays a node's last-good inventory for up to NODE_MISS_GRACE=3 consecutive failed probes and signals channel eviction after CHANNEL_EVICT_AFTER=2 failures, completing the recovery loop documented in transport.rs.
  • inventory.rs: Probe functions are refactored from returning Result<(Option<DeviceInventory>, Vec<CacheOutcome>), InventoryError> to a NodeProbe struct carrying a healthy flag; open_failures are tracked separately so channel-open errors also benefit from replay; the enumerator acts on settled.evict_channel to drop and reopen wedged channels.
  • mappings.rs: Pure extraction of device-kind and battery mapping helpers — no behaviour change.

Confidence Score: 5/5

Safe to merge — the ledger logic is well-reasoned, all edge cases (first probe, grace expiry, saturated failure counter, vanished nodes, open-failure replay) behave correctly, and 6 targeted unit tests cover the decision table.

The NodeLedger state machine is straightforward and the code handles every path traced: nodes that never had a healthy probe correctly surface the live result immediately (no stale replay), grace expiry cleanly removes last_good so a recovered node re-arms from a fresh baseline, saturating_add prevents counter overflow, and retain_nodes is called before probing so vanished nodes are pruned before any settle call can touch stale state. The open_failures path correctly ignores evict_channel since there is no cached channel to drop. No correctness issues were found beyond what is already documented in the previous review thread.

No files require special attention.

Important Files Changed

Filename Overview
crates/openlogi-hid/src/node_ledger.rs New NodeLedger that tracks per-node probe health, replays last-good inventory within NODE_MISS_GRACE=3 consecutive failures, and signals channel eviction after CHANNEL_EVICT_AFTER=2 failures; well-structured with 6 unit tests covering all branches.
crates/openlogi-hid/src/inventory.rs Integrates NodeLedger into Enumerator::enumerate(): probe functions now return NodeProbe (healthy flag + inventory + outcomes); open_failures tracked separately for replay; channel eviction wired to settled.evict_channel; logic is sound.
crates/openlogi-hid/src/mappings.rs Pure extraction of device-kind and battery mapping helpers from inventory.rs into a new module; no behavior changes, tests moved alongside.
crates/openlogi-hid/src/lib.rs Adds mod declarations for the two new modules (mappings and node_ledger); trivial change.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[OS enumerates HID candidates] --> B[Build seen_nodes]
    B --> C{Channel cached?}
    C -- Yes --> D[Add to active]
    C -- No --> E[open_hidpp_channel]
    E -- Ok Some --> F[Insert channel + add to active]
    E -- Ok None --> G[Ignore - not HID++]
    E -- Err --> H[open_failures list]

    B --> I[retain: drop vanished channels & ledger entries]

    D --> J[Concurrent probe with PROBE_BUDGET timeout]
    F --> J
    J -- timeout --> K[NodeProbe::failed\nhealthy=false, inventory=None]
    J -- Ok --> L[NodeProbe\nhealthy=true/false, inventory=Some/None]

    K --> M[ledger.settle]
    L --> M
    H --> N[ledger.settle\nhealthy=false, live=None]

    M --> O{healthy?}
    O -- Yes --> P[Update last_good\nreset failures]
    O -- No --> Q{failures <= NODE_MISS_GRACE=3?}
    Q -- Yes --> R[Replay last_good]
    Q -- No --> S[Surface live result\nclear last_good]

    M --> T{failures >= CHANNEL_EVICT_AFTER=2?}
    T -- Yes --> U[evict_channel=true\nremove from channels map]
    T -- No --> V[Keep channel]

    P --> W[Output inventory]
    R --> W
    S --> W
    N --> X[Output replayed/live inventory]
Loading

Reviews (3): Last reviewed commit: "fix(hid): replay a node's last inventory..." | Re-trigger Greptile

Comment thread crates/openlogi-hid/src/inventory.rs
@buliwyf42

Copy link
Copy Markdown
Contributor

This is a cleaner take on the same #218 root cause than my #221 (now closed in favor of this) — the per-node ledger handles the multi-receiver "one bad receiver suppresses the others" case and the cache-eviction starvation that a whole-cycle guard doesn't.

One small gap the ledger doesn't reach: the one-shot enumerate() free fn (CLI list / diag) builds a fresh Enumerator per call, so its ledger has no prior snapshot to replay — a transient hiccup there still surfaces as an empty/partial list. A small bounded retry on that path (re-running the enumerate a couple of times on an unhealthy/empty result) would cover it. Happy to send that as a tiny follow-up on top of this if useful.

…ures

A HID node the OS still enumerates can stop answering HID++ — receiver
register reads time out, or the transport read loop parks on a
Disconnected handle. Such a tick used to yield an Ok(empty/partial)
snapshot that is indistinguishable from "checked, no devices", so the
GUI flapped between the device list and the empty state (#218), and a
parked channel — only ever evicted when its node vanishes — wedged
enumeration until the agent was restarted.

Per-node ledger in the Enumerator:
- probe_one now reports whether the node actually answered (receiver
  pairing-count register + complete slot walk; direct probes require a
  completed feature walk before their verdict counts)
- a failed node replays its last completed inventory for a bounded
  grace (NODE_MISS_GRACE = 3 ticks) before surfacing the live result
- from CHANNEL_EVICT_AFTER = 2 consecutive failures the node's cached
  channel is dropped and reopened fresh, completing the recovery path
  the transport read-park contract expects

Also splits the pure HID++→core mappings out of inventory.rs
(mappings.rs) to keep it under the size limit.
@AprilNEA AprilNEA force-pushed the fix/receiver-probe-resilience branch from 39e8a43 to 0fc51f9 Compare June 13, 2026 12:55
@AprilNEA AprilNEA merged commit 76d5870 into master Jun 13, 2026
8 checks passed
@AprilNEA AprilNEA deleted the fix/receiver-probe-resilience branch June 13, 2026 12:59
@AprilNEA

Copy link
Copy Markdown
Owner Author

Yes please — send the follow-up! You've read the gap exactly right: a fresh Enumerator has no snapshot to replay, and your 3-then-0 isolated-runs observation in #218 is precisely that case. Two notes for it:

And thanks for the gracious close on #221 — the diagnosis there matched what landed here almost line for line, and the receiver_read_complete test matrix was genuinely good.

One more ask, honestly the most valuable one: you have the only known repro rig (macOS 27 beta, MX Master 4 on Bolt). Now that this is merged, would you run a master build against it for a minute? Expected: the device list stays populated (a bad cycle logs a replay instead of flapping to "No devices"), and powering a device off still drops it within a cycle or two. If the flap survives on your stack, just reopen #218.

@aprilnea aprilnea Bot mentioned this pull request Jun 13, 2026
buliwyf42 added a commit to buliwyf42/OpenLogi that referenced this pull request Jun 13, 2026
…ss (AprilNEA#218)

Follow-up to AprilNEA#222. The per-node `NodeLedger` from AprilNEA#222 replays a node's
last good inventory across watcher ticks, but the one-shot `enumerate()`
free fn (CLI `list` / `diag`) builds a fresh `Enumerator` whose ledger is
empty — so a transient probe miss has nothing to replay and still surfaces
as an empty/partial list. That's exactly the "two isolated runs read 3
devices and 0" case from AprilNEA#218.

- openlogi-hid: split the enumerate pass into `enumerate_reporting_health`,
  which also reports whether every probed node answered cleanly this pass
  (a probe timeout, open failure, or a receiver read short of its pairing
  count is unhealthy). The polling watcher ignores the flag (the ledger
  handles replay); the one-shot `enumerate()` retries on it, reusing the
  same enumerator so its ledger accumulates a snapshot to replay and the
  channel stays warm. AprilNEA#226's 5 s `HidppChannel::send` timeout keeps a dead
  probe fast, so the bounded retry (4 attempts, 300 ms apart) is cheap.

- openlogi-agent-core: factor the watcher tick->event decision into
  `WatchState::classify` (no behaviour change — `Err` already kept the last
  snapshot) and unit-test it: an enumerate failure after a success keeps the
  last snapshot, a clean empty `Ok` is still forwarded as a disconnect, and
  persistent initial failure reports `Unavailable` once then recovers.
  (Invited in the AprilNEA#222 review; the watcher was untouched by AprilNEA#222.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@buliwyf42

Copy link
Copy Markdown
Contributor

Following up here, since the live re-test on #218 lands squarely on this PR's code.

On the repro rig (macOS 27 beta, MX Master 4 on Bolt) the ledger absorbs short glitches exactly as designed — but it surfaced a sustained failure: a paired online device's deep probe_features hangs every cycle, burning the full 5 s PROBE_BUDGET. Since probe_one is the unit wrapped in timeout(PROBE_BUDGET), firing it discards the whole node — including the pairing-register reads that succeeded — so the receiver yields nothing → count=0 "No devices" (confirmed in the GUI; only a physical receiver replug cleared it; a standalone openlogi list hangs identically, so it's device/OS-level, not capture contention).

The telling part: the Unifying path already guards this with UNIFYING_SLOT_PROBE (3.5 s per-slot cap → partial instead of starving the budget). The Bolt slot probe has no equivalent — probe_or_reuse is awaited bare in probe_bolt_slot, so one hung Bolt device blows the whole-receiver budget.

I've opened #251 to mirror that guard onto Bolt (a BOLT_SLOT_PROBE cap → the hung device falls back to cached/identity data while the rest of the receiver still enumerates). Writeup + logs on #218.

buliwyf42 added a commit to buliwyf42/OpenLogi that referenced this pull request Jun 13, 2026
… the whole receiver (AprilNEA#218)

On the AprilNEA#218 repro rig (macOS 27 beta, MX Master 4 on a Bolt receiver) the
device list still dropped to "No devices" under a *sustained* failure that
AprilNEA#222's ledger and AprilNEA#237's retry don't cover: a single paired online device's
deep feature walk (`probe_features` / `Device::new`) hangs every cycle,
burning the whole 5 s `PROBE_BUDGET`. Because `probe_one` is the unit wrapped
in `timeout(PROBE_BUDGET)`, firing it discards the entire node — including the
pairing-register reads that succeeded — so the receiver yields nothing.
Confirmed device/OS-level (a standalone `openlogi list` hangs identically);
only a physical receiver replug cleared it.

The Unifying slot probe already guards exactly this with `UNIFYING_SLOT_PROBE`
(a per-slot cap, so a slow walk returns partial instead of starving the
budget). The Bolt slot probe had no equivalent — `probe_or_reuse` was awaited
bare, so one hung device blew the whole-receiver budget.

Mirror that guard to Bolt: a new `BOLT_SLOT_PROBE` (1.5 s) wraps the Bolt
slot's `probe_or_reuse`; on timeout the slot falls back to its cached /
identity data (codename + kind + online come from the pairing register, which
reads fine every cycle) instead of the whole `probe_one` timing out. A hung
device keeps showing with its last-known capabilities while the rest of the
receiver enumerates. 1.5 s is well above a healthy BTLE walk (sub-second) and
small enough that two simultaneously-hung slots still fit `PROBE_BUDGET` after
the arrival drain; the rare all-slots-hung case degrades to the existing
per-node ledger replay (AprilNEA#222).

Refs AprilNEA#218, AprilNEA#222.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
buliwyf42 added a commit to buliwyf42/OpenLogi that referenced this pull request Jun 13, 2026
… the whole receiver (AprilNEA#218)

On the AprilNEA#218 repro rig (macOS 27 beta, MX Master 4 on a Bolt receiver) the
device list still dropped to "No devices" under a *sustained* failure that
AprilNEA#222's ledger and AprilNEA#237's retry don't cover: a single paired online device's
deep feature walk (`probe_features` / `Device::new`) hangs every cycle,
burning the whole 5 s `PROBE_BUDGET`. Because `probe_one` is the unit wrapped
in `timeout(PROBE_BUDGET)`, firing it discards the entire node — including the
pairing-register reads that succeeded — so the receiver yields nothing.
Confirmed device/OS-level (a standalone `openlogi list` hangs identically);
only a physical receiver replug cleared it.

The Unifying slot probe already guards exactly this with `UNIFYING_SLOT_PROBE`
(a per-slot cap, so a slow walk returns partial instead of starving the
budget). The Bolt slot probe had no equivalent — `probe_or_reuse` was awaited
bare, so one hung device blew the whole-receiver budget.

Mirror that guard to Bolt: a new `BOLT_SLOT_PROBE` (1.5 s) wraps the Bolt
slot's `probe_or_reuse`; on timeout the slot falls back to its cached /
identity data (codename + kind + online come from the pairing register, which
reads fine every cycle) instead of the whole `probe_one` timing out. A hung
device keeps showing with its last-known capabilities while the rest of the
receiver enumerates. 1.5 s is well above a healthy BTLE walk (sub-second) and
small enough that two simultaneously-hung slots still fit `PROBE_BUDGET` after
the arrival drain; the rare all-slots-hung case degrades to the existing
per-node ledger replay (AprilNEA#222).

Refs AprilNEA#218, AprilNEA#222.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
AprilNEA pushed a commit that referenced this pull request Jun 15, 2026
…ss (#218) (#237)

Follow-up to #222. The per-node `NodeLedger` from #222 replays a node's
last good inventory across watcher ticks, but the one-shot `enumerate()`
free fn (CLI `list` / `diag`) builds a fresh `Enumerator` whose ledger is
empty — so a transient probe miss has nothing to replay and still surfaces
as an empty/partial list. That's exactly the "two isolated runs read 3
devices and 0" case from #218.

- openlogi-hid: split the enumerate pass into `enumerate_reporting_health`,
  which also reports whether every probed node answered cleanly this pass
  (a probe timeout, open failure, or a receiver read short of its pairing
  count is unhealthy). The polling watcher ignores the flag (the ledger
  handles replay); the one-shot `enumerate()` retries on it, reusing the
  same enumerator so its ledger accumulates a snapshot to replay and the
  channel stays warm. #226's 5 s `HidppChannel::send` timeout keeps a dead
  probe fast, so the bounded retry (4 attempts, 300 ms apart) is cheap.

- openlogi-agent-core: factor the watcher tick->event decision into
  `WatchState::classify` (no behaviour change — `Err` already kept the last
  snapshot) and unit-test it: an enumerate failure after a success keeps the
  last snapshot, a clean empty `Ok` is still forwarded as a disconnect, and
  persistent initial failure reports `Unavailable` once then recovers.
  (Invited in the #222 review; the watcher was untouched by #222.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Device list flaps (appears then vanishes ~2s) + device images never load — Bolt receiver, macOS 27 beta

2 participants