fix(hid): replay a node's last inventory through transient probe failures by AprilNEA · Pull Request #222 · AprilNEA/OpenLogi

AprilNEA · 2026-06-12T07:10:04Z

Problem

Enumerator::enumerate() returns Ok with an empty or partial device list when a receiver probe fails or times out — the inventory watcher only guards against Err from OS-level enumeration. A transient HID++ hiccup (busy receiver, timed-out feature walk, failed open) therefore masquerades as "checked: no devices", and the agent wipes live bindings/DPI state for devices that never went anywhere.

Worse, a sick channel was only ever evicted when its OS node vanished. read_report parks forever on HidError::Disconnected expecting the watcher to evict the channel (see the comment in transport.rs), but for a node that stays visible while its channel is wedged, that eviction never came — the probe failed every tick from then on. This is the agent-side half of the 0.6.6 device-loss reports (#218).

Fix

New NodeLedger (per-HID-node probe-health ledger) in openlogi-hid:

Each probe now reports a health verdict alongside its inventory: a Bolt probe is healthy when the pairing enumeration is complete (paired.len() == pairing_count); a direct probe is healthy when its capabilities walk succeeded.
On an unhealthy or timed-out probe (and on a failed open), the ledger replays the node's last good inventory for up to NODE_MISS_GRACE = 3 consecutive misses, so a transient failure no longer erases the node's devices. Recovery resets the count; after the grace expires, whatever the live (possibly partial) probe returned is surfaced.
After CHANNEL_EVICT_AFTER = 2 consecutive misses, the node's HID++ channel is dropped and reopened on the next tick — completing the recovery loop transport.rs documents ("until the inventory watcher evicts the channel"). A wedged channel can now actually heal.
Vanished nodes are pruned from the ledger so it can't grow unboundedly.

Pure device-classification helpers moved out of inventory.rs into a new mappings.rs (file-size limit); no behavior change there.

Notes

The ledger is generic over the node key (async_hid::DeviceId is #[non_exhaustive] and cfg-gated per OS), which is also what makes it unit-testable.
Agent-internal only — no IPC wire-format change, no protocol bump.

Fixes #218

Tests

6 new ledger tests: replay-within-grace, grace expiry, recovery reset, eviction threshold, healthy-None clearing, vanished-node pruning.
cargo fmt / clippy --workspace --all-targets -D warnings / full test suite green.

greptile-apps · 2026-06-12T07:18:35Z

Greptile Summary

This PR introduces a NodeLedger to fix transient HID++ probe failures masquerading as "no devices" and a channel-wedge scenario that blocked recovery without an agent restart (#218).

node_ledger.rs: New NodeLedger<K> replays a node's last-good inventory for up to NODE_MISS_GRACE=3 consecutive failed probes and signals channel eviction after CHANNEL_EVICT_AFTER=2 failures, completing the recovery loop documented in transport.rs.
inventory.rs: Probe functions are refactored from returning Result<(Option<DeviceInventory>, Vec<CacheOutcome>), InventoryError> to a NodeProbe struct carrying a healthy flag; open_failures are tracked separately so channel-open errors also benefit from replay; the enumerator acts on settled.evict_channel to drop and reopen wedged channels.
mappings.rs: Pure extraction of device-kind and battery mapping helpers — no behaviour change.

Confidence Score: 5/5

Safe to merge — the ledger logic is well-reasoned, all edge cases (first probe, grace expiry, saturated failure counter, vanished nodes, open-failure replay) behave correctly, and 6 targeted unit tests cover the decision table.

The NodeLedger state machine is straightforward and the code handles every path traced: nodes that never had a healthy probe correctly surface the live result immediately (no stale replay), grace expiry cleanly removes last_good so a recovered node re-arms from a fresh baseline, saturating_add prevents counter overflow, and retain_nodes is called before probing so vanished nodes are pruned before any settle call can touch stale state. The open_failures path correctly ignores evict_channel since there is no cached channel to drop. No correctness issues were found beyond what is already documented in the previous review thread.

No files require special attention.

Important Files Changed

Filename	Overview
crates/openlogi-hid/src/node_ledger.rs	New NodeLedger that tracks per-node probe health, replays last-good inventory within NODE_MISS_GRACE=3 consecutive failures, and signals channel eviction after CHANNEL_EVICT_AFTER=2 failures; well-structured with 6 unit tests covering all branches.
crates/openlogi-hid/src/inventory.rs	Integrates NodeLedger into Enumerator::enumerate(): probe functions now return NodeProbe (healthy flag + inventory + outcomes); open_failures tracked separately for replay; channel eviction wired to settled.evict_channel; logic is sound.
crates/openlogi-hid/src/mappings.rs	Pure extraction of device-kind and battery mapping helpers from inventory.rs into a new module; no behavior changes, tests moved alongside.
crates/openlogi-hid/src/lib.rs	Adds mod declarations for the two new modules (mappings and node_ledger); trivial change.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[OS enumerates HID candidates] --> B[Build seen_nodes]
    B --> C{Channel cached?}
    C -- Yes --> D[Add to active]
    C -- No --> E[open_hidpp_channel]
    E -- Ok Some --> F[Insert channel + add to active]
    E -- Ok None --> G[Ignore - not HID++]
    E -- Err --> H[open_failures list]

    B --> I[retain: drop vanished channels & ledger entries]

    D --> J[Concurrent probe with PROBE_BUDGET timeout]
    F --> J
    J -- timeout --> K[NodeProbe::failed\nhealthy=false, inventory=None]
    J -- Ok --> L[NodeProbe\nhealthy=true/false, inventory=Some/None]

    K --> M[ledger.settle]
    L --> M
    H --> N[ledger.settle\nhealthy=false, live=None]

    M --> O{healthy?}
    O -- Yes --> P[Update last_good\nreset failures]
    O -- No --> Q{failures <= NODE_MISS_GRACE=3?}
    Q -- Yes --> R[Replay last_good]
    Q -- No --> S[Surface live result\nclear last_good]

    M --> T{failures >= CHANNEL_EVICT_AFTER=2?}
    T -- Yes --> U[evict_channel=true\nremove from channels map]
    T -- No --> V[Keep channel]

    P --> W[Output inventory]
    R --> W
    S --> W
    N --> X[Output replayed/live inventory]

_{Reviews (3): Last reviewed commit: "fix(hid): replay a node's last inventory..." | Re-trigger Greptile}

buliwyf42 · 2026-06-12T13:06:36Z

This is a cleaner take on the same #218 root cause than my #221 (now closed in favor of this) — the per-node ledger handles the multi-receiver "one bad receiver suppresses the others" case and the cache-eviction starvation that a whole-cycle guard doesn't.

One small gap the ledger doesn't reach: the one-shot enumerate() free fn (CLI list / diag) builds a fresh Enumerator per call, so its ledger has no prior snapshot to replay — a transient hiccup there still surfaces as an empty/partial list. A small bounded retry on that path (re-running the enumerate a couple of times on an unhealthy/empty result) would cover it. Happy to send that as a tiny follow-up on top of this if useful.

…ures A HID node the OS still enumerates can stop answering HID++ — receiver register reads time out, or the transport read loop parks on a Disconnected handle. Such a tick used to yield an Ok(empty/partial) snapshot that is indistinguishable from "checked, no devices", so the GUI flapped between the device list and the empty state (#218), and a parked channel — only ever evicted when its node vanishes — wedged enumeration until the agent was restarted. Per-node ledger in the Enumerator: - probe_one now reports whether the node actually answered (receiver pairing-count register + complete slot walk; direct probes require a completed feature walk before their verdict counts) - a failed node replays its last completed inventory for a bounded grace (NODE_MISS_GRACE = 3 ticks) before surfacing the live result - from CHANNEL_EVICT_AFTER = 2 consecutive failures the node's cached channel is dropped and reopened fresh, completing the recovery path the transport read-park contract expects Also splits the pure HID++→core mappings out of inventory.rs (mappings.rs) to keep it under the size limit.

AprilNEA · 2026-06-13T13:00:06Z

Yes please — send the follow-up! You've read the gap exactly right: a fresh Enumerator has no snapshot to replay, and your 3-then-0 isolated-runs observation in #218 is precisely that case. Two notes for it:

Master just gained fix: add a request timeout inside HidppChannel::send (fix hang root at the right layer) #226 (a 5 s request timeout inside HidppChannel::send), so a dead probe now fails fast instead of burning the whole PROBE_BUDGET — a short bounded retry on the one-shot path should be cheap.
If you'd like to bring over the WatchState::classify refactor + watcher tests from fix(hid): keep last-known-good devices when a receiver read is incomplete (#218) #221 in the same PR, they're conflict-free (this PR didn't touch the watcher) and the testability win stands.

And thanks for the gracious close on #221 — the diagnosis there matched what landed here almost line for line, and the receiver_read_complete test matrix was genuinely good.

One more ask, honestly the most valuable one: you have the only known repro rig (macOS 27 beta, MX Master 4 on Bolt). Now that this is merged, would you run a master build against it for a minute? Expected: the device list stays populated (a bad cycle logs a replay instead of flapping to "No devices"), and powering a device off still drops it within a cycle or two. If the flap survives on your stack, just reopen #218.

…ss (AprilNEA#218) Follow-up to AprilNEA#222. The per-node `NodeLedger` from AprilNEA#222 replays a node's last good inventory across watcher ticks, but the one-shot `enumerate()` free fn (CLI `list` / `diag`) builds a fresh `Enumerator` whose ledger is empty — so a transient probe miss has nothing to replay and still surfaces as an empty/partial list. That's exactly the "two isolated runs read 3 devices and 0" case from AprilNEA#218. - openlogi-hid: split the enumerate pass into `enumerate_reporting_health`, which also reports whether every probed node answered cleanly this pass (a probe timeout, open failure, or a receiver read short of its pairing count is unhealthy). The polling watcher ignores the flag (the ledger handles replay); the one-shot `enumerate()` retries on it, reusing the same enumerator so its ledger accumulates a snapshot to replay and the channel stays warm. AprilNEA#226's 5 s `HidppChannel::send` timeout keeps a dead probe fast, so the bounded retry (4 attempts, 300 ms apart) is cheap. - openlogi-agent-core: factor the watcher tick->event decision into `WatchState::classify` (no behaviour change — `Err` already kept the last snapshot) and unit-test it: an enumerate failure after a success keeps the last snapshot, a clean empty `Ok` is still forwarded as a disconnect, and persistent initial failure reports `Unavailable` once then recovers. (Invited in the AprilNEA#222 review; the watcher was untouched by AprilNEA#222.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

buliwyf42 · 2026-06-13T16:47:54Z

Following up here, since the live re-test on #218 lands squarely on this PR's code.

On the repro rig (macOS 27 beta, MX Master 4 on Bolt) the ledger absorbs short glitches exactly as designed — but it surfaced a sustained failure: a paired online device's deep probe_features hangs every cycle, burning the full 5 s PROBE_BUDGET. Since probe_one is the unit wrapped in timeout(PROBE_BUDGET), firing it discards the whole node — including the pairing-register reads that succeeded — so the receiver yields nothing → count=0 "No devices" (confirmed in the GUI; only a physical receiver replug cleared it; a standalone openlogi list hangs identically, so it's device/OS-level, not capture contention).

The telling part: the Unifying path already guards this with UNIFYING_SLOT_PROBE (3.5 s per-slot cap → partial instead of starving the budget). The Bolt slot probe has no equivalent — probe_or_reuse is awaited bare in probe_bolt_slot, so one hung Bolt device blows the whole-receiver budget.

I've opened #251 to mirror that guard onto Bolt (a BOLT_SLOT_PROBE cap → the hung device falls back to cached/identity data while the rest of the receiver still enumerates). Writeup + logs on #218.

… the whole receiver (AprilNEA#218) On the AprilNEA#218 repro rig (macOS 27 beta, MX Master 4 on a Bolt receiver) the device list still dropped to "No devices" under a *sustained* failure that AprilNEA#222's ledger and AprilNEA#237's retry don't cover: a single paired online device's deep feature walk (`probe_features` / `Device::new`) hangs every cycle, burning the whole 5 s `PROBE_BUDGET`. Because `probe_one` is the unit wrapped in `timeout(PROBE_BUDGET)`, firing it discards the entire node — including the pairing-register reads that succeeded — so the receiver yields nothing. Confirmed device/OS-level (a standalone `openlogi list` hangs identically); only a physical receiver replug cleared it. The Unifying slot probe already guards exactly this with `UNIFYING_SLOT_PROBE` (a per-slot cap, so a slow walk returns partial instead of starving the budget). The Bolt slot probe had no equivalent — `probe_or_reuse` was awaited bare, so one hung device blew the whole-receiver budget. Mirror that guard to Bolt: a new `BOLT_SLOT_PROBE` (1.5 s) wraps the Bolt slot's `probe_or_reuse`; on timeout the slot falls back to its cached / identity data (codename + kind + online come from the pairing register, which reads fine every cycle) instead of the whole `probe_one` timing out. A hung device keeps showing with its last-known capabilities while the rest of the receiver enumerates. 1.5 s is well above a healthy BTLE walk (sub-second) and small enough that two simultaneously-hung slots still fit `PROBE_BUDGET` after the arrival drain; the rare all-slots-hung case degrades to the existing per-node ledger replay (AprilNEA#222). Refs AprilNEA#218, AprilNEA#222. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ss (#218) (#237) Follow-up to #222. The per-node `NodeLedger` from #222 replays a node's last good inventory across watcher ticks, but the one-shot `enumerate()` free fn (CLI `list` / `diag`) builds a fresh `Enumerator` whose ledger is empty — so a transient probe miss has nothing to replay and still surfaces as an empty/partial list. That's exactly the "two isolated runs read 3 devices and 0" case from #218. - openlogi-hid: split the enumerate pass into `enumerate_reporting_health`, which also reports whether every probed node answered cleanly this pass (a probe timeout, open failure, or a receiver read short of its pairing count is unhealthy). The polling watcher ignores the flag (the ledger handles replay); the one-shot `enumerate()` retries on it, reusing the same enumerator so its ledger accumulates a snapshot to replay and the channel stays warm. #226's 5 s `HidppChannel::send` timeout keeps a dead probe fast, so the bounded retry (4 attempts, 300 ms apart) is cheap. - openlogi-agent-core: factor the watcher tick->event decision into `WatchState::classify` (no behaviour change — `Err` already kept the last snapshot) and unit-test it: an enumerate failure after a success keeps the last snapshot, a clean empty `Ok` is still forwarded as a disconnect, and persistent initial failure reports `Unavailable` once then recovers. (Invited in the #222 review; the watcher was untouched by #222.)

greptile-apps Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread crates/openlogi-hid/src/inventory.rs

buliwyf42 mentioned this pull request Jun 12, 2026

fix(hid): keep last-known-good devices when a receiver read is incomplete (#218) #221

Closed

AprilNEA force-pushed the fix/receiver-probe-resilience branch from 39e8a43 to 0fc51f9 Compare June 13, 2026 12:55

AprilNEA merged commit 76d5870 into master Jun 13, 2026
8 checks passed

AprilNEA deleted the fix/receiver-probe-resilience branch June 13, 2026 12:59

aprilnea Bot mentioned this pull request Jun 13, 2026

chore: release v0.6.11 #236

Merged

This was referenced Jun 13, 2026

fix(hid): retry the one-shot enumerate() through a transient probe miss (#218) #237

Merged

[Bug]: Device list flaps (appears then vanishes ~2s) + device images never load — Bolt receiver, macOS 27 beta #218

Closed

buliwyf42 mentioned this pull request Jun 13, 2026

fix(hid): bound the Bolt per-slot probe so one hung device can't drop the whole receiver (#218) #251

Open

This was referenced Jun 16, 2026

One-shot enumerate() retry is a no-op for Unifying partial arrival-drain misses #277

Open

Bolt partial-walk recovery does a cold re-probe: ledger replay grace and probe-cache grace expire on the same tick #278

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(hid): replay a node's last inventory through transient probe failures#222

fix(hid): replay a node's last inventory through transient probe failures#222
AprilNEA merged 1 commit into
masterfrom
fix/receiver-probe-resilience

AprilNEA commented Jun 12, 2026

Uh oh!

greptile-apps Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

buliwyf42 commented Jun 12, 2026

Uh oh!

Uh oh!

AprilNEA commented Jun 13, 2026

Uh oh!

buliwyf42 commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

AprilNEA commented Jun 12, 2026

Problem

Fix

Notes

Tests

Uh oh!

greptile-apps Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

buliwyf42 commented Jun 12, 2026

Uh oh!

Uh oh!

AprilNEA commented Jun 13, 2026

Uh oh!

buliwyf42 commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented Jun 12, 2026 •

edited

Loading