Skip to content

fix(hid): keep last-known-good devices when a receiver read is incomplete (#218)#221

Closed
buliwyf42 wants to merge 1 commit into
AprilNEA:masterfrom
buliwyf42:fix/218-inventory-flap-transient-reads
Closed

fix(hid): keep last-known-good devices when a receiver read is incomplete (#218)#221
buliwyf42 wants to merge 1 commit into
AprilNEA:masterfrom
buliwyf42:fix/218-inventory-flap-transient-reads

Conversation

@buliwyf42

Copy link
Copy Markdown
Contributor

Summary

Fixes the device-list flapping half of #218: on a Logi Bolt receiver the GUI oscillated between showing all paired devices and "No devices connected" roughly every 2 s, so they never stayed long enough to configure. Reported on 0.6.6, macOS 27 beta, Apple Silicon, with an MX Master 4 + MX Master 3S + MX Mechanical Mini on one receiver.

This PR is scoped to the flap only. The device-images half of #218 is already addressed by #142 (Settings → Assets runtime sync, no CLI needed), #137 (match every modelId), and #140 (retry sync with backoff), so I've left it to those.

Root cause

The agent's inventory resilience rests on one rule, stated in watchers/inventory.rs:

enumerate() returning Err = "couldn't check" → keep the last good snapshot.
enumerate() returning Ok = authoritative → forward it (an empty Ok is a genuine disconnect).

A transient HID++ register read broke that rule. count_pairings() and the per-slot get_device_pairing_information() are independent HID++ 1.0 register reads; on the macOS 27-beta IOHID stack (with the brand-new MX Master 4 adding notification traffic) they time out intermittently. A failed slot read was silently skipped, so probe_one returned Ok with a truncated paired list — possibly empty — and a whole-receiver probe timeout dropped the receiver entirely. enumerate() already detected the shortfall (paired.len() != pairing_count) but only logged a warning.

That truncated Ok reached Orchestrator::refresh_inventory, which unconditionally overwrites InventoryState::Ready; the IPC inventory() poll served the short/empty list and the GUI rendered "No devices connected" until the next good cycle ~2 s later. The GUI's own INVENTORY_MISS_GRACE = 2 rides out 1–2 transient empties, but a run of ≥3 (plausible on this stack) drops the devices anyway — and the agent-side snapshot flapped regardless.

I also checked the duplicate-agent spawn race from #207's review (the open -n path in ipc_client.rs). It's bounded — just_disconnected skips the spawn gate on the drop tick, SPAWN_RETRY_PERIOD = 30 s rate-limits respawns, and the agent's single_instance::acquire("agent.lock") makes any duplicate exit — so it can't produce a 2 s flap, and the report's own evidence (one agent process; isolated cold agent runs flap) rules it out. This PR doesn't touch ipc_client.rs, so it won't conflict with #207.

The fix

Make enumerate() honest about partial reads so the existing keep-last-snapshot machinery engages, instead of papering over it in every consumer.

  • openlogi-hid/inventory.rs
    • New InventoryError::Incomplete.
    • receiver_read_complete(pairing_count, paired_len): a Bolt read is complete only when count_pairings() succeeded and at least that many slots read. An unreadable count or a short read is incomplete; reading more than the count is still complete.
    • probe_one reports that completeness; a probe timeout counts as incomplete too.
    • On an incomplete cycle enumerate() skips cache eviction (a device merely unread this cycle keeps its entry) and returns Incomplete rather than publishing a truncated list.
    • The one-shot enumerate() free fn (CLI / diag) has no last-known-good to fall back on, so it retries an Incomplete cycle a few times before giving up — the report's two isolated runs read 3 devices and 0; a retry would have read 3 both times.
  • openlogi-agent-core/watchers/inventory.rs — factor the tick→event decision into WatchState::classify (no behaviour change; the Err arm already kept the last snapshot) so it's unit-testable.

A genuine unpair/disconnect still reads cleanly as an Ok carrying the receiver's real, smaller count, so the list still shrinks when devices actually leave.

Tests

New unit tests (host-only, no hardware):

  • receiver_read_complete: Some(3)/3 and Some(0)/0 complete; Some(3)/2, Some(3)/0, None/* incomplete.
  • WatchState::classify: an incomplete read after a success keeps the last snapshot; a clean empty Ok is still forwarded as a disconnect; persistent initial failure reports Unavailable once, then recovers.
cargo fmt --all -- --check                                                                          # clean
cargo clippy -p openlogi-hid -p openlogi-agent-core -p openlogi-cli --all-targets -- -D warnings    # clean
cargo test  -p openlogi-hid -p openlogi-agent-core                                                   # 46 + 21 pass

I couldn't build openlogi-gui locally (it compiles Metal shaders against a full Xcode toolchain I don't have), but the GUI is untouched, InventoryError has no exhaustive match anywhere, and the new variant is additive — so the GUI build is unaffected. CI covers it.

Manual hardware verification still needed

I don't have the Bolt-receiver hardware, so the live repro needs a hand (ideally @buliwyf42, who filed it):

  1. Build the agent from this branch and run it against the 3-device Bolt setup on macOS 27 beta.
  2. Watch the Devices view / agent logs for ~60 s. Expected: the list stays populated; on a bad cycle the agent logs receiver read incomplete this cycle … keeping last snapshot and emits no snapshot, instead of flapping to "No devices".
  3. Power one device off → it should drop from the list within a cycle or two (a clean Ok with the smaller count), confirming genuine disconnects still propagate.

Closes the flapping half of #218.

@greptile-apps

greptile-apps Bot commented Jun 12, 2026

Copy link
Copy Markdown

Greptile Summary

This PR fixes the device-list flapping bug (#218) on Logi Bolt receivers by introducing InventoryError::Incomplete to distinguish transient partial reads — where the receiver answers but some paired-device slots time out — from genuine disconnects, which still produce a clean Ok with a smaller count.

  • receiver_read_complete() in openlogi-hid/src/inventory.rs encodes the completeness rule (pairing_count must be known and paired_len must meet it); probe_one propagates a complete flag up to Enumerator::enumerate(), which skips cache eviction and returns Err(Incomplete) instead of a truncated list when any receiver read is partial.
  • WatchState::classify() in openlogi-agent-core/src/watchers/inventory.rs factors the tick→event decision out of the poll loop, preserving the rule that Err(…) suppresses the tick (keeping the last good snapshot) while Ok(…) — even empty — is always forwarded as authoritative.
  • The one-shot enumerate() free function (CLI / diagnostics) retries up to ONESHOT_ATTEMPTS times on Incomplete before giving up, since it has no last-known-good state to fall back on.

Confidence Score: 5/5

Safe to merge — the change is narrowly scoped to inventory polling, adds no new externally-visible states that existing callers need to handle exhaustively, and all affected code paths are covered by new unit tests.

The completeness check (receiver_read_complete) is simple, deterministic, and well-tested; the Enumerator correctly updates the probe cache for successfully-read devices even on incomplete cycles while skipping eviction; and WatchState::classify preserves the pre-existing Err→suppress / Ok→forward contract while making it testable. The one-shot retry loop in the free enumerate() function is bounded and reuses the warmed channel. No regressions are apparent for the direct-device or multi-receiver paths.

No files require special attention.

Important Files Changed

Filename Overview
crates/openlogi-hid/src/inventory.rs Adds InventoryError::Incomplete, receiver_read_complete(), and the complete flag in probe_one; Enumerator::enumerate() returns Err(Incomplete) instead of a truncated list when any receiver read is partial, and skips cache eviction so unseen (but still-paired) devices retain their cache entries through the next good cycle.
crates/openlogi-agent-core/src/watchers/inventory.rs Factors cross-tick watcher state into WatchState::classify() for testability; existing Err→keep-snapshot / Ok→forward semantics are preserved and now covered by three new unit tests (empty snapshot, incomplete suppression, Unavailable once then recovery).

Sequence Diagram

sequenceDiagram
    participant W as Watcher Thread
    participant E as Enumerator
    participant R as Bolt Receiver (HID++)
    participant A as Agent Orchestrator

    loop Every ~2 s
        W->>E: enumerate()
        E->>R: count_pairings()

        alt count_pairings succeeds → N
            E->>R: probe_bolt_slots 1..6
            alt "paired.len() >= N (complete)"
                E-->>W: Ok(full inventory)
                W->>W: classify → Some(Snapshot)
                W->>A: Snapshot(inventory)
                Note over A: Overwrites inventory state ✓
            else "paired.len() < N (slot timeout)"
                Note over E: cache updated for read slots, evict_unseen skipped
                E-->>W: Err(Incomplete)
                W->>W: "classify → None (succeeded=true)"
                Note over A: Keeps last-known-good snapshot ✓
            end
        else count_pairings fails → None
            Note over E: receiver_read_complete(None, _) = false
            E-->>W: Err(Incomplete)
            W->>W: "classify → None (succeeded=true)"
            Note over A: Keeps last-known-good snapshot ✓
        end
    end

    Note over W,A: Genuine disconnect → clean Ok(smaller count) → forwarded as Snapshot
Loading

Reviews (2): Last reviewed commit: "fix(hid): keep last-known-good devices w..." | Re-trigger Greptile

Comment thread crates/openlogi-hid/src/inventory.rs
Comment thread crates/openlogi-hid/src/inventory.rs
@buliwyf42 buliwyf42 marked this pull request as draft June 12, 2026 06:54
@buliwyf42 buliwyf42 marked this pull request as ready for review June 12, 2026 06:55
…lete (AprilNEA#218)

The device list flapped between all paired devices and "No devices
connected" roughly every 2s on a Logi Bolt receiver (macOS 27 beta,
MX Master 4 + MX Master 3S + MX Mechanical Mini).

Root cause: a transient HID++ register read — count_pairings() or a
per-slot get_device_pairing_information() timing out, or a whole device
probe exceeding PROBE_BUDGET — left enumerate() returning Ok with a
*truncated* device list. The agent's resilience is built on the rule
"Err = couldn't check, keep the last snapshot; Ok = authoritative",
but a partial read was mis-classified as an authoritative Ok. So
Orchestrator::refresh_inventory overwrote the last-known-good set with
the short/empty list, the IPC inventory() poll served it, and the GUI
rendered "No devices connected" — then a good cycle restored all three.

Fix: surface a partial read as InventoryError::Incomplete so the
existing keep-last-snapshot machinery engages instead of publishing it.

- inventory.rs: a Bolt receiver read is "complete" only when
  count_pairings() succeeds and at least that many slots read
  (receiver_read_complete); a probe timeout counts as incomplete too.
  On an incomplete cycle enumerate() skips cache eviction and returns
  Incomplete rather than a truncated list.
- The one-shot enumerate() free fn (CLI / diagnostics) has no
  last-known-good to fall back on, so it retries an Incomplete cycle a
  few times before giving up — the two isolated runs in the report read
  3 devices and 0; a retry would have read 3 both times.
- watchers/inventory.rs: factor the tick->event decision into
  WatchState::classify (no behaviour change — Err already kept the last
  snapshot) and unit-test that an incomplete read keeps the last
  snapshot while a clean empty Ok is still forwarded as a disconnect.

A genuine unpair/disconnect still reads cleanly as an Ok carrying the
receiver's real, smaller count and propagates exactly as before.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@buliwyf42 buliwyf42 force-pushed the fix/218-inventory-flap-transient-reads branch from 0bf7384 to 560efef Compare June 12, 2026 06:56
@buliwyf42

Copy link
Copy Markdown
Contributor Author

Closing in favor of #222 — it fixes the same transient-probe-failure root cause behind #218, and more thoroughly than this PR.

We independently landed on the same core signal (a Bolt read is healthy only when paired.len() == pairing_count), but #222's per-node NodeLedger is the better design: it replays each node's last good inventory for a grace window instead of dropping the whole enumerate cycle, so a partial read on one receiver no longer suppresses a cleanly-read second one, and cache eviction keeps running. It also evicts a wedged channel, which this PR didn't touch.

The only bit here that #222 doesn't cover is the one-shot enumerate() free fn used by the CLI list/diag commands — I left a note about that on #222. Thanks for the fast turnaround!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant