Skip to content

fix(hid): bound the Bolt per-slot probe so one hung device can't drop the whole receiver (#218)#251

Open
buliwyf42 wants to merge 1 commit into
AprilNEA:masterfrom
buliwyf42:fix/bolt-per-slot-probe-timeout
Open

fix(hid): bound the Bolt per-slot probe so one hung device can't drop the whole receiver (#218)#251
buliwyf42 wants to merge 1 commit into
AprilNEA:masterfrom
buliwyf42:fix/bolt-per-slot-probe-timeout

Conversation

@buliwyf42

Copy link
Copy Markdown
Contributor

Context

Live re-testing #218 on the repro rig (macOS 27 beta, MX Master 4 + MX Master 3S + MX Mechanical Mini on one Bolt receiver) showed that #222 (NodeLedger) and #237 (one-shot retry) fix the short transient misses, but a sustained failure still drops the device list to "No devices" — details in #218. This PR fixes that remaining case.

Root cause

A single paired online device's deep feature walk (probe_features / Device::new) hangs on every cycle, burning the whole 5 s PROBE_BUDGET. Each failing cycle still reads pairing_count=Some(3) and all three slot codenames — only the deep walk hangs — but because probe_one is the unit wrapped in timeout(PROBE_BUDGET), when it fires the entire node is discarded, including the pairing-register reads that succeeded. The receiver yields nothing → GUI inventory refreshed count=0.

The tell: the Unifying slot probe already guards exactly this with UNIFYING_SLOT_PROBE (a per-slot cap, so a slow walk returns partial instead of starving the budget). The Bolt slot probe had no equivalent — probe_or_reuse was awaited bare in probe_bolt_slot, so one hung Bolt device blew the whole-receiver budget.

Fix

Mirror the Unifying guard onto the Bolt path: a new BOLT_SLOT_PROBE (1.5 s) wraps the Bolt slot's probe_or_reuse; on timeout the slot falls back to its cached / identity-only data (codename + kind + online come from the pairing register, which reads fine every cycle) instead of letting the whole probe_one time out. A hung device keeps showing with its last-known capabilities while the rest of the receiver still enumerates.

1.5 s is well above a healthy BTLE walk (sub-second) yet small enough that two simultaneously-hung slots still fit PROBE_BUDGET after the 1.5 s arrival drain; the rare all-slots-hung case degrades to the existing per-node ledger replay (#222).

Verification

cargo fmt --all -- --check                                   # clean
cargo clippy -p openlogi-hid --all-targets -- -D warnings     # clean
cargo test  -p openlogi-hid                                   # 61 pass

I couldn't add a unit test (the per-slot timeout needs a live/mock HID++ channel, same as the existing Unifying guard, which also has none). The change mirrors that proven pattern exactly.

Manual hardware verification still recommended: I have the repro rig but the wedge cleared on a receiver replug and I can't trigger the sustained hang on demand. The expected behaviour with this patch: when a device's deep probe hangs, the agent logs Bolt slot probe timed out; using cached data if available and the device list stays populated (the hung device keeps its last-known card) instead of dropping to "No devices".

Refs #218, #222.

@greptile-apps

greptile-apps Bot commented Jun 13, 2026

Copy link
Copy Markdown

Greptile Summary

This PR fixes the sustained "No devices" regression (#218) on a Bolt receiver by adding a per-slot timeout (BOLT_SLOT_PROBE = 1 s) around probe_or_reuse in probe_bolt_slot, mirroring the existing UNIFYING_SLOT_PROBE guard on the Unifying path. A hung device's feature walk now times out gracefully and falls back to its cached/identity-only data instead of consuming the entire PROBE_BUDGET and causing probe_one to discard all paired devices.

  • New constant BOLT_SLOT_PROBE (1 s): Wraps probe_or_reuse with tokio::time::timeout; on expiry the slot returns cached probe data (or ProbedFeatures::default()) and a Seen/Unkeyed cache outcome, so the hung device stays in the inventory with its last-known capabilities.
  • Budget math holds for the 3-device repro rig: ARRIVAL_DRAIN (1.5 s) + 3×1 s = 4.5 s < PROBE_BUDGET (5 s); the ledger replay from fix(hid): replay a node's last inventory through transient probe failures #222 remains the backstop for the edge case of 4+ simultaneously-hung online slots (possible since MAX_BOLT_SLOTS = 6)."

Confidence Score: 5/5

The change is a focused, additive guard that strictly narrows the blast radius of a hung Bolt slot walk; it cannot regress the happy path.

The fix is a direct mirror of the already-proven Unifying per-slot guard. The fallback path (cached data → Seen outcome → device stays in inventory) is consistent with how probe_or_reuse itself handles a failed re-probe, and the budget arithmetic (1.5 + 3×1 = 4.5 s) fits within PROBE_BUDGET for the described repro rig. The only open edge (4+ simultaneously-hung online slots) was already handled by the NodeLedger before this PR, so this change is strictly an improvement with no new failure modes.

No files require special attention; the single changed file is well-scoped and the new constant is clearly documented.

Important Files Changed

Filename Overview
crates/openlogi-hid/src/inventory.rs Adds BOLT_SLOT_PROBE (1 s) timeout around probe_or_reuse in probe_bolt_slot, mirroring the existing UNIFYING_SLOT_PROBE guard; on timeout, falls back to cached/identity-only data via seen(id), preventing a single hung Bolt device from consuming PROBE_BUDGET and dropping all devices.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[probe_bolt_receiver] -->|for slot in 1..=MAX_BOLT_SLOTS| B[probe_bolt_slot]
    B --> C[get_device_pairing_information]
    C -->|Err| D[return None – slot skipped]
    C -->|Ok| E[read_codename]
    E --> F["timeout(BOLT_SLOT_PROBE=1s, probe_or_reuse(...)"]
    F -->|Ok – probe succeeded| G[use fresh/cached probe\nCacheOutcome::Fresh/Update/Seen]
    F -->|Err – timed out| H[log: Bolt slot probe timed out\nfallback: cached probe or default\nCacheOutcome::Seen/Unkeyed]
    G --> I[build PairedDevice]
    H --> I
    I --> J[return Some – slot counted]
    J --> K{all slots done?}
    K -->|no| B
    K -->|yes| L{paired.len == pairing_count?}
    L -->|yes: complete=true| M[NodeProbe – ledger accepts fresh data]
    L -->|no: complete=false| N[NodeProbe – ledger replays last good snapshot]
Loading

Reviews (2): Last reviewed commit: "fix(hid): bound the Bolt per-slot probe ..." | Re-trigger Greptile

Comment thread crates/openlogi-hid/src/inventory.rs Outdated
@buliwyf42 buliwyf42 force-pushed the fix/bolt-per-slot-probe-timeout branch from 13b3fc8 to 1d42ddf Compare June 13, 2026 16:58
… the whole receiver (AprilNEA#218)

On the AprilNEA#218 repro rig (macOS 27 beta, MX Master 4 on a Bolt receiver) the
device list still dropped to "No devices" under a *sustained* failure that
AprilNEA#222's ledger and AprilNEA#237's retry don't cover: a single paired online device's
deep feature walk (`probe_features` / `Device::new`) hangs every cycle,
burning the whole 5 s `PROBE_BUDGET`. Because `probe_one` is the unit wrapped
in `timeout(PROBE_BUDGET)`, firing it discards the entire node — including the
pairing-register reads that succeeded — so the receiver yields nothing.
Confirmed device/OS-level (a standalone `openlogi list` hangs identically);
only a physical receiver replug cleared it.

The Unifying slot probe already guards exactly this with `UNIFYING_SLOT_PROBE`
(a per-slot cap, so a slow walk returns partial instead of starving the
budget). The Bolt slot probe had no equivalent — `probe_or_reuse` was awaited
bare, so one hung device blew the whole-receiver budget.

Mirror that guard to Bolt: a new `BOLT_SLOT_PROBE` (1.5 s) wraps the Bolt
slot's `probe_or_reuse`; on timeout the slot falls back to its cached /
identity data (codename + kind + online come from the pairing register, which
reads fine every cycle) instead of the whole `probe_one` timing out. A hung
device keeps showing with its last-known capabilities while the rest of the
receiver enumerates. 1.5 s is well above a healthy BTLE walk (sub-second) and
small enough that two simultaneously-hung slots still fit `PROBE_BUDGET` after
the arrival drain; the rare all-slots-hung case degrades to the existing
per-node ledger replay (AprilNEA#222).

Refs AprilNEA#218, AprilNEA#222.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@buliwyf42 buliwyf42 force-pushed the fix/bolt-per-slot-probe-timeout branch from 1d42ddf to 34314ba Compare June 13, 2026 16:59
@buliwyf42

Copy link
Copy Markdown
Contributor Author

Live-tested this on the #218 rig (macOS 27 beta, MX Master 4 + MX Master 3S + MX Mechanical Mini on a Bolt receiver) and tightened it as a result.

The per-slot cap works: when a slot's deep walk hung, Bolt slot probe timed out … slot=N fired and that device kept its cached card instead of the whole receiver vanishing — confirmed live for 1–2 hung slots.

But I hit a budget bug at my first value (1.5 s). When the MX Master 3S woke so all three slots were online and hung at once, 1.5 s arrival drain + 3 × 1.5 s = 6 s overran the 5 s PROBE_BUDGET, so the outer timeout(PROBE_BUDGET, probe_one) still fired and dropped the whole node. Tightened to BOLT_SLOT_PROBE = 1 s so all three online slots fit (1.5 + 3×1 = 4.5 s); pushed. (The all-three-at-1s case is budget-derived — the wedge is non-deterministic, so I couldn't re-trigger it on demand without more churn.)

One honest caveat: this keeps the device list populated (via the cached fallback), but it doesn't cure the underlying wedge — during it the device genuinely stops answering HID++ (writes like DPI/SmartShift would still time out), and only a physical receiver replug cleared it on my rig. So #251 is graceful degradation for the symptom; the root macOS-27 IOHID wedge stays the separate #218 track.

@buliwyf42

Copy link
Copy Markdown
Contributor Author

Thanks — already addressed. After live-testing on the rig I hit exactly this: with the MX Master 3S awake, all three slots online and hung at once overran the budget at 1.5 s/slot. The current head (34314ba) tightens BOLT_SLOT_PROBE to Duration::from_secs(1), so 1.5 s drain + 3×1 s = 4.5 s fits PROBE_BUDGET — your review picked up the earlier 1.5 s push. (Empty slots fail fast on UnknownDevice without a deep probe, so the worst realistic case is the 3 paired devices, not 6 slots.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant