fix(hid): bound the Bolt per-slot probe so one hung device can't drop the whole receiver (#218) by buliwyf42 · Pull Request #251 · AprilNEA/OpenLogi

buliwyf42 · 2026-06-13T16:47:36Z

Context

Live re-testing #218 on the repro rig (macOS 27 beta, MX Master 4 + MX Master 3S + MX Mechanical Mini on one Bolt receiver) showed that #222 (NodeLedger) and #237 (one-shot retry) fix the short transient misses, but a sustained failure still drops the device list to "No devices" — details in #218. This PR fixes that remaining case.

Root cause

A single paired online device's deep feature walk (probe_features / Device::new) hangs on every cycle, burning the whole 5 s PROBE_BUDGET. Each failing cycle still reads pairing_count=Some(3) and all three slot codenames — only the deep walk hangs — but because probe_one is the unit wrapped in timeout(PROBE_BUDGET), when it fires the entire node is discarded, including the pairing-register reads that succeeded. The receiver yields nothing → GUI inventory refreshed count=0.

The tell: the Unifying slot probe already guards exactly this with UNIFYING_SLOT_PROBE (a per-slot cap, so a slow walk returns partial instead of starving the budget). The Bolt slot probe had no equivalent — probe_or_reuse was awaited bare in probe_bolt_slot, so one hung Bolt device blew the whole-receiver budget.

Fix

Mirror the Unifying guard onto the Bolt path: a new BOLT_SLOT_PROBE (1.5 s) wraps the Bolt slot's probe_or_reuse; on timeout the slot falls back to its cached / identity-only data (codename + kind + online come from the pairing register, which reads fine every cycle) instead of letting the whole probe_one time out. A hung device keeps showing with its last-known capabilities while the rest of the receiver still enumerates.

1.5 s is well above a healthy BTLE walk (sub-second) yet small enough that two simultaneously-hung slots still fit PROBE_BUDGET after the 1.5 s arrival drain; the rare all-slots-hung case degrades to the existing per-node ledger replay (#222).

Verification

cargo fmt --all -- --check                                   # clean
cargo clippy -p openlogi-hid --all-targets -- -D warnings     # clean
cargo test  -p openlogi-hid                                   # 61 pass

I couldn't add a unit test (the per-slot timeout needs a live/mock HID++ channel, same as the existing Unifying guard, which also has none). The change mirrors that proven pattern exactly.

Manual hardware verification still recommended: I have the repro rig but the wedge cleared on a receiver replug and I can't trigger the sustained hang on demand. The expected behaviour with this patch: when a device's deep probe hangs, the agent logs Bolt slot probe timed out; using cached data if available and the device list stays populated (the hung device keeps its last-known card) instead of dropping to "No devices".

Refs #218, #222.

greptile-apps · 2026-06-13T16:51:20Z

Greptile Summary

This PR fixes the sustained "No devices" regression (#218) on a Bolt receiver by adding a per-slot timeout (BOLT_SLOT_PROBE = 1 s) around probe_or_reuse in probe_bolt_slot, mirroring the existing UNIFYING_SLOT_PROBE guard on the Unifying path. A hung device's feature walk now times out gracefully and falls back to its cached/identity-only data instead of consuming the entire PROBE_BUDGET and causing probe_one to discard all paired devices.

New constant BOLT_SLOT_PROBE (1 s): Wraps probe_or_reuse with tokio::time::timeout; on expiry the slot returns cached probe data (or ProbedFeatures::default()) and a Seen/Unkeyed cache outcome, so the hung device stays in the inventory with its last-known capabilities.
Budget math holds for the 3-device repro rig: ARRIVAL_DRAIN (1.5 s) + 3×1 s = 4.5 s < PROBE_BUDGET (5 s); the ledger replay from fix(hid): replay a node's last inventory through transient probe failures #222 remains the backstop for the edge case of 4+ simultaneously-hung online slots (possible since MAX_BOLT_SLOTS = 6)."

Confidence Score: 5/5

The change is a focused, additive guard that strictly narrows the blast radius of a hung Bolt slot walk; it cannot regress the happy path.

The fix is a direct mirror of the already-proven Unifying per-slot guard. The fallback path (cached data → Seen outcome → device stays in inventory) is consistent with how probe_or_reuse itself handles a failed re-probe, and the budget arithmetic (1.5 + 3×1 = 4.5 s) fits within PROBE_BUDGET for the described repro rig. The only open edge (4+ simultaneously-hung online slots) was already handled by the NodeLedger before this PR, so this change is strictly an improvement with no new failure modes.

No files require special attention; the single changed file is well-scoped and the new constant is clearly documented.

Important Files Changed

Filename	Overview
crates/openlogi-hid/src/inventory.rs	Adds BOLT_SLOT_PROBE (1 s) timeout around probe_or_reuse in probe_bolt_slot, mirroring the existing UNIFYING_SLOT_PROBE guard; on timeout, falls back to cached/identity-only data via seen(id), preventing a single hung Bolt device from consuming PROBE_BUDGET and dropping all devices.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[probe_bolt_receiver] -->|for slot in 1..=MAX_BOLT_SLOTS| B[probe_bolt_slot]
    B --> C[get_device_pairing_information]
    C -->|Err| D[return None – slot skipped]
    C -->|Ok| E[read_codename]
    E --> F["timeout(BOLT_SLOT_PROBE=1s, probe_or_reuse(...)"]
    F -->|Ok – probe succeeded| G[use fresh/cached probe\nCacheOutcome::Fresh/Update/Seen]
    F -->|Err – timed out| H[log: Bolt slot probe timed out\nfallback: cached probe or default\nCacheOutcome::Seen/Unkeyed]
    G --> I[build PairedDevice]
    H --> I
    I --> J[return Some – slot counted]
    J --> K{all slots done?}
    K -->|no| B
    K -->|yes| L{paired.len == pairing_count?}
    L -->|yes: complete=true| M[NodeProbe – ledger accepts fresh data]
    L -->|no: complete=false| N[NodeProbe – ledger replays last good snapshot]

_{Reviews (2): Last reviewed commit: "fix(hid): bound the Bolt per-slot probe ..." | Re-trigger Greptile}

… the whole receiver (AprilNEA#218) On the AprilNEA#218 repro rig (macOS 27 beta, MX Master 4 on a Bolt receiver) the device list still dropped to "No devices" under a *sustained* failure that AprilNEA#222's ledger and AprilNEA#237's retry don't cover: a single paired online device's deep feature walk (`probe_features` / `Device::new`) hangs every cycle, burning the whole 5 s `PROBE_BUDGET`. Because `probe_one` is the unit wrapped in `timeout(PROBE_BUDGET)`, firing it discards the entire node — including the pairing-register reads that succeeded — so the receiver yields nothing. Confirmed device/OS-level (a standalone `openlogi list` hangs identically); only a physical receiver replug cleared it. The Unifying slot probe already guards exactly this with `UNIFYING_SLOT_PROBE` (a per-slot cap, so a slow walk returns partial instead of starving the budget). The Bolt slot probe had no equivalent — `probe_or_reuse` was awaited bare, so one hung device blew the whole-receiver budget. Mirror that guard to Bolt: a new `BOLT_SLOT_PROBE` (1.5 s) wraps the Bolt slot's `probe_or_reuse`; on timeout the slot falls back to its cached / identity data (codename + kind + online come from the pairing register, which reads fine every cycle) instead of the whole `probe_one` timing out. A hung device keeps showing with its last-known capabilities while the rest of the receiver enumerates. 1.5 s is well above a healthy BTLE walk (sub-second) and small enough that two simultaneously-hung slots still fit `PROBE_BUDGET` after the arrival drain; the rare all-slots-hung case degrades to the existing per-node ledger replay (AprilNEA#222). Refs AprilNEA#218, AprilNEA#222. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

buliwyf42 · 2026-06-13T17:01:09Z

Live-tested this on the #218 rig (macOS 27 beta, MX Master 4 + MX Master 3S + MX Mechanical Mini on a Bolt receiver) and tightened it as a result.

The per-slot cap works: when a slot's deep walk hung, Bolt slot probe timed out … slot=N fired and that device kept its cached card instead of the whole receiver vanishing — confirmed live for 1–2 hung slots.

But I hit a budget bug at my first value (1.5 s). When the MX Master 3S woke so all three slots were online and hung at once, 1.5 s arrival drain + 3 × 1.5 s = 6 s overran the 5 s PROBE_BUDGET, so the outer timeout(PROBE_BUDGET, probe_one) still fired and dropped the whole node. Tightened to BOLT_SLOT_PROBE = 1 s so all three online slots fit (1.5 + 3×1 = 4.5 s); pushed. (The all-three-at-1s case is budget-derived — the wedge is non-deterministic, so I couldn't re-trigger it on demand without more churn.)

One honest caveat: this keeps the device list populated (via the cached fallback), but it doesn't cure the underlying wedge — during it the device genuinely stops answering HID++ (writes like DPI/SmartShift would still time out), and only a physical receiver replug cleared it on my rig. So #251 is graceful degradation for the symptom; the root macOS-27 IOHID wedge stays the separate #218 track.

buliwyf42 · 2026-06-14T06:38:54Z

Thanks — already addressed. After live-testing on the rig I hit exactly this: with the MX Master 3S awake, all three slots online and hung at once overran the budget at 1.5 s/slot. The current head (34314ba) tightens BOLT_SLOT_PROBE to Duration::from_secs(1), so 1.5 s drain + 3×1 s = 4.5 s fits PROBE_BUDGET — your review picked up the earlier 1.5 s push. (Empty slots fail fast on UnknownDevice without a deep probe, so the worst realistic case is the 3 paired devices, not 6 slots.)

buliwyf42 mentioned this pull request Jun 13, 2026

fix(hid): replay a node's last inventory through transient probe failures #222

Merged

greptile-apps Bot reviewed Jun 13, 2026

View reviewed changes

Comment thread crates/openlogi-hid/src/inventory.rs Outdated

buliwyf42 force-pushed the fix/bolt-per-slot-probe-timeout branch from 13b3fc8 to 1d42ddf Compare June 13, 2026 16:58

buliwyf42 force-pushed the fix/bolt-per-slot-probe-timeout branch from 1d42ddf to 34314ba Compare June 13, 2026 16:59

AprilNEA mentioned this pull request Jun 16, 2026

Bolt partial-walk recovery does a cold re-probe: ledger replay grace and probe-cache grace expire on the same tick #278

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(hid): bound the Bolt per-slot probe so one hung device can't drop the whole receiver (#218)#251

fix(hid): bound the Bolt per-slot probe so one hung device can't drop the whole receiver (#218)#251
buliwyf42 wants to merge 1 commit into
AprilNEA:masterfrom
buliwyf42:fix/bolt-per-slot-probe-timeout

buliwyf42 commented Jun 13, 2026

Uh oh!

greptile-apps Bot commented Jun 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

buliwyf42 commented Jun 13, 2026

Uh oh!

buliwyf42 commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

buliwyf42 commented Jun 13, 2026

Context

Root cause

Fix

Verification

Uh oh!

greptile-apps Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

buliwyf42 commented Jun 13, 2026

Uh oh!

buliwyf42 commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Jun 13, 2026 •

edited

Loading