fix(hid): bound the Bolt per-slot probe so one hung device can't drop the whole receiver (#218)#251
Conversation
Greptile SummaryThis PR fixes the sustained "No devices" regression (#218) on a Bolt receiver by adding a per-slot timeout (
Confidence Score: 5/5The change is a focused, additive guard that strictly narrows the blast radius of a hung Bolt slot walk; it cannot regress the happy path. The fix is a direct mirror of the already-proven Unifying per-slot guard. The fallback path (cached data → Seen outcome → device stays in inventory) is consistent with how probe_or_reuse itself handles a failed re-probe, and the budget arithmetic (1.5 + 3×1 = 4.5 s) fits within PROBE_BUDGET for the described repro rig. The only open edge (4+ simultaneously-hung online slots) was already handled by the NodeLedger before this PR, so this change is strictly an improvement with no new failure modes. No files require special attention; the single changed file is well-scoped and the new constant is clearly documented. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[probe_bolt_receiver] -->|for slot in 1..=MAX_BOLT_SLOTS| B[probe_bolt_slot]
B --> C[get_device_pairing_information]
C -->|Err| D[return None – slot skipped]
C -->|Ok| E[read_codename]
E --> F["timeout(BOLT_SLOT_PROBE=1s, probe_or_reuse(...)"]
F -->|Ok – probe succeeded| G[use fresh/cached probe\nCacheOutcome::Fresh/Update/Seen]
F -->|Err – timed out| H[log: Bolt slot probe timed out\nfallback: cached probe or default\nCacheOutcome::Seen/Unkeyed]
G --> I[build PairedDevice]
H --> I
I --> J[return Some – slot counted]
J --> K{all slots done?}
K -->|no| B
K -->|yes| L{paired.len == pairing_count?}
L -->|yes: complete=true| M[NodeProbe – ledger accepts fresh data]
L -->|no: complete=false| N[NodeProbe – ledger replays last good snapshot]
Reviews (2): Last reviewed commit: "fix(hid): bound the Bolt per-slot probe ..." | Re-trigger Greptile |
13b3fc8 to
1d42ddf
Compare
… the whole receiver (AprilNEA#218) On the AprilNEA#218 repro rig (macOS 27 beta, MX Master 4 on a Bolt receiver) the device list still dropped to "No devices" under a *sustained* failure that AprilNEA#222's ledger and AprilNEA#237's retry don't cover: a single paired online device's deep feature walk (`probe_features` / `Device::new`) hangs every cycle, burning the whole 5 s `PROBE_BUDGET`. Because `probe_one` is the unit wrapped in `timeout(PROBE_BUDGET)`, firing it discards the entire node — including the pairing-register reads that succeeded — so the receiver yields nothing. Confirmed device/OS-level (a standalone `openlogi list` hangs identically); only a physical receiver replug cleared it. The Unifying slot probe already guards exactly this with `UNIFYING_SLOT_PROBE` (a per-slot cap, so a slow walk returns partial instead of starving the budget). The Bolt slot probe had no equivalent — `probe_or_reuse` was awaited bare, so one hung device blew the whole-receiver budget. Mirror that guard to Bolt: a new `BOLT_SLOT_PROBE` (1.5 s) wraps the Bolt slot's `probe_or_reuse`; on timeout the slot falls back to its cached / identity data (codename + kind + online come from the pairing register, which reads fine every cycle) instead of the whole `probe_one` timing out. A hung device keeps showing with its last-known capabilities while the rest of the receiver enumerates. 1.5 s is well above a healthy BTLE walk (sub-second) and small enough that two simultaneously-hung slots still fit `PROBE_BUDGET` after the arrival drain; the rare all-slots-hung case degrades to the existing per-node ledger replay (AprilNEA#222). Refs AprilNEA#218, AprilNEA#222. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1d42ddf to
34314ba
Compare
|
Live-tested this on the #218 rig (macOS 27 beta, MX Master 4 + MX Master 3S + MX Mechanical Mini on a Bolt receiver) and tightened it as a result. The per-slot cap works: when a slot's deep walk hung, But I hit a budget bug at my first value (1.5 s). When the MX Master 3S woke so all three slots were online and hung at once, One honest caveat: this keeps the device list populated (via the cached fallback), but it doesn't cure the underlying wedge — during it the device genuinely stops answering HID++ (writes like DPI/SmartShift would still time out), and only a physical receiver replug cleared it on my rig. So #251 is graceful degradation for the symptom; the root macOS-27 IOHID wedge stays the separate #218 track. |
|
Thanks — already addressed. After live-testing on the rig I hit exactly this: with the MX Master 3S awake, all three slots online and hung at once overran the budget at 1.5 s/slot. The current head ( |
Context
Live re-testing #218 on the repro rig (macOS 27 beta, MX Master 4 + MX Master 3S + MX Mechanical Mini on one Bolt receiver) showed that #222 (
NodeLedger) and #237 (one-shot retry) fix the short transient misses, but a sustained failure still drops the device list to "No devices" — details in #218. This PR fixes that remaining case.Root cause
A single paired online device's deep feature walk (
probe_features/Device::new) hangs on every cycle, burning the whole 5 sPROBE_BUDGET. Each failing cycle still readspairing_count=Some(3)and all three slot codenames — only the deep walk hangs — but becauseprobe_oneis the unit wrapped intimeout(PROBE_BUDGET), when it fires the entire node is discarded, including the pairing-register reads that succeeded. The receiver yields nothing → GUIinventory refreshed count=0.The tell: the Unifying slot probe already guards exactly this with
UNIFYING_SLOT_PROBE(a per-slot cap, so a slow walk returns partial instead of starving the budget). The Bolt slot probe had no equivalent —probe_or_reusewas awaited bare inprobe_bolt_slot, so one hung Bolt device blew the whole-receiver budget.Fix
Mirror the Unifying guard onto the Bolt path: a new
BOLT_SLOT_PROBE(1.5 s) wraps the Bolt slot'sprobe_or_reuse; on timeout the slot falls back to its cached / identity-only data (codename + kind + online come from the pairing register, which reads fine every cycle) instead of letting the wholeprobe_onetime out. A hung device keeps showing with its last-known capabilities while the rest of the receiver still enumerates.1.5 s is well above a healthy BTLE walk (sub-second) yet small enough that two simultaneously-hung slots still fit
PROBE_BUDGETafter the 1.5 s arrival drain; the rare all-slots-hung case degrades to the existing per-node ledger replay (#222).Verification
I couldn't add a unit test (the per-slot timeout needs a live/mock HID++ channel, same as the existing Unifying guard, which also has none). The change mirrors that proven pattern exactly.
Manual hardware verification still recommended: I have the repro rig but the wedge cleared on a receiver replug and I can't trigger the sustained hang on demand. The expected behaviour with this patch: when a device's deep probe hangs, the agent logs
Bolt slot probe timed out; using cached data if availableand the device list stays populated (the hung device keeps its last-known card) instead of dropping to "No devices".Refs #218, #222.