Skip to content

One-shot enumerate() retry is a no-op for Unifying partial arrival-drain misses #277

@AprilNEA

Description

@AprilNEA

Summary

The one-shot openlogi_hid::enumerate() free fn — used only by the CLI (openlogi list, openlogi diag *) — retries through a transient probe miss (added in #237), but its early-exit gate never trips for a Unifying receiver. A Unifying probe that drops a device because its arrival event landed after the drain window reports healthy = true, so the retry loop returns the partial list immediately instead of retrying. #237's retry therefore covers the Bolt path (the #218 reporter's case) but is effectively a no-op on Unifying.

The continuous consumer (the GUI/agent inventory watcher) is not affected: it re-enumerates every ~2 s forever, so a single partial drain self-corrects on a later tick. This gap is specific to the single-shot CLI path.

Mechanism

#237 made the one-shot enumerate() retry while any probed node is unhealthy:

pub async fn enumerate() -> Result<Vec<DeviceInventory>, InventoryError> {
// The polling [`Enumerator`] keeps a per-node ledger across ticks, so a
// transient probe miss replays the node's last good inventory. A one-shot
// caller (CLI `list` / `diag`) builds a fresh `Enumerator` whose ledger is
// empty, so a miss has nothing to replay and would surface as an empty or
// partial list — the two isolated runs in #218 read 3 devices and 0. Retry a
// few times instead, reusing the same enumerator so its ledger accumulates a
// snapshot a later attempt can replay and the opened channel stays warm.
// #226's 5 s request timeout inside `HidppChannel::send` makes a dead probe
// fail fast, so a short bounded retry is cheap.
let mut enumerator = Enumerator::default();
let mut attempt = 1u8;
loop {
let (inventories, all_healthy) = enumerator.enumerate_reporting_health().await?;
if all_healthy || attempt >= ONESHOT_ATTEMPTS {
return Ok(inventories);
}
debug!(
attempt,
"one-shot enumerate saw an unhealthy node — retrying"
);
tokio::time::sleep(ONESHOT_RETRY_DELAY).await;
attempt += 1;
}
}
/// Attempts a one-shot [`enumerate`] makes before returning whatever it last
/// read, when a node keeps coming back unhealthy.
const ONESHOT_ATTEMPTS: u8 = 4;
/// Delay between one-shot [`enumerate`] retries. A first probe usually wakes an
/// asleep device, so a short pause lets the next attempt read it cleanly.
const ONESHOT_RETRY_DELAY: Duration = Duration::from_millis(300);

The exit condition is all_healthy || attempt >= ONESHOT_ATTEMPTS (inventory.rs:271), where all_healthy is the logical AND of each probed node's probe.healthy. The two receiver kinds define healthy very differently:

  • Bolt (inventory.rs:537, probe_bolt_receiver): healthy = pairing_count.is_some_and(|count| paired.len() == count). A device missing from the arrival drain lowers paired.len() below counthealthy = falsethe retry fires.
  • Unifying (inventory.rs:622, probe_unifying_receiver): healthy = pairing_count.is_some(). Offline Unifying devices aren't enumerable yet (the 0xB5/0x5N pairing-info sub-register format isn't resolved), so paired.len() != count is expected and cannot be the health signal. The arrival drain is the only device source, so a device whose arrival event arrives after ARRIVAL_DRAIN (1.5 s) is silently dropped — yet pairing_count.is_some() is still true → healthy = truethe retry never fires.

So on Unifying, a single timing-induced partial drain in openlogi list / diag returns a short device set with no retry, and (because that partial list is written to the per-node ledger's last_good) it is returned as-is.

Originally flagged by Greptile as a P2 on #237 and acknowledged as a known limitation in the probe_unifying_receiver code comment.

Impact

Proposed fix

Make the retry's early-exit transport-agnostic: stop when all_healthy OR the returned inventory is unchanged from the previous attempt.

  • A Unifying partial drain → the missing device appears on a later attempt → the inventory changes → keep retrying until it stabilizes.
  • Also removes a smaller wart: a chronically-unhealthy node (e.g. an asleep BT-direct device that never answers) keeps all_healthy = false forever, forcing every one-shot call to burn all ONESHOT_ATTEMPTS even when the receiver's devices read cleanly on attempt 1. A "stable inventory" stop returns as soon as the set stops growing.

Cost: needs DeviceInventory / PairedDevice equality + a stable ordering for the comparison, and (if all_healthy is dropped as the fast path) a minimum of two passes on the happy path. Keeping all_healthy as the fast-exit and adding "unchanged inventory" as the fallback stop preserves the one-pass happy path while closing the Unifying gap.

Rejected alternative: a Unifying-specific "online-device count vs a known-good baseline" health signal — stateful, needs persistence, and has its own flapping failure modes.

Refs #237, #222, #218.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area: cliCommand-line interfacearea: hidHID device discovery, permissions, reads, or writestype: bugSomething is broken or behaves incorrectly

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions