Skip to content

wait_for_text: two-call state/capture race within a single poll tick #50

@tony

Description

@tony

Type: architecture · Tier: deferred · Tool: wait_for_text

What's happening

Each poll in wait_for_text runs two tmux subprocess calls in sequence:

  1. _read_pane_state issues display-message to read history_size, cursor_y, pane_height, pane_pid, pane_dead.
  2. pane.capture_pane(start=start_line, end=None, join_wrapped=True) issues capture-pane, where start_line = baseline_abs - state.history_size + 1.

Between (1) and (2), tmux can scroll more lines into history. tmux's capture-pane computes top = gd->hsize + n against the live hsize at capture time (cmd-capture-pane.c#L158), not the hsize we sampled in step 1. So when N new rows scroll between the two calls:

  • We pass n = baseline_abs - hsize_at_step1 + 1
  • tmux computes top = hsize_at_step2 + n = baseline_abs + 1 + (hsize_at_step2 - hsize_at_step1)
  • The captured window starts N rows past the row we wanted; those N rows are invisible to the wait this tick.

When it matters

Single-tick latency under bursty output. The next poll usually picks the missed rows back up — unless the missed rows have already scrolled past the visible region and been collected by grid_collect_history, at which point the rollover guard fires and the wait raises. So the bug surface is:

  • One-tick interval of latency on transient bursts (default 50 ms; bounded).
  • Permanent miss only at the moment of history rollover — but rollover now raises.

In other words: the race exists but its impact is bounded by interval and capped at "raise" rather than "silently wrong" thanks to the rollover guard.

Options under consideration

1. Re-read after capture, retry on drift

state_pre = await asyncio.to_thread(_read_pane_state, pane)
start_line = baseline_abs - state_pre.history_size + 1
lines = await asyncio.to_thread(pane.capture_pane, start=start_line, ..., join_wrapped=True)
state_post = await asyncio.to_thread(_read_pane_state, pane)
delta = state_post.history_size - state_pre.history_size
if delta > 0:
    # capture started \`delta\` rows too late; re-issue with adjusted start
    ...

Doubles per-tick subprocess cost in the worst case (3 tmux calls instead of 2 when drift is detected). Complicates the _PaneState invariant set: now we track two state reads per tick. Test matrix grows.

2. Chain in a single tmux command

Build one pane.cmd(...) invocation that issues display-message ; capture-pane with tmux's \; chaining. One stdout stream needs to be split by the caller. Drops out of libtmux's typed API. Tightly couples to tmux's chaining quirks.

3. Document, rely on next-tick recovery (current behavior)

Acceptable because:

  • The miss is bounded by interval (default 50 ms).
  • Permanent misses now raise rather than silently return wrong results, courtesy of the rollover guard.
  • The deterministic alternative for command-completion synchronization is wait_for_channel composed with tmux wait-for -S — zero polling, zero races.

Recommendation

Stay on option 3 until real-world telemetry shows flaky single-tick misses. The blast radius is small and the agent-facing escape hatch (wait_for_channel) is already documented in the wait_for_text "When NOT to use this" section. Re-evaluate if a stress-test fixture starts catching missed transitions.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions