Skip to content

fix(test): stop stress_resize_extremes hanging CI forever (phux-s2iw)#49

Merged
phall1 merged 1 commit into
mainfrom
fix/s2iw-stress-resize-screenshot-hang
Jun 3, 2026
Merged

fix(test): stop stress_resize_extremes hanging CI forever (phux-s2iw)#49
phall1 merged 1 commit into
mainfrom
fix/s2iw-stress-resize-screenshot-hang

Conversation

@phall1
Copy link
Copy Markdown
Owner

@phall1 phall1 commented Jun 3, 2026

Summary

The e2e CI lane intermittently hung for >100 minutes on both_axes_shrink_storm_under_output_does_not_panic (and latently on resize_degenerate_viewports_do_not_panic), then got cancelled — the main cause of the flaky/failing e2e lane.

Root cause (it was never a deadlock)

The final cap.attach_screen(client.screenshot().await...) drains output "until 20ms of quiet." But these tests' seeds emit forever — every 5ms (printf; sleep 0.005) and 20ms (stty size; sleep 0.02) — so screenshot()'s drain loop never sees a gap and spins indefinitely. It's flaky (not 100%) only because occasional scheduler jitter yields a >20ms gap that lets it break.

Confirmed by stack sample of a hung process: the runtime thread is stuck in ClientHandle::screenshot (builder.rs:430), ~4.6% CPU (parked, not spinning the server). Two earlier hypotheses (client-read backpressure, server resize deadlock) were disproved — draining the client didn't help; the hang is purely the screenshot loop.

Fix (test-harness only, no production code)

  • Add ClientHandle::drain_output_bounded(max_frames) — a count-bounded drain, safe against a continuously-emitting seed where screenshot() cannot terminate.
  • Add ClientHandle::snapshot_text() — read the oracle's text without draining.
  • Replace the hanging screenshot() in both tests with drain_output_bounded(32) + snapshot_text().
  • The resize storm loops are untouched — they never deadlocked (resize_raw is send-only and always reached the final screenshot).

Validation

  • both_axes_shrink_storm: 25/25 pass, slowest 1s (was a 100-min hang).
  • Full file (both tests): 10/10, slowest 2s.
  • just ci green.

Follow-up (noted in bead phux-s2iw)

The same screenshot()-on-a-fast-continuous-seed anti-pattern is latent in stress_resize_storm.rs and stress_lifecycle_churn.rs (not currently failing CI). Separately, this run surfaced that the CI workflow recompiles the workspace ~4× across jobs and the FlakeHub/nix cache auth is failing — tracked separately.

🤖 Generated with Claude Code

The e2e lane intermittently hung for >100min on
`both_axes_shrink_storm_under_output_does_not_panic` (and latently
`resize_degenerate_viewports_do_not_panic`). Root cause is NOT a server
deadlock: the final `cap.attach_screen(client.screenshot()...)` call
drains output "until 20ms of quiet", but these tests' seeds emit every
5ms / 20ms forever, so screenshot's drain loop never sees a gap and
spins indefinitely. Flaky only because occasional scheduler jitter
yields a >20ms gap. Confirmed by stack sample: runtime thread stuck in
`ClientHandle::screenshot` (builder.rs).

Add two harness helpers: `drain_output_bounded(max_frames)` (a
count-bounded drain that is safe against a continuously-emitting seed)
and `snapshot_text()` (read the oracle without draining). Replace the
hanging `screenshot()` in both tests with a bounded drain + snapshot.
The resize storm loops are left untouched -- they never deadlocked
(`resize_raw` is send-only and always reached the final screenshot).

Validation: both_axes 25/25 pass, slowest 1s (was a 100min hang); full
file both tests 10/10, slowest 2s.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@phall1 phall1 merged commit a1c6b18 into main Jun 3, 2026
5 of 6 checks passed
@phall1 phall1 deleted the fix/s2iw-stress-resize-screenshot-hang branch June 3, 2026 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant