fix(test): stop stress_resize_extremes hanging CI forever (phux-s2iw)#49
Merged
Merged
Conversation
The e2e lane intermittently hung for >100min on `both_axes_shrink_storm_under_output_does_not_panic` (and latently `resize_degenerate_viewports_do_not_panic`). Root cause is NOT a server deadlock: the final `cap.attach_screen(client.screenshot()...)` call drains output "until 20ms of quiet", but these tests' seeds emit every 5ms / 20ms forever, so screenshot's drain loop never sees a gap and spins indefinitely. Flaky only because occasional scheduler jitter yields a >20ms gap. Confirmed by stack sample: runtime thread stuck in `ClientHandle::screenshot` (builder.rs). Add two harness helpers: `drain_output_bounded(max_frames)` (a count-bounded drain that is safe against a continuously-emitting seed) and `snapshot_text()` (read the oracle without draining). Replace the hanging `screenshot()` in both tests with a bounded drain + snapshot. The resize storm loops are left untouched -- they never deadlocked (`resize_raw` is send-only and always reached the final screenshot). Validation: both_axes 25/25 pass, slowest 1s (was a 100min hang); full file both tests 10/10, slowest 2s. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The
e2eCI lane intermittently hung for >100 minutes onboth_axes_shrink_storm_under_output_does_not_panic(and latently onresize_degenerate_viewports_do_not_panic), then got cancelled — the main cause of the flaky/failing e2e lane.Root cause (it was never a deadlock)
The final
cap.attach_screen(client.screenshot().await...)drains output "until 20ms of quiet." But these tests' seeds emit forever — every 5ms (printf; sleep 0.005) and 20ms (stty size; sleep 0.02) — soscreenshot()'s drain loop never sees a gap and spins indefinitely. It's flaky (not 100%) only because occasional scheduler jitter yields a >20ms gap that lets it break.Confirmed by stack sample of a hung process: the runtime thread is stuck in
ClientHandle::screenshot(builder.rs:430), ~4.6% CPU (parked, not spinning the server). Two earlier hypotheses (client-read backpressure, server resize deadlock) were disproved — draining the client didn't help; the hang is purely the screenshot loop.Fix (test-harness only, no production code)
ClientHandle::drain_output_bounded(max_frames)— a count-bounded drain, safe against a continuously-emitting seed wherescreenshot()cannot terminate.ClientHandle::snapshot_text()— read the oracle's text without draining.screenshot()in both tests withdrain_output_bounded(32)+snapshot_text().resize_rawis send-only and always reached the final screenshot).Validation
both_axes_shrink_storm: 25/25 pass, slowest 1s (was a 100-min hang).just cigreen.Follow-up (noted in bead phux-s2iw)
The same
screenshot()-on-a-fast-continuous-seed anti-pattern is latent instress_resize_storm.rsandstress_lifecycle_churn.rs(not currently failing CI). Separately, this run surfaced that the CI workflow recompiles the workspace ~4× across jobs and the FlakeHub/nix cache auth is failing — tracked separately.🤖 Generated with Claude Code