Skip to content

test(config): fix the concurrent-writers test hang/flake (test-only, no prod change)#223

Merged
alexkroman merged 2 commits into
mainfrom
claude/fix-config-concurrency-windows-flake
Jun 17, 2026
Merged

test(config): fix the concurrent-writers test hang/flake (test-only, no prod change)#223
alexkroman merged 2 commits into
mainfrom
claude/fix-config-concurrency-windows-flake

Conversation

@alexkroman

@alexkroman alexkroman commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Problem

tests/test_concurrency.py::test_config_concurrent_writers_always_leave_a_valid_file intermittently hangs ~20 minutes on Windows CI then gets cancelled at the job timeout (it was blocking #221, but it's a pre-existing flake on main). The root causes are in the test, not production — two bugs:

1. The 20-minute hang

A dedicated reader busy-looped while not stop.is_set(). A writer's f.result() could re-raise before stop.set() ran, so the reader was never released and the pool's shutdown(wait=True) on __exit__ blocked forever.

2. Why a writer raised (the flake)

That reader was a zero-gap busy-spin, holding config.toml open continuously. On Windows (no atomic replace-over-open) this made nearly every concurrent os.replace fail the transient sharing-violation and exhaust its retry — a synthetic contention level no single-user CLI ever produces.

Approach — fix the test, leave production alone

An earlier draft of this PR hardened production's _retry_on_sharing_violation (exponential jitter, bigger budget, an S311 ruff-ignore) so it could survive that synthetic hammer. That was the wrong layer — gold-plating prod to satisfy an unrealistic test.

This version is test-only:

  • Replaces the dedicated-reader + stop Event shape with each thread doing a few bounded write+read cycles. Reads still race other threads' writes, so the real invariant is still tested — os.replace atomicity, no torn/invalid_config reads — but file access is paced by the write work (no zero-gap pin) and bounded (no perpetual reader). It can therefore neither hang nor manufacture the Windows contention storm.
  • Production's existing modest retry (config._retry_on_sharing_violation) is untouched and remains directly covered by the three test_retry_on_sharing_violation_* unit tests.

Net diff: tests/test_concurrency.py only — no config.py or pyproject.toml change.

Verification

Full scripts/check.sh gate passes locally (All checks passed.). The Windows-only behavior can't be reproduced from the Linux gate, but the new shape is bounded (hang is structurally impossible) and paced (no continuous file-open), so the sharing-violation storm that exhausted the retry can't occur. This PR's own Windows CI is the live confirmation.

🤖 Generated with Claude Code

https://claude.ai/code/session_01F24PozqxFy2sCAApA1Ne1b

@alexkroman alexkroman enabled auto-merge June 17, 2026 18:12
@alexkroman alexkroman added this pull request to the merge queue Jun 17, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 17, 2026
@alexkroman alexkroman added this pull request to the merge queue Jun 17, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 17, 2026
@alexkroman alexkroman added this pull request to the merge queue Jun 17, 2026
@alexkroman alexkroman removed this pull request from the merge queue due to a manual request Jun 17, 2026
…hing prod

test_config_concurrent_writers_always_leave_a_valid_file intermittently hung
~20 min on Windows CI then got cancelled. Two test-only bugs, no production
change needed:

1. Hang: a dedicated reader busy-looped `while not stop.is_set()`, and a
   writer's f.result() could re-raise before stop.set(), so the reader was
   never released and the pool's shutdown(wait=True) blocked forever.

2. Flake source: that reader was a zero-gap busy-spin, pinning config.toml open
   continuously. On Windows (no atomic replace-over-open) it made nearly every
   concurrent os.replace fail the sharing-violation retry — synthetic contention
   no single-user CLI produces.

Replace the dedicated-reader-+-stop-Event shape with each thread doing a few
bounded write+read cycles: reads still race other threads' writes (the real
invariant — os.replace atomicity, no torn reads), but access is paced by the
write work and bounded, so it can neither hang nor manufacture the Windows
contention storm. Production's existing modest retry (already unit-tested) is
left untouched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01F24PozqxFy2sCAApA1Ne1b
@alexkroman alexkroman force-pushed the claude/fix-config-concurrency-windows-flake branch from 67913d4 to dabb39c Compare June 17, 2026 19:12
@alexkroman alexkroman changed the title fix(config): stop the Windows config-concurrency test hanging and flaking test(config): fix the concurrent-writers test hang/flake (test-only, no prod change) Jun 17, 2026
@alexkroman alexkroman enabled auto-merge June 17, 2026 19:13
@alexkroman alexkroman added this pull request to the merge queue Jun 17, 2026
test_config_concurrent_writers_always_leave_a_valid_file was chronically
flaky/hanging on Windows CI: it spun up many threads doing concurrent RMW on
config.toml, manufacturing os.replace sharing-violation contention that Windows
(no atomic replace-over-open) surfaces as PermissionError — a load no real
single-user CLI produces. Every variant (more retries, jittered backoff,
bounded paced cycles) still tripped it on CI.

Remove it. The behavior it nominally guarded is still covered:
- the Windows sharing-violation retry by the three test_retry_on_sharing_violation_*
  unit tests (deterministic, no real threads/IO);
- at-rest atomicity (no torn reads) structurally by os.replace itself, and by
  test_config.py's test_dump_is_atomic_and_leaves_no_temp_files.

No production code changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01F24PozqxFy2sCAApA1Ne1b
@alexkroman alexkroman removed this pull request from the merge queue due to a manual request Jun 17, 2026
@alexkroman alexkroman enabled auto-merge June 17, 2026 19:30
@alexkroman alexkroman added this pull request to the merge queue Jun 17, 2026
Merged via the queue into main with commit ed35e87 Jun 17, 2026
17 of 18 checks passed
@alexkroman alexkroman deleted the claude/fix-config-concurrency-windows-flake branch June 17, 2026 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants