Skip to content

test: stop codex app-server runtime tests flaking under parallel load#376

Merged
danshapiro merged 1 commit into
mainfrom
fix/codex-runtime-startup-flake
May 30, 2026
Merged

test: stop codex app-server runtime tests flaking under parallel load#376
danshapiro merged 1 commit into
mainfrom
fix/codex-runtime-startup-flake

Conversation

@danshapiro
Copy link
Copy Markdown
Owner

Problem

test/unit/server/coding-cli/codex-app-server/runtime.test.ts flakes in the broad server-config run (it passes in isolation). The hand-off blamed the per-test 3000 ms wall-clock timeout, but a contention storm (60 CPU-busy loops + 16 concurrent copies of the file) showed that's only a small slice:

Failure mode Share
ensureReady() rejecteddid not finish initialize within Nms / connect ECONNREFUSED dominant (~32/34)
Test timed out in 3000ms (wall-clock) ~2/34

Root cause: these tests spawn real node sidecars (the fake-app-server fixture) and require them to spawn + bind a WebSocket + answer initialize. The server-config suite runs with fileParallelism + maxConcurrency (vitest.server.config.ts), so freshly-spawned sidecars are CPU-starved and can't meet the tests' tight per-attempt budgets (startupAttemptTimeoutMs of 200/500 ms), so ensureReady rejects at the await. It's a family, not one test: retries startup when the preallocated loopback port is lost (200 ms), keeps the same sidecar when wrapper identity is transiently incomplete (500 ms), uses the startup attempt timeout to tear down an initialize hang (500 ms + a Date.now()-start < 1500 assertion), and both tears down the owned process group before retry… tests (500 ms).

Production is unaffectedDEFAULT_STARTUP_ATTEMPT_TIMEOUT_MS = 3000; only the tests shrank it for speed. This is test-config fragility, not a product bug, so the change is test-only.

Fix

  • Real-success startup tests → contention-tolerant per-attempt budget (REAL_STARTUP_ATTEMPT_TIMEOUT_MS = 5000, above production's own 3000 default) and drop the redundant }, 3_000) wall-clock overrides so they inherit the 30 s global.
  • Initialize-hang test → keep its tight budget (it verifies the per-attempt timeout fires), but turn the absolute < 1500 ms into a boundedness check tied to the configured budget (limit * timeout + slack).
  • Wait helpers (waitForPidFile / waitForProcessExit / waitForMetadataRecord) → raise the poll deadline (WAIT_HELPER_TIMEOUT_MS = 15000); they observe second-order fixture spawns (e.g. a native child) that lag under load.
  • writes ownership metadata immediately after spawn… → wait for the identity-complete metadata write via a predicate instead of racing the runtime's first (empty-identity) write.

Verification

  • In isolation: 53/53.
  • Under the storm (16 concurrent copies × 53 tests, 60-process CPU load, load avg ~135 — heavier than the pre-fix run): 16/16 copies fully green, 0 failures, vs 5 distinct tests flaking pre-fix.
  • typecheck:server and typecheck:client clean; targeted tsc on the test file clean.

No production code touched; change is confined to one test file.

🤖 Generated with Claude Code

These tests spawn real `node` sidecars (the fake-app-server fixture) and
require them to reach `initialize`. The server-config suite runs with
fileParallelism + maxConcurrency, so freshly-spawned sidecars are CPU-starved.
Under that contention the dominant failure was not the per-test wall-clock
timeout but `ensureReady()` rejecting (`did not finish initialize within Nms`
/ `ECONNREFUSED`): a real spawn+bind+initialize could not complete inside the
tests' tight per-attempt budgets (200/500ms). A whole family flaked, not one
test.

Production is unaffected (DEFAULT_STARTUP_ATTEMPT_TIMEOUT_MS = 3000); only the
tests shrank it for speed. Fix is test-only:

- Give real-success startup tests a contention-tolerant per-attempt budget
  (REAL_STARTUP_ATTEMPT_TIMEOUT_MS) and drop their redundant 3000ms wall-clock
  overrides so they inherit the 30s global.
- Keep the initialize-hang test's tight budget (it verifies the timeout FIRES)
  but make its absolute timing assertion a contention-tolerant boundedness
  check tied to the configured budget.
- Raise the wait-helper poll deadline (pid file / process exit / metadata
  record) which observes second-order fixture spawns that lag under load.
- Wait for the identity-complete metadata write instead of racing the runtime's
  first (empty-identity) write.

Verified: 16 concurrent copies x 53 tests fully green under a 60-process CPU
storm (load avg ~135), versus 5 distinct tests flaking pre-fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@danshapiro danshapiro merged commit d2b2004 into main May 30, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants