test: stop codex app-server runtime tests flaking under parallel load#376
Merged
Conversation
These tests spawn real `node` sidecars (the fake-app-server fixture) and require them to reach `initialize`. The server-config suite runs with fileParallelism + maxConcurrency, so freshly-spawned sidecars are CPU-starved. Under that contention the dominant failure was not the per-test wall-clock timeout but `ensureReady()` rejecting (`did not finish initialize within Nms` / `ECONNREFUSED`): a real spawn+bind+initialize could not complete inside the tests' tight per-attempt budgets (200/500ms). A whole family flaked, not one test. Production is unaffected (DEFAULT_STARTUP_ATTEMPT_TIMEOUT_MS = 3000); only the tests shrank it for speed. Fix is test-only: - Give real-success startup tests a contention-tolerant per-attempt budget (REAL_STARTUP_ATTEMPT_TIMEOUT_MS) and drop their redundant 3000ms wall-clock overrides so they inherit the 30s global. - Keep the initialize-hang test's tight budget (it verifies the timeout FIRES) but make its absolute timing assertion a contention-tolerant boundedness check tied to the configured budget. - Raise the wait-helper poll deadline (pid file / process exit / metadata record) which observes second-order fixture spawns that lag under load. - Wait for the identity-complete metadata write instead of racing the runtime's first (empty-identity) write. Verified: 16 concurrent copies x 53 tests fully green under a 60-process CPU storm (load avg ~135), versus 5 distinct tests flaking pre-fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
test/unit/server/coding-cli/codex-app-server/runtime.test.tsflakes in the broad server-config run (it passes in isolation). The hand-off blamed the per-test 3000 ms wall-clock timeout, but a contention storm (60 CPU-busy loops + 16 concurrent copies of the file) showed that's only a small slice:ensureReady()rejected —did not finish initialize within Nms/connect ECONNREFUSEDTest timed out in 3000ms(wall-clock)Root cause: these tests spawn real
nodesidecars (thefake-app-serverfixture) and require them to spawn + bind a WebSocket + answerinitialize. The server-config suite runs withfileParallelism + maxConcurrency(vitest.server.config.ts), so freshly-spawned sidecars are CPU-starved and can't meet the tests' tight per-attempt budgets (startupAttemptTimeoutMsof 200/500 ms), soensureReadyrejects at theawait. It's a family, not one test:retries startup when the preallocated loopback port is lost(200 ms),keeps the same sidecar when wrapper identity is transiently incomplete(500 ms),uses the startup attempt timeout to tear down an initialize hang(500 ms + aDate.now()-start < 1500assertion), and bothtears down the owned process group before retry…tests (500 ms).Production is unaffected —
DEFAULT_STARTUP_ATTEMPT_TIMEOUT_MS = 3000; only the tests shrank it for speed. This is test-config fragility, not a product bug, so the change is test-only.Fix
REAL_STARTUP_ATTEMPT_TIMEOUT_MS = 5000, above production's own 3000 default) and drop the redundant}, 3_000)wall-clock overrides so they inherit the 30 s global.< 1500 msinto a boundedness check tied to the configured budget (limit * timeout + slack).waitForPidFile/waitForProcessExit/waitForMetadataRecord) → raise the poll deadline (WAIT_HELPER_TIMEOUT_MS = 15000); they observe second-order fixture spawns (e.g. a native child) that lag under load.writes ownership metadata immediately after spawn…→ wait for the identity-complete metadata write via a predicate instead of racing the runtime's first (empty-identity) write.Verification
typecheck:serverandtypecheck:clientclean; targetedtscon the test file clean.No production code touched; change is confined to one test file.
🤖 Generated with Claude Code