Skip to content

fix: multi-worker startup + 23 bugs from full codebase audit#2

Merged
johnm-dta merged 15 commits intomainfrom
fix/multi-worker-uvicorn-startup
Mar 23, 2026
Merged

fix: multi-worker startup + 23 bugs from full codebase audit#2
johnm-dta merged 15 commits intomainfrom
fix/multi-worker-uvicorn-startup

Conversation

@johnm-dta
Copy link
Copy Markdown
Owner

Summary

  • Fix multi-worker uvicorn startup — all presets and the default config set workers=4, but the CLI passed the app as a Python object which only supports single-worker mode. Implemented factory pattern with env var serialization.
  • Fix 22 additional bugs discovered through a systematic 14-agent deep audit of the entire codebase (9,400 lines across 28 source files). Bugs were found via test execution, integration tracing, linting, type checking, and multi-perspective analysis (architect, systems thinker, Python engineer, quality engineer).

Bug clusters resolved

Cluster Bugs Root cause
Metrics misclassification 7 _classify_outcome gated connection_error on status_code is None, but servers record some connection errors with non-None status codes (timeout→504, incomplete→200). Fixed to check error_type first.
CLI duplication bugs 6 Same 3 bugs in both llm/cli.py and web/cli.py: !!python/tuple in YAML output, --format accepts invalid values, env var leak after uvicorn.run().
MCP analysis logic 4 Percentile off-by-one (int(n*p)ceil(n*p)-1), trailing burst silently dropped, false recovery time for unfinished bursts, anomaly detection only checked 2/7 error columns.
OpenAI API fidelity 3 Missing param field in error responses, inconsistent timeout 504 body format, echo mode crashes on multi-modal content (list instead of string).
Thread safety 2 ContentGenerator._get_preset_bank() had no locking (ported double-checked locking from LLM counterpart), corrected InjectionEngine docstring on RNG safety.
Multi-worker startup 1 Original bug report — factory pattern for workers > 1.

Audit also filed 18 additional bugs not resolved in this PR

Tracked in filigree with audit/* labels. Key remaining items: memory scalability (unbounded fetchall), validation gaps, finish_reason variation (feature), warning stacklevel (cosmetic).

Test plan

  • All 1112 tests pass (up from 1087 — 25 new tests added)
  • uv run ruff check src tests — clean
  • uv run mypy src — clean
  • Each fix developed with TDD (RED→GREEN→verify)
  • Full regression suite run after every commit
  • Manual smoke test: chaosllm serve --preset=realistic with workers=4 starts successfully
  • Manual smoke test: chaosweb serve --preset=realistic with workers=4 starts successfully

🤖 Generated with Claude Code

johnm-dta and others added 15 commits March 22, 2026 20:58
All presets and the ServerConfig default set workers=4, but the CLI
passed the app as a Python object to uvicorn.run(), which only supports
single-worker mode. Uvicorn requires an import string when workers > 1
so it can re-import the app in each forked child process.

When workers > 1, the config is now serialized to an environment
variable and an import string with factory=True is passed to uvicorn.
Each worker calls the factory to deserialize config and build its own
app instance. Single-worker mode continues to pass the app object
directly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both _classify_outcome (LLM) and _classify_web_outcome (Web) gated
connection_error on status_code being None. But the servers record some
connection errors with non-None status codes:
- timeout with 504 (LLM: 50% of timeouts, Web: always)
- incomplete_response with 200 (Web: sends headers then drops)

These were misclassified as server_error or fell through all categories,
silently corrupting timeseries data.

Fix: check error_type membership first (the authoritative injection
record), then fall back to status_code for HTTP-level errors. Also:
- Remove slow_response from web connection error set (it's a successful
  response with extra delay, correctly recorded as outcome="success")
- Match redirect_loop_terminated in the redirect category
- Add tests using actual server inputs instead of synthetic combinations

Resolves 7 audit bugs: errorworks-297db2ec7d, errorworks-653ef0e10c,
errorworks-12ca406f5e, errorworks-7b802d3807, errorworks-3a4450f89b,
errorworks-be94e4edd9, errorworks-2ce7661e21.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. show-config YAML output contained !!python/tuple tags because
   model_dump() preserves Python tuples. Fixed by using
   model_dump(mode="json") which converts tuples to lists, producing
   portable YAML parseable by yaml.safe_load and non-Python parsers.

2. show-config --format accepted arbitrary values (e.g. --format=xml)
   and silently fell through to YAML output. Now validates and exits
   with an error for unsupported formats.

3. Multi-worker env var (_ERRORWORKS_LLM_CONFIG / _ERRORWORKS_WEB_CONFIG)
   was never cleaned up after uvicorn.run() returned. Wrapped in
   try/finally to ensure cleanup.

Resolves 6 audit bugs: errorworks-32b301fff6, errorworks-2ab67e57c5,
errorworks-60629d33be, errorworks-d4d064025e, errorworks-9a5548de3a,
errorworks-284efc9456.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Add missing "param" field (always null) to all error responses.
   Real OpenAI errors always include {message, type, param, code} but
   the server only returned {type, message, code}.

2. Fix timeout 504 response body to use standard format. Previously
   used {type: "timeout", message: ...} with no code/param fields.
   Now matches all other error responses with type: "server_error",
   code: "timeout", and the param field.

3. Fix echo mode and token estimation to handle multi-modal message
   content. OpenAI messages can have content as a list of parts
   (text + image_url) instead of a plain string. Previously this
   dumped the raw list representation. Now extracts text parts
   properly via _extract_text_content().

Note: finish_reason hardcoded to "stop" (errorworks-3461e3c975) is
left as a tracked feature request — varying it requires config
changes and could break existing users.

Resolves: errorworks-f3107a555c, errorworks-1b57330c19,
errorworks-f7844b1e6e.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Percentile off-by-one: int(n*p) returns the max value for p99 with
   n=100 (index 99 = 100th percentile). Fixed to use ceil(n*p)-1
   (nearest-rank method), so p99 of [1..100] correctly returns 99.

2. Trailing burst dropped: get_burst_events silently discarded any
   burst still active at end of timeseries data. Now appends it with
   still_active=True and end_bucket=None.

3. False recovery time: analyze_aimd_behavior calculated recovery time
   for unfinished bursts (using len(buckets)-1 as synthetic end),
   producing misleading avg_recovery_buckets. Now excludes unfinished
   bursts from recovery time calculation.

4. Incomplete anomaly detection: find_anomalies error clustering check
   only queried requests_rate_limited and requests_capacity_error,
   ignoring server_error, client_error, connection_error, and
   malformed. Now checks all 6 error columns.

Resolves: errorworks-abeacc785b, errorworks-25e3c0fd07,
errorworks-d2c087f550, errorworks-a814ca59eb.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ContentGenerator._get_preset_bank() used a bare check-then-act pattern
with no synchronization. Concurrent requests could load the JSONL file
multiple times and silently replace the PresetBank instance, losing
sequential index state. The LLM counterpart (ResponseGenerator) already
had the correct double-checked locking pattern.

Also updated InjectionEngine docstring to accurately describe why its
RNG is safe in the current ASGI architecture (single-threaded event
loop per worker, multi-worker forks processes) rather than incorrectly
claiming the config snapshot pattern solves thread safety.

Resolves: errorworks-8bebd5472f, errorworks-98431b8a79.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
get_stats() loaded all latency values into Python memory via fetchall()
to compute percentiles. For long-running chaos tests with hundreds of
thousands of requests, this caused O(N) memory spikes. Replaced with
SQL LIMIT 1 OFFSET queries that return a single value per percentile,
mirroring the pattern already used in update_bucket_latency().

export_data() loaded all requests and timeseries rows via fetchall()
with no bound. Added optional limit/offset parameters (default: all,
for backward compatibility) so callers can paginate large exports.
Propagated parameters through LLM and Web metrics recorder wrappers.

Resolves: errorworks-cee800e603, errorworks-c9f14f9cd9.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hema)

1. deep_merge: override mutable values (lists, nested dicts) were
   aliased into the result dict without copying. Mutating the override
   after merge would silently corrupt the result. Now deepcopy'd.

2. validate_error_decision: success path only checked category=None
   but allowed status_code, delay_sec, retry_after_sec, malformed_type
   to be set on success decisions. Now validates all fields are None.

3. MetricsSchema: validated duplicate column names and index column
   references, but not duplicate index names. Two indexes with the
   same name would fail at SQLite CREATE INDEX time with an opaque
   error. Now caught at schema validation with a clear message.

Resolves: errorworks-ecc051b180, errorworks-a342ac07a2,
errorworks-9ec900b8f8.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…emplates

1. Template override exceeding max_template_length now returns error
   content ("[template_override_error: exceeds max length ...]") instead
   of raising ValueError that propagated as an unhandled 500. This
   matches the existing pattern for Jinja syntax errors on the same
   code path — external data from request headers should never crash
   the server.

2. Config-sourced template body is now validated against
   max_template_length at ResponseGenerator construction time (fail
   fast). Previously only the header override path checked length,
   allowing arbitrarily large templates from config/presets.

Resolves: errorworks-f2e2b59860, errorworks-de0eb95ec6.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ng stacklevel

Admin endpoints (stats, reset, export) included raw sqlite3.Error
messages in HTTP response bodies, potentially exposing table names,
column names, file paths, and query fragments. Now returns a generic
error message and logs the actual error server-side via structlog.

Also removed misleading stacklevel=2 from warnings.warn() calls in
both LLM and Web config validators. Inside Pydantic model_validators,
stacklevel=2 points to Pydantic internals rather than the user's code.
Suppressed B028 lint rule with explanation since no correct stacklevel
exists for Pydantic validator context.

Resolves: errorworks-8c144e8704, errorworks-0edc821730,
errorworks-874e295b0a.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Unused logger in engine/admin.py — already resolved by prior commit
   (logger.exception calls added for sqlite error handling).

2. _select_weighted defensive return incorrectly marked pragma: no cover
   — removed pragma since the line IS reachable via floating-point
   accumulation edge cases. Updated comment to explain when.

3. Root __init__.py __all__ omitted llm_mcp subpackage — added it.
   Also updated __version__ to match 0.1.2.

4. validators.py and vocabulary.py low test coverage — added direct
   unit tests for get_vocabulary() (3 tests). Validator coverage was
   already improved by the earlier validation cluster commit.

Resolves: errorworks-a70817c373, errorworks-9ea9c814bb,
errorworks-8c7d697788, errorworks-a532ae71b1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@johnm-dta johnm-dta merged commit 218c12f into main Mar 23, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant