fix: multi-worker startup + 23 bugs from full codebase audit by johnm-dta · Pull Request #2 · johnm-dta/errorworks

johnm-dta · 2026-03-22T17:49:40Z

Summary

Fix multi-worker uvicorn startup — all presets and the default config set workers=4, but the CLI passed the app as a Python object which only supports single-worker mode. Implemented factory pattern with env var serialization.
Fix 22 additional bugs discovered through a systematic 14-agent deep audit of the entire codebase (9,400 lines across 28 source files). Bugs were found via test execution, integration tracing, linting, type checking, and multi-perspective analysis (architect, systems thinker, Python engineer, quality engineer).

Bug clusters resolved

Cluster	Bugs	Root cause
Metrics misclassification	7	`_classify_outcome` gated `connection_error` on `status_code is None`, but servers record some connection errors with non-None status codes (timeout→504, incomplete→200). Fixed to check `error_type` first.
CLI duplication bugs	6	Same 3 bugs in both `llm/cli.py` and `web/cli.py`: `!!python/tuple` in YAML output, `--format` accepts invalid values, env var leak after `uvicorn.run()`.
MCP analysis logic	4	Percentile off-by-one (`int(np)` → `ceil(np)-1`), trailing burst silently dropped, false recovery time for unfinished bursts, anomaly detection only checked 2/7 error columns.
OpenAI API fidelity	3	Missing `param` field in error responses, inconsistent timeout 504 body format, echo mode crashes on multi-modal content (list instead of string).
Thread safety	2	`ContentGenerator._get_preset_bank()` had no locking (ported double-checked locking from LLM counterpart), corrected `InjectionEngine` docstring on RNG safety.
Multi-worker startup	1	Original bug report — factory pattern for `workers > 1`.

Audit also filed 18 additional bugs not resolved in this PR

Tracked in filigree with audit/* labels. Key remaining items: memory scalability (unbounded fetchall), validation gaps, finish_reason variation (feature), warning stacklevel (cosmetic).

Test plan

All 1112 tests pass (up from 1087 — 25 new tests added)
uv run ruff check src tests — clean
uv run mypy src — clean
Each fix developed with TDD (RED→GREEN→verify)
Full regression suite run after every commit
Manual smoke test: chaosllm serve --preset=realistic with workers=4 starts successfully
Manual smoke test: chaosweb serve --preset=realistic with workers=4 starts successfully

🤖 Generated with Claude Code

All presets and the ServerConfig default set workers=4, but the CLI passed the app as a Python object to uvicorn.run(), which only supports single-worker mode. Uvicorn requires an import string when workers > 1 so it can re-import the app in each forked child process. When workers > 1, the config is now serialized to an environment variable and an import string with factory=True is passed to uvicorn. Each worker calls the factory to deserialize config and build its own app instance. Single-worker mode continues to pass the app object directly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Both _classify_outcome (LLM) and _classify_web_outcome (Web) gated connection_error on status_code being None. But the servers record some connection errors with non-None status codes: - timeout with 504 (LLM: 50% of timeouts, Web: always) - incomplete_response with 200 (Web: sends headers then drops) These were misclassified as server_error or fell through all categories, silently corrupting timeseries data. Fix: check error_type membership first (the authoritative injection record), then fall back to status_code for HTTP-level errors. Also: - Remove slow_response from web connection error set (it's a successful response with extra delay, correctly recorded as outcome="success") - Match redirect_loop_terminated in the redirect category - Add tests using actual server inputs instead of synthetic combinations Resolves 7 audit bugs: errorworks-297db2ec7d, errorworks-653ef0e10c, errorworks-12ca406f5e, errorworks-7b802d3807, errorworks-3a4450f89b, errorworks-be94e4edd9, errorworks-2ce7661e21. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1. show-config YAML output contained !!python/tuple tags because model_dump() preserves Python tuples. Fixed by using model_dump(mode="json") which converts tuples to lists, producing portable YAML parseable by yaml.safe_load and non-Python parsers. 2. show-config --format accepted arbitrary values (e.g. --format=xml) and silently fell through to YAML output. Now validates and exits with an error for unsupported formats. 3. Multi-worker env var (_ERRORWORKS_LLM_CONFIG / _ERRORWORKS_WEB_CONFIG) was never cleaned up after uvicorn.run() returned. Wrapped in try/finally to ensure cleanup. Resolves 6 audit bugs: errorworks-32b301fff6, errorworks-2ab67e57c5, errorworks-60629d33be, errorworks-d4d064025e, errorworks-9a5548de3a, errorworks-284efc9456. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1. Add missing "param" field (always null) to all error responses. Real OpenAI errors always include {message, type, param, code} but the server only returned {type, message, code}. 2. Fix timeout 504 response body to use standard format. Previously used {type: "timeout", message: ...} with no code/param fields. Now matches all other error responses with type: "server_error", code: "timeout", and the param field. 3. Fix echo mode and token estimation to handle multi-modal message content. OpenAI messages can have content as a list of parts (text + image_url) instead of a plain string. Previously this dumped the raw list representation. Now extracts text parts properly via _extract_text_content(). Note: finish_reason hardcoded to "stop" (errorworks-3461e3c975) is left as a tracked feature request — varying it requires config changes and could break existing users. Resolves: errorworks-f3107a555c, errorworks-1b57330c19, errorworks-f7844b1e6e. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1. Percentile off-by-one: int(n*p) returns the max value for p99 with n=100 (index 99 = 100th percentile). Fixed to use ceil(n*p)-1 (nearest-rank method), so p99 of [1..100] correctly returns 99. 2. Trailing burst dropped: get_burst_events silently discarded any burst still active at end of timeseries data. Now appends it with still_active=True and end_bucket=None. 3. False recovery time: analyze_aimd_behavior calculated recovery time for unfinished bursts (using len(buckets)-1 as synthetic end), producing misleading avg_recovery_buckets. Now excludes unfinished bursts from recovery time calculation. 4. Incomplete anomaly detection: find_anomalies error clustering check only queried requests_rate_limited and requests_capacity_error, ignoring server_error, client_error, connection_error, and malformed. Now checks all 6 error columns. Resolves: errorworks-abeacc785b, errorworks-25e3c0fd07, errorworks-d2c087f550, errorworks-a814ca59eb. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ContentGenerator._get_preset_bank() used a bare check-then-act pattern with no synchronization. Concurrent requests could load the JSONL file multiple times and silently replace the PresetBank instance, losing sequential index state. The LLM counterpart (ResponseGenerator) already had the correct double-checked locking pattern. Also updated InjectionEngine docstring to accurately describe why its RNG is safe in the current ASGI architecture (single-threaded event loop per worker, multi-worker forks processes) rather than incorrectly claiming the config snapshot pattern solves thread safety. Resolves: errorworks-8bebd5472f, errorworks-98431b8a79. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

get_stats() loaded all latency values into Python memory via fetchall() to compute percentiles. For long-running chaos tests with hundreds of thousands of requests, this caused O(N) memory spikes. Replaced with SQL LIMIT 1 OFFSET queries that return a single value per percentile, mirroring the pattern already used in update_bucket_latency(). export_data() loaded all requests and timeseries rows via fetchall() with no bound. Added optional limit/offset parameters (default: all, for backward compatibility) so callers can paginate large exports. Propagated parameters through LLM and Web metrics recorder wrappers. Resolves: errorworks-cee800e603, errorworks-c9f14f9cd9. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…hema) 1. deep_merge: override mutable values (lists, nested dicts) were aliased into the result dict without copying. Mutating the override after merge would silently corrupt the result. Now deepcopy'd. 2. validate_error_decision: success path only checked category=None but allowed status_code, delay_sec, retry_after_sec, malformed_type to be set on success decisions. Now validates all fields are None. 3. MetricsSchema: validated duplicate column names and index column references, but not duplicate index names. Two indexes with the same name would fail at SQLite CREATE INDEX time with an opaque error. Now caught at schema validation with a clear message. Resolves: errorworks-ecc051b180, errorworks-a342ac07a2, errorworks-9ec900b8f8. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…emplates 1. Template override exceeding max_template_length now returns error content ("[template_override_error: exceeds max length ...]") instead of raising ValueError that propagated as an unhandled 500. This matches the existing pattern for Jinja syntax errors on the same code path — external data from request headers should never crash the server. 2. Config-sourced template body is now validated against max_template_length at ResponseGenerator construction time (fail fast). Previously only the header override path checked length, allowing arbitrarily large templates from config/presets. Resolves: errorworks-f2e2b59860, errorworks-de0eb95ec6. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ng stacklevel Admin endpoints (stats, reset, export) included raw sqlite3.Error messages in HTTP response bodies, potentially exposing table names, column names, file paths, and query fragments. Now returns a generic error message and logs the actual error server-side via structlog. Also removed misleading stacklevel=2 from warnings.warn() calls in both LLM and Web config validators. Inside Pydantic model_validators, stacklevel=2 points to Pydantic internals rather than the user's code. Suppressed B028 lint rule with explanation since no correct stacklevel exists for Pydantic validator context. Resolves: errorworks-8c144e8704, errorworks-0edc821730, errorworks-874e295b0a. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1. Unused logger in engine/admin.py — already resolved by prior commit (logger.exception calls added for sqlite error handling). 2. _select_weighted defensive return incorrectly marked pragma: no cover — removed pragma since the line IS reachable via floating-point accumulation edge cases. Updated comment to explain when. 3. Root __init__.py __all__ omitted llm_mcp subpackage — added it. Also updated __version__ to match 0.1.2. 4. validators.py and vocabulary.py low test coverage — added direct unit tests for get_vocabulary() (3 tests). Validator coverage was already improved by the earlier validation cluster commit. Resolves: errorworks-a70817c373, errorworks-9ea9c814bb, errorworks-8c7d697788, errorworks-a532ae71b1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

johnm-dta and others added 15 commits March 22, 2026 20:58

chore: update filigree instructions in CLAUDE.md and sync uv.lock

0fb999f

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

style: apply ruff formatting to metrics and MCP test files

8452bf9

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: bump version to 0.1.2 and update changelog

b7226cb

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: update filigree-mcp command path to use local installation

f16ce8d

johnm-dta merged commit 218c12f into main Mar 23, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: multi-worker startup + 23 bugs from full codebase audit#2

fix: multi-worker startup + 23 bugs from full codebase audit#2
johnm-dta merged 15 commits intomainfrom
fix/multi-worker-uvicorn-startup

johnm-dta commented Mar 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

johnm-dta commented Mar 22, 2026

Summary

Bug clusters resolved

Audit also filed 18 additional bugs not resolved in this PR

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant