fix: multi-worker startup + 23 bugs from full codebase audit#2
Merged
fix: multi-worker startup + 23 bugs from full codebase audit#2
Conversation
All presets and the ServerConfig default set workers=4, but the CLI passed the app as a Python object to uvicorn.run(), which only supports single-worker mode. Uvicorn requires an import string when workers > 1 so it can re-import the app in each forked child process. When workers > 1, the config is now serialized to an environment variable and an import string with factory=True is passed to uvicorn. Each worker calls the factory to deserialize config and build its own app instance. Single-worker mode continues to pass the app object directly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both _classify_outcome (LLM) and _classify_web_outcome (Web) gated connection_error on status_code being None. But the servers record some connection errors with non-None status codes: - timeout with 504 (LLM: 50% of timeouts, Web: always) - incomplete_response with 200 (Web: sends headers then drops) These were misclassified as server_error or fell through all categories, silently corrupting timeseries data. Fix: check error_type membership first (the authoritative injection record), then fall back to status_code for HTTP-level errors. Also: - Remove slow_response from web connection error set (it's a successful response with extra delay, correctly recorded as outcome="success") - Match redirect_loop_terminated in the redirect category - Add tests using actual server inputs instead of synthetic combinations Resolves 7 audit bugs: errorworks-297db2ec7d, errorworks-653ef0e10c, errorworks-12ca406f5e, errorworks-7b802d3807, errorworks-3a4450f89b, errorworks-be94e4edd9, errorworks-2ce7661e21. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. show-config YAML output contained !!python/tuple tags because model_dump() preserves Python tuples. Fixed by using model_dump(mode="json") which converts tuples to lists, producing portable YAML parseable by yaml.safe_load and non-Python parsers. 2. show-config --format accepted arbitrary values (e.g. --format=xml) and silently fell through to YAML output. Now validates and exits with an error for unsupported formats. 3. Multi-worker env var (_ERRORWORKS_LLM_CONFIG / _ERRORWORKS_WEB_CONFIG) was never cleaned up after uvicorn.run() returned. Wrapped in try/finally to ensure cleanup. Resolves 6 audit bugs: errorworks-32b301fff6, errorworks-2ab67e57c5, errorworks-60629d33be, errorworks-d4d064025e, errorworks-9a5548de3a, errorworks-284efc9456. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Add missing "param" field (always null) to all error responses.
Real OpenAI errors always include {message, type, param, code} but
the server only returned {type, message, code}.
2. Fix timeout 504 response body to use standard format. Previously
used {type: "timeout", message: ...} with no code/param fields.
Now matches all other error responses with type: "server_error",
code: "timeout", and the param field.
3. Fix echo mode and token estimation to handle multi-modal message
content. OpenAI messages can have content as a list of parts
(text + image_url) instead of a plain string. Previously this
dumped the raw list representation. Now extracts text parts
properly via _extract_text_content().
Note: finish_reason hardcoded to "stop" (errorworks-3461e3c975) is
left as a tracked feature request — varying it requires config
changes and could break existing users.
Resolves: errorworks-f3107a555c, errorworks-1b57330c19,
errorworks-f7844b1e6e.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Percentile off-by-one: int(n*p) returns the max value for p99 with n=100 (index 99 = 100th percentile). Fixed to use ceil(n*p)-1 (nearest-rank method), so p99 of [1..100] correctly returns 99. 2. Trailing burst dropped: get_burst_events silently discarded any burst still active at end of timeseries data. Now appends it with still_active=True and end_bucket=None. 3. False recovery time: analyze_aimd_behavior calculated recovery time for unfinished bursts (using len(buckets)-1 as synthetic end), producing misleading avg_recovery_buckets. Now excludes unfinished bursts from recovery time calculation. 4. Incomplete anomaly detection: find_anomalies error clustering check only queried requests_rate_limited and requests_capacity_error, ignoring server_error, client_error, connection_error, and malformed. Now checks all 6 error columns. Resolves: errorworks-abeacc785b, errorworks-25e3c0fd07, errorworks-d2c087f550, errorworks-a814ca59eb. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ContentGenerator._get_preset_bank() used a bare check-then-act pattern with no synchronization. Concurrent requests could load the JSONL file multiple times and silently replace the PresetBank instance, losing sequential index state. The LLM counterpart (ResponseGenerator) already had the correct double-checked locking pattern. Also updated InjectionEngine docstring to accurately describe why its RNG is safe in the current ASGI architecture (single-threaded event loop per worker, multi-worker forks processes) rather than incorrectly claiming the config snapshot pattern solves thread safety. Resolves: errorworks-8bebd5472f, errorworks-98431b8a79. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
get_stats() loaded all latency values into Python memory via fetchall() to compute percentiles. For long-running chaos tests with hundreds of thousands of requests, this caused O(N) memory spikes. Replaced with SQL LIMIT 1 OFFSET queries that return a single value per percentile, mirroring the pattern already used in update_bucket_latency(). export_data() loaded all requests and timeseries rows via fetchall() with no bound. Added optional limit/offset parameters (default: all, for backward compatibility) so callers can paginate large exports. Propagated parameters through LLM and Web metrics recorder wrappers. Resolves: errorworks-cee800e603, errorworks-c9f14f9cd9. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hema) 1. deep_merge: override mutable values (lists, nested dicts) were aliased into the result dict without copying. Mutating the override after merge would silently corrupt the result. Now deepcopy'd. 2. validate_error_decision: success path only checked category=None but allowed status_code, delay_sec, retry_after_sec, malformed_type to be set on success decisions. Now validates all fields are None. 3. MetricsSchema: validated duplicate column names and index column references, but not duplicate index names. Two indexes with the same name would fail at SQLite CREATE INDEX time with an opaque error. Now caught at schema validation with a clear message. Resolves: errorworks-ecc051b180, errorworks-a342ac07a2, errorworks-9ec900b8f8. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…emplates
1. Template override exceeding max_template_length now returns error
content ("[template_override_error: exceeds max length ...]") instead
of raising ValueError that propagated as an unhandled 500. This
matches the existing pattern for Jinja syntax errors on the same
code path — external data from request headers should never crash
the server.
2. Config-sourced template body is now validated against
max_template_length at ResponseGenerator construction time (fail
fast). Previously only the header override path checked length,
allowing arbitrarily large templates from config/presets.
Resolves: errorworks-f2e2b59860, errorworks-de0eb95ec6.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ng stacklevel Admin endpoints (stats, reset, export) included raw sqlite3.Error messages in HTTP response bodies, potentially exposing table names, column names, file paths, and query fragments. Now returns a generic error message and logs the actual error server-side via structlog. Also removed misleading stacklevel=2 from warnings.warn() calls in both LLM and Web config validators. Inside Pydantic model_validators, stacklevel=2 points to Pydantic internals rather than the user's code. Suppressed B028 lint rule with explanation since no correct stacklevel exists for Pydantic validator context. Resolves: errorworks-8c144e8704, errorworks-0edc821730, errorworks-874e295b0a. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Unused logger in engine/admin.py — already resolved by prior commit (logger.exception calls added for sqlite error handling). 2. _select_weighted defensive return incorrectly marked pragma: no cover — removed pragma since the line IS reachable via floating-point accumulation edge cases. Updated comment to explain when. 3. Root __init__.py __all__ omitted llm_mcp subpackage — added it. Also updated __version__ to match 0.1.2. 4. validators.py and vocabulary.py low test coverage — added direct unit tests for get_vocabulary() (3 tests). Validator coverage was already improved by the earlier validation cluster commit. Resolves: errorworks-a70817c373, errorworks-9ea9c814bb, errorworks-8c7d697788, errorworks-a532ae71b1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
workers=4, but the CLI passed the app as a Python object which only supports single-worker mode. Implemented factory pattern with env var serialization.Bug clusters resolved
_classify_outcomegatedconnection_erroronstatus_code is None, but servers record some connection errors with non-None status codes (timeout→504, incomplete→200). Fixed to checkerror_typefirst.llm/cli.pyandweb/cli.py:!!python/tuplein YAML output,--formataccepts invalid values, env var leak afteruvicorn.run().int(n*p)→ceil(n*p)-1), trailing burst silently dropped, false recovery time for unfinished bursts, anomaly detection only checked 2/7 error columns.paramfield in error responses, inconsistent timeout 504 body format, echo mode crashes on multi-modal content (list instead of string).ContentGenerator._get_preset_bank()had no locking (ported double-checked locking from LLM counterpart), correctedInjectionEnginedocstring on RNG safety.workers > 1.Audit also filed 18 additional bugs not resolved in this PR
Tracked in filigree with
audit/*labels. Key remaining items: memory scalability (unboundedfetchall), validation gaps,finish_reasonvariation (feature), warningstacklevel(cosmetic).Test plan
uv run ruff check src tests— cleanuv run mypy src— cleanchaosllm serve --preset=realisticwithworkers=4starts successfullychaosweb serve --preset=realisticwithworkers=4starts successfully🤖 Generated with Claude Code