Cycle-2 audit: safety + honesty reconciliation (W0 + W1 partial, 19 commits)#10
Cycle-2 audit: safety + honesty reconciliation (W0 + W1 partial, 19 commits)#10ChunkyTortoise wants to merge 25 commits into
Conversation
Wave 0 safety pre-work (audit cycle 2): - .maintenance/ (local scan reports/triage) gitignored — was untracked AND unignored; a `git add -A` would have committed a 14k secret-scan report - .gitleaks.toml: extends default ruleset, allowlists the test-fixture and adversarial-corpus paths (eval/gate test secrets, injection payloads) - SECURITY.md: documents the 16-match scan triage — all intentional fixtures, zero live credentials, no rotation/history-rewrite needed gitleaks: 16 -> 0 (no leaks found) with the scoped allowlist. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…026.json Wave 1.2 (audit cycle 2 — honesty reconciliation). multi_llm/pricing.py held a hardcoded table that contradicted the canonical pricing/2026.json on every shared model (gpt-5.5 5/30 vs 2/8, gpt-5.5-pro 30/180 vs 12/48, grok 2/6 vs 3/15). Every emitted cost_usd was wrong, and the test suite enshrined the wrong numbers. - multi_llm/pricing.py now derives PRICING from the canonical loaded table (mcp_toolkit.framework.costing) — one in-memory source shared with CostTracker; estimate_cost() signature unchanged (no call-site churn). - test_pricing.py corrected to canonical values + a delegation-contract assertion (no duplicated price list). Verified: full suite 598 passed / 2 skipped; no hardcoded price table remains. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… link Wave 1.3 (audit cycle 2 — honesty reconciliation). The TL;DR claimed $0.0018/call -> $1.80/day -> $54/month, internally self-consistent but ~4.9x the doc's own canonical cost table ($0.000368/call, $0.21/1000 @ 42% hit), the span JSON (0.0003675), and the scale table. A reviewer doing the one multiplication the doc invites catches it in a minute. - TL;DR realigned to the canonical table: $0.000368/call (cache-miss), $0.21/day, ~$6/month @ 42% hit; qualifiers added so per-call vs per-1000 no longer appear to contradict. - Fixed a broken in-repo link: framework/rate_limit.py -> rate_limiter.py (actual filename) — a reviewer clicking it hit a 404. Left untouched: the live-Jaeger link (Wave 2 deploys it for real). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wave 1.6 (audit cycle 2 — honesty reconciliation). ADR-0005 documented a RateLimiter that does not exist: a `RateLimiter(backend=RedisSlidingWindow(...))` constructor and a caller_id/client_id/user_id resolution-with-WARNING path — neither is in the code (verified: zero grep hits; RateLimiter takes only default_config and a generic `key`). Rewritten to describe the real in-process generic-key limiter, its honest known limitations (no cross-process coord, caller-supplied key, unsynchronized check()), with Redis backend + caller-ID resolution moved to an explicit "Future work (NOT implemented)" section. ADR-0003 said circuit-breaker state lived in providers.py as _consecutive_failures/_circuit_open; it is actually a CircuitBreaker class in models.py with _failures/_open_until. Corrected. Acceptance: no ADR cites a missing path; every symbol ADRs now claim exists in the cited file; the only RedisSlidingWindow mention is the "not implemented" deferral. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…drop oversell Wave 1.11 (audit cycle 2 — honesty reconciliation). The README markets rate limiting and caching as differentiators, but: - RateLimiter.check() did an unsynchronized read-modify-write on the timestamp list; concurrent awaits on one key could both pass the limit check. Guarded with an asyncio.Lock (the resolution ADR-0005 already documents). - cached_tool had a cache-stampede gap: N concurrent misses all executed the expensive call. Added per-key single-flight (double-checked under a lazily created per-cache-key lock). - pyproject.toml description claimed "Production-ready" (prior-audit P1, an unbacked claim on the PyPI page) — dropped. Added concurrency regression tests proving both fixes (100 concurrent checks admit exactly the limit; 20 concurrent misses execute the tool once). Verified: full suite green. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wave 1.10 (audit cycle 2 — honesty reconciliation). handle_task() and stream_task() catch a failing tool, return a structured `failed` A2ATaskStatus, and ALSO called logger.exception() — dumping a full traceback to stderr for an expected, correctly-handled outcome. In CI logs (and the eval gate's unknown_tool case) this reads as an uncaught crash even though run_evals.py exits 0 and 26/26 pass. Both sites now logger.warning() (matching the existing pattern at the push- notification handler); the error detail is still preserved in status.message=str(exc). No behavior change — status/message identical; no test asserts log level. Verified: run_evals.py exit 0, 26/26, stderr now a clean one-line warning; full suite 600 passed / 2 skipped. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wave 1.1 (audit cycle 2 — honesty reconciliation). Verified firsthand: PyPI has exactly one release, 0.1.0. The repo claimed 0.3.0 everywhere — `pip install mcp-server-toolkit==0.3.0` (in .claude/ CLAUDE.md and a spec doc) 404s, and the dynamic PyPI badge renders "0.1.0" beside prose asserting 0.3.0. - pyproject.toml + mcp_toolkit/__init__.py: 0.3.0 -> 0.3.0.dev0. (PEP 440 dev marker: honest — the code IS past 0.1.0 — while `pip install ==0.3.0` still correctly fails, preserving the immutable-publish guard. The gated publish task 1.12 flips this to 0.3.0 at release. Chosen over literal 0.1.0 to avoid a worse dishonesty: shipping 0.3.0-level code stamped as the published 0.1.0.) - .claude/CLAUDE.md, docs/specs/codex-roi-evaluation-prompt.md: removed the false `==0.3.0` install claims; marked 0.3.0 unreleased. - SECURITY.md: Supported Versions table fixed (claimed 0.2.x supported / 0.1.x not — i.e. nothing published was supported). Now 0.1.x published, 0.3.x in dev. - README: added a release-status note so the dynamic PyPI badge (0.1.0) no longer silently contradicts the documented 0.3.0 feature set. Verified: package builds + imports as 0.3.0.dev0; full suite 600 passed / 2 skipped; no installable ==0.3.0 claim remains. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 14f1a6aa2e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| lock = asyncio.Lock() | ||
| self._inflight_locks[key] = lock | ||
| return lock |
There was a problem hiding this comment.
Remove stale single-flight locks after cache fill
The new single-flight cache path stores a lock per computed cache_key in _inflight_locks, but those entries are never evicted. In workloads with high-cardinality keys (for example, user- or query-specific arguments), this dictionary grows for the lifetime of the process even after cache entries expire, causing unbounded memory growth over time. Please add lifecycle cleanup for per-key locks (or use a weak/evicting structure) once a key is no longer in flight.
Useful? React with 👍 / 👎.
| async with self._lock: | ||
| now = time.monotonic() |
There was a problem hiding this comment.
Avoid global lock contention across independent rate keys
Making check() atomic is good, but guarding every key with one shared asyncio.Lock serializes all rate-limit checks, including unrelated callers and tools. Under concurrent traffic this creates head-of-line blocking and can significantly reduce throughput/raise latency compared to per-key locking, even when keys are independent. A per-key lock strategy would preserve correctness without turning the limiter into a single global bottleneck.
Useful? React with 👍 / 👎.
…ing/db as mock-default - email/server.py main(): reads SMTP_HOST/USER/PASS/PORT/IMAP_HOST; wires SMTPEmailClient when set, falls through to MockEmailClient otherwise - calendar/server.py main(): reads GOOGLE_CALENDAR_CREDENTIALS + CALENDAR_ID; wires GoogleCalendarProvider when set (skips gracefully if [gcal] not installed), falls through to MockCalendarProvider otherwise - README: top-line "9 ready-to-run" → "9 pre-built; 6 zero-config, 3 wire from env"; table rows for all 9 servers updated to match actual main() behaviour; email/calendar detail sections document env vars; crm_ghl documents no bundled GHL HTTP client; web_scraping documents extract_data needs a wired LLM; database_query documents DefaultLLMProvider is a reference stub and sqlglot validation is the real asset - tests: main()-wiring tests for email (2) and calendar (2); DefaultLLMProvider stub behaviour tests for database_query (2) — 5 net new passing, 1 new skip ([gcal] extra)
…es not exist PyJWT 2.12.1 publishes extras: crypto, dev, docs, tests. The extra named 'cryptography' does not exist; uv warns on every install. Update pyproject.toml and README to use the correct pyjwt[crypto] form.
…ocument loud SMTP_PORT failure
607 passed / 2 skipped (was 600), 84% coverage (was 82.87%), bench P50 miss 0.024ms / speedup 3.1x (was 0.022ms / 2.9x). gh repo card shows stale 598/88% — edit command prepared but not run.
bench_cache.py:79 printed "499 of 1020" by dividing a hit-loop numerator by a warmup+hit+miss denominator, understating the cache ~2x. Replace with split counters: 519 of 520 cacheable calls served from cache (99.8%); hit loop made 0 of 500 backend calls. RESULTS.md L15 now matches the script output verbatim; added a run-to-run variance note.
README:481 linked tests/test_benchmarks.py::test_cache_hit_latency_p95 — that path 404s on GitHub and no such function exists. Point to the real assertion: tests/gates/test_gate_scale.py TestCacheScaleGate::test_cache_hit_p95_under_1ms (asserts hit p95 < 1ms, which bounds the P50 0.008 ms claim).
RESULTS.md:15 claimed "verbatim" but dropped the "Cache effectiveness:" prefix — change to "from bench_cache.py stdout" (numbers already match; only the word was wrong). README cache latency/speedup Method cells now disclose "median of 3 runs" to match RESULTS.md so the primary surface is self-consistent. bench_cache.py: comment the cacheable-surface scope (miss loop excluded as intrinsically uncacheable) and drop the unused backend_calls_total return key.
Wave 1.5 (audit cycle 2 — honesty reconciliation). Closes the audit's
harshest finding.
`requires_scope` read the credential from `kwargs.get("token") or
kwargs.get("api_key") or kwargs.get("authorization")` — i.e. from a public
tool-schema parameter. A `token: str = ""` param surfaces in list_tools()
inputSchema, so the "credential" was caller-supplied public input. It was
also wired into zero servers (dead code).
The MCP SDK (1.26.0, bundled FastMCP) already ships the correct mechanism;
hand-rolling a contextvar/middleware would duplicate it. So:
- auth.py: add `JWTTokenVerifier(TokenVerifier)` adapting the existing
JWTAuth/APIKeyAuth `.authenticate()` contract to the SDK's
`verify_token() -> AccessToken | None`. Rewrite `requires_scope` to SHRINK:
it now reads the SDK-verified token via `get_access_token()` (the per-request
contextvar set by the SDK's AuthContextMiddleware), returns Unauthorized if
absent and Forbidden if the required scope is missing. All
kwargs.get("token"/"api_key"/"authorization") paths removed — no fallback
(a fallback IS the audit's failure mode). Signature is now
`requires_scope(scope)`; credential is never a tool argument.
- base_server.py: `auth_tool` follows the new signature/docstring.
`EnhancedMCP.__init__` already forwards **kwargs to FastMCP, so
token_verifier/auth pass through with no change (verified at runtime).
- database_query + crm_ghl servers: wired with
`JWTTokenVerifier(JWTAuth(secret=MCP_JWT_SECRET))` + `AuthSettings`;
per-tool `@requires_scope` (db:read; crm:read/crm:write). Default
transport stays stdio (no `python -m ...server` break); opt-in
streamable-HTTP via `MCP_HTTP_PORT`, which refuses to start without an
explicit `MCP_JWT_SECRET` so auth can never silently no-op. Under stdio
(no Authorization channel) every decorated tool hard-rejects by design —
see ADR-0008.
Tests (no mocking of the SDK auth path — it is exercised live):
- test_auth_wiring.py: the gating test (no auth-wired tool's inputSchema
exposes token/api_key/authorization) + live streamable_http_app TestClient
proving no header -> 401 (+WWW-Authenticate), wrong scope -> 403, valid
bearer -> 200 and the tool body runs.
- TestRequiresScope / gate tests rewritten to the contextvar contract via a
shared `grant_scopes` helper (real SDK AuthenticatedUser/AccessToken). The
old tests asserted the vulnerable kwarg path; keeping them green would have
kept the bug. Added regression guards that the gate still fires when the
auth context is absent.
Verified: full suite 620 passed / 2 skipped (was 607/2); ruff clean;
gating test + 401/403/200 green in isolation.
Wave 1.5 (audit cycle 2 — honesty reconciliation). Documentation half of the auth fix. - README Authentication section rewritten to the wired model: the JWTTokenVerifier + AuthSettings wiring, credential-never-a-tool-arg, the stdio-is-unauthed-by-design caveat, the MCP_HTTP_PORT opt-in (and its refuse-to-start-without-MCP_JWT_SECRET guard), an honest statement of what the resource server checks (sig/exp/iss/aud/scope; no issuance), and a link to ADR-0008. Removed the old `requires_scope(auth, "db:read")` line, which used the now-removed signature and implied auth "just works" (it does not under the default stdio transport). Capability + hiring-signal tables updated to name JWTTokenVerifier and point at the live 401/403/200 proof test. - ADR-0008-auth-boundary.md: new. Narrates flaw (credential in public tool schema, wired nowhere) -> why hand-rolling middleware was rejected (SDK ships it) -> native SDK bearer auth with RFC-compliant 401/403 -> stdio's missing header channel as an honest engineering bound -> the deliberate stdio hard-reject design call (soft-skip would make auth optional in practice). - ADR-0006: its `requires_scope`/`auth_tool` description (credential from a token/api_key/authorization kwarg) was now false; marked superseded by ADR-0008 on the credential-source point, with the corrected model inline. Every symbol/API named in ADR-0008 was grep-verified against the code (JWTTokenVerifier.verify_token, requires_scope, AccessToken/AuthSettings fields, FastMCP token_verifier/auth params, get_access_token, the SDK middleware classes, MCP_HTTP_PORT/MCP_JWT_SECRET, mcp 1.26.0). README import line verified to resolve. Full suite unchanged at 620 passed / 2 skipped.
…_tool Wave 1.5 polish (audit cycle 2). Post-review hardening; the accepted core (requires_scope, JWTTokenVerifier, server wiring, live 401/403/200) is untouched. The per-file autouse `_auth_context` fixture grants scopes to every logic test, so the previous single-tool `TestAuthGateStillEnforced` only guarded 1 of N tools — removing `@requires_scope` from explain_query / list_tables / get_pipeline_summary / create_opportunity would have passed silently (4 of 7 auth tools unguarded). Replaced with a dynamic canary that discovers tools via list_tools(), drops the auth contextvar to None, and asserts every tool hard-rejects with the Unauthorized string (message names the tool). New tools are auto-covered; a pinned name-set assertion forces deliberate review if the tool surface changes. Verified each of the 7 tools returns the exact requires_scope reject string (not an arg-validation error masquerading as it). `stub_tool_args` (shared in conftest) builds minimal required kwargs so the call reaches the decorator. crm read-vs-write scope test retained. `EnhancedMCP.auth_tool` is public and documented but was untested and its signature changed (auth_tool(auth, scope) -> auth_tool(scope)). Added one end-to-end test through the real MCP dispatch + real auth contextvar: with scope -> runs; no context -> Unauthorized; wrong scope -> Forbidden. Not removed/deprecated (a semver call above this task) — just proven. Suite: 621 passed / 2 skipped (was 620/2); ruff clean.
Wave 1.5 polish (audit cycle 2). - ADR-0008 §3: one sentence preempting the sharp-reviewer question about the apparent double scope check — AuthSettings.required_scopes (RequireAuthMiddleware, server-wide admission) is intentionally complemented by @requires_scope for per-tool granularity the server-wide setting cannot express (crm:read reads vs crm:write mutations on one server). - database_query/server.py: the 11-line inline "Auth model" comment block trimmed to 5, matching the crm side's shape — ADR-0008 carries the full narrative, the inline comment just states the model + backref. crm side's now-pointless "the matching comment in database_query" backref dropped (both are equally short now); both point straight at ADR-0008. No code/behavior change. Suite unchanged at 621 passed / 2 skipped.
crm_ghl, web_scraping, gemini_embedding, multi_llm, and database_query silently ran in-process mock/placeholder mode while the README advertised them as ready-to-run. main() now emits an explicit logger.warning() that names the mock, what is fake, and how to wire the real client via configure()/env vars. Mirrors the calendar pattern (65c610b). Closes the 1.7/1.8 acceptance bullet ("no server silently runs mock while advertised ready-to-run") for the 5 servers not covered by the concurrent wave1 pass. email/calendar excluded here: actively modified on wave1 — their residual silent paths are flagged separately to avoid a merge conflict. +5 caplog regression tests. Suite 611 passed / 3 skipped; ruff clean.
email.main() (no SMTP_HOST) and calendar.main() (no GOOGLE_CALENDAR_CREDENTIALS)
silently fell back to the in-memory mock. calendar previously warned only on
the [gcal]-ImportError sub-case, not the common unset-creds case. Both now emit
an explicit logger.warning() naming the mock and the env fix.
Closes the final residual of the 1.7 acceptance bullet ("no server silently
runs mock while advertised ready-to-run") — all 9 servers now warn loudly in
mock/degraded mode. +2 caplog tests. Suite 613 passed / 3 skipped; ruff clean.
Brings in the concurrent wave1 pass that makes every server warn loudly when it runs in mock/placeholder mode (no server silently advertised as ready-to-run while faking its backend). Conflicts in crm_ghl and database_query main() and their tests resolved by keeping BOTH sides: the 1.5 bearer-auth/transport wiring AND the mock-mode logger.warning; test suites take the union (auth canary + caplog mock-warning tests).
C2: README test-suite count was 607 passed; the merged tree (1.5 auth tests + the parallel mock-warning caplog tests) measures 628 passed / 2 skipped via `uv run python -m pytest tests/ -q`. Updated both README occurrences (overview metrics table + measured-results table). I1: ADR-0006 listed the JWTAuth dependency as PyJWT[cryptography]; [cryptography] is not a real PyJWT extra (ae43e20 corrected this everywhere else). Fixed to PyJWT[crypto]. M1: ADR-0008 cited `mcp` 1.26.0 as a hard version while pyproject pins mcp>=1.23.0 — reworded to "developed against `mcp` 1.26.0" so there is no apparent version contradiction.
0e41a22 to
f183a87
Compare
Summary
Cycle-2 hiring-readiness audit — 19 commits vs
main(Wave 0 safety pre-work + Wave 1 honesty reconciliation, partial).Status: partial cycle. Wave 1 tasks 1.5 (auth, own session), 1.9 (validated judge, own session), 1.12 (gated publish) and Waves 2–4 are NOT in this PR. This ships the verified honesty/correctness fixes completed so far.
What's in this PR
Wave 0 — safety pre-work
.maintenance/gitignored;.gitleaks.tomlallowlists test-fixture secrets; SECURITY.md triage.Wave 1 — honesty reconciliation
2026.json; CASE_STUDY math fixed; ADR-0003/0005 reconciled to real code.PyJWT[crypto]extra corrected ([cryptography]was invalid).emailwires SMTP from env;calendarwires Google Calendar from env; README relabeled to the precise reality (5 zero-config + 4 wire-from-env;database_query/web_scrapingdefault LLM is an explicit reference stub).logger.warning()when running in mock/placeholder/degraded mode (no key, no SMTP, no creds, default stub LLM) — no server silently pretends to be real. +caplog regression tests.Note: intentional version-strategy deviation
Repo version is
0.3.0.dev0+ explicit "published PyPI = 0.1.0" disclosure (PEP 440 dev marker), not a literal down-stamp. The gated final-publish task must flip both version sources to0.3.0at real release.Test plan
pytest tests/— 613 passed / 3 skipped locallytest (3.10 / 3.11 / 3.12)all pass on the PR tip🤖 Generated with Claude Code