Skip to content

Cycle-2 audit: safety + honesty reconciliation (W0 + W1 partial, 19 commits)#10

Open
ChunkyTortoise wants to merge 25 commits into
mainfrom
wave1/honesty-reconciliation
Open

Cycle-2 audit: safety + honesty reconciliation (W0 + W1 partial, 19 commits)#10
ChunkyTortoise wants to merge 25 commits into
mainfrom
wave1/honesty-reconciliation

Conversation

@ChunkyTortoise
Copy link
Copy Markdown
Owner

@ChunkyTortoise ChunkyTortoise commented May 19, 2026

Summary

Cycle-2 hiring-readiness audit — 19 commits vs main (Wave 0 safety pre-work + Wave 1 honesty reconciliation, partial).

Status: partial cycle. Wave 1 tasks 1.5 (auth, own session), 1.9 (validated judge, own session), 1.12 (gated publish) and Waves 2–4 are NOT in this PR. This ships the verified honesty/correctness fixes completed so far.

What's in this PR

Wave 0 — safety pre-work

  • .maintenance/ gitignored; .gitleaks.toml allowlists test-fixture secrets; SECURITY.md triage.

Wave 1 — honesty reconciliation

  • Pricing single-sourced from 2026.json; CASE_STUDY math fixed; ADR-0003/0005 reconciled to real code.
  • Concurrency-safe rate limiter + single-flight cache; A2A handled failures log WARNING not exception.
  • Version claims reconciled (see note); PyJWT[crypto] extra corrected ([cryptography] was invalid).
  • Servers honesty (1.7/1.8): email wires SMTP from env; calendar wires Google Calendar from env; README relabeled to the precise reality (5 zero-config + 4 wire-from-env; database_query/web_scraping default LLM is an explicit reference stub).
  • Mock-mode warning layer: every one of the 9 servers now emits a loud logger.warning() when running in mock/placeholder/degraded mode (no key, no SMTP, no creds, default stub LLM) — no server silently pretends to be real. +caplog regression tests.
  • Benchmark/measured numbers reconciled across README/RESULTS (1.4).

Note: intentional version-strategy deviation

Repo version is 0.3.0.dev0 + explicit "published PyPI = 0.1.0" disclosure (PEP 440 dev marker), not a literal down-stamp. The gated final-publish task must flip both version sources to 0.3.0 at real release.

Test plan

  • pytest tests/ — 613 passed / 3 skipped locally
  • CI green — test (3.10 / 3.11 / 3.12) all pass on the PR tip

Repo-wide evals-nightly.yml 0s push-failure is pre-existing (fails on every push incl. main); schedule/workflow_dispatch-only, unrelated to this PR, tracked separately.

🤖 Generated with Claude Code

ChunkyTortoise and others added 7 commits May 18, 2026 22:54
Wave 0 safety pre-work (audit cycle 2):
- .maintenance/ (local scan reports/triage) gitignored — was untracked AND
  unignored; a `git add -A` would have committed a 14k secret-scan report
- .gitleaks.toml: extends default ruleset, allowlists the test-fixture and
  adversarial-corpus paths (eval/gate test secrets, injection payloads)
- SECURITY.md: documents the 16-match scan triage — all intentional fixtures,
  zero live credentials, no rotation/history-rewrite needed

gitleaks: 16 -> 0 (no leaks found) with the scoped allowlist.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…026.json

Wave 1.2 (audit cycle 2 — honesty reconciliation).

multi_llm/pricing.py held a hardcoded table that contradicted the canonical
pricing/2026.json on every shared model (gpt-5.5 5/30 vs 2/8, gpt-5.5-pro
30/180 vs 12/48, grok 2/6 vs 3/15). Every emitted cost_usd was wrong, and
the test suite enshrined the wrong numbers.

- multi_llm/pricing.py now derives PRICING from the canonical loaded table
  (mcp_toolkit.framework.costing) — one in-memory source shared with
  CostTracker; estimate_cost() signature unchanged (no call-site churn).
- test_pricing.py corrected to canonical values + a delegation-contract
  assertion (no duplicated price list).

Verified: full suite 598 passed / 2 skipped; no hardcoded price table remains.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… link

Wave 1.3 (audit cycle 2 — honesty reconciliation).

The TL;DR claimed $0.0018/call -> $1.80/day -> $54/month, internally
self-consistent but ~4.9x the doc's own canonical cost table ($0.000368/call,
$0.21/1000 @ 42% hit), the span JSON (0.0003675), and the scale table. A
reviewer doing the one multiplication the doc invites catches it in a minute.

- TL;DR realigned to the canonical table: $0.000368/call (cache-miss),
  $0.21/day, ~$6/month @ 42% hit; qualifiers added so per-call vs per-1000
  no longer appear to contradict.
- Fixed a broken in-repo link: framework/rate_limit.py -> rate_limiter.py
  (actual filename) — a reviewer clicking it hit a 404.

Left untouched: the live-Jaeger link (Wave 2 deploys it for real).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wave 1.6 (audit cycle 2 — honesty reconciliation).

ADR-0005 documented a RateLimiter that does not exist: a
`RateLimiter(backend=RedisSlidingWindow(...))` constructor and a
caller_id/client_id/user_id resolution-with-WARNING path — neither is in the
code (verified: zero grep hits; RateLimiter takes only default_config and a
generic `key`). Rewritten to describe the real in-process generic-key limiter,
its honest known limitations (no cross-process coord, caller-supplied key,
unsynchronized check()), with Redis backend + caller-ID resolution moved to an
explicit "Future work (NOT implemented)" section.

ADR-0003 said circuit-breaker state lived in providers.py as
_consecutive_failures/_circuit_open; it is actually a CircuitBreaker class in
models.py with _failures/_open_until. Corrected.

Acceptance: no ADR cites a missing path; every symbol ADRs now claim exists in
the cited file; the only RedisSlidingWindow mention is the "not implemented"
deferral.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…drop oversell

Wave 1.11 (audit cycle 2 — honesty reconciliation).

The README markets rate limiting and caching as differentiators, but:
- RateLimiter.check() did an unsynchronized read-modify-write on the timestamp
  list; concurrent awaits on one key could both pass the limit check. Guarded
  with an asyncio.Lock (the resolution ADR-0005 already documents).
- cached_tool had a cache-stampede gap: N concurrent misses all executed the
  expensive call. Added per-key single-flight (double-checked under a lazily
  created per-cache-key lock).
- pyproject.toml description claimed "Production-ready" (prior-audit P1, an
  unbacked claim on the PyPI page) — dropped.

Added concurrency regression tests proving both fixes (100 concurrent checks
admit exactly the limit; 20 concurrent misses execute the tool once).

Verified: full suite green.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wave 1.10 (audit cycle 2 — honesty reconciliation).

handle_task() and stream_task() catch a failing tool, return a structured
`failed` A2ATaskStatus, and ALSO called logger.exception() — dumping a full
traceback to stderr for an expected, correctly-handled outcome. In CI logs
(and the eval gate's unknown_tool case) this reads as an uncaught crash even
though run_evals.py exits 0 and 26/26 pass.

Both sites now logger.warning() (matching the existing pattern at the push-
notification handler); the error detail is still preserved in
status.message=str(exc). No behavior change — status/message identical;
no test asserts log level.

Verified: run_evals.py exit 0, 26/26, stderr now a clean one-line warning;
full suite 600 passed / 2 skipped.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wave 1.1 (audit cycle 2 — honesty reconciliation).

Verified firsthand: PyPI has exactly one release, 0.1.0. The repo claimed
0.3.0 everywhere — `pip install mcp-server-toolkit==0.3.0` (in .claude/
CLAUDE.md and a spec doc) 404s, and the dynamic PyPI badge renders "0.1.0"
beside prose asserting 0.3.0.

- pyproject.toml + mcp_toolkit/__init__.py: 0.3.0 -> 0.3.0.dev0. (PEP 440 dev
  marker: honest — the code IS past 0.1.0 — while `pip install ==0.3.0` still
  correctly fails, preserving the immutable-publish guard. The gated publish
  task 1.12 flips this to 0.3.0 at release. Chosen over literal 0.1.0 to avoid
  a worse dishonesty: shipping 0.3.0-level code stamped as the published 0.1.0.)
- .claude/CLAUDE.md, docs/specs/codex-roi-evaluation-prompt.md: removed the
  false `==0.3.0` install claims; marked 0.3.0 unreleased.
- SECURITY.md: Supported Versions table fixed (claimed 0.2.x supported / 0.1.x
  not — i.e. nothing published was supported). Now 0.1.x published, 0.3.x in dev.
- README: added a release-status note so the dynamic PyPI badge (0.1.0) no
  longer silently contradicts the documented 0.3.0 feature set.

Verified: package builds + imports as 0.3.0.dev0; full suite 600 passed / 2
skipped; no installable ==0.3.0 claim remains.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 14f1a6aa2e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +61 to +63
lock = asyncio.Lock()
self._inflight_locks[key] = lock
return lock
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove stale single-flight locks after cache fill

The new single-flight cache path stores a lock per computed cache_key in _inflight_locks, but those entries are never evicted. In workloads with high-cardinality keys (for example, user- or query-specific arguments), this dictionary grows for the lifetime of the process even after cache entries expire, causing unbounded memory growth over time. Please add lifecycle cleanup for per-key locks (or use a weak/evicting structure) once a key is no longer in flight.

Useful? React with 👍 / 👎.

Comment on lines +67 to +68
async with self._lock:
now = time.monotonic()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid global lock contention across independent rate keys

Making check() atomic is good, but guarding every key with one shared asyncio.Lock serializes all rate-limit checks, including unrelated callers and tools. Under concurrent traffic this creates head-of-line blocking and can significantly reduce throughput/raise latency compared to per-key locking, even when keys are independent. A per-key lock strategy would preserve correctness without turning the limiter into a single global bottleneck.

Useful? React with 👍 / 👎.

…ing/db as mock-default

- email/server.py main(): reads SMTP_HOST/USER/PASS/PORT/IMAP_HOST; wires SMTPEmailClient
  when set, falls through to MockEmailClient otherwise
- calendar/server.py main(): reads GOOGLE_CALENDAR_CREDENTIALS + CALENDAR_ID; wires
  GoogleCalendarProvider when set (skips gracefully if [gcal] not installed), falls through
  to MockCalendarProvider otherwise
- README: top-line "9 ready-to-run" → "9 pre-built; 6 zero-config, 3 wire from env";
  table rows for all 9 servers updated to match actual main() behaviour; email/calendar
  detail sections document env vars; crm_ghl documents no bundled GHL HTTP client;
  web_scraping documents extract_data needs a wired LLM; database_query documents
  DefaultLLMProvider is a reference stub and sqlglot validation is the real asset
- tests: main()-wiring tests for email (2) and calendar (2); DefaultLLMProvider stub
  behaviour tests for database_query (2) — 5 net new passing, 1 new skip ([gcal] extra)
…es not exist

PyJWT 2.12.1 publishes extras: crypto, dev, docs, tests.
The extra named 'cryptography' does not exist; uv warns on every install.
Update pyproject.toml and README to use the correct pyjwt[crypto] form.
607 passed / 2 skipped (was 600), 84% coverage (was 82.87%),
bench P50 miss 0.024ms / speedup 3.1x (was 0.022ms / 2.9x).
gh repo card shows stale 598/88% — edit command prepared but not run.
bench_cache.py:79 printed "499 of 1020" by dividing a hit-loop
numerator by a warmup+hit+miss denominator, understating the cache
~2x. Replace with split counters: 519 of 520 cacheable calls served
from cache (99.8%); hit loop made 0 of 500 backend calls. RESULTS.md
L15 now matches the script output verbatim; added a run-to-run
variance note.
README:481 linked tests/test_benchmarks.py::test_cache_hit_latency_p95
— that path 404s on GitHub and no such function exists. Point to the
real assertion: tests/gates/test_gate_scale.py
TestCacheScaleGate::test_cache_hit_p95_under_1ms (asserts hit p95 <
1ms, which bounds the P50 0.008 ms claim).
RESULTS.md:15 claimed "verbatim" but dropped the "Cache
effectiveness:" prefix — change to "from bench_cache.py stdout"
(numbers already match; only the word was wrong). README cache
latency/speedup Method cells now disclose "median of 3 runs" to match
RESULTS.md so the primary surface is self-consistent. bench_cache.py:
comment the cacheable-surface scope (miss loop excluded as
intrinsically uncacheable) and drop the unused backend_calls_total
return key.
@ChunkyTortoise ChunkyTortoise changed the title Cycle-2 audit: safety pre-work + honesty reconciliation (7 fixes) Cycle-2 audit: safety + honesty reconciliation (W0 + W1 partial, 19 commits) May 19, 2026
Wave 1.5 (audit cycle 2 — honesty reconciliation). Closes the audit's
harshest finding.

`requires_scope` read the credential from `kwargs.get("token") or
kwargs.get("api_key") or kwargs.get("authorization")` — i.e. from a public
tool-schema parameter. A `token: str = ""` param surfaces in list_tools()
inputSchema, so the "credential" was caller-supplied public input. It was
also wired into zero servers (dead code).

The MCP SDK (1.26.0, bundled FastMCP) already ships the correct mechanism;
hand-rolling a contextvar/middleware would duplicate it. So:

- auth.py: add `JWTTokenVerifier(TokenVerifier)` adapting the existing
  JWTAuth/APIKeyAuth `.authenticate()` contract to the SDK's
  `verify_token() -> AccessToken | None`. Rewrite `requires_scope` to SHRINK:
  it now reads the SDK-verified token via `get_access_token()` (the per-request
  contextvar set by the SDK's AuthContextMiddleware), returns Unauthorized if
  absent and Forbidden if the required scope is missing. All
  kwargs.get("token"/"api_key"/"authorization") paths removed — no fallback
  (a fallback IS the audit's failure mode). Signature is now
  `requires_scope(scope)`; credential is never a tool argument.
- base_server.py: `auth_tool` follows the new signature/docstring.
  `EnhancedMCP.__init__` already forwards **kwargs to FastMCP, so
  token_verifier/auth pass through with no change (verified at runtime).
- database_query + crm_ghl servers: wired with
  `JWTTokenVerifier(JWTAuth(secret=MCP_JWT_SECRET))` + `AuthSettings`;
  per-tool `@requires_scope` (db:read; crm:read/crm:write). Default
  transport stays stdio (no `python -m ...server` break); opt-in
  streamable-HTTP via `MCP_HTTP_PORT`, which refuses to start without an
  explicit `MCP_JWT_SECRET` so auth can never silently no-op. Under stdio
  (no Authorization channel) every decorated tool hard-rejects by design —
  see ADR-0008.

Tests (no mocking of the SDK auth path — it is exercised live):
- test_auth_wiring.py: the gating test (no auth-wired tool's inputSchema
  exposes token/api_key/authorization) + live streamable_http_app TestClient
  proving no header -> 401 (+WWW-Authenticate), wrong scope -> 403, valid
  bearer -> 200 and the tool body runs.
- TestRequiresScope / gate tests rewritten to the contextvar contract via a
  shared `grant_scopes` helper (real SDK AuthenticatedUser/AccessToken). The
  old tests asserted the vulnerable kwarg path; keeping them green would have
  kept the bug. Added regression guards that the gate still fires when the
  auth context is absent.

Verified: full suite 620 passed / 2 skipped (was 607/2); ruff clean;
gating test + 401/403/200 green in isolation.
Wave 1.5 (audit cycle 2 — honesty reconciliation). Documentation half of
the auth fix.

- README Authentication section rewritten to the wired model: the
  JWTTokenVerifier + AuthSettings wiring, credential-never-a-tool-arg, the
  stdio-is-unauthed-by-design caveat, the MCP_HTTP_PORT opt-in (and its
  refuse-to-start-without-MCP_JWT_SECRET guard), an honest statement of what
  the resource server checks (sig/exp/iss/aud/scope; no issuance), and a link
  to ADR-0008. Removed the old `requires_scope(auth, "db:read")` line, which
  used the now-removed signature and implied auth "just works" (it does not
  under the default stdio transport). Capability + hiring-signal tables updated
  to name JWTTokenVerifier and point at the live 401/403/200 proof test.
- ADR-0008-auth-boundary.md: new. Narrates flaw (credential in public tool
  schema, wired nowhere) -> why hand-rolling middleware was rejected (SDK ships
  it) -> native SDK bearer auth with RFC-compliant 401/403 -> stdio's missing
  header channel as an honest engineering bound -> the deliberate stdio
  hard-reject design call (soft-skip would make auth optional in practice).
- ADR-0006: its `requires_scope`/`auth_tool` description (credential from a
  token/api_key/authorization kwarg) was now false; marked superseded by
  ADR-0008 on the credential-source point, with the corrected model inline.

Every symbol/API named in ADR-0008 was grep-verified against the code
(JWTTokenVerifier.verify_token, requires_scope, AccessToken/AuthSettings
fields, FastMCP token_verifier/auth params, get_access_token, the SDK
middleware classes, MCP_HTTP_PORT/MCP_JWT_SECRET, mcp 1.26.0). README import
line verified to resolve. Full suite unchanged at 620 passed / 2 skipped.
…_tool

Wave 1.5 polish (audit cycle 2). Post-review hardening; the accepted core
(requires_scope, JWTTokenVerifier, server wiring, live 401/403/200) is
untouched.

The per-file autouse `_auth_context` fixture grants scopes to every logic
test, so the previous single-tool `TestAuthGateStillEnforced` only guarded 1
of N tools — removing `@requires_scope` from explain_query / list_tables /
get_pipeline_summary / create_opportunity would have passed silently (4 of 7
auth tools unguarded). Replaced with a dynamic canary that discovers tools via
list_tools(), drops the auth contextvar to None, and asserts every tool
hard-rejects with the Unauthorized string (message names the tool). New tools
are auto-covered; a pinned name-set assertion forces deliberate review if the
tool surface changes. Verified each of the 7 tools returns the exact
requires_scope reject string (not an arg-validation error masquerading as it).
`stub_tool_args` (shared in conftest) builds minimal required kwargs so the
call reaches the decorator. crm read-vs-write scope test retained.

`EnhancedMCP.auth_tool` is public and documented but was untested and its
signature changed (auth_tool(auth, scope) -> auth_tool(scope)). Added one
end-to-end test through the real MCP dispatch + real auth contextvar: with
scope -> runs; no context -> Unauthorized; wrong scope -> Forbidden. Not
removed/deprecated (a semver call above this task) — just proven.

Suite: 621 passed / 2 skipped (was 620/2); ruff clean.
Wave 1.5 polish (audit cycle 2).

- ADR-0008 §3: one sentence preempting the sharp-reviewer question about the
  apparent double scope check — AuthSettings.required_scopes
  (RequireAuthMiddleware, server-wide admission) is intentionally complemented
  by @requires_scope for per-tool granularity the server-wide setting cannot
  express (crm:read reads vs crm:write mutations on one server).
- database_query/server.py: the 11-line inline "Auth model" comment block
  trimmed to 5, matching the crm side's shape — ADR-0008 carries the full
  narrative, the inline comment just states the model + backref. crm side's
  now-pointless "the matching comment in database_query" backref dropped (both
  are equally short now); both point straight at ADR-0008.

No code/behavior change. Suite unchanged at 621 passed / 2 skipped.
crm_ghl, web_scraping, gemini_embedding, multi_llm, and database_query
silently ran in-process mock/placeholder mode while the README advertised
them as ready-to-run. main() now emits an explicit logger.warning() that
names the mock, what is fake, and how to wire the real client via
configure()/env vars.

Mirrors the calendar pattern (65c610b). Closes the 1.7/1.8 acceptance
bullet ("no server silently runs mock while advertised ready-to-run") for
the 5 servers not covered by the concurrent wave1 pass. email/calendar
excluded here: actively modified on wave1 — their residual silent paths
are flagged separately to avoid a merge conflict.

+5 caplog regression tests. Suite 611 passed / 3 skipped; ruff clean.
email.main() (no SMTP_HOST) and calendar.main() (no GOOGLE_CALENDAR_CREDENTIALS)
silently fell back to the in-memory mock. calendar previously warned only on
the [gcal]-ImportError sub-case, not the common unset-creds case. Both now emit
an explicit logger.warning() naming the mock and the env fix.

Closes the final residual of the 1.7 acceptance bullet ("no server silently
runs mock while advertised ready-to-run") — all 9 servers now warn loudly in
mock/degraded mode. +2 caplog tests. Suite 613 passed / 3 skipped; ruff clean.
Brings in the concurrent wave1 pass that makes every server warn loudly
when it runs in mock/placeholder mode (no server silently advertised as
ready-to-run while faking its backend). Conflicts in crm_ghl and
database_query main() and their tests resolved by keeping BOTH sides:
the 1.5 bearer-auth/transport wiring AND the mock-mode logger.warning;
test suites take the union (auth canary + caplog mock-warning tests).
C2: README test-suite count was 607 passed; the merged tree (1.5 auth
tests + the parallel mock-warning caplog tests) measures 628 passed /
2 skipped via `uv run python -m pytest tests/ -q`. Updated both README
occurrences (overview metrics table + measured-results table).

I1: ADR-0006 listed the JWTAuth dependency as PyJWT[cryptography];
[cryptography] is not a real PyJWT extra (ae43e20 corrected this
everywhere else). Fixed to PyJWT[crypto].

M1: ADR-0008 cited `mcp` 1.26.0 as a hard version while pyproject pins
mcp>=1.23.0 — reworded to "developed against `mcp` 1.26.0" so there is
no apparent version contradiction.
@ChunkyTortoise ChunkyTortoise force-pushed the wave1/honesty-reconciliation branch from 0e41a22 to f183a87 Compare May 19, 2026 21:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant