Skip to content

test(perf): episode A2 10k scale verification — scale-break detected#2

Merged
kitfunso merged 1 commit into
masterfrom
feat/a2-10k-perf-benchmark
May 23, 2026
Merged

test(perf): episode A2 10k scale verification — scale-break detected#2
kitfunso merged 1 commit into
masterfrom
feat/a2-10k-perf-benchmark

Conversation

@kitfunso

Copy link
Copy Markdown
Owner

Summary

Episode A2 (scale verification + latency profile) generated a 10k synthetic corpus and ran the A1 eval gate against it. Result: A1's NDCG@3 of 0.7296 collapses to 0.3139 at 10k scale. Single-tool top-1 hit rate goes from 10/14 → 0/14. Per the plan's own rule ("NDCG@3 < 0.60 at 10k is retrieval-broken territory, not a tolerance to relax"), this episode does NOT pin a new CI gate at the degraded numbers.

Metric 434 (A1) 1k (subset) 10k (full)
NDCG@3 mean 0.7296 0.0530 0.3139
Recall@3 mean 0.4250 0.0267 0.1133
single-tool top-1 10/14 0/14 0/14
p95 latency unmeasured 4-18 ms 60-87 ms
stddev across N=5 0.0000 0.0000 0.0000

Diversity gate passed cleanly: 100% of 200 random pairs cosine<0.97 on both 1k pilot and full 10k. The dilution problem isn't near-duplicates, it's the absolute pool size relative to RRF's top-K window.

Root causes (probed)

Two distinct issues, neither fixable by RRF weight tuning:

  1. Skill/subagent contamination. Queries like "search npm registry" rank publish-repo skill above any tool. "transcribe audio whisper" → freeze skill at top-1. discover() doesn't filter by kind=tool.
  2. Synthetic-tool dilution. Strategy-A expansion (47 base templates × vendor + scenario perms) crowds top-K for queries matching the underlying base capability.

See docs/perf/10k-benchmark.md for diagnostic queries, hypotheses, and open items.

Plan trail

Plan-eng-critic rev 1 FAIL (score 28): killed HNSW + BM25 + sqlite-tuning fictions (don't exist in stack).
Plan-eng-critic rev 2 FAIL (score 52): killed wrong file path + fictional env-var plumbing + MCP count error.
Plan-eng-critic rev 3 PASS (score 78): rev 4 inline patches applied for 5 advisory items.

The outside-voice loop killed two whole tranches of fictional work before they cost execute time. Lesson recorded for future plans: grep source first, draft second.

What shipped (all diagnostic, no CI gate change)

  • tests/fixtures/v2-corpus-10k-snapshot.json — 10,167 entries, content-addressed sha256
  • tests/fixtures/v2-stress.json — 50 queries × 5 strata
  • tests/fixtures/v2-baseline-10k.json + v2-baseline-1k.json — diagnostic baselines
  • src/fixtures/generated-expanded.ts — Strategy A expansion generator
  • scripts/eval/build-10k-corpus.ts, build-1k-subset.ts
  • scripts/smoke/v2-golden-v2native.ts extended with --corpus-path, --include-stress, --baseline-out, --skip-hash-check (backward-compatible; default eval:golden unchanged — verified 0.7296)
  • docs/perf/10k-benchmark.md

Test plan

  • CI green on this PR (Ajv2020 lint + ndcg unit test + cross-OS install)
  • Locally: npm run typecheck clean
  • Locally: npm run eval:golden against 434 corpus still 0.7296 (backward-compat verified)
  • Diagnostic numbers in v2-baseline-10k.json reproduce N=5 deterministic

Open items (next episodes)

  1. kind=tool filter at the discover surface — the cheapest likely-effective fix.
  2. Reliability-score post-filter as a third RRF arm.
  3. Phase 2 D (Postgres + pgvector) picks up retrieval restructuring with real HNSW tuning.

🤖 Generated with Claude Code

…(NDCG@3 0.73→0.31)

Episode A2 (rev 4 plan, plan-eng-critic rounds 1+2 FAIL → rev 3 PASS score 78)
generated a 10k synthetic corpus and ran v2-golden + a 50-query stress set against
it. Result: A1's NDCG@3 baseline of 0.7296 (434 corpus) drops to 0.3139 at 10k,
single-tool top-1 drops from 10/14 to 0/14. The plan's 0.60 floor is NOT met.
Per the plan's own rule, this is "retrieval scaled wrong" territory and the
0.31 baseline is NOT pinned as a new gate.

Two root causes identified (probed via diagnostic queries):
1. Skills outrank tools for queries whose phrasing matches skill names
   (e.g. "publish-repo" skill wins "search npm registry"; "freeze" skill
   wins "transcribe audio"). discover() doesn't filter by kind=tool.
2. Synthetic tools (Strategy A: vendor + scenario permutations of 47 base
   templates) crowd top-K for queries matching their base capability.

Diversity gate passed cleanly: 100% of 200 random pairs cosine<0.97 on both
1k pilot and full 10k corpus. The dilution problem isn't near-duplicates,
it's the absolute pool size relative to RRF's top-K candidate window.

Files shipped (all diagnostic, none gating):
  src/fixtures/generated-expanded.ts — Strategy A expansion generator
  scripts/eval/build-10k-corpus.ts   — corpus builder with diversity gate
  scripts/eval/build-1k-subset.ts    — deterministic 1k subset
  scripts/smoke/v2-golden-v2native.ts — extended with --corpus-path,
    --include-stress, --baseline-out, --skip-hash-check (backward-compat
    verified: eval:golden against 434 corpus still 0.7296)
  tests/fixtures/v2-corpus-10k-snapshot.json — 10,167 entries, sha256 5e77a6e1...
  tests/fixtures/v2-stress.json — 50 stress queries × 5 strata
  tests/fixtures/v2-baseline-10k.json — diagnostic (NOT a gate)
  tests/fixtures/v2-baseline-1k.json — diagnostic; alphabetic-first-1k is
    unrepresentative (only synthetic tools), use seeded-random for any
    future CI gate
  docs/perf/10k-benchmark.md — write-up with root-cause hypothesis + open items

Decision (per plan rev 4 Step 4 "CI gate decision"):
  - No new CI gate pinned at degraded numbers.
  - A1's v2-baseline-native.json remains the only retrieval gate.
  - Next episodes: (1) kind=tool filter at discover surface;
    (2) reliability_score as third RRF arm or post-filter;
    (3) Phase 2 D (pgvector) picks up retrieval restructuring with HNSW.

src/fixtures/generated.ts: DOMAINS + DomainTpl exports added (was private)
so the expansion generator can reuse the source templates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kitfunso kitfunso merged commit 1e0e24f into master May 23, 2026
1 check passed
@kitfunso kitfunso deleted the feat/a2-10k-perf-benchmark branch May 23, 2026 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant