test(perf): episode A2 10k scale verification — scale-break detected#2
Merged
Conversation
…(NDCG@3 0.73→0.31)
Episode A2 (rev 4 plan, plan-eng-critic rounds 1+2 FAIL → rev 3 PASS score 78)
generated a 10k synthetic corpus and ran v2-golden + a 50-query stress set against
it. Result: A1's NDCG@3 baseline of 0.7296 (434 corpus) drops to 0.3139 at 10k,
single-tool top-1 drops from 10/14 to 0/14. The plan's 0.60 floor is NOT met.
Per the plan's own rule, this is "retrieval scaled wrong" territory and the
0.31 baseline is NOT pinned as a new gate.
Two root causes identified (probed via diagnostic queries):
1. Skills outrank tools for queries whose phrasing matches skill names
(e.g. "publish-repo" skill wins "search npm registry"; "freeze" skill
wins "transcribe audio"). discover() doesn't filter by kind=tool.
2. Synthetic tools (Strategy A: vendor + scenario permutations of 47 base
templates) crowd top-K for queries matching their base capability.
Diversity gate passed cleanly: 100% of 200 random pairs cosine<0.97 on both
1k pilot and full 10k corpus. The dilution problem isn't near-duplicates,
it's the absolute pool size relative to RRF's top-K candidate window.
Files shipped (all diagnostic, none gating):
src/fixtures/generated-expanded.ts — Strategy A expansion generator
scripts/eval/build-10k-corpus.ts — corpus builder with diversity gate
scripts/eval/build-1k-subset.ts — deterministic 1k subset
scripts/smoke/v2-golden-v2native.ts — extended with --corpus-path,
--include-stress, --baseline-out, --skip-hash-check (backward-compat
verified: eval:golden against 434 corpus still 0.7296)
tests/fixtures/v2-corpus-10k-snapshot.json — 10,167 entries, sha256 5e77a6e1...
tests/fixtures/v2-stress.json — 50 stress queries × 5 strata
tests/fixtures/v2-baseline-10k.json — diagnostic (NOT a gate)
tests/fixtures/v2-baseline-1k.json — diagnostic; alphabetic-first-1k is
unrepresentative (only synthetic tools), use seeded-random for any
future CI gate
docs/perf/10k-benchmark.md — write-up with root-cause hypothesis + open items
Decision (per plan rev 4 Step 4 "CI gate decision"):
- No new CI gate pinned at degraded numbers.
- A1's v2-baseline-native.json remains the only retrieval gate.
- Next episodes: (1) kind=tool filter at discover surface;
(2) reliability_score as third RRF arm or post-filter;
(3) Phase 2 D (pgvector) picks up retrieval restructuring with HNSW.
src/fixtures/generated.ts: DOMAINS + DomainTpl exports added (was private)
so the expansion generator can reuse the source templates.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Episode A2 (scale verification + latency profile) generated a 10k synthetic corpus and ran the A1 eval gate against it. Result: A1's NDCG@3 of 0.7296 collapses to 0.3139 at 10k scale. Single-tool top-1 hit rate goes from 10/14 → 0/14. Per the plan's own rule ("NDCG@3 < 0.60 at 10k is retrieval-broken territory, not a tolerance to relax"), this episode does NOT pin a new CI gate at the degraded numbers.
Diversity gate passed cleanly: 100% of 200 random pairs cosine<0.97 on both 1k pilot and full 10k. The dilution problem isn't near-duplicates, it's the absolute pool size relative to RRF's top-K window.
Root causes (probed)
Two distinct issues, neither fixable by RRF weight tuning:
publish-reposkill above any tool. "transcribe audio whisper" →freezeskill at top-1. discover() doesn't filter bykind=tool.See
docs/perf/10k-benchmark.mdfor diagnostic queries, hypotheses, and open items.Plan trail
Plan-eng-critic rev 1 FAIL (score 28): killed HNSW + BM25 + sqlite-tuning fictions (don't exist in stack).
Plan-eng-critic rev 2 FAIL (score 52): killed wrong file path + fictional env-var plumbing + MCP count error.
Plan-eng-critic rev 3 PASS (score 78): rev 4 inline patches applied for 5 advisory items.
The outside-voice loop killed two whole tranches of fictional work before they cost execute time. Lesson recorded for future plans: grep source first, draft second.
What shipped (all diagnostic, no CI gate change)
tests/fixtures/v2-corpus-10k-snapshot.json— 10,167 entries, content-addressed sha256tests/fixtures/v2-stress.json— 50 queries × 5 stratatests/fixtures/v2-baseline-10k.json+v2-baseline-1k.json— diagnostic baselinessrc/fixtures/generated-expanded.ts— Strategy A expansion generatorscripts/eval/build-10k-corpus.ts,build-1k-subset.tsscripts/smoke/v2-golden-v2native.tsextended with--corpus-path,--include-stress,--baseline-out,--skip-hash-check(backward-compatible; defaulteval:goldenunchanged — verified 0.7296)docs/perf/10k-benchmark.mdTest plan
npm run typecheckcleannpm run eval:goldenagainst 434 corpus still 0.7296 (backward-compat verified)v2-baseline-10k.jsonreproduce N=5 deterministicOpen items (next episodes)
kind=toolfilter at the discover surface — the cheapest likely-effective fix.🤖 Generated with Claude Code