test(perf): episode A2 10k scale verification — scale-break detected by kitfunso · Pull Request #2 · kitfunso/2chain

kitfunso · 2026-05-23T18:20:28Z

Summary

Episode A2 (scale verification + latency profile) generated a 10k synthetic corpus and ran the A1 eval gate against it. Result: A1's NDCG@3 of 0.7296 collapses to 0.3139 at 10k scale. Single-tool top-1 hit rate goes from 10/14 → 0/14. Per the plan's own rule ("NDCG@3 < 0.60 at 10k is retrieval-broken territory, not a tolerance to relax"), this episode does NOT pin a new CI gate at the degraded numbers.

Metric	434 (A1)	1k (subset)	10k (full)
NDCG@3 mean	0.7296	0.0530	0.3139
Recall@3 mean	0.4250	0.0267	0.1133
single-tool top-1	10/14	0/14	0/14
p95 latency	unmeasured	4-18 ms	60-87 ms
stddev across N=5	0.0000	0.0000	0.0000

Diversity gate passed cleanly: 100% of 200 random pairs cosine<0.97 on both 1k pilot and full 10k. The dilution problem isn't near-duplicates, it's the absolute pool size relative to RRF's top-K window.

Root causes (probed)

Two distinct issues, neither fixable by RRF weight tuning:

Skill/subagent contamination. Queries like "search npm registry" rank publish-repo skill above any tool. "transcribe audio whisper" → freeze skill at top-1. discover() doesn't filter by kind=tool.
Synthetic-tool dilution. Strategy-A expansion (47 base templates × vendor + scenario perms) crowds top-K for queries matching the underlying base capability.

See docs/perf/10k-benchmark.md for diagnostic queries, hypotheses, and open items.

Plan trail

Plan-eng-critic rev 1 FAIL (score 28): killed HNSW + BM25 + sqlite-tuning fictions (don't exist in stack).
Plan-eng-critic rev 2 FAIL (score 52): killed wrong file path + fictional env-var plumbing + MCP count error.
Plan-eng-critic rev 3 PASS (score 78): rev 4 inline patches applied for 5 advisory items.

The outside-voice loop killed two whole tranches of fictional work before they cost execute time. Lesson recorded for future plans: grep source first, draft second.

What shipped (all diagnostic, no CI gate change)

tests/fixtures/v2-corpus-10k-snapshot.json — 10,167 entries, content-addressed sha256
tests/fixtures/v2-stress.json — 50 queries × 5 strata
tests/fixtures/v2-baseline-10k.json + v2-baseline-1k.json — diagnostic baselines
src/fixtures/generated-expanded.ts — Strategy A expansion generator
scripts/eval/build-10k-corpus.ts, build-1k-subset.ts
scripts/smoke/v2-golden-v2native.ts extended with --corpus-path, --include-stress, --baseline-out, --skip-hash-check (backward-compatible; default eval:golden unchanged — verified 0.7296)
docs/perf/10k-benchmark.md

Test plan

CI green on this PR (Ajv2020 lint + ndcg unit test + cross-OS install)
Locally: npm run typecheck clean
Locally: npm run eval:golden against 434 corpus still 0.7296 (backward-compat verified)
Diagnostic numbers in v2-baseline-10k.json reproduce N=5 deterministic

Open items (next episodes)

kind=tool filter at the discover surface — the cheapest likely-effective fix.
Reliability-score post-filter as a third RRF arm.
Phase 2 D (Postgres + pgvector) picks up retrieval restructuring with real HNSW tuning.

🤖 Generated with Claude Code

…(NDCG@3 0.73→0.31) Episode A2 (rev 4 plan, plan-eng-critic rounds 1+2 FAIL → rev 3 PASS score 78) generated a 10k synthetic corpus and ran v2-golden + a 50-query stress set against it. Result: A1's NDCG@3 baseline of 0.7296 (434 corpus) drops to 0.3139 at 10k, single-tool top-1 drops from 10/14 to 0/14. The plan's 0.60 floor is NOT met. Per the plan's own rule, this is "retrieval scaled wrong" territory and the 0.31 baseline is NOT pinned as a new gate. Two root causes identified (probed via diagnostic queries): 1. Skills outrank tools for queries whose phrasing matches skill names (e.g. "publish-repo" skill wins "search npm registry"; "freeze" skill wins "transcribe audio"). discover() doesn't filter by kind=tool. 2. Synthetic tools (Strategy A: vendor + scenario permutations of 47 base templates) crowd top-K for queries matching their base capability. Diversity gate passed cleanly: 100% of 200 random pairs cosine<0.97 on both 1k pilot and full 10k corpus. The dilution problem isn't near-duplicates, it's the absolute pool size relative to RRF's top-K candidate window. Files shipped (all diagnostic, none gating): src/fixtures/generated-expanded.ts — Strategy A expansion generator scripts/eval/build-10k-corpus.ts — corpus builder with diversity gate scripts/eval/build-1k-subset.ts — deterministic 1k subset scripts/smoke/v2-golden-v2native.ts — extended with --corpus-path, --include-stress, --baseline-out, --skip-hash-check (backward-compat verified: eval:golden against 434 corpus still 0.7296) tests/fixtures/v2-corpus-10k-snapshot.json — 10,167 entries, sha256 5e77a6e1... tests/fixtures/v2-stress.json — 50 stress queries × 5 strata tests/fixtures/v2-baseline-10k.json — diagnostic (NOT a gate) tests/fixtures/v2-baseline-1k.json — diagnostic; alphabetic-first-1k is unrepresentative (only synthetic tools), use seeded-random for any future CI gate docs/perf/10k-benchmark.md — write-up with root-cause hypothesis + open items Decision (per plan rev 4 Step 4 "CI gate decision"): - No new CI gate pinned at degraded numbers. - A1's v2-baseline-native.json remains the only retrieval gate. - Next episodes: (1) kind=tool filter at discover surface; (2) reliability_score as third RRF arm or post-filter; (3) Phase 2 D (pgvector) picks up retrieval restructuring with HNSW. src/fixtures/generated.ts: DOMAINS + DomainTpl exports added (was private) so the expansion generator can reuse the source templates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kitfunso merged commit 1e0e24f into master May 23, 2026
1 check passed

kitfunso deleted the feat/a2-10k-perf-benchmark branch May 23, 2026 18:27

This was referenced May 23, 2026

feat(eval): per-query kind targeting infra (opt-in; honest negative on NDCG) #4

Merged

test(perf): reliability arm — negative; three-strikes on simple remediations #5

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(perf): episode A2 10k scale verification — scale-break detected#2

test(perf): episode A2 10k scale verification — scale-break detected#2
kitfunso merged 1 commit into
masterfrom
feat/a2-10k-perf-benchmark

kitfunso commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kitfunso commented May 23, 2026

Summary

Root causes (probed)

Plan trail

What shipped (all diagnostic, no CI gate change)

Test plan

Open items (next episodes)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant