feat(benchmarks): add Suite C precision-under-noise benchmark#64
Conversation
Builds seeded deterministic corpora (100/1k/10k/100k records) in a temp DB using the real schema and write paths, then measures the real FTS5 search() path against a ground-truth-labeled query set (exact lookup, paraphrase, problem lookup, ambiguous-with-collisions). Reports P@5, R@5, MRR@5, and latency p50/p95 per corpus size, with breakdowns by query category, target table, and provenance (#42). Near-duplicate noise exercises the gap dedup lineage (#45) exists to close; dedup is intentionally not run so the baseline records the unmitigated behavior. Baseline-first per the issue: no pass/fail threshold. Env overrides RECALL_BENCH_C_SIZES / RECALL_BENCH_C_REPEATS keep CI fast.
First honest baseline: exact-lookup MRR degrades 1.0 -> 0 from 100 to 100k records as unmarked near-duplicates crowd originals out of the top-5; latency p95 grows 0.5ms -> 36ms. Future regression gating diffs against this JSONL.
Review — PR #64: Suite C precision-under-noise benchmarkReviewed head Independent verification (all passed)
Issue #47 acceptance criteriaAll met: corpus ladder 100/1k/10k/100k; seeded deterministic fixtures (asserted by tests and verified end-to-end above); ground truth labeled with table/project/provenance; all four query categories with typed name/project/topic collision labels on ambiguous queries; P@5/R@5/MRR + latency p50/p95 with the warmup/repeat protocol and embedding availability documented in the caveats; table and provenance breakdowns; results in Findings — none blocking; recommended follow-ups
StrengthsBenchmarks the real write/search paths instead of a strawman; determinism holds end-to-end across machines-and-runs (verified, not just asserted); the baseline is honest — paraphrase queries scoring ~0 on keyword search and the 100k collapse are reported, caveated, and explained rather than tuned away; isolation of the user DB is exactly right and test-pinned; this is precisely the measurable gap that dedup lineage (#45) exists to close, now with a committed JSONL to diff against. Verdict: APPROVE |
Closes #47
What
Adds Suite C to the benchmark harness — the final item of the MemPalace roadmap. It measures whether
search()retrieves the right high-signal memory when the database contains many irrelevant records.initDb()schema and the realsrc/lib/memory.tswrite paths (FTS triggers fire as in production). The user's real DB is never touched; the env override is saved and restored.dispatchSuitecase C,recall benchmark listshows C as built, README methodology section added.Baseline (committed)
First honest run, no pass/fail threshold per the issue. Headline: exact-lookup MRR degrades 1.0 → 0 from 100 → 100k records as unmarked near-duplicates crowd originals out of the top-5 (LoA near-dups double-count terms across title+extract; message near-dups win bm25 length normalization). Latency p95 grows 0.5ms → 36ms. This is the measurable gap dedup lineage (#45) exists to close — a post-dedup run can be compared directly against this JSONL.
Tests
Gate
bun run lintcleanbun test652 pass / 0 failrecall benchmark run Cend-to-end produces the committed baseline (full ladder, ~5s)