From da403b8d7667112f8df83db56dae5be4c9b381cf Mon Sep 17 00:00:00 2001 From: Ed Heltzel <402910+edheltzel@users.noreply.github.com> Date: Thu, 11 Jun 2026 05:38:34 -0400 Subject: [PATCH 1/2] feat(benchmarks): add Suite C precision-under-noise benchmark Builds seeded deterministic corpora (100/1k/10k/100k records) in a temp DB using the real schema and write paths, then measures the real FTS5 search() path against a ground-truth-labeled query set (exact lookup, paraphrase, problem lookup, ambiguous-with-collisions). Reports P@5, R@5, MRR@5, and latency p50/p95 per corpus size, with breakdowns by query category, target table, and provenance (#42). Near-duplicate noise exercises the gap dedup lineage (#45) exists to close; dedup is intentionally not run so the baseline records the unmitigated behavior. Baseline-first per the issue: no pass/fail threshold. Env overrides RECALL_BENCH_C_SIZES / RECALL_BENCH_C_REPEATS keep CI fast. --- benchmarks/README.md | 17 +- benchmarks/runner.ts | 6 +- benchmarks/suites/suite-c-internals.ts | 666 +++++++++++++++++++ benchmarks/suites/suite-c-precision-noise.ts | 230 +++++++ src/commands/benchmark.ts | 2 +- tests/benchmarks/suite-b.test.ts | 15 +- tests/benchmarks/suite-c.test.ts | 329 +++++++++ 7 files changed, 1257 insertions(+), 8 deletions(-) create mode 100644 benchmarks/suites/suite-c-internals.ts create mode 100644 benchmarks/suites/suite-c-precision-noise.ts create mode 100644 tests/benchmarks/suite-c.test.ts diff --git a/benchmarks/README.md b/benchmarks/README.md index 4a8256b..a5d75db 100644 --- a/benchmarks/README.md +++ b/benchmarks/README.md @@ -1,6 +1,6 @@ # Recall Benchmarks (Phase 2) -> Status: Suite B (token efficiency) implemented. Suites A / C / D / E are scaffolded in the runner but not yet built. See `.atlas/plans/2026-04-17-mempalace-research-borrow-list.md` for the full Phase 2 design. +> Status: Suite B (token efficiency) and Suite C (precision under noise) implemented. Suites A / D / E are scaffolded in the runner but not yet built. See `.atlas/plans/2026-04-17-mempalace-research-borrow-list.md` for the full Phase 2 design. ## Why this exists @@ -51,10 +51,23 @@ Each run writes two files to `benchmarks/results/`: |---|---|---|---| | A | Cross-session recall | Planned | Retrieval@5 + answer accuracy across N-session synthetic gaps | | B | Token efficiency | **Built** | Wake-up bundle char/token cost vs v1 baseline and CLAUDE.md | -| C | Precision under noise | Planned | Precision@5 and latency at corpus sizes 100 / 1k / 10k / 100k | +| C | Precision under noise | **Built** | P@5 / R@5 / MRR@5 + latency p50/p95 at corpus sizes 100 / 1k / 10k / 100k | | D | Structured-knowledge fidelity | Planned | Supersession correctness, LoA elevation in mixed results | | E | Real-world replay | Planned | Help-rate and wrong-direction-rate on anonymized session history | +## Suite C methodology — precision under noise + +Suite C answers one question: **when the database is full of junk, does `search()` still surface the right record?** + +- **Corpus.** For each size in the ladder (default 100 / 1,000 / 10,000 / 100,000 records), a synthetic corpus is built in a temporary DB using the real schema (`initDb()`) and the real write paths from `src/lib/memory.ts`, so FTS triggers populate exactly as in production. The user's real DB is never touched. +- **Determinism.** Fixture generation is seeded (mulberry32, default seed 47). The same seed and size produce byte-identical record content, so runs are comparable across machines and over time. Tests assert this. +- **Ground truth.** A fixed set of target records (constant across sizes) carries labels: table, project, and provenance. The rest of the corpus is noise in three roles: near-duplicates of targets (the precision trap), entity-name collisions, and low-signal filler. Noise spans all five searchable tables, including messages. +- **Queries.** Four labeled categories: exact project/name lookup, paraphrased decision lookup, learning/problem lookup, and noisy ambiguous queries. Ambiguous queries carry explicit collision labels (name / project / topic) so failures can be attributed to entity ambiguity vs generic ranking noise. +- **Metrics.** Precision@5, Recall@5, and MRR@5 per corpus size, plus breakdowns by query category, by ground-truth table (`r_at_5_table_*`), and by provenance (`r_at_5_prov_*`). No composite scores, per the methodology rules. +- **Latency.** One unmeasured warmup pass per corpus size, then 5 measured repeats per query on a warm connection; p50/p95 are computed across all measured calls at that size. The report caveats state the protocol and whether the embedding service was available — Suite C exercises the FTS5 keyword path only. +- **Baseline-first.** The first run records an honest baseline; there is no pass/fail threshold. Later regression gating can diff runs against the checked-in baseline JSONL in `benchmarks/results/`. +- **Overrides.** `RECALL_BENCH_C_SIZES` (comma-separated) and `RECALL_BENCH_C_REPEATS` override the corpus ladder and repeat count — used by tests to keep CI fast; leave unset for comparable real runs. + ## Adding a new suite 1. Create `benchmarks/suites/suite--.ts` exporting `runSuite(): Promise`. diff --git a/benchmarks/runner.ts b/benchmarks/runner.ts index 366a457..9e81d4b 100644 --- a/benchmarks/runner.ts +++ b/benchmarks/runner.ts @@ -11,6 +11,7 @@ import { mkdirSync, writeFileSync, existsSync } from 'fs'; import { join, dirname } from 'path'; import { runSuiteB } from './suites/suite-b-token-efficiency.js'; +import { runSuiteC } from './suites/suite-c-precision-noise.js'; import type { RunResult, SuiteResult, SuiteId } from './types.js'; const RESULTS_DIR = join(import.meta.dir, 'results'); @@ -38,8 +39,11 @@ async function dispatchSuite(suite: SuiteId, project?: string): Promise number { + let a = seed >>> 0; + return () => { + a |= 0; + a = (a + 0x6d2b79f5) | 0; + let t = Math.imul(a ^ (a >>> 15), 1 | a); + t = (t + Math.imul(t ^ (t >>> 7), 61 | t)) ^ t; + return ((t ^ (t >>> 14)) >>> 0) / 4294967296; + }; +} + +// ── Fixture types ──────────────────────────────────────────────────── + +export type FixtureTable = 'decisions' | 'learnings' | 'breadcrumbs' | 'loa_entries' | 'messages'; + +export type FixtureRole = 'target' | 'near_duplicate' | 'entity_collision' | 'low_signal'; + +export interface FixtureRecord { + /** Stable key — ground-truth queries reference target keys. */ + key: string; + table: FixtureTable; + project: string; + provenance: Provenance; + /** Primary text (decision / problem / content / LoA title). */ + text: string; + /** Secondary text (reasoning / solution / LoA fabric_extract). */ + detail?: string; + importance: number; + /** Decisions/learnings only — search() orders decisions by confidence before rank. */ + confidence?: 'high' | 'medium' | 'low'; + role: FixtureRole; +} + +export type QueryCategory = 'exact_lookup' | 'paraphrase' | 'problem_lookup' | 'ambiguous'; + +export interface FixtureQuery { + id: string; + text: string; + category: QueryCategory; + /** Optional project filter passed to search(), mirroring scoped lookups. */ + project?: string; + /** Keys of target records that are relevant to this query (ground truth). */ + expected: string[]; + /** Ambiguous queries only — labels the collision so failures can be attributed. */ + collision?: { kind: 'name' | 'project' | 'topic'; note: string }; +} + +export interface FixtureSpec { + seed: number; + size: number; + records: FixtureRecord[]; + queries: FixtureQuery[]; +} + +export interface SeededRecord { + key: string; + table: FixtureTable; + id: number; + project: string; + provenance: Provenance; +} + +// ── Ground-truth targets ───────────────────────────────────────────── +// Fixed records present at EVERY corpus size, so every query has answers and +// scores are comparable across sizes. Entity names are invented (ZephyrQueue, +// GlacierStore, HeliosParser, Atlas) so real-world content can't collide. +// Confidence stays 'medium' on decision targets — search() ranks decisions by +// confidence before FTS rank, and inflating our own targets would flatter the +// baseline. + +const PROJECT_PHOENIX = 'phoenix-api'; +const PROJECT_ATLAS = 'atlas-web'; +const PROJECT_NIMBUS = 'nimbus-cli'; + +const TARGETS: FixtureRecord[] = [ + { + key: 't_zephyr_backoff', + table: 'decisions', + project: PROJECT_PHOENIX, + provenance: 'user_authored', + text: 'Use exponential backoff with jitter for ZephyrQueue retry handling', + detail: 'Fixed-interval retries hammered the broker during outages; jitter spreads the reconnect storm.', + importance: 7, + confidence: 'medium', + role: 'target', + }, + { + key: 't_glacier_compaction', + table: 'decisions', + project: PROJECT_NIMBUS, + provenance: 'extracted', + text: 'Schedule GlacierStore compaction nightly at 03:00 UTC', + detail: 'Daytime compaction competed with interactive workloads for disk I/O.', + importance: 6, + confidence: 'medium', + role: 'target', + }, + { + key: 't_wal_journaling', + table: 'decisions', + project: PROJECT_PHOENIX, + provenance: 'user_authored', + text: 'Adopt SQLite WAL journal mode for the persistence layer', + detail: 'Allows concurrent readers while a single writer appends; the rollback journal blocked readers during writes.', + importance: 7, + confidence: 'medium', + role: 'target', + }, + { + key: 't_atlas_phoenix', + table: 'decisions', + project: PROJECT_PHOENIX, + provenance: 'extracted', + text: 'Deploy Atlas gateway to the edge POPs before regional rollout', + detail: 'Atlas at the edge cuts p95 latency for auth handshakes.', + importance: 6, + confidence: 'medium', + role: 'target', + }, + { + key: 't_atlas_web', + table: 'decisions', + project: PROJECT_ATLAS, + provenance: 'user_authored', + text: 'Rename the Atlas design tokens package to atlas-tokens', + detail: 'The old package name collided with the Atlas gateway component.', + importance: 5, + confidence: 'medium', + role: 'target', + }, + { + key: 't_sqlite_busy', + table: 'learnings', + project: PROJECT_PHOENIX, + provenance: 'extracted', + text: 'SQLITE_BUSY errors under concurrent WAL writers in bun:sqlite', + detail: 'Serialize writes through a single connection; WAL allows many readers but only one writer.', + importance: 7, + confidence: 'medium', + role: 'target', + }, + { + key: 't_orphan_worktree', + table: 'learnings', + project: PROJECT_NIMBUS, + provenance: 'extracted', + text: 'Orphaned git worktree branches accumulate after subagent merges', + detail: 'Delete the per-agent branch right after merging; lowercase -d refuses unmerged deletes.', + importance: 6, + confidence: 'medium', + role: 'target', + }, + { + key: 't_timeout_ingest', + table: 'learnings', + project: PROJECT_PHOENIX, + provenance: 'derived', + text: 'Ingest pipeline timeout when the embedding service is cold', + detail: 'Warm the embedding service at startup and fail open to keyword search.', + importance: 6, + confidence: 'medium', + role: 'target', + }, + { + key: 't_timeout_ui', + table: 'learnings', + project: PROJECT_ATLAS, + provenance: 'extracted', + text: 'Modal dismiss timeout races the navigation transition', + detail: 'Await the transition promise before starting the dismiss timer.', + importance: 5, + confidence: 'medium', + role: 'target', + }, + { + key: 't_helios_tokenizer', + table: 'loa_entries', + project: PROJECT_PHOENIX, + provenance: 'verbatim', + text: 'HeliosParser streaming tokenizer design', + detail: + 'HeliosParser tokenizes input incrementally so multi-megabyte payloads never buffer fully in memory. The lookahead window is bounded at 4KB and backpressure propagates to the source stream.', + importance: 8, + role: 'target', + }, + { + key: 't_loa_retro', + table: 'loa_entries', + project: PROJECT_ATLAS, + provenance: 'extracted', + text: 'Q2 atlas-web performance retro', + detail: + 'Bundle splitting halved initial load time. The Atlas tokens rename unblocked the design system release train.', + importance: 8, + role: 'target', + }, + { + key: 't_breadcrumb_release', + table: 'breadcrumbs', + project: PROJECT_NIMBUS, + provenance: 'verbatim', + text: 'Release 0.9.3 tagged; GlacierStore migration gate passed on staging', + importance: 6, + role: 'target', + }, +]; + +// ── Ground-truth query set ─────────────────────────────────────────── +// Four categories per the issue spec. Query texts avoid FTS5 syntax +// characters (quotes, colons, parens) — they go into MATCH verbatim, exactly +// as search() receives them from callers. + +const QUERIES: FixtureQuery[] = [ + // Exact project/name lookups — entity name + topic terms, sometimes scoped. + { + id: 'q_exact_zephyr', + text: 'ZephyrQueue retry backoff', + category: 'exact_lookup', + project: PROJECT_PHOENIX, + expected: ['t_zephyr_backoff'], + }, + { + id: 'q_exact_glacier', + text: 'GlacierStore compaction', + category: 'exact_lookup', + project: PROJECT_NIMBUS, + expected: ['t_glacier_compaction'], + }, + { + id: 'q_exact_helios', + text: 'HeliosParser tokenizer', + category: 'exact_lookup', + expected: ['t_helios_tokenizer'], + }, + { + id: 'q_exact_release', + text: 'GlacierStore migration gate', + category: 'exact_lookup', + project: PROJECT_NIMBUS, + expected: ['t_breadcrumb_release'], + }, + // Paraphrased decision lookups — reworded intent. FTS5 MATCH is implicit + // AND with no stemming, so some of these are EXPECTED to miss on keyword + // search. That gap is part of the baseline this suite records. + { + id: 'q_para_wal', + text: 'journal mode concurrent readers', + category: 'paraphrase', + expected: ['t_wal_journaling'], + }, + { + id: 'q_para_retry', + text: 'spread reconnect attempts after broker outages', + category: 'paraphrase', + expected: ['t_zephyr_backoff'], + }, + { + id: 'q_para_compact', + text: 'when to run storage compaction', + category: 'paraphrase', + expected: ['t_glacier_compaction'], + }, + // Learning/problem lookups — phrased the way an agent reports a failure. + { + id: 'q_prob_busy', + text: 'SQLITE_BUSY concurrent writers', + category: 'problem_lookup', + expected: ['t_sqlite_busy'], + }, + { + id: 'q_prob_worktree', + text: 'orphaned worktree branches', + category: 'problem_lookup', + expected: ['t_orphan_worktree'], + }, + { + id: 'q_prob_cold', + text: 'embedding service cold timeout', + category: 'problem_lookup', + expected: ['t_timeout_ingest'], + }, + // Noisy ambiguous queries — labeled collisions so failures can be + // attributed to entity/name ambiguity vs generic ranking noise. + { + id: 'q_amb_atlas_name', + text: 'Atlas', + category: 'ambiguous', + expected: ['t_atlas_phoenix', 't_atlas_web', 't_loa_retro'], + collision: { + kind: 'name', + note: 'Atlas is both a gateway (phoenix-api) and a design-tokens package (atlas-web); the atlas-web PROJECT name also matches the term because project is an indexed FTS column, and collision noise mentions Atlas in unrelated content.', + }, + }, + { + id: 'q_amb_atlas_scoped', + text: 'Atlas gateway edge', + category: 'ambiguous', + project: PROJECT_PHOENIX, + expected: ['t_atlas_phoenix'], + collision: { + kind: 'name', + note: 'Same Atlas name collision, disambiguated by a project filter plus topic terms.', + }, + }, + { + id: 'q_amb_timeout', + text: 'timeout', + category: 'ambiguous', + expected: ['t_timeout_ingest', 't_timeout_ui'], + collision: { + kind: 'topic', + note: 'Two unrelated timeout learnings (ingest pipeline vs UI modal) plus collision noise mentioning timeouts.', + }, + }, + { + id: 'q_amb_timeout_scoped', + text: 'timeout', + category: 'ambiguous', + project: PROJECT_ATLAS, + expected: ['t_timeout_ui'], + collision: { + kind: 'project', + note: 'Same bare term as q_amb_timeout; the project filter is the only disambiguator.', + }, + }, +]; + +// ── Noise generation ───────────────────────────────────────────────── +// Three noise roles: +// near_duplicate — deterministic variants of target text. They compete +// directly with targets in ranking but are NOT labeled +// relevant: surfacing the variant instead of the record +// the query asks about is a precision failure. +// entity_collision — target entity names embedded in unrelated content. +// low_signal — generic filler an extraction pipeline accumulates. + +const NOISE_PROJECTS = [PROJECT_PHOENIX, PROJECT_ATLAS, PROJECT_NIMBUS, 'quartz-docs', 'ember-infra']; + +const ENTITY_NAMES = ['Atlas', 'ZephyrQueue', 'GlacierStore', 'HeliosParser', 'timeout', 'compaction']; + +const COLLISION_TEMPLATES = [ + 'Filed a ticket about ENTITY color contrast on the marketing splash page', + 'Renamed the ENTITY spreadsheet tab in the quarterly planning doc', + 'Standup note - ENTITY demo moved to Thursday', + 'Asked design for new ENTITY stickers for the offsite', + 'The ENTITY conference talk recording is up on the wiki', +]; + +const LOW_SIGNAL_SUBJECTS = [ + 'build pipeline', 'cache layer', 'release notes', 'standup notes', + 'dependency bump', 'flaky test', 'lint warning', 'onboarding doc', + 'dashboard widget', 'feature flag', 'log rotation', 'pager schedule', +]; + +const LOW_SIGNAL_VERBS = ['Investigated', 'Reviewed', 'Skimmed', 'Touched', 'Noted', 'Parked']; + +const LOW_SIGNAL_OUTCOMES = [ + 'no conclusion', 'follow-up later', 'seems fine', 'needs owner', 'low priority', 'waiting on infra', +]; + +const NEAR_DUP_SUFFIXES = [ + 'revisited after incident review', + 'as discussed in the planning sync', + 'pending final sign-off', + 'copied from the old tracker', + 'second occurrence this quarter', +]; + +function pick(rng: () => number, pool: readonly T[]): T { + return pool[Math.floor(rng() * pool.length)]; +} + +function pickWeighted(rng: () => number, weights: Record): T { + const entries = Object.entries(weights) as Array<[T, number]>; + let roll = rng(); + for (const [value, weight] of entries) { + roll -= weight; + if (roll < 0) return value; + } + return entries[entries.length - 1][0]; +} + +const NOISE_TABLE_WEIGHTS: Record = { + messages: 0.4, + breadcrumbs: 0.25, + decisions: 0.15, + learnings: 0.15, + loa_entries: 0.05, +}; + +const NOISE_ROLE_WEIGHTS: Record, number> = { + low_signal: 0.7, + entity_collision: 0.2, + near_duplicate: 0.1, +}; + +const NOISE_PROVENANCE_WEIGHTS: Record = { + extracted: 0.6, + derived: 0.2, + verbatim: 0.1, + user_authored: 0.1, +}; + +function makeNoiseRecord(rng: () => number, index: number): FixtureRecord { + const table = pickWeighted(rng, NOISE_TABLE_WEIGHTS); + const role = pickWeighted(rng, NOISE_ROLE_WEIGHTS); + const provenance = pickWeighted(rng, NOISE_PROVENANCE_WEIGHTS); + const project = pick(rng, NOISE_PROJECTS); + const importance = 2 + Math.floor(rng() * 6); // 2..7 + + let text: string; + let detail: string | undefined; + + if (role === 'near_duplicate') { + const target = TARGETS[Math.floor(rng() * TARGETS.length)]; + text = `${target.text} - ${pick(rng, NEAR_DUP_SUFFIXES)}`; + detail = target.detail; + } else if (role === 'entity_collision') { + const entity = pick(rng, ENTITY_NAMES); + text = `${pick(rng, COLLISION_TEMPLATES).replace('ENTITY', entity)} (n${index})`; + } else { + text = `${pick(rng, LOW_SIGNAL_VERBS)} ${pick(rng, LOW_SIGNAL_SUBJECTS)} drift; ${pick(rng, LOW_SIGNAL_OUTCOMES)} (n${index})`; + } + + const record: FixtureRecord = { + key: `noise_${index}`, + table, + project, + provenance, + text, + detail, + importance, + role, + }; + + if (table === 'decisions' || table === 'learnings') { + const roll = rng(); + record.confidence = roll < 0.15 ? 'high' : roll < 0.85 ? 'medium' : 'low'; + } + + return record; +} + +/** + * Build the full deterministic fixture spec for one corpus size. + * + * Targets are constant across sizes; noise fills the remainder. Records are + * shuffled (seeded Fisher-Yates) so target rows scatter through the ID space + * instead of clustering at the start. + */ +export function generateFixtureSpec(seed: number, size: number): FixtureSpec { + if (size < TARGETS.length + QUERIES.length) { + throw new Error(`Suite C corpus size must be at least ${TARGETS.length + QUERIES.length}, got ${size}`); + } + const rng = mulberry32(seed); + + const records: FixtureRecord[] = [...TARGETS]; + const noiseCount = size - TARGETS.length; + for (let i = 0; i < noiseCount; i++) { + records.push(makeNoiseRecord(rng, i)); + } + + // Seeded Fisher-Yates shuffle. + for (let i = records.length - 1; i > 0; i--) { + const j = Math.floor(rng() * (i + 1)); + [records[i], records[j]] = [records[j], records[i]]; + } + + return { seed, size, records, queries: QUERIES }; +} + +// ── Seeding ────────────────────────────────────────────────────────── + +const FIXTURE_SESSION_ID = 'suite-c-fixture'; +const SEED_EPOCH_MS = Date.UTC(2026, 0, 1); // fixed epoch — deterministic timestamps + +/** + * Insert a fixture spec into the CURRENT database (RECALL_DB_PATH must already + * point at an initDb()-initialized fixture DB). Uses the real write paths so + * FTS triggers populate exactly as they do in production. + * + * Returns target key → seeded row reference for ground-truth resolution. + */ +export function seedFixture(spec: FixtureSpec): Map { + const db = getDb(); + const targets = new Map(); + + createSession({ + session_id: FIXTURE_SESSION_ID, + started_at: new Date(SEED_EPOCH_MS).toISOString(), + project: PROJECT_PHOENIX, + source: 'suite-c-benchmark', + }); + + const messages: Array> = []; + const structured = spec.records.filter((r) => { + if (r.table !== 'messages') return true; + messages.push({ + session_id: FIXTURE_SESSION_ID, + timestamp: new Date(SEED_EPOCH_MS + messages.length * 60_000).toISOString(), + role: messages.length % 2 === 0 ? 'user' : 'assistant', + content: r.text, + project: r.project, + importance: r.importance, + provenance: r.provenance, + }); + return false; + }); + + // Structured tables in one transaction — 100k single inserts without one + // would pay a COMMIT per row and take minutes. + const insertStructured = db.transaction(() => { + for (const r of structured) { + let id: number; + switch (r.table) { + case 'decisions': + id = addDecision({ + decision: r.text, + reasoning: r.detail, + project: r.project, + status: 'active', + confidence: r.confidence ?? 'medium', + importance: r.importance, + provenance: r.provenance, + }); + break; + case 'learnings': + id = addLearning({ + problem: r.text, + solution: r.detail, + project: r.project, + confidence: r.confidence ?? 'medium', + importance: r.importance, + provenance: r.provenance, + }); + break; + case 'breadcrumbs': + id = addBreadcrumb({ + content: r.text, + project: r.project, + importance: r.importance, + provenance: r.provenance, + }); + break; + case 'loa_entries': + id = createLoaEntry({ + title: r.text, + description: r.detail ? r.detail.slice(0, 120) : undefined, + fabric_extract: r.detail ?? r.text, + project: r.project, + importance: r.importance, + provenance: r.provenance, + }); + break; + default: + continue; + } + if (r.role === 'target') { + targets.set(r.key, { key: r.key, table: r.table, id, project: r.project, provenance: r.provenance }); + } + } + }); + insertStructured(); + + if (messages.length > 0) { + addMessagesBatch(messages); // manages its own transaction + } + + return targets; +} + +// ── Metrics ────────────────────────────────────────────────────────── +// Pure functions over (retrieved, relevant) so they are trivially testable. +// Relevance keys are `${table}#${id}` — the same identity search() returns. + +export interface RetrievedRef { + table: string; + id: number; +} + +export function refKey(table: string, id: number): string { + return `${table}#${id}`; +} + +/** + * search() reports loa_entries rows under the logical table name 'loa' + * (see SEARCH_TABLES in src/lib/memory.ts). Ground-truth records carry the + * physical table name — map it before comparing identities. + */ +export function searchTableName(table: FixtureTable): string { + return table === 'loa_entries' ? 'loa' : table; +} + +/** Fraction of the top-k that is relevant. Standard P@k: divisor is k. */ +export function precisionAtK(retrieved: RetrievedRef[], relevant: Set, k: number): number { + if (k <= 0) return 0; + const hits = retrieved.slice(0, k).filter((r) => relevant.has(refKey(r.table, r.id))).length; + return hits / k; +} + +/** Fraction of all relevant records that appear in the top-k. */ +export function recallAtK(retrieved: RetrievedRef[], relevant: Set, k: number): number { + if (relevant.size === 0) return 0; + const top = retrieved.slice(0, k); + let hits = 0; + for (const key of relevant) { + if (top.some((r) => refKey(r.table, r.id) === key)) hits++; + } + return hits / relevant.size; +} + +/** 1/rank of the first relevant result within the top-k; 0 if none. */ +export function reciprocalRank(retrieved: RetrievedRef[], relevant: Set, k: number): number { + const top = retrieved.slice(0, k); + for (let i = 0; i < top.length; i++) { + if (relevant.has(refKey(top[i].table, top[i].id))) return 1 / (i + 1); + } + return 0; +} + +/** Nearest-rank percentile (p in 0..100) over an unsorted sample. */ +export function percentile(values: number[], p: number): number { + if (values.length === 0) return 0; + const sorted = [...values].sort((a, b) => a - b); + const idx = Math.min(sorted.length - 1, Math.max(0, Math.ceil((p / 100) * sorted.length) - 1)); + return sorted[idx]; +} + +export function mean(values: number[]): number { + if (values.length === 0) return 0; + return values.reduce((a, b) => a + b, 0) / values.length; +} diff --git a/benchmarks/suites/suite-c-precision-noise.ts b/benchmarks/suites/suite-c-precision-noise.ts new file mode 100644 index 0000000..0db5742 --- /dev/null +++ b/benchmarks/suites/suite-c-precision-noise.ts @@ -0,0 +1,230 @@ +// Suite C — Precision under noise. +// +// Measures whether search() retrieves the right high-signal memory when the +// database contains many irrelevant records. Builds seeded synthetic corpora +// (100 / 1k / 10k / 100k records by default) in a temporary DB using the real +// schema and write paths, then runs a ground-truth-labeled query set through +// the real FTS5 search path and reports P@5, R@5, MRR@5, and latency p50/p95 +// per corpus size, with breakdowns by table, provenance, and query category. +// +// What this suite is NOT: +// - It does NOT measure the wake-up bundle cost (Suite B). +// - It does NOT measure answer accuracy (Suite A's grader). +// - It does NOT exercise semantic/hybrid retrieval — FTS5 keyword path only. +// First implementation records an honest baseline; there is no pass/fail +// threshold. Regression gating compares future runs against the checked-in +// baseline JSONL. + +import { mkdtempSync, rmSync } from 'fs'; +import { tmpdir } from 'os'; +import { join } from 'path'; +import { search } from '../../src/lib/memory.js'; +import { initDb, closeDb } from '../../src/db/connection.js'; +import { checkEmbeddingService } from '../../src/lib/embeddings.js'; +import type { SuiteResult, MetricSample } from '../types.js'; +import { + K, + DEFAULT_SEED, + generateFixtureSpec, + seedFixture, + precisionAtK, + recallAtK, + reciprocalRank, + percentile, + mean, + refKey, + searchTableName, + type FixtureQuery, + type SeededRecord, +} from './suite-c-internals.js'; + +export interface SuiteCOptions { + /** Corpus sizes to run. Default 100/1k/10k/100k; env RECALL_BENCH_C_SIZES overrides. */ + sizes?: number[]; + /** PRNG seed for fixture generation. Default 47. */ + seed?: number; + /** Measured repeats per query (after 1 unmeasured warmup pass). Default 5; env RECALL_BENCH_C_REPEATS overrides. */ + repeats?: number; +} + +const DEFAULT_SIZES = [100, 1_000, 10_000, 100_000]; +const DEFAULT_REPEATS = 5; +const WARMUP_PASSES = 1; + +function parseEnvInts(name: string): number[] | undefined { + const raw = process.env[name]; + if (!raw) return undefined; + const values = raw.split(',').map((s) => parseInt(s.trim(), 10)).filter((n) => Number.isFinite(n) && n > 0); + return values.length > 0 ? values : undefined; +} + +const round = (value: number, places: number): number => { + const f = 10 ** places; + return Math.round(value * f) / f; +}; + +interface QueryOutcome { + query: FixtureQuery; + p5: number; + r5: number; + rr: number; + /** Per-expected-record retrieval flags for table/provenance breakdowns. */ + expectedHits: Array<{ record: SeededRecord; retrieved: boolean }>; +} + +function runQuery(query: FixtureQuery, targets: Map): QueryOutcome { + const results = search(query.text, { project: query.project, limit: K }); + const retrieved = results.map((r) => ({ table: r.table, id: r.id })); + + const expected = query.expected.map((key) => { + const record = targets.get(key); + if (!record) throw new Error(`Suite C ground-truth key ${key} did not resolve to a seeded record`); + return record; + }); + const relevant = new Set(expected.map((r) => refKey(searchTableName(r.table), r.id))); + const topKeys = new Set(retrieved.slice(0, K).map((r) => refKey(r.table, r.id))); + + return { + query, + p5: precisionAtK(retrieved, relevant, K), + r5: recallAtK(retrieved, relevant, K), + rr: reciprocalRank(retrieved, relevant, K), + expectedHits: expected.map((record) => ({ + record, + retrieved: topKeys.has(refKey(searchTableName(record.table), record.id)), + })), + }; +} + +function pushBreakdown( + samples: MetricSample[], + outcomes: QueryOutcome[], + scope: string, +): void { + // By query category — P@5 and MRR@5 per category, so ambiguous-query + // failures are attributable separately from exact-lookup failures. + const categories = [...new Set(outcomes.map((o) => o.query.category))]; + for (const category of categories) { + const group = outcomes.filter((o) => o.query.category === category); + samples.push({ + name: `p_at_5_cat_${category}`, + value: round(mean(group.map((o) => o.p5)), 4), + unit: 'ratio', + scope, + }); + samples.push({ + name: `mrr_cat_${category}`, + value: round(mean(group.map((o) => o.rr)), 4), + unit: 'ratio', + scope, + }); + } + + // By table and provenance — recall of ground-truth records grouped by the + // record's own table/provenance: of the labeled-relevant records in this + // dimension, what fraction surfaced in a top-5? + const hits = outcomes.flatMap((o) => o.expectedHits); + const byDimension = (dim: 'table' | 'provenance', prefix: string) => { + const values = [...new Set(hits.map((h) => h.record[dim]))]; + for (const value of values) { + const group = hits.filter((h) => h.record[dim] === value); + samples.push({ + name: `${prefix}_${value}`, + value: round(group.filter((h) => h.retrieved).length / group.length, 4), + unit: 'ratio', + scope, + }); + } + }; + byDimension('table', 'r_at_5_table'); + byDimension('provenance', 'r_at_5_prov'); +} + +export async function runSuiteC(options: SuiteCOptions = {}): Promise { + const t0 = performance.now(); + + const sizes = options.sizes ?? parseEnvInts('RECALL_BENCH_C_SIZES') ?? DEFAULT_SIZES; + const seed = options.seed ?? DEFAULT_SEED; + const repeats = options.repeats ?? parseEnvInts('RECALL_BENCH_C_REPEATS')?.[0] ?? DEFAULT_REPEATS; + + const embedding = await checkEmbeddingService().catch(() => ({ available: false, model: 'unknown', url: 'unknown' })); + + const samples: MetricSample[] = []; + + // Suite C must never touch the user's real DB: every corpus lives in a + // temp dir, and the env override + module connection are restored after. + const savedRecallPath = process.env.RECALL_DB_PATH; + const savedMemPath = process.env.MEM_DB_PATH; + const tempRoot = mkdtempSync(join(tmpdir(), 'recall-suite-c-')); + + try { + for (const size of sizes) { + const spec = generateFixtureSpec(seed, size); + + closeDb(); + process.env.RECALL_DB_PATH = join(tempRoot, `corpus-${size}.db`); + delete process.env.MEM_DB_PATH; + initDb(); + const targets = seedFixture(spec); + const scope = `corpus=${size}`; + + // Warmup — unmeasured pass(es) so first-touch page cache and statement + // compilation don't pollute the latency distribution. + for (let w = 0; w < WARMUP_PASSES; w++) { + for (const query of spec.queries) runQuery(query, targets); + } + + const latencies: number[] = []; + let outcomes: QueryOutcome[] = []; + for (let r = 0; r < repeats; r++) { + const pass: QueryOutcome[] = []; + for (const query of spec.queries) { + const tq = performance.now(); + pass.push(runQuery(query, targets)); + latencies.push(performance.now() - tq); + } + // Retrieval is deterministic for a fixed corpus — relevance metrics + // come from the first measured pass; later passes only feed latency. + if (r === 0) outcomes = pass; + } + closeDb(); + + samples.push({ name: 'p_at_5', value: round(mean(outcomes.map((o) => o.p5)), 4), unit: 'ratio', scope }); + samples.push({ name: 'r_at_5', value: round(mean(outcomes.map((o) => o.r5)), 4), unit: 'ratio', scope }); + samples.push({ name: 'mrr', value: round(mean(outcomes.map((o) => o.rr)), 4), unit: 'ratio (MRR@5)', scope }); + samples.push({ name: 'latency_p50_ms', value: round(percentile(latencies, 50), 3), unit: 'ms', scope }); + samples.push({ name: 'latency_p95_ms', value: round(percentile(latencies, 95), 3), unit: 'ms', scope }); + pushBreakdown(samples, outcomes, scope); + } + } finally { + closeDb(); + if (savedRecallPath !== undefined) process.env.RECALL_DB_PATH = savedRecallPath; + else delete process.env.RECALL_DB_PATH; + if (savedMemPath !== undefined) process.env.MEM_DB_PATH = savedMemPath; + else delete process.env.MEM_DB_PATH; + rmSync(tempRoot, { recursive: true, force: true }); + } + + const durationMs = Math.round(performance.now() - t0); + + return { + suite: 'C', + name: 'Precision under noise', + description: + `Measures FTS5 search() precision against seeded synthetic corpora (sizes: ${sizes.join(', ')}; seed: ${seed}). ` + + `A ground-truth-labeled query set (exact lookup, paraphrase, problem lookup, ambiguous-with-collisions) runs at each size; ` + + `reports P@5, R@5, MRR@5, latency p50/p95, and breakdowns by query category, target table, and provenance.`, + ranAt: new Date().toISOString(), + durationMs, + samples, + caveats: [ + `Synthetic corpus: deterministic seeded fixtures (seed ${seed}). Absolute scores do not transfer to real-world corpora; compare runs only against this same fixture set.`, + `Latency protocol: ${WARMUP_PASSES} unmeasured warmup pass per corpus size, then ${repeats} measured repeats per query on a warm connection; p50/p95 are computed across all measured calls at that size. Relevance metrics come from the first measured pass (retrieval is deterministic for a fixed corpus).`, + `Embedding service available: ${embedding.available ? `yes (${embedding.model})` : 'no'}. Suite C exercises the FTS5 keyword path (search()) only — semantic/hybrid retrieval is NOT measured in this baseline either way.`, + 'FTS5 MATCH is implicit AND with no stemming — paraphrase-category queries are expected to score near zero on keyword search. That gap is part of the honest baseline this suite records.', + 'Dedup was NOT run before measurement: the corpus contains unmarked near-duplicates that legitimately compete in ranking. search() excludes only records already marked in dedup_lineage.', + 'Ground truth never includes messages-table records — messages are noise-only in this corpus. The project column is part of every FTS index, so unscoped queries can match records via their project name alone.', + 'No pass/fail threshold — baseline-first. Later regression gating can diff future runs against the checked-in baseline JSONL.', + ], + }; +} diff --git a/src/commands/benchmark.ts b/src/commands/benchmark.ts index e732e76..0aa0168 100644 --- a/src/commands/benchmark.ts +++ b/src/commands/benchmark.ts @@ -11,7 +11,7 @@ import { join } from 'path'; const SUITES_AVAILABLE = [ { id: 'A', name: 'Cross-session recall', status: 'planned' }, { id: 'B', name: 'Token efficiency', status: 'built' }, - { id: 'C', name: 'Precision under noise', status: 'planned' }, + { id: 'C', name: 'Precision under noise', status: 'built' }, { id: 'D', name: 'Structured-knowledge fidelity', status: 'planned' }, { id: 'E', name: 'Real-world replay', status: 'planned' }, ] as const; diff --git a/tests/benchmarks/suite-b.test.ts b/tests/benchmarks/suite-b.test.ts index 27517c4..0e67647 100644 --- a/tests/benchmarks/suite-b.test.ts +++ b/tests/benchmarks/suite-b.test.ts @@ -241,10 +241,17 @@ describe('Runner — runBenchmarks', () => { }); test('skips planned suites cleanly when running all', async () => { - const out = await runBenchmarks({ project: 'bench-test', dryRun: true }); - // Only Suite B is built — others return null and are skipped - expect(out.result.suites.length).toBe(1); - expect(out.result.suites[0].suite).toBe('B'); + // Suite C is built — cap its corpus ladder so this stays a fast test. + process.env.RECALL_BENCH_C_SIZES = '100'; + process.env.RECALL_BENCH_C_REPEATS = '1'; + try { + const out = await runBenchmarks({ project: 'bench-test', dryRun: true }); + // Suites B and C are built — A / D / E return null and are skipped + expect(out.result.suites.map(s => s.suite)).toEqual(['B', 'C']); + } finally { + delete process.env.RECALL_BENCH_C_SIZES; + delete process.env.RECALL_BENCH_C_REPEATS; + } }); }); diff --git a/tests/benchmarks/suite-c.test.ts b/tests/benchmarks/suite-c.test.ts new file mode 100644 index 0000000..f30d1f0 --- /dev/null +++ b/tests/benchmarks/suite-c.test.ts @@ -0,0 +1,329 @@ +// Tests for Suite C — precision under noise. +// +// Coverage per the issue spec: fixture determinism, ground-truth label +// consistency, metric calculation, and report generation. Mirrors the +// suite-b.test.ts approach: assert SHAPE and invariants, not absolute scores — +// the whole point of Suite C is that scores are an honest measurement, so the +// tests must not encode expectations about how well search ranks. + +import { describe, test, expect, beforeEach, afterEach } from 'bun:test'; +import { mkdtempSync, rmSync, existsSync } from 'fs'; +import { join } from 'path'; +import { tmpdir } from 'os'; +import { Database } from 'bun:sqlite'; +import { + K, + mulberry32, + generateFixtureSpec, + seedFixture, + precisionAtK, + recallAtK, + reciprocalRank, + percentile, + mean, + refKey, + searchTableName, +} from '../../benchmarks/suites/suite-c-internals'; +import { runSuiteC } from '../../benchmarks/suites/suite-c-precision-noise'; +import { renderMarkdown } from '../../benchmarks/runner'; +import { initDb, closeDb } from '../../src/db/connection'; + +const QUERY_CATEGORIES = ['exact_lookup', 'paraphrase', 'problem_lookup', 'ambiguous'] as const; + +let savedDbPath: string | undefined; +let savedMemPath: string | undefined; +let tempDirs: string[] = []; + +beforeEach(() => { + savedDbPath = process.env.RECALL_DB_PATH; + savedMemPath = process.env.MEM_DB_PATH; +}); + +afterEach(() => { + closeDb(); + if (savedDbPath !== undefined) process.env.RECALL_DB_PATH = savedDbPath; + else delete process.env.RECALL_DB_PATH; + if (savedMemPath !== undefined) process.env.MEM_DB_PATH = savedMemPath; + else delete process.env.MEM_DB_PATH; + for (const dir of tempDirs) { + if (existsSync(dir)) rmSync(dir, { recursive: true, force: true }); + } + tempDirs = []; +}); + +/** Point RECALL_DB_PATH at a fresh initialized temp DB, ready for seedFixture. */ +function seedIntoTempDb(): { dbPath: string } { + const dir = mkdtempSync(join(tmpdir(), 'recall-suite-c-test-')); + tempDirs.push(dir); + const dbPath = join(dir, 'fixture.db'); + closeDb(); + process.env.RECALL_DB_PATH = dbPath; + delete process.env.MEM_DB_PATH; + initDb(); + return { dbPath }; +} + +/** Stable digest of corpus text content, independent of row IDs. */ +function corpusDigest(dbPath: string): string { + const db = new Database(dbPath, { readonly: true }); + try { + const texts: string[] = []; + const pull = (sql: string) => { + for (const row of db.prepare(sql).all() as Array<{ t: string }>) texts.push(row.t); + }; + pull('SELECT decision || COALESCE(reasoning, \'\') AS t FROM decisions'); + pull('SELECT problem || COALESCE(solution, \'\') AS t FROM learnings'); + pull('SELECT content AS t FROM breadcrumbs'); + pull('SELECT title || fabric_extract AS t FROM loa_entries'); + pull('SELECT content AS t FROM messages'); + texts.sort(); + return String(Bun.hash(texts.join('\u0000'))); + } finally { + db.close(); + } +} + +describe('Metric helpers', () => { + const retrieved = [ + { table: 'decisions', id: 1 }, + { table: 'learnings', id: 2 }, + { table: 'decisions', id: 3 }, + { table: 'breadcrumbs', id: 4 }, + { table: 'loa', id: 5 }, + ]; + + test('precisionAtK divides hits in top-k by k', () => { + const relevant = new Set(['decisions#1', 'decisions#3']); + expect(precisionAtK(retrieved, relevant, 5)).toBe(2 / 5); + expect(precisionAtK(retrieved, relevant, 1)).toBe(1); + expect(precisionAtK(retrieved, new Set(), 5)).toBe(0); + expect(precisionAtK([], relevant, 5)).toBe(0); + expect(precisionAtK(retrieved, relevant, 0)).toBe(0); + }); + + test('recallAtK divides retrieved relevant by total relevant', () => { + const relevant = new Set(['decisions#1', 'learnings#2', 'messages#99']); + expect(recallAtK(retrieved, relevant, 5)).toBe(2 / 3); + expect(recallAtK(retrieved, relevant, 1)).toBe(1 / 3); + expect(recallAtK(retrieved, new Set(), 5)).toBe(0); + }); + + test('reciprocalRank returns 1/rank of first relevant within k', () => { + expect(reciprocalRank(retrieved, new Set(['decisions#1']), 5)).toBe(1); + expect(reciprocalRank(retrieved, new Set(['decisions#3']), 5)).toBe(1 / 3); + expect(reciprocalRank(retrieved, new Set(['loa#5']), 5)).toBe(1 / 5); + // Relevant exists but is outside the top-k cutoff + expect(reciprocalRank(retrieved, new Set(['loa#5']), 4)).toBe(0); + expect(reciprocalRank(retrieved, new Set(['messages#99']), 5)).toBe(0); + }); + + test('percentile uses nearest-rank on a sorted copy', () => { + const values = [5, 1, 4, 2, 3]; + expect(percentile(values, 50)).toBe(3); + expect(percentile(values, 95)).toBe(5); + expect(percentile(values, 100)).toBe(5); + expect(percentile([7], 95)).toBe(7); + expect(percentile([], 50)).toBe(0); + // Input must not be mutated + expect(values).toEqual([5, 1, 4, 2, 3]); + }); + + test('mean averages, empty is 0', () => { + expect(mean([1, 2, 3])).toBe(2); + expect(mean([])).toBe(0); + }); + + test('refKey and searchTableName align ground truth with search() identities', () => { + expect(refKey('decisions', 7)).toBe('decisions#7'); + expect(searchTableName('loa_entries')).toBe('loa'); + expect(searchTableName('decisions')).toBe('decisions'); + expect(searchTableName('messages')).toBe('messages'); + }); +}); + +describe('mulberry32 PRNG', () => { + test('same seed produces the same sequence', () => { + const a = mulberry32(47); + const b = mulberry32(47); + for (let i = 0; i < 10; i++) expect(a()).toBe(b()); + }); + + test('different seeds diverge and values stay in [0,1)', () => { + const a = mulberry32(47); + const b = mulberry32(48); + const av = Array.from({ length: 5 }, () => a()); + const bv = Array.from({ length: 5 }, () => b()); + expect(av).not.toEqual(bv); + for (const v of [...av, ...bv]) { + expect(v).toBeGreaterThanOrEqual(0); + expect(v).toBeLessThan(1); + } + }); +}); + +describe('Fixture determinism', () => { + test('same seed and size produce an identical spec', () => { + const a = generateFixtureSpec(47, 500); + const b = generateFixtureSpec(47, 500); + expect(JSON.stringify(a)).toBe(JSON.stringify(b)); + }); + + test('different seeds produce different noise', () => { + const a = generateFixtureSpec(47, 500); + const b = generateFixtureSpec(48, 500); + expect(JSON.stringify(a.records)).not.toBe(JSON.stringify(b.records)); + }); + + test('record count matches the requested size', () => { + expect(generateFixtureSpec(47, 100).records.length).toBe(100); + expect(generateFixtureSpec(47, 1000).records.length).toBe(1000); + }); + + test('rejects sizes too small to hold targets and queries', () => { + expect(() => generateFixtureSpec(47, 10)).toThrow(); + }); + + test('seeding the same spec yields byte-identical corpus content', () => { + const spec = generateFixtureSpec(47, 100); + const first = seedIntoTempDb(); + seedFixture(spec); + closeDb(); + const second = seedIntoTempDb(); + seedFixture(spec); + closeDb(); + expect(corpusDigest(first.dbPath)).toBe(corpusDigest(second.dbPath)); + }); +}); + +describe('Ground-truth label consistency', () => { + const spec = generateFixtureSpec(47, 100); + + test('every expected key references a target record in the spec', () => { + const targetKeys = new Set(spec.records.filter((r) => r.role === 'target').map((r) => r.key)); + for (const query of spec.queries) { + expect(query.expected.length).toBeGreaterThan(0); + for (const key of query.expected) { + expect(targetKeys.has(key)).toBe(true); + } + } + }); + + test('project-scoped queries only expect records from that project', () => { + const byKey = new Map(spec.records.map((r) => [r.key, r])); + const scoped = spec.queries.filter((q) => q.project); + expect(scoped.length).toBeGreaterThan(0); + for (const query of scoped) { + for (const key of query.expected) { + expect(byKey.get(key)!.project).toBe(query.project!); + } + } + }); + + test('all four query categories are present and ambiguous queries carry collision labels', () => { + const categories = new Set(spec.queries.map((q) => q.category)); + for (const category of QUERY_CATEGORIES) expect(categories.has(category)).toBe(true); + for (const query of spec.queries) { + if (query.category === 'ambiguous') { + expect(query.collision).toBeDefined(); + expect(['name', 'project', 'topic']).toContain(query.collision!.kind); + } + } + }); + + test('targets are constant across corpus sizes', () => { + const targetsAt = (size: number) => + JSON.stringify(generateFixtureSpec(47, size).records.filter((r) => r.role === 'target').sort((a, b) => a.key.localeCompare(b.key))); + expect(targetsAt(100)).toBe(targetsAt(1000)); + }); + + test('seeded targets resolve with matching table, project, and provenance', () => { + seedIntoTempDb(); + const targets = seedFixture(spec); + closeDb(); + const byKey = new Map(spec.records.map((r) => [r.key, r])); + for (const query of spec.queries) { + for (const key of query.expected) { + const seeded = targets.get(key); + expect(seeded).toBeDefined(); + const declared = byKey.get(key)!; + expect(seeded!.table).toBe(declared.table); + expect(seeded!.project).toBe(declared.project); + expect(seeded!.provenance).toBe(declared.provenance); + expect(seeded!.id).toBeGreaterThan(0); + } + } + }); +}); + +describe('Suite C — runSuiteC report generation', () => { + test('returns the documented metric set per corpus size', async () => { + const result = await runSuiteC({ sizes: [100], repeats: 2 }); + expect(result.suite).toBe('C'); + expect(result.name).toBe('Precision under noise'); + expect(result.caveats.length).toBeGreaterThan(0); + + const at = (name: string) => result.samples.find((s) => s.name === name && s.scope === 'corpus=100'); + for (const required of ['p_at_5', 'r_at_5', 'mrr', 'latency_p50_ms', 'latency_p95_ms']) { + expect(at(required)).toBeDefined(); + } + + // Ratios stay in [0,1]; latencies are non-negative. + for (const name of ['p_at_5', 'r_at_5', 'mrr']) { + const sample = at(name)!; + expect(sample.value).toBeGreaterThanOrEqual(0); + expect(sample.value).toBeLessThanOrEqual(1); + } + expect(at('latency_p50_ms')!.value).toBeGreaterThanOrEqual(0); + expect(at('latency_p95_ms')!.value).toBeGreaterThanOrEqual(at('latency_p50_ms')!.value); + + // Breakdowns: every query category, plus table and provenance dimensions. + const names = result.samples.map((s) => s.name); + for (const category of QUERY_CATEGORIES) { + expect(names).toContain(`p_at_5_cat_${category}`); + expect(names).toContain(`mrr_cat_${category}`); + } + expect(names.some((n) => n.startsWith('r_at_5_table_'))).toBe(true); + expect(names.some((n) => n.startsWith('r_at_5_prov_'))).toBe(true); + }); + + test('one sample group per requested corpus size', async () => { + const result = await runSuiteC({ sizes: [100, 150], repeats: 1 }); + const scopes = new Set(result.samples.map((s) => s.scope)); + expect(scopes.has('corpus=100')).toBe(true); + expect(scopes.has('corpus=150')).toBe(true); + }); + + test('restores the DB env override and never touches the previous DB path', async () => { + process.env.RECALL_DB_PATH = '/tmp/suite-c-sentinel-does-not-exist.db'; + await runSuiteC({ sizes: [100], repeats: 1 }); + expect(process.env.RECALL_DB_PATH).toBe('/tmp/suite-c-sentinel-does-not-exist.db'); + expect(existsSync('/tmp/suite-c-sentinel-does-not-exist.db')).toBe(false); + }); + + test('latency caveat documents the warmup and repeat protocol', async () => { + const result = await runSuiteC({ sizes: [100], repeats: 3 }); + const latencyCaveat = result.caveats.find((c) => c.includes('warmup')); + expect(latencyCaveat).toBeDefined(); + expect(latencyCaveat).toContain('3 measured repeats'); + const embeddingCaveat = result.caveats.find((c) => c.includes('Embedding service available')); + expect(embeddingCaveat).toBeDefined(); + }); + + test('renderMarkdown renders the Suite C section', async () => { + const suite = await runSuiteC({ sizes: [100], repeats: 1 }); + const md = renderMarkdown({ + startedAt: '', + finishedAt: '', + recallVersion: 'test', + hostInfo: { platform: 'test', bunVersion: 'test' }, + suites: [suite], + }); + expect(md).toContain('## Suite C — Precision under noise'); + expect(md).toContain('| p_at_5 |'); + expect(md).toContain('### Caveats'); + }); + + test('K cutoff is 5 — the metric names match the measurement', () => { + expect(K).toBe(5); + }); +}); From 575a60ee7b70c3463880a781ebd996563672ea05 Mon Sep 17 00:00:00 2001 From: Ed Heltzel <402910+edheltzel@users.noreply.github.com> Date: Thu, 11 Jun 2026 05:38:34 -0400 Subject: [PATCH 2/2] test(benchmarks): record Suite C baseline run (seed 47, full ladder) First honest baseline: exact-lookup MRR degrades 1.0 -> 0 from 100 to 100k records as unmarked near-duplicates crowd originals out of the top-5; latency p95 grows 0.5ms -> 36ms. Future regression gating diffs against this JSONL. --- .../results/2026-06-11T09-36-53-suite-C.jsonl | 1 + .../results/2026-06-11T09-36-53-suite-C.md | 112 ++++++++++++++++++ 2 files changed, 113 insertions(+) create mode 100644 benchmarks/results/2026-06-11T09-36-53-suite-C.jsonl create mode 100644 benchmarks/results/2026-06-11T09-36-53-suite-C.md diff --git a/benchmarks/results/2026-06-11T09-36-53-suite-C.jsonl b/benchmarks/results/2026-06-11T09-36-53-suite-C.jsonl new file mode 100644 index 0000000..68975e4 --- /dev/null +++ b/benchmarks/results/2026-06-11T09-36-53-suite-C.jsonl @@ -0,0 +1 @@ +{"startedAt":"2026-06-11T09:36:48.827Z","finishedAt":"2026-06-11T09:36:53.977Z","recallVersion":"1.0.0","hostInfo":{"platform":"darwin-arm64","bunVersion":"1.3.14"},"suites":[{"suite":"C","name":"Precision under noise","description":"Measures FTS5 search() precision against seeded synthetic corpora (sizes: 100, 1000, 10000, 100000; seed: 47). A ground-truth-labeled query set (exact lookup, paraphrase, problem lookup, ambiguous-with-collisions) runs at each size; reports P@5, R@5, MRR@5, latency p50/p95, and breakdowns by query category, target table, and provenance.","ranAt":"2026-06-11T09:36:53.977Z","durationMs":5149,"samples":[{"name":"p_at_5","value":0.1857,"unit":"ratio","scope":"corpus=100"},{"name":"r_at_5","value":0.8095,"unit":"ratio","scope":"corpus=100"},{"name":"mrr","value":0.7381,"unit":"ratio (MRR@5)","scope":"corpus=100"},{"name":"latency_p50_ms","value":0.338,"unit":"ms","scope":"corpus=100"},{"name":"latency_p95_ms","value":0.475,"unit":"ms","scope":"corpus=100"},{"name":"p_at_5_cat_exact_lookup","value":0.2,"unit":"ratio","scope":"corpus=100"},{"name":"mrr_cat_exact_lookup","value":1,"unit":"ratio","scope":"corpus=100"},{"name":"p_at_5_cat_paraphrase","value":0.0667,"unit":"ratio","scope":"corpus=100"},{"name":"mrr_cat_paraphrase","value":0.3333,"unit":"ratio","scope":"corpus=100"},{"name":"p_at_5_cat_problem_lookup","value":0.2,"unit":"ratio","scope":"corpus=100"},{"name":"mrr_cat_problem_lookup","value":0.8333,"unit":"ratio","scope":"corpus=100"},{"name":"p_at_5_cat_ambiguous","value":0.25,"unit":"ratio","scope":"corpus=100"},{"name":"mrr_cat_ambiguous","value":0.7083,"unit":"ratio","scope":"corpus=100"},{"name":"r_at_5_table_decisions","value":0.625,"unit":"ratio","scope":"corpus=100"},{"name":"r_at_5_table_loa_entries","value":0.5,"unit":"ratio","scope":"corpus=100"},{"name":"r_at_5_table_breadcrumbs","value":1,"unit":"ratio","scope":"corpus=100"},{"name":"r_at_5_table_learnings","value":1,"unit":"ratio","scope":"corpus=100"},{"name":"r_at_5_prov_user_authored","value":0.75,"unit":"ratio","scope":"corpus=100"},{"name":"r_at_5_prov_extracted","value":0.6667,"unit":"ratio","scope":"corpus=100"},{"name":"r_at_5_prov_verbatim","value":1,"unit":"ratio","scope":"corpus=100"},{"name":"r_at_5_prov_derived","value":1,"unit":"ratio","scope":"corpus=100"},{"name":"p_at_5","value":0.0714,"unit":"ratio","scope":"corpus=1000"},{"name":"r_at_5","value":0.3571,"unit":"ratio","scope":"corpus=1000"},{"name":"mrr","value":0.2857,"unit":"ratio (MRR@5)","scope":"corpus=1000"},{"name":"latency_p50_ms","value":0.442,"unit":"ms","scope":"corpus=1000"},{"name":"latency_p95_ms","value":0.714,"unit":"ms","scope":"corpus=1000"},{"name":"p_at_5_cat_exact_lookup","value":0.15,"unit":"ratio","scope":"corpus=1000"},{"name":"mrr_cat_exact_lookup","value":0.75,"unit":"ratio","scope":"corpus=1000"},{"name":"p_at_5_cat_paraphrase","value":0.0667,"unit":"ratio","scope":"corpus=1000"},{"name":"mrr_cat_paraphrase","value":0.1667,"unit":"ratio","scope":"corpus=1000"},{"name":"p_at_5_cat_problem_lookup","value":0,"unit":"ratio","scope":"corpus=1000"},{"name":"mrr_cat_problem_lookup","value":0,"unit":"ratio","scope":"corpus=1000"},{"name":"p_at_5_cat_ambiguous","value":0.05,"unit":"ratio","scope":"corpus=1000"},{"name":"mrr_cat_ambiguous","value":0.125,"unit":"ratio","scope":"corpus=1000"},{"name":"r_at_5_table_decisions","value":0.5,"unit":"ratio","scope":"corpus=1000"},{"name":"r_at_5_table_loa_entries","value":0,"unit":"ratio","scope":"corpus=1000"},{"name":"r_at_5_table_breadcrumbs","value":1,"unit":"ratio","scope":"corpus=1000"},{"name":"r_at_5_table_learnings","value":0,"unit":"ratio","scope":"corpus=1000"},{"name":"r_at_5_prov_user_authored","value":0.5,"unit":"ratio","scope":"corpus=1000"},{"name":"r_at_5_prov_extracted","value":0.2222,"unit":"ratio","scope":"corpus=1000"},{"name":"r_at_5_prov_verbatim","value":0.5,"unit":"ratio","scope":"corpus=1000"},{"name":"r_at_5_prov_derived","value":0,"unit":"ratio","scope":"corpus=1000"},{"name":"p_at_5","value":0.0429,"unit":"ratio","scope":"corpus=10000"},{"name":"r_at_5","value":0.2143,"unit":"ratio","scope":"corpus=10000"},{"name":"mrr","value":0.1095,"unit":"ratio (MRR@5)","scope":"corpus=10000"},{"name":"latency_p50_ms","value":1.246,"unit":"ms","scope":"corpus=10000"},{"name":"latency_p95_ms","value":5.236,"unit":"ms","scope":"corpus=10000"},{"name":"p_at_5_cat_exact_lookup","value":0.1,"unit":"ratio","scope":"corpus=10000"},{"name":"mrr_cat_exact_lookup","value":0.1333,"unit":"ratio","scope":"corpus=10000"},{"name":"p_at_5_cat_paraphrase","value":0,"unit":"ratio","scope":"corpus=10000"},{"name":"mrr_cat_paraphrase","value":0,"unit":"ratio","scope":"corpus=10000"},{"name":"p_at_5_cat_problem_lookup","value":0,"unit":"ratio","scope":"corpus=10000"},{"name":"mrr_cat_problem_lookup","value":0,"unit":"ratio","scope":"corpus=10000"},{"name":"p_at_5_cat_ambiguous","value":0.05,"unit":"ratio","scope":"corpus=10000"},{"name":"mrr_cat_ambiguous","value":0.25,"unit":"ratio","scope":"corpus=10000"},{"name":"r_at_5_table_decisions","value":0.25,"unit":"ratio","scope":"corpus=10000"},{"name":"r_at_5_table_loa_entries","value":0,"unit":"ratio","scope":"corpus=10000"},{"name":"r_at_5_table_breadcrumbs","value":1,"unit":"ratio","scope":"corpus=10000"},{"name":"r_at_5_table_learnings","value":0,"unit":"ratio","scope":"corpus=10000"},{"name":"r_at_5_prov_user_authored","value":0,"unit":"ratio","scope":"corpus=10000"},{"name":"r_at_5_prov_extracted","value":0.2222,"unit":"ratio","scope":"corpus=10000"},{"name":"r_at_5_prov_verbatim","value":0.5,"unit":"ratio","scope":"corpus=10000"},{"name":"r_at_5_prov_derived","value":0,"unit":"ratio","scope":"corpus=10000"},{"name":"p_at_5","value":0,"unit":"ratio","scope":"corpus=100000"},{"name":"r_at_5","value":0,"unit":"ratio","scope":"corpus=100000"},{"name":"mrr","value":0,"unit":"ratio (MRR@5)","scope":"corpus=100000"},{"name":"latency_p50_ms","value":2.626,"unit":"ms","scope":"corpus=100000"},{"name":"latency_p95_ms","value":36.182,"unit":"ms","scope":"corpus=100000"},{"name":"p_at_5_cat_exact_lookup","value":0,"unit":"ratio","scope":"corpus=100000"},{"name":"mrr_cat_exact_lookup","value":0,"unit":"ratio","scope":"corpus=100000"},{"name":"p_at_5_cat_paraphrase","value":0,"unit":"ratio","scope":"corpus=100000"},{"name":"mrr_cat_paraphrase","value":0,"unit":"ratio","scope":"corpus=100000"},{"name":"p_at_5_cat_problem_lookup","value":0,"unit":"ratio","scope":"corpus=100000"},{"name":"mrr_cat_problem_lookup","value":0,"unit":"ratio","scope":"corpus=100000"},{"name":"p_at_5_cat_ambiguous","value":0,"unit":"ratio","scope":"corpus=100000"},{"name":"mrr_cat_ambiguous","value":0,"unit":"ratio","scope":"corpus=100000"},{"name":"r_at_5_table_decisions","value":0,"unit":"ratio","scope":"corpus=100000"},{"name":"r_at_5_table_loa_entries","value":0,"unit":"ratio","scope":"corpus=100000"},{"name":"r_at_5_table_breadcrumbs","value":0,"unit":"ratio","scope":"corpus=100000"},{"name":"r_at_5_table_learnings","value":0,"unit":"ratio","scope":"corpus=100000"},{"name":"r_at_5_prov_user_authored","value":0,"unit":"ratio","scope":"corpus=100000"},{"name":"r_at_5_prov_extracted","value":0,"unit":"ratio","scope":"corpus=100000"},{"name":"r_at_5_prov_verbatim","value":0,"unit":"ratio","scope":"corpus=100000"},{"name":"r_at_5_prov_derived","value":0,"unit":"ratio","scope":"corpus=100000"}],"caveats":["Synthetic corpus: deterministic seeded fixtures (seed 47). Absolute scores do not transfer to real-world corpora; compare runs only against this same fixture set.","Latency protocol: 1 unmeasured warmup pass per corpus size, then 5 measured repeats per query on a warm connection; p50/p95 are computed across all measured calls at that size. Relevance metrics come from the first measured pass (retrieval is deterministic for a fixed corpus).","Embedding service available: no. Suite C exercises the FTS5 keyword path (search()) only — semantic/hybrid retrieval is NOT measured in this baseline either way.","FTS5 MATCH is implicit AND with no stemming — paraphrase-category queries are expected to score near zero on keyword search. That gap is part of the honest baseline this suite records.","Dedup was NOT run before measurement: the corpus contains unmarked near-duplicates that legitimately compete in ranking. search() excludes only records already marked in dedup_lineage.","Ground truth never includes messages-table records — messages are noise-only in this corpus. The project column is part of every FTS index, so unscoped queries can match records via their project name alone.","No pass/fail threshold — baseline-first. Later regression gating can diff future runs against the checked-in baseline JSONL."]}]} diff --git a/benchmarks/results/2026-06-11T09-36-53-suite-C.md b/benchmarks/results/2026-06-11T09-36-53-suite-C.md new file mode 100644 index 0000000..aedfd7b --- /dev/null +++ b/benchmarks/results/2026-06-11T09-36-53-suite-C.md @@ -0,0 +1,112 @@ +# Recall Benchmark Run + +- **Started:** 2026-06-11T09:36:48.827Z +- **Finished:** 2026-06-11T09:36:53.977Z +- **Recall version:** 1.0.0 +- **Host:** darwin-arm64 (Bun 1.3.14) + +## Suite C — Precision under noise + +Measures FTS5 search() precision against seeded synthetic corpora (sizes: 100, 1000, 10000, 100000; seed: 47). A ground-truth-labeled query set (exact lookup, paraphrase, problem lookup, ambiguous-with-collisions) runs at each size; reports P@5, R@5, MRR@5, latency p50/p95, and breakdowns by query category, target table, and provenance. + +_Ran in 5149 ms at 2026-06-11T09:36:53.977Z._ + +| Metric | Value | Unit | Scope | vs Baseline | +|---|---:|---|---|---| +| p_at_5 | 0.1857 | ratio | corpus=100 | — | +| r_at_5 | 0.8095 | ratio | corpus=100 | — | +| mrr | 0.7381 | ratio (MRR@5) | corpus=100 | — | +| latency_p50_ms | 0.338 | ms | corpus=100 | — | +| latency_p95_ms | 0.475 | ms | corpus=100 | — | +| p_at_5_cat_exact_lookup | 0.2 | ratio | corpus=100 | — | +| mrr_cat_exact_lookup | 1 | ratio | corpus=100 | — | +| p_at_5_cat_paraphrase | 0.0667 | ratio | corpus=100 | — | +| mrr_cat_paraphrase | 0.3333 | ratio | corpus=100 | — | +| p_at_5_cat_problem_lookup | 0.2 | ratio | corpus=100 | — | +| mrr_cat_problem_lookup | 0.8333 | ratio | corpus=100 | — | +| p_at_5_cat_ambiguous | 0.25 | ratio | corpus=100 | — | +| mrr_cat_ambiguous | 0.7083 | ratio | corpus=100 | — | +| r_at_5_table_decisions | 0.625 | ratio | corpus=100 | — | +| r_at_5_table_loa_entries | 0.5 | ratio | corpus=100 | — | +| r_at_5_table_breadcrumbs | 1 | ratio | corpus=100 | — | +| r_at_5_table_learnings | 1 | ratio | corpus=100 | — | +| r_at_5_prov_user_authored | 0.75 | ratio | corpus=100 | — | +| r_at_5_prov_extracted | 0.6667 | ratio | corpus=100 | — | +| r_at_5_prov_verbatim | 1 | ratio | corpus=100 | — | +| r_at_5_prov_derived | 1 | ratio | corpus=100 | — | +| p_at_5 | 0.0714 | ratio | corpus=1000 | — | +| r_at_5 | 0.3571 | ratio | corpus=1000 | — | +| mrr | 0.2857 | ratio (MRR@5) | corpus=1000 | — | +| latency_p50_ms | 0.442 | ms | corpus=1000 | — | +| latency_p95_ms | 0.714 | ms | corpus=1000 | — | +| p_at_5_cat_exact_lookup | 0.15 | ratio | corpus=1000 | — | +| mrr_cat_exact_lookup | 0.75 | ratio | corpus=1000 | — | +| p_at_5_cat_paraphrase | 0.0667 | ratio | corpus=1000 | — | +| mrr_cat_paraphrase | 0.1667 | ratio | corpus=1000 | — | +| p_at_5_cat_problem_lookup | 0 | ratio | corpus=1000 | — | +| mrr_cat_problem_lookup | 0 | ratio | corpus=1000 | — | +| p_at_5_cat_ambiguous | 0.05 | ratio | corpus=1000 | — | +| mrr_cat_ambiguous | 0.125 | ratio | corpus=1000 | — | +| r_at_5_table_decisions | 0.5 | ratio | corpus=1000 | — | +| r_at_5_table_loa_entries | 0 | ratio | corpus=1000 | — | +| r_at_5_table_breadcrumbs | 1 | ratio | corpus=1000 | — | +| r_at_5_table_learnings | 0 | ratio | corpus=1000 | — | +| r_at_5_prov_user_authored | 0.5 | ratio | corpus=1000 | — | +| r_at_5_prov_extracted | 0.2222 | ratio | corpus=1000 | — | +| r_at_5_prov_verbatim | 0.5 | ratio | corpus=1000 | — | +| r_at_5_prov_derived | 0 | ratio | corpus=1000 | — | +| p_at_5 | 0.0429 | ratio | corpus=10000 | — | +| r_at_5 | 0.2143 | ratio | corpus=10000 | — | +| mrr | 0.1095 | ratio (MRR@5) | corpus=10000 | — | +| latency_p50_ms | 1.246 | ms | corpus=10000 | — | +| latency_p95_ms | 5.236 | ms | corpus=10000 | — | +| p_at_5_cat_exact_lookup | 0.1 | ratio | corpus=10000 | — | +| mrr_cat_exact_lookup | 0.1333 | ratio | corpus=10000 | — | +| p_at_5_cat_paraphrase | 0 | ratio | corpus=10000 | — | +| mrr_cat_paraphrase | 0 | ratio | corpus=10000 | — | +| p_at_5_cat_problem_lookup | 0 | ratio | corpus=10000 | — | +| mrr_cat_problem_lookup | 0 | ratio | corpus=10000 | — | +| p_at_5_cat_ambiguous | 0.05 | ratio | corpus=10000 | — | +| mrr_cat_ambiguous | 0.25 | ratio | corpus=10000 | — | +| r_at_5_table_decisions | 0.25 | ratio | corpus=10000 | — | +| r_at_5_table_loa_entries | 0 | ratio | corpus=10000 | — | +| r_at_5_table_breadcrumbs | 1 | ratio | corpus=10000 | — | +| r_at_5_table_learnings | 0 | ratio | corpus=10000 | — | +| r_at_5_prov_user_authored | 0 | ratio | corpus=10000 | — | +| r_at_5_prov_extracted | 0.2222 | ratio | corpus=10000 | — | +| r_at_5_prov_verbatim | 0.5 | ratio | corpus=10000 | — | +| r_at_5_prov_derived | 0 | ratio | corpus=10000 | — | +| p_at_5 | 0 | ratio | corpus=100000 | — | +| r_at_5 | 0 | ratio | corpus=100000 | — | +| mrr | 0 | ratio (MRR@5) | corpus=100000 | — | +| latency_p50_ms | 2.626 | ms | corpus=100000 | — | +| latency_p95_ms | 36.182 | ms | corpus=100000 | — | +| p_at_5_cat_exact_lookup | 0 | ratio | corpus=100000 | — | +| mrr_cat_exact_lookup | 0 | ratio | corpus=100000 | — | +| p_at_5_cat_paraphrase | 0 | ratio | corpus=100000 | — | +| mrr_cat_paraphrase | 0 | ratio | corpus=100000 | — | +| p_at_5_cat_problem_lookup | 0 | ratio | corpus=100000 | — | +| mrr_cat_problem_lookup | 0 | ratio | corpus=100000 | — | +| p_at_5_cat_ambiguous | 0 | ratio | corpus=100000 | — | +| mrr_cat_ambiguous | 0 | ratio | corpus=100000 | — | +| r_at_5_table_decisions | 0 | ratio | corpus=100000 | — | +| r_at_5_table_loa_entries | 0 | ratio | corpus=100000 | — | +| r_at_5_table_breadcrumbs | 0 | ratio | corpus=100000 | — | +| r_at_5_table_learnings | 0 | ratio | corpus=100000 | — | +| r_at_5_prov_user_authored | 0 | ratio | corpus=100000 | — | +| r_at_5_prov_extracted | 0 | ratio | corpus=100000 | — | +| r_at_5_prov_verbatim | 0 | ratio | corpus=100000 | — | +| r_at_5_prov_derived | 0 | ratio | corpus=100000 | — | + +### Caveats + +- Synthetic corpus: deterministic seeded fixtures (seed 47). Absolute scores do not transfer to real-world corpora; compare runs only against this same fixture set. +- Latency protocol: 1 unmeasured warmup pass per corpus size, then 5 measured repeats per query on a warm connection; p50/p95 are computed across all measured calls at that size. Relevance metrics come from the first measured pass (retrieval is deterministic for a fixed corpus). +- Embedding service available: no. Suite C exercises the FTS5 keyword path (search()) only — semantic/hybrid retrieval is NOT measured in this baseline either way. +- FTS5 MATCH is implicit AND with no stemming — paraphrase-category queries are expected to score near zero on keyword search. That gap is part of the honest baseline this suite records. +- Dedup was NOT run before measurement: the corpus contains unmarked near-duplicates that legitimately compete in ranking. search() excludes only records already marked in dedup_lineage. +- Ground truth never includes messages-table records — messages are noise-only in this corpus. The project column is part of every FTS index, so unscoped queries can match records via their project name alone. +- No pass/fail threshold — baseline-first. Later regression gating can diff future runs against the checked-in baseline JSONL. + +--- +_All metrics are unblended. We do not publish composite scores. See the per-suite caveats before drawing conclusions._ \ No newline at end of file