LearningCircuit · LearningCircuit · May 2, 2026 · May 2, 2026
diff --git a/METHODOLOGY.md b/METHODOLOGY.md
@@ -0,0 +1,228 @@
+# Benchmark Methodology
+
+This document explains how to interpret LDR benchmark results, how
+much to trust a reported number, and what conditions must hold for a
+comparison between two runs to be valid.
+
+For a quick orientation, the README's
+[Considerations section](README.md#considerations-for-using-the-data)
+has a condensed version of the same material.
+
+---
+
+## What is being measured
+
+LDR benchmarks measure how accurately the system answers factual
+questions when it has access to a live search engine. The score is the
+fraction of questions where an LLM grader judges the system's final
+answer to match the reference answer.
+
+This is **not** a measure of the underlying language model's knowledge.
+It measures the full pipeline: query generation, search retrieval,
+context synthesis, and final answer extraction — all in combination.
+Changing any one of those components (model, strategy, search engine,
+prompt template) changes what the score measures.
+
+---
+
+## Confidence intervals for a single run
+
+A reported accuracy of "91%" is a point estimate. The true accuracy
+could be higher or lower depending on which questions happened to be
+sampled. The Wilson score interval gives the statistically correct
+range at 95% confidence:
+
+```
+center     = (p̂ + z²/2n) / (1 + z²/n)
+half-width = z × sqrt(p̂(1−p̂)/n + z²/4n²) / (1 + z²/n)
+
+where  p̂ = observed accuracy (e.g. 0.91)
+       n  = number of examples tested
+       z  = 1.96 for 95% confidence
+```
+
+Approximate 95% margins of error at common sample sizes:
+
+| Examples (n) | ~70% accuracy | ~85% accuracy | ~91% accuracy | ~95% accuracy |
+|---|---|---|---|---|
+| 20  | ±21% | ±17% | ±14% | ±10% |
+| 50  | ±13% | ±10% | ±8%  | ±6%  |
+| 100 | ±9%  | ±7%  | ±6%  | ±4%  |
+| 200 | ±6%  | ±5%  | ±4%  | ±3%  |
+| 500 | ±4%  | ±3%  | ±3%  | ±2%  |
+
+**Practical guidance:**
+- A run of 20 questions has an uncertainty window of ±14–21%. A "91%"
+  result plausibly spans 77–100%. Treat it as a rough sanity check only.
+- 100 examples is the minimum before drawing any conclusion from a run.
+- 200+ examples is the minimum before comparing two configurations.
+- Results clustered above 90% (as most LDR submissions are) have tighter
+  absolute intervals: at n=300 with 91% accuracy the margin is roughly
+  ±3–4 pp.
+
+---
+
+## Comparing two configurations
+
+To tell whether configuration A is genuinely better than configuration
+B, the observed difference needs to be larger than the noise from both
+runs combined. The table below shows how many examples each
+configuration needs (run independently on the same question set via the
+same seed) to reliably detect a given absolute accuracy difference at
+80% statistical power (α = 0.05, two-sided):
+
+| Difference to detect       | Examples needed per config |
+|---|---|
+| 5 pp (e.g., 85% vs 90%)   | ~680 |
+| 10 pp (e.g., 80% vs 90%)  | ~200 |
+| 15 pp (e.g., 75% vs 90%)  | ~90  |
+
+**Rule of thumb:** If the observed gap between two runs is smaller than
+the margin of error for either run (see the table above), treat the
+results as a tie.
+
+**Practical implication for high-accuracy runs:** When results are
+clustered around 90–95%, even a 200-question run can only reliably
+detect a ~10 pp difference. If you are trying to distinguish between,
+say, 91% and 93%, you would need closer to 680 questions per
+configuration — and even then grader noise (see below) will obscure
+differences smaller than ~2–3 pp.
+
+---
+
+## When two runs cannot be compared
+
+Statistical power is only one requirement. Even with large sample sizes,
+a comparison is unreliable if **any** of the following differ between
+the two runs:
+
+| Factor | Why it matters |
+|---|---|
+| **LDR version** | Search logic, prompt templates, and result filtering change between releases. Two runs on different versions measure different pipelines. |
+| **Strategy** | `focused_iteration` and `source_based` answer questions differently by design. Their scores measure different things and are not interchangeable. |
+| **Grader model** | Changing the evaluation LLM changes what "correct" means. The same system response may grade differently under a different grader. |
+| **Random seed / question sample** | Some subsets of SimpleQA are inherently easier than others. Always use `--seed 42` (or any fixed seed) consistently across compared runs. |
+| **Search engine** | Tavily, SearXNG, Serper, and Brave retrieve different content. Engine latency also affects what gets retrieved within per-query time limits. |
+
+Treat each combination of `(ldr_version, strategy, search_engine,
+grader_model, seed)` as a distinct experimental condition. Only compare
+runs within the same condition.
+
+---
+
+## Evaluator LLM error
+
+The grader LLM (default: Claude 3.7 Sonnet via OpenRouter) is not
+perfect. On SimpleQA-style questions it mis-grades approximately 1% of
+responses, consistent with calibration results reported in the original
+SimpleQA paper for similarly capable graders.
+
+What this means in practice:
+
+| Run size | Expected grading errors | Smallest detectable real difference |
+|---|---|---|
+| 100 examples | ~1 question | ~3–4 pp (1 pp is pure noise) |
+| 200 examples | ~2 questions | ~2–3 pp |
+| 500 examples | ~5 questions | ~2 pp |
+
+The grader tends to be conservative — it marks ambiguous or partially
+correct matches as incorrect — so reported accuracy is a slight
+underestimate of true accuracy.
+
+**Do not optimize for differences smaller than ~2–3 pp on runs under
+500 examples.** The signal is not there.
+
+---
+
+## Hands-on advice
+
+### Starting a new benchmark run
+
+1. **Fix your seed.** Use a constant seed. An unfixed seed
+   means each run samples a different subset of questions, making reruns
+   incomparable.
+2. **Start small.** Run 20–50 questions first to confirm your search
+   engine is returning results and the grader is producing sensible
+   output. Look at a handful of graded examples, not just the summary
+   score.
+3. **Check search retrieval before trusting accuracy numbers.** If
+   `average_results_per_query` is 0 or very low, the model is answering
+   from memory, not from search. The accuracy number then measures the
+   model and not LDR.
+4. **Scale up only after sanity checks pass.** 100 examples for a
+   single-configuration result; 200+ if you plan to compare
+   configurations.
+
+### Interpreting a submitted result
+
+- Look at `total_questions`. A run of fewer than 100 questions
+  should be read as "approximately X%" with wide error bars, not as a
+  precise figure.
+- Check `ldr_version`, `strategy`, `search_engine`, and
+  `evaluator.model`. These four fields define the experimental
+  condition. Only compare rows where all four match.
+- If `hardware` fields are blank, timing numbers (`avg_time_per_question`)
+  cannot be meaningfully compared to other submissions.
+- A result with `total_tokens_used` filled in is more reproducible: you
+  can estimate cost and check whether the run hit context limits.
+
+### Comparing two configurations yourself
+
+1. Decide in advance what difference size you care about (e.g., "I want
+   to know if strategy B is more than 10 pp better than strategy A").
+2. Look up the required sample size from the table above (~200 per
+   config for 10 pp).
+3. Run both configurations on the **same question set** (same seed).
+4. Check: is the observed difference greater than the margin of error
+   for each run individually? If not, it is noise.
+5. Check: is the observed difference greater than ~2–3 pp? If not, it
+   may be grader noise even if statistically "significant".
+
+### When a result looks surprisingly good or bad
+
+- **Suspiciously high accuracy (>95%):** Check `total_questions`. A
+  small sample can produce any score by chance. Also verify the grader
+  model — a lenient grader inflates scores.
+- **Suspiciously low accuracy (<70%):** Check that search results are
+  actually being retrieved. Zero or near-zero `average_results_per_query`
+  or repeated search failures in `test_details` are the most common
+  cause.
+- **Very fast processing times:** Usually indicates the search step is
+  being skipped or timing out silently.
+- **Score drops sharply between LDR versions:** Check the changelog for
+  that version. Prompt template changes and result-filtering changes have
+  historically caused 5–10 pp swings that have nothing to do with the
+  model or search engine.
+
+---
+
+## Pre-flight checklist
+
+Before acting on a benchmark result or publishing a comparison:
+
+- [ ] `total_questions` ≥ 100 for a single-configuration result
+- [ ] `total_questions` ≥ 200 for a head-to-head comparison
+- [ ] Same `--seed` used across all compared runs
+- [ ] Same `ldr_version`, `strategy`, `search_engine`, and
+      `evaluator.model` across compared runs
+- [ ] Observed difference > margin of error for each individual run
+- [ ] Observed difference > ~2–3 pp (minimum above grader noise floor)
+- [ ] `average_results_per_query` > 0 (search is actually running)
+- [ ] Reviewed a sample of graded examples, not just the headline score
+
+---
+
+## Adding a new benchmark dataset
+
+When LDR adds support for a new dataset, update the `BENCHMARKS`
+whitelist in **both** `scripts/validate_yamls.py` and
+`scripts/build_leaderboards.py`, and document here:
+
+- Whether per-question examples may be shared (see sharing policy in
+  README)
+- The canonical dataset ID used in LDR's registry
+- Any dataset-specific interpretation notes (e.g., question difficulty
+  distribution, known contamination risks)
+
+Keep `canonical_id` in sync with LDR's
+`src/local_deep_research/benchmarks/datasets/__init__.py`.
diff --git a/README.md b/README.md
@@ -183,20 +183,49 @@ Keep the `canonical_id` in sync with LDR's
 
 ## Considerations for using the data
 
-This is a community-submitted leaderboard, not a controlled experiment.
-
-- **Self-reported.** CI validates schema but not that a run actually
-  happened as described.
-- **Evaluator bias.** Most submissions use an LLM grader (Claude 3.7
-  Sonnet by default). Expect ~1% grading error.
-- **Small sample sizes.** Typical runs use 50–200 questions. Confidence
-  intervals are wide; small differences are usually not significant.
-- **Timing is environment-dependent.** Compare `avg_time_per_question`
-  with caution across different hardware/network setups.
-- **Contamination risk.** SimpleQA is publicly distributed. BrowseComp
-  and xbench mitigate this with encryption.
-- **Strategy semantics drift** between LDR versions — prefer comparing
-  runs tagged with the same `ldr_version`.
+This is a community-submitted leaderboard so the numbers here are estimates and not ground truth. This section explains how much to trust a result and when a difference between two runs is meaningful.
+
+**Self-reported results.** CI validates schema and path conventions but
+cannot verify that a run happened exactly as described. Treat every
+submission as coming from a good-faith contributor, but apply the
+guardrails below before drawing conclusions.
+
+**Uncertainty scales with sample size.** A reported accuracy
+is a point estimate with a CI that depends on how many
+questions were tested. The Wilson score interval gives the
+correct range at 95% confidence: Use at least 100 examples before drawing any conclusions, and 200+ before comparing two configurations.
+
+**Small observed differences are likely noise.** To reliably detect a
+real accuracy difference between two configurations (80% statistical
+power, α = 0.05), each configuration needs roughly:
+
+| Difference to detect | Examples needed per config |
+|---|---|
+| 5 pp (e.g., 85% vs 90%)  | ~680 |
+| 10 pp (e.g., 80% vs 90%) | ~200 |
+| 15 pp (e.g., 75% vs 90%) | ~90  |
+
+If the observed gap between two runs is smaller than the margin of error
+for either run, treat the results as a tie.
+
+**Evaluator noise sets a practical floor.** The grader LLM mis-grades
+approximately 1% of responses. This means a 1–2 percentage-point
+difference is indistinguishable from grader error alone, regardless of
+sample size. Differences smaller than ~2–3 pp are not actionable.
+
+**Cross-run comparison requires identical conditions.** Large
+sample sizes cannot save a comparison where any of the following differ
+between runs: LDR version, strategy, grader model, random seed, or
+search engine. Each combination of these factors is a distinct
+experimental condition. See [`METHODOLOGY.md`](METHODOLOGY.md) for
+detail and a pre-flight checklist.
+
+**Timing is environment-dependent.** Compare `avg_time_per_question`
+with caution across different hardware and network setups.
+
+**Contamination risk.** SimpleQA is publicly distributed and model
+providers may have trained on it. BrowseComp and xbench-DeepSearch
+mitigate this with encryption and per-question canary strings.
 
 ## Contributor attribution