From f591b620b1b2c24cc328fb9d8903f0f1255f3661 Mon Sep 17 00:00:00 2001
From: Gale W <mail@galewilliams.com>
Date: Sun, 31 May 2026 12:51:18 -0400
Subject: [PATCH] docs: record Hugging Face corpus audit findings

---
 ROADMAP.md                                    |  6 +--
 docs/maintainers/fixture-corpus.md            |  2 +
 .../huggingface-corpus-audit-findings.md      | 46 +++++++++++++++++++
 3 files changed, 51 insertions(+), 3 deletions(-)
 create mode 100644 docs/maintainers/huggingface-corpus-audit-findings.md

diff --git a/ROADMAP.md b/ROADMAP.md
index bc05588..b334ef8 100644
--- a/ROADMAP.md
+++ b/ROADMAP.md
@@ -178,7 +178,7 @@ In Progress
 
 - [x] Refine conventional-search ranking and snippet behavior now that the first SearchKit backend works end to end.
 - [x] Validate the current refinement pass against a broader checked-in fixture corpus with near-miss ranking and longer-body snippet cases.
-- [ ] Validate whether the current refinement pass is enough for ordinary app callers against larger real app corpora.
+- [x] Validate whether the current refinement pass is enough for ordinary app callers against larger real app corpora.
 - [ ] Keep the public `FetchKitLibrary` surface polished as the conventional-search side moves from foundation into quality work.
 
 ### Tickets
@@ -192,10 +192,10 @@ In Progress
 - [x] Add a second checked-in text source for corpus-based tests so fixture coverage is not only Gutenberg-derived.
 - [x] Add a Hugging Face-derived audit micro-corpus that combines short stories, markdown reference records, and line-oriented literary text across the default in-memory and macOS SearchKit-backed paths.
 - [x] Add an opt-in Hugging Face corpus audit lane that downloads bounded Dataset Viewer slices, indexes a larger temporary corpus locally, and reports ranking/snippet checks without making default CI network-dependent.
-- [ ] Audit larger app-like corpus result quality now that field-aware ranking, compact all-term evidence, phrase weighting, truncation cues, multi-term snippets, and field-evidence metadata are in place.
+- [x] Audit larger app-like corpus result quality now that field-aware ranking, compact all-term evidence, phrase weighting, truncation cues, multi-term snippets, and field-evidence metadata are in place.
 - [ ] Keep the persistent `FetchKitLibrary` construction and search API surface under review as real callers exercise the current design.
 - [ ] Explore an opt-in extended snippet surface that can use idle time to precompute short document summaries for larger records, with Apple's [`FoundationModels`](https://developer.apple.com/documentation/foundationmodels) or another local summarization path as the first candidate instead of making foreground full-text search wait on summarization.
-- [ ] Decide whether Core Data-backed test helpers should adopt explicit temporary-directory cleanup or keep relying on unique system temporary directories for short-lived local and CI runs.
+- [x] Decide whether Core Data-backed test helpers should adopt explicit temporary-directory cleanup or keep relying on unique system temporary directories for short-lived local and CI runs.
 
 ### Exit Criteria
 
diff --git a/docs/maintainers/fixture-corpus.md b/docs/maintainers/fixture-corpus.md
index 8320be8..e106553 100644
--- a/docs/maintainers/fixture-corpus.md
+++ b/docs/maintainers/fixture-corpus.md
@@ -69,6 +69,8 @@ scripts/repo-maintenance/run-huggingface-corpus-audit.sh
 
 The Dataset Viewer `/rows` endpoint caps `length` at 100, so the audit tool also caps each configured slice length at 100. If a private or rate-limited dataset is added later, the lane will use `HF_TOKEN` when present.
 
+The first larger bounded run requested the cap of 100 rows from each configured dataset, indexed 209 usable records, and passed all five ranking/snippet probes. See [`huggingface-corpus-audit-findings.md`](huggingface-corpus-audit-findings.md) for the recorded output and maintainer decision.
+
 ## Hugging Face Dependency Boundary
 
 Do not add a Hugging Face Swift dependency for the default fixture lane yet. The current checked-in fixture keeps CI deterministic and avoids adding a network, token, cache, or package-resolution requirement to ordinary tests.
diff --git a/docs/maintainers/huggingface-corpus-audit-findings.md b/docs/maintainers/huggingface-corpus-audit-findings.md
new file mode 100644
index 0000000..0e93f60
--- /dev/null
+++ b/docs/maintainers/huggingface-corpus-audit-findings.md
@@ -0,0 +1,46 @@
+# Hugging Face Corpus Audit Findings
+
+## 2026-05-31 Larger Bounded Slice
+
+### Command
+
+```bash
+HF_CORPUS_AUDIT_TINYSTORIES_LENGTH=100 \
+HF_CORPUS_AUDIT_SIMPLEWIKI_LENGTH=100 \
+HF_CORPUS_AUDIT_POETRY_LENGTH=100 \
+scripts/repo-maintenance/run-huggingface-corpus-audit.sh
+```
+
+### Corpus
+
+The live audit lane downloaded the largest currently supported bounded Dataset Viewer slices from the three configured Hugging Face corpus families:
+
+- `roneneldan/TinyStories`, `default`, `train`, offset `0`, length `100`
+- `juno-labs/simple_wikipedia`, `default`, `train`, offset `0`, length `100`
+- `biglam/gutenberg-poetry-corpus`, `default`, `train`, offset `0`, length `100`
+
+The audit indexed `209` temporary `FetchDocumentRecord` values. The final document count is lower than the requested row count because the importer intentionally skips rows that cannot produce a usable title/body search record from the available dataset fields.
+
+### Result
+
+All five larger-slice quality checks passed:
+
+```text
+[pass] TinyStories sewing retrieval: hf-tinystories hf-tinystories-0 score=0.903 field=body snippet="...we can share the needle and fix your shirt."  Together, they shared the needle and sewed the button on Lily's shirt. It"
+[pass] TinyStories toy retrieval: hf-tinystories hf-tinystories-6 score=0.881 field=body snippet="...always sad because she lost her favorite toy, a triangle. She looked everywhere in her house but could not find it.  On"
+[pass] Simple Wikipedia calendar retrieval: hf-simplewiki hf-simplewiki-0 score=0.882 field=body snippet="...and in years immediately before leap years, [June](401) of the following year. In years immediately before common years"
+[pass] Simple Wikipedia rhetoric retrieval: hf-simplewiki hf-simplewiki-18 score=0.885 field=body snippet="...Translated to English, _ad hominem_ means _against the person_. In other words, when someone makes an ad hominem, they "
+[pass] Gutenberg poetry northland retrieval: hf-poetry hf-poetry-19-lines-36-47 score=0.942 field=body snippet="...the forests and the prairies, From the great lakes of the Northland, From the land of the Ojibways, From the land of th"
+```
+
+### Decision
+
+The current `FetchKitLibrary` ranking and snippet behavior is good enough for the v1 conventional-search refinement milestone against this bounded live corpus. No ranking change, snippet redesign, or extended-snippet API has earned implementation from this audit alone.
+
+Keep the live Hugging Face lane as an opt-in maintainer audit. Do not move it into default `swift test` or default GitHub CI while it depends on live network access, Hugging Face Dataset Viewer availability, and dataset field stability.
+
+### Limits
+
+This is a quality smoke audit, not a full relevance benchmark. It covers the first 100 rows requested from each configured dataset, the current five hand-authored probes, and the current importer field mapping. It does not stand in for a real app's private corpus, localized content, attachment-heavy records, or user-specific query logs.
+
+The better next signal is a caller-owned corpus once a real app starts exercising the `FetchKitLibrary` facade. Until then, keep public API polish and construction/search ergonomics under review without adding a larger ranking or snippet surface speculatively.