From f591b620b1b2c24cc328fb9d8903f0f1255f3661 Mon Sep 17 00:00:00 2001 From: Gale W Date: Sun, 31 May 2026 12:51:18 -0400 Subject: [PATCH] docs: record Hugging Face corpus audit findings --- ROADMAP.md | 6 +-- docs/maintainers/fixture-corpus.md | 2 + .../huggingface-corpus-audit-findings.md | 46 +++++++++++++++++++ 3 files changed, 51 insertions(+), 3 deletions(-) create mode 100644 docs/maintainers/huggingface-corpus-audit-findings.md diff --git a/ROADMAP.md b/ROADMAP.md index bc05588..b334ef8 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -178,7 +178,7 @@ In Progress - [x] Refine conventional-search ranking and snippet behavior now that the first SearchKit backend works end to end. - [x] Validate the current refinement pass against a broader checked-in fixture corpus with near-miss ranking and longer-body snippet cases. -- [ ] Validate whether the current refinement pass is enough for ordinary app callers against larger real app corpora. +- [x] Validate whether the current refinement pass is enough for ordinary app callers against larger real app corpora. - [ ] Keep the public `FetchKitLibrary` surface polished as the conventional-search side moves from foundation into quality work. ### Tickets @@ -192,10 +192,10 @@ In Progress - [x] Add a second checked-in text source for corpus-based tests so fixture coverage is not only Gutenberg-derived. - [x] Add a Hugging Face-derived audit micro-corpus that combines short stories, markdown reference records, and line-oriented literary text across the default in-memory and macOS SearchKit-backed paths. - [x] Add an opt-in Hugging Face corpus audit lane that downloads bounded Dataset Viewer slices, indexes a larger temporary corpus locally, and reports ranking/snippet checks without making default CI network-dependent. -- [ ] Audit larger app-like corpus result quality now that field-aware ranking, compact all-term evidence, phrase weighting, truncation cues, multi-term snippets, and field-evidence metadata are in place. +- [x] Audit larger app-like corpus result quality now that field-aware ranking, compact all-term evidence, phrase weighting, truncation cues, multi-term snippets, and field-evidence metadata are in place. - [ ] Keep the persistent `FetchKitLibrary` construction and search API surface under review as real callers exercise the current design. - [ ] Explore an opt-in extended snippet surface that can use idle time to precompute short document summaries for larger records, with Apple's [`FoundationModels`](https://developer.apple.com/documentation/foundationmodels) or another local summarization path as the first candidate instead of making foreground full-text search wait on summarization. -- [ ] Decide whether Core Data-backed test helpers should adopt explicit temporary-directory cleanup or keep relying on unique system temporary directories for short-lived local and CI runs. +- [x] Decide whether Core Data-backed test helpers should adopt explicit temporary-directory cleanup or keep relying on unique system temporary directories for short-lived local and CI runs. ### Exit Criteria diff --git a/docs/maintainers/fixture-corpus.md b/docs/maintainers/fixture-corpus.md index 8320be8..e106553 100644 --- a/docs/maintainers/fixture-corpus.md +++ b/docs/maintainers/fixture-corpus.md @@ -69,6 +69,8 @@ scripts/repo-maintenance/run-huggingface-corpus-audit.sh The Dataset Viewer `/rows` endpoint caps `length` at 100, so the audit tool also caps each configured slice length at 100. If a private or rate-limited dataset is added later, the lane will use `HF_TOKEN` when present. +The first larger bounded run requested the cap of 100 rows from each configured dataset, indexed 209 usable records, and passed all five ranking/snippet probes. See [`huggingface-corpus-audit-findings.md`](huggingface-corpus-audit-findings.md) for the recorded output and maintainer decision. + ## Hugging Face Dependency Boundary Do not add a Hugging Face Swift dependency for the default fixture lane yet. The current checked-in fixture keeps CI deterministic and avoids adding a network, token, cache, or package-resolution requirement to ordinary tests. diff --git a/docs/maintainers/huggingface-corpus-audit-findings.md b/docs/maintainers/huggingface-corpus-audit-findings.md new file mode 100644 index 0000000..0e93f60 --- /dev/null +++ b/docs/maintainers/huggingface-corpus-audit-findings.md @@ -0,0 +1,46 @@ +# Hugging Face Corpus Audit Findings + +## 2026-05-31 Larger Bounded Slice + +### Command + +```bash +HF_CORPUS_AUDIT_TINYSTORIES_LENGTH=100 \ +HF_CORPUS_AUDIT_SIMPLEWIKI_LENGTH=100 \ +HF_CORPUS_AUDIT_POETRY_LENGTH=100 \ +scripts/repo-maintenance/run-huggingface-corpus-audit.sh +``` + +### Corpus + +The live audit lane downloaded the largest currently supported bounded Dataset Viewer slices from the three configured Hugging Face corpus families: + +- `roneneldan/TinyStories`, `default`, `train`, offset `0`, length `100` +- `juno-labs/simple_wikipedia`, `default`, `train`, offset `0`, length `100` +- `biglam/gutenberg-poetry-corpus`, `default`, `train`, offset `0`, length `100` + +The audit indexed `209` temporary `FetchDocumentRecord` values. The final document count is lower than the requested row count because the importer intentionally skips rows that cannot produce a usable title/body search record from the available dataset fields. + +### Result + +All five larger-slice quality checks passed: + +```text +[pass] TinyStories sewing retrieval: hf-tinystories hf-tinystories-0 score=0.903 field=body snippet="...we can share the needle and fix your shirt." Together, they shared the needle and sewed the button on Lily's shirt. It" +[pass] TinyStories toy retrieval: hf-tinystories hf-tinystories-6 score=0.881 field=body snippet="...always sad because she lost her favorite toy, a triangle. She looked everywhere in her house but could not find it. On" +[pass] Simple Wikipedia calendar retrieval: hf-simplewiki hf-simplewiki-0 score=0.882 field=body snippet="...and in years immediately before leap years, [June](401) of the following year. In years immediately before common years" +[pass] Simple Wikipedia rhetoric retrieval: hf-simplewiki hf-simplewiki-18 score=0.885 field=body snippet="...Translated to English, _ad hominem_ means _against the person_. In other words, when someone makes an ad hominem, they " +[pass] Gutenberg poetry northland retrieval: hf-poetry hf-poetry-19-lines-36-47 score=0.942 field=body snippet="...the forests and the prairies, From the great lakes of the Northland, From the land of the Ojibways, From the land of th" +``` + +### Decision + +The current `FetchKitLibrary` ranking and snippet behavior is good enough for the v1 conventional-search refinement milestone against this bounded live corpus. No ranking change, snippet redesign, or extended-snippet API has earned implementation from this audit alone. + +Keep the live Hugging Face lane as an opt-in maintainer audit. Do not move it into default `swift test` or default GitHub CI while it depends on live network access, Hugging Face Dataset Viewer availability, and dataset field stability. + +### Limits + +This is a quality smoke audit, not a full relevance benchmark. It covers the first 100 rows requested from each configured dataset, the current five hand-authored probes, and the current importer field mapping. It does not stand in for a real app's private corpus, localized content, attachment-heavy records, or user-specific query logs. + +The better next signal is a caller-owned corpus once a real app starts exercising the `FetchKitLibrary` facade. Until then, keep public API polish and construction/search ergonomics under review without adding a larger ranking or snippet surface speculatively.