Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,7 @@ In Progress

- [x] Refine conventional-search ranking and snippet behavior now that the first SearchKit backend works end to end.
- [x] Validate the current refinement pass against a broader checked-in fixture corpus with near-miss ranking and longer-body snippet cases.
- [ ] Validate whether the current refinement pass is enough for ordinary app callers against larger real app corpora.
- [x] Validate whether the current refinement pass is enough for ordinary app callers against larger real app corpora.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep real-app corpus validation open

This newly checked roadmap item claims the refinement pass has been validated against larger real app corpora, but the recorded audit only uses the first 100 rows from three public Hugging Face datasets and the new findings doc explicitly says it does not stand in for a real app's private corpus. When maintainers use the roadmap to decide what Milestone 4 work remains, this marks a validation gap as completed even though the documented evidence says the next signal is still a caller-owned corpus.

Useful? React with 👍 / 👎.

- [ ] Keep the public `FetchKitLibrary` surface polished as the conventional-search side moves from foundation into quality work.

### Tickets
Expand All @@ -192,10 +192,10 @@ In Progress
- [x] Add a second checked-in text source for corpus-based tests so fixture coverage is not only Gutenberg-derived.
- [x] Add a Hugging Face-derived audit micro-corpus that combines short stories, markdown reference records, and line-oriented literary text across the default in-memory and macOS SearchKit-backed paths.
- [x] Add an opt-in Hugging Face corpus audit lane that downloads bounded Dataset Viewer slices, indexes a larger temporary corpus locally, and reports ranking/snippet checks without making default CI network-dependent.
- [ ] Audit larger app-like corpus result quality now that field-aware ranking, compact all-term evidence, phrase weighting, truncation cues, multi-term snippets, and field-evidence metadata are in place.
- [x] Audit larger app-like corpus result quality now that field-aware ranking, compact all-term evidence, phrase weighting, truncation cues, multi-term snippets, and field-evidence metadata are in place.
- [ ] Keep the persistent `FetchKitLibrary` construction and search API surface under review as real callers exercise the current design.
- [ ] Explore an opt-in extended snippet surface that can use idle time to precompute short document summaries for larger records, with Apple's [`FoundationModels`](https://developer.apple.com/documentation/foundationmodels) or another local summarization path as the first candidate instead of making foreground full-text search wait on summarization.
- [ ] Decide whether Core Data-backed test helpers should adopt explicit temporary-directory cleanup or keep relying on unique system temporary directories for short-lived local and CI runs.
- [x] Decide whether Core Data-backed test helpers should adopt explicit temporary-directory cleanup or keep relying on unique system temporary directories for short-lived local and CI runs.

### Exit Criteria

Expand Down
2 changes: 2 additions & 0 deletions docs/maintainers/fixture-corpus.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,8 @@ scripts/repo-maintenance/run-huggingface-corpus-audit.sh

The Dataset Viewer `/rows` endpoint caps `length` at 100, so the audit tool also caps each configured slice length at 100. If a private or rate-limited dataset is added later, the lane will use `HF_TOKEN` when present.

The first larger bounded run requested the cap of 100 rows from each configured dataset, indexed 209 usable records, and passed all five ranking/snippet probes. See [`huggingface-corpus-audit-findings.md`](huggingface-corpus-audit-findings.md) for the recorded output and maintainer decision.

## Hugging Face Dependency Boundary

Do not add a Hugging Face Swift dependency for the default fixture lane yet. The current checked-in fixture keeps CI deterministic and avoids adding a network, token, cache, or package-resolution requirement to ordinary tests.
Expand Down
46 changes: 46 additions & 0 deletions docs/maintainers/huggingface-corpus-audit-findings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Hugging Face Corpus Audit Findings

## 2026-05-31 Larger Bounded Slice

### Command

```bash
HF_CORPUS_AUDIT_TINYSTORIES_LENGTH=100 \
HF_CORPUS_AUDIT_SIMPLEWIKI_LENGTH=100 \
HF_CORPUS_AUDIT_POETRY_LENGTH=100 \
scripts/repo-maintenance/run-huggingface-corpus-audit.sh
```

### Corpus

The live audit lane downloaded the largest currently supported bounded Dataset Viewer slices from the three configured Hugging Face corpus families:

- `roneneldan/TinyStories`, `default`, `train`, offset `0`, length `100`
- `juno-labs/simple_wikipedia`, `default`, `train`, offset `0`, length `100`
- `biglam/gutenberg-poetry-corpus`, `default`, `train`, offset `0`, length `100`

The audit indexed `209` temporary `FetchDocumentRecord` values. The final document count is lower than the requested row count because the importer intentionally skips rows that cannot produce a usable title/body search record from the available dataset fields.

### Result

All five larger-slice quality checks passed:

```text
[pass] TinyStories sewing retrieval: hf-tinystories hf-tinystories-0 score=0.903 field=body snippet="...we can share the needle and fix your shirt." Together, they shared the needle and sewed the button on Lily's shirt. It"
[pass] TinyStories toy retrieval: hf-tinystories hf-tinystories-6 score=0.881 field=body snippet="...always sad because she lost her favorite toy, a triangle. She looked everywhere in her house but could not find it. On"
[pass] Simple Wikipedia calendar retrieval: hf-simplewiki hf-simplewiki-0 score=0.882 field=body snippet="...and in years immediately before leap years, [June](401) of the following year. In years immediately before common years"
[pass] Simple Wikipedia rhetoric retrieval: hf-simplewiki hf-simplewiki-18 score=0.885 field=body snippet="...Translated to English, _ad hominem_ means _against the person_. In other words, when someone makes an ad hominem, they "
[pass] Gutenberg poetry northland retrieval: hf-poetry hf-poetry-19-lines-36-47 score=0.942 field=body snippet="...the forests and the prairies, From the great lakes of the Northland, From the land of the Ojibways, From the land of th"
```

### Decision

The current `FetchKitLibrary` ranking and snippet behavior is good enough for the v1 conventional-search refinement milestone against this bounded live corpus. No ranking change, snippet redesign, or extended-snippet API has earned implementation from this audit alone.

Keep the live Hugging Face lane as an opt-in maintainer audit. Do not move it into default `swift test` or default GitHub CI while it depends on live network access, Hugging Face Dataset Viewer availability, and dataset field stability.

### Limits

This is a quality smoke audit, not a full relevance benchmark. It covers the first 100 rows requested from each configured dataset, the current five hand-authored probes, and the current importer field mapping. It does not stand in for a real app's private corpus, localized content, attachment-heavy records, or user-specific query logs.

The better next signal is a caller-owned corpus once a real app starts exercising the `FetchKitLibrary` facade. Until then, keep public API polish and construction/search ergonomics under review without adding a larger ranking or snippet surface speculatively.
Loading