-
-
Notifications
You must be signed in to change notification settings - Fork 0
docs: record Hugging Face corpus audit findings #24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,46 @@ | ||
| # Hugging Face Corpus Audit Findings | ||
|
|
||
| ## 2026-05-31 Larger Bounded Slice | ||
|
|
||
| ### Command | ||
|
|
||
| ```bash | ||
| HF_CORPUS_AUDIT_TINYSTORIES_LENGTH=100 \ | ||
| HF_CORPUS_AUDIT_SIMPLEWIKI_LENGTH=100 \ | ||
| HF_CORPUS_AUDIT_POETRY_LENGTH=100 \ | ||
| scripts/repo-maintenance/run-huggingface-corpus-audit.sh | ||
| ``` | ||
|
|
||
| ### Corpus | ||
|
|
||
| The live audit lane downloaded the largest currently supported bounded Dataset Viewer slices from the three configured Hugging Face corpus families: | ||
|
|
||
| - `roneneldan/TinyStories`, `default`, `train`, offset `0`, length `100` | ||
| - `juno-labs/simple_wikipedia`, `default`, `train`, offset `0`, length `100` | ||
| - `biglam/gutenberg-poetry-corpus`, `default`, `train`, offset `0`, length `100` | ||
|
|
||
| The audit indexed `209` temporary `FetchDocumentRecord` values. The final document count is lower than the requested row count because the importer intentionally skips rows that cannot produce a usable title/body search record from the available dataset fields. | ||
|
|
||
| ### Result | ||
|
|
||
| All five larger-slice quality checks passed: | ||
|
|
||
| ```text | ||
| [pass] TinyStories sewing retrieval: hf-tinystories hf-tinystories-0 score=0.903 field=body snippet="...we can share the needle and fix your shirt." Together, they shared the needle and sewed the button on Lily's shirt. It" | ||
| [pass] TinyStories toy retrieval: hf-tinystories hf-tinystories-6 score=0.881 field=body snippet="...always sad because she lost her favorite toy, a triangle. She looked everywhere in her house but could not find it. On" | ||
| [pass] Simple Wikipedia calendar retrieval: hf-simplewiki hf-simplewiki-0 score=0.882 field=body snippet="...and in years immediately before leap years, [June](401) of the following year. In years immediately before common years" | ||
| [pass] Simple Wikipedia rhetoric retrieval: hf-simplewiki hf-simplewiki-18 score=0.885 field=body snippet="...Translated to English, _ad hominem_ means _against the person_. In other words, when someone makes an ad hominem, they " | ||
| [pass] Gutenberg poetry northland retrieval: hf-poetry hf-poetry-19-lines-36-47 score=0.942 field=body snippet="...the forests and the prairies, From the great lakes of the Northland, From the land of the Ojibways, From the land of th" | ||
| ``` | ||
|
|
||
| ### Decision | ||
|
|
||
| The current `FetchKitLibrary` ranking and snippet behavior is good enough for the v1 conventional-search refinement milestone against this bounded live corpus. No ranking change, snippet redesign, or extended-snippet API has earned implementation from this audit alone. | ||
|
|
||
| Keep the live Hugging Face lane as an opt-in maintainer audit. Do not move it into default `swift test` or default GitHub CI while it depends on live network access, Hugging Face Dataset Viewer availability, and dataset field stability. | ||
|
|
||
| ### Limits | ||
|
|
||
| This is a quality smoke audit, not a full relevance benchmark. It covers the first 100 rows requested from each configured dataset, the current five hand-authored probes, and the current importer field mapping. It does not stand in for a real app's private corpus, localized content, attachment-heavy records, or user-specific query logs. | ||
|
|
||
| The better next signal is a caller-owned corpus once a real app starts exercising the `FetchKitLibrary` facade. Until then, keep public API polish and construction/search ergonomics under review without adding a larger ranking or snippet surface speculatively. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This newly checked roadmap item claims the refinement pass has been validated against larger real app corpora, but the recorded audit only uses the first 100 rows from three public Hugging Face datasets and the new findings doc explicitly says it does not stand in for a real app's private corpus. When maintainers use the roadmap to decide what Milestone 4 work remains, this marks a validation gap as completed even though the documented evidence says the next signal is still a caller-owned corpus.
Useful? React with 👍 / 👎.