feat(drift): retire fixtures whose bytes left upstream docs#108
Merged
Conversation
Adds the second half of corpus lifecycle: fixtures are retired when their exact bytes are no longer present at the .source URL we recorded. How it works: - Step 5b runs after candidate processing. Walks every entry in CORPUS_SHAS, skipping shas that were observed in some fetched doc block this run (those get marked in SEEN_CORPUS_SHAS as a side effect of process_candidate's dedupe early-return). - For each unseen corpus fixture, reads its sibling .source file, extracts the recorded `url:`. Only retires when that URL appears in FETCHED_URLS — the page must have actually been fetched this run, otherwise a vendor docs URL restructure would false-retire every fixture from the old path. - Vendor docs serve content at both `…/foo` and `…/foo.md` (llms.txt uses the .md form; humans use the no-suffix form); normalize trailing .md on both sides before comparison. - Retired fixtures are `git rm`'d, so the same commit that adds new fixtures also drops the dead ones. Skip-creation guard now fires only when both ADDED_FILES and RETIRED_FILES are empty. PR body gains a `### Retired` section listing the dropped fixtures with their last `verified:` date. Triage rules add a "default to accepting the retirement" rule with an escape hatch for fixtures the vendor still accepts at runtime but no longer documents. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the second half of corpus lifecycle: fixtures get retired when their exact bytes are no longer present at the
.sourceURL we recorded.CORPUS_SHAS, skipping the ones marked seen during candidate processing. For unseen shas, reads the sibling.sourcefile'surl:, normalizes (strip trailing.mdsince vendor docs serve both forms), and retires only when the URL appears inFETCHED_URLS. The URL guard prevents false-retirement when vendor docs restructure (the bytes might still exist at a new URL we haven't taught the fetcher about yet).git rms the retired ones, so the drift PR is one atomic reconciliation.ADDED_FILESandRETIRED_FILESare empty.### Retiredsection with each dropped fixture's lastverified:date + source URL. Triage rules grow a default-accept rule with an escape hatch.Test plan
bash -n+shellcheckcleannormalize_url+ matching: no-suffix source URL matches.md-suffixed fetched URL ✓; missing URL correctly returns no match ✓🤖 Generated with Claude Code