Skip to content

feat(drift): retire fixtures whose bytes left upstream docs#108

Merged
paultyng merged 1 commit into
mainfrom
feat/drift-retirement
May 26, 2026
Merged

feat(drift): retire fixtures whose bytes left upstream docs#108
paultyng merged 1 commit into
mainfrom
feat/drift-retirement

Conversation

@paultyng
Copy link
Copy Markdown
Owner

Summary

Adds the second half of corpus lifecycle: fixtures get retired when their exact bytes are no longer present at the .source URL we recorded.

  • Step 5b walks every entry in CORPUS_SHAS, skipping the ones marked seen during candidate processing. For unseen shas, reads the sibling .source file's url:, normalizes (strip trailing .md since vendor docs serve both forms), and retires only when the URL appears in FETCHED_URLS. The URL guard prevents false-retirement when vendor docs restructure (the bytes might still exist at a new URL we haven't taught the fetcher about yet).
  • Same commit adds the new fixtures and git rms the retired ones, so the drift PR is one atomic reconciliation.
  • Skip-creation guard now fires only when both ADDED_FILES and RETIRED_FILES are empty.
  • PR body gains a ### Retired section with each dropped fixture's last verified: date + source URL. Triage rules grow a default-accept rule with an escape hatch.

Test plan

  • bash -n + shellcheck clean
  • Smoke-tested normalize_url + matching: no-suffix source URL matches .md-suffixed fetched URL ✓; missing URL correctly returns no match ✓
  • After merge: close chore(testdata): upstream-config drift detected (cursor) #107 (its corpus state is now stale), re-dispatch the drift workflow. Expect cursor to surface both Added (from upstream docs that drifted since 2026-05-20) and Retired (Phase 1 fixtures whose bytes no longer appear verbatim).

🤖 Generated with Claude Code

Adds the second half of corpus lifecycle: fixtures are retired
when their exact bytes are no longer present at the .source URL
we recorded.

How it works:

- Step 5b runs after candidate processing. Walks every entry in
  CORPUS_SHAS, skipping shas that were observed in some fetched
  doc block this run (those get marked in SEEN_CORPUS_SHAS as a
  side effect of process_candidate's dedupe early-return).
- For each unseen corpus fixture, reads its sibling .source file,
  extracts the recorded `url:`. Only retires when that URL appears
  in FETCHED_URLS — the page must have actually been fetched this
  run, otherwise a vendor docs URL restructure would false-retire
  every fixture from the old path.
- Vendor docs serve content at both `…/foo` and `…/foo.md`
  (llms.txt uses the .md form; humans use the no-suffix form);
  normalize trailing .md on both sides before comparison.
- Retired fixtures are `git rm`'d, so the same commit that adds
  new fixtures also drops the dead ones. Skip-creation guard now
  fires only when both ADDED_FILES and RETIRED_FILES are empty.

PR body gains a `### Retired` section listing the dropped
fixtures with their last `verified:` date. Triage rules add a
"default to accepting the retirement" rule with an escape hatch
for fixtures the vendor still accepts at runtime but no longer
documents.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@paultyng paultyng marked this pull request as ready for review May 26, 2026 17:32
@paultyng paultyng enabled auto-merge (squash) May 26, 2026 17:33
@paultyng paultyng merged commit 8cb5309 into main May 26, 2026
3 checks passed
@paultyng paultyng deleted the feat/drift-retirement branch May 26, 2026 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant