gitronald · gitronald · May 31, 2026 · May 31, 2026 · May 31, 2026 · May 31, 2026
diff --git a/docs/guides/fixture-corpus.md b/docs/guides/fixture-corpus.md
@@ -101,13 +101,15 @@ is unseen elsewhere. It is the conservative middle ground between the pair-set l
 
 ## Tooling
 
-Two scripts implement the analysis. Both run through `uv`.
+Three scripts, all run through `uv`. They are **report-only** — none mutates the
+corpus or auto-recommends deletions (the plan-032 drop decision is already applied;
+`scripts/build_fixture_corpus.py` is the one-time builder that produced the file).
 
 ### `scripts/profile_fixture_corpus.py`
 
 Parses every record and reports per-record provenance, `main_layout`, fired feature
 flags, the pair-set signature, parse errors, and corpus-wide rarity (which pairs /
-layouts have only 1–2 carriers) plus subset-coverage.
+layouts have only 1–2 carriers).
 
 ```bash
 uv run python scripts/profile_fixture_corpus.py            # human-readable report
@@ -116,15 +118,29 @@ uv run python scripts/profile_fixture_corpus.py --json     # machine-readable
 
 ### `scripts/compare_drop_signatures.py`
 
-Applies the signature readings above to the drop candidates, answers "does any
-candidate have a unique signature?" under all three readings, and prints the **final
-drop list under the distinct-type bar** — each dropped record annotated with the
-surviving record that preserves its signature.
+Reports the three signature readings and surfaces **distinct-type signature
+clusters** (records sharing a component sequence) for human review.
+
+It does NOT recommend drops, by design. A shared `(type, sub_type)` signature is
+blind to details-level structure: e.g. two `ai_overview/sectioned` records can
+differ in section *count* (1 vs 3), and `test_ai_overview_legacy_sge.py` depends on
+the multi-section one specifically. Always confirm at the details level — and check
+the query-keyed tests — before treating a cluster as redundant.
 
 ```bash
 uv run python scripts/compare_drop_signatures.py
 ```
 
+### `scripts/verify_drops.py`
+
+Corpus-integrity guard: confirms the 8 plan-032 drops are absent, serp_ids are
+unique, every record carries a `note`, the three witnessed layouts survive, and
+every parsed `(type, sub_type)` has a carrier. Exits non-zero on failure (CI-usable).
+
+```bash
+uv run python scripts/verify_drops.py
+```
+
 ## The `note` field
 
 Surviving records should carry a `note` mirroring the curated format — a provenance
@@ -135,19 +151,23 @@ clause plus what the record contributes:
 ```
 
 e.g. `"Bulk corpus capture, v0.6.7a0 crawl 2026-02-06. Sole carrier of
-knowledge/unit_converter."` Keep the provenance clause free of private-repo names
-and any embedded crawl tokens / IPs (scrub `GOOGLE_ABUSE_EXEMPTION` URLs).
+knowledge/unit_converter."` Keep the provenance clause free of private-repo names.
+One record (`7049404a2dd6`) intentionally retains a `GOOGLE_ABUSE_EXEMPTION` URL
+token as an artifact of how the crawler obtained an abuse exemption.
 
 ## Reproducing the corpus assessment
 
 ```bash
-# 1. Profile every record and find unique contributors + redundancy
+# 1. Profile every record: provenance, layouts, unique contributors, rarity
 uv run python scripts/profile_fixture_corpus.py
 
-# 2. Compute the drop list under the distinct-type-order bar
+# 2. Review distinct-type signature clusters (potential redundancy)
 uv run python scripts/compare_drop_signatures.py
 
-# 3. After any change, confirm tests + that no unique pair/layout was lost
+# 3. Corpus-integrity guard (drops absent, notes present, layouts + coverage intact)
+uv run python scripts/verify_drops.py
+
+# 4. After any change, confirm the tests pass
 uv run pytest tests/test_parse_serp.py tests/test_parser_coverage.py \
   tests/test_ai_overview_legacy_sge.py -q
 ```
diff --git a/docs/plans/032-fixture-corpus-notes-and-pruning.md b/docs/plans/032-fixture-corpus-notes-and-pruning.md
@@ -1,9 +1,9 @@
 ---
-status: draft
+status: done
 branch: feature/fixture-corpus-notes
 created: 2026-05-31T11:32:26-07:00
-completed:
-pr:
+completed: 2026-05-31T12:42:13-07:00
+pr: https://github.com/gitronald/WebSearcher/pull/143
 ---
 
 # Annotate and prune the bulk SERP fixture corpus
@@ -244,8 +244,9 @@ These records are committed to a **public** repo. While rewriting the files:
 
 - Drop the private-repo name from any note text (the curated v0.7.2 notes currently
   say "Captured via SearchAudits ...") — use a neutral provenance clause.
-- Scrub the `GOOGLE_ABUSE_EXEMPTION` token + embedded IP from the `url` of
-  `serps-v0.6.7` record `7049404a2dd6` (mooted if that file is retired).
+- **Decided: keep** the `GOOGLE_ABUSE_EXEMPTION` token (with its embedded client IP)
+  on record `7049404a2dd6` as a deliberate artifact — it documents how the crawler
+  obtained an abuse exemption, which is worth preserving. Not scrubbed.
 
 ### 5. Consolidate into `serps.json.bz2` and update loaders
 
@@ -309,3 +310,46 @@ than retiring the file wholesale; consolidate into one version-less file.
 - **Should the freed budget be backfilled now** with new diverse captures, or left
   for a follow-up plan? Recommended: follow-up — keep this plan scoped to
   annotate + prune.
+
+## Log
+
+- **2026-05-31** — Implemented as specified across five commits:
+  - `ca43a31` — committed the methodology artifacts (fixture-corpus guide, this plan,
+    `scripts/profile_fixture_corpus.py`, `scripts/compare_drop_signatures.py`).
+  - `f5b6eb4` — recorded the Pass-B drop verification and added `scripts/verify_drops.py`
+    (per-drop coverage check against the whole surviving corpus: 0 drop-only items).
+  - `30dcaef` — consolidated all seven version-named fixtures into a single
+    `tests/fixtures/serps.json.bz2` (renamed from `serps-v0.6.8`, the six others deleted),
+    added a per-record `note` to every survivor via `scripts/build_fixture_corpus.py`,
+    dropped the 8 redundant records from §3, deleted their orphaned snapshots, and
+    `--snapshot-update`'d the ~22 newly-included parser-coverage/sge records. Repointed
+    the three test loaders at the single file.
+  - `7426854` — repointed `profile_fixture_corpus.py` / `compare_drop_signatures.py` at
+    `serps.json.bz2` and made them report-only.
+  - `4cc6aca` — clarified the specialized-components label in the cluster report.
+- **Final corpus:** 80 records in one `serps.json.bz2` (~21 MB). The 8 drops match the
+  §3 table exactly (5 sky-blue, 1 streaming, 1 paragraph-query, 1 college).
+- **Verify suite (§6) green** on close: `149 passed, 80 snapshots passed` across
+  `test_parse_serp`, `test_parser_coverage`, `test_ai_overview_legacy_sge`,
+  `test_extractor_serp_features`; `test_features_expose_main_layout` included, no
+  orphaned or missing snapshots, every single-carrier `(type, sub_type)` pair retained.
+- **2026-05-31** — Closed: PR #143 merged into `feature/v0.9.0` (the 0.9.0 integration
+  branch).
+
+## Retrospective
+
+- The two-pass drop policy (mechanical signature screen, then adversarial value-level
+  `details` rescue) earned its keep: the unordered pair-set screen flagged ~25 subset
+  candidates, but the decided distinct-type-order bar plus the rescue pass cut that to
+  exactly 8 — protecting 17 records the cruder screen would have over-dropped.
+- The one genuinely load-bearing constraint was mechanical, not structural: keeping
+  `f6fae1c9a96e` because it is the sole `standard-overview` carrier *within the
+  `serps-v*` glob* that `test_features_expose_main_layout` asserts against. Worth
+  re-checking that assertion's source set first on any future fixture surgery.
+- Consolidating to one version-less file widened `test_parse_serp` from the `serps-v*`
+  subset to the whole corpus (+~22 snapshots). That was the right default — provenance
+  already lives per-record in the `version` field, so the filename carried no
+  information the JSON didn't.
+- This branch sat fully implemented but with `status: draft` and no PR — the work and
+  the bookkeeping had drifted apart. Closing it promptly after the last impl commit
+  would have avoided the "did we execute this?" ambiguity.
diff --git a/scripts/build_fixture_corpus.py b/scripts/build_fixture_corpus.py
@@ -0,0 +1,154 @@
+"""Build the consolidated SERP fixture corpus: tests/fixtures/serps.json.bz2.
+
+Merges the seven version-named bz2 fixtures into one, dropping the 8 verified-
+redundant records (plan 032), scrubbing private-repo names from notes, and adding a
+`note` to every survivor. Notes are generated from the POST-DROP corpus so "sole
+carrier" claims are accurate; curated notes are preserved (only scrubbed). The one
+surviving google_abuse exemption token is kept intact as a deliberate artifact. The
+HTML field is written last on each line for readability.
+
+Usage:
+    uv run python scripts/build_fixture_corpus.py            # write the file
+    uv run python scripts/build_fixture_corpus.py --check    # report, do not write
+"""
+
+import argparse
+import bz2
+from collections import Counter, defaultdict
+from pathlib import Path
+
+import orjson
+
+import WebSearcher as ws
+
+FIXTURES_DIR = Path("tests/fixtures")
+OUT = FIXTURES_DIR / "serps.json.bz2"
+
+# Input files in the order they should appear in the consolidated file.
+SOURCES = [
+    "serps-parser-coverage.json.bz2",
+    "serps-sge-2024.json.bz2",
+    "serps-v0.6.7.json.bz2",
+    "serps-v0.6.8.json.bz2",
+    "serps-v0.7.2-ads.json.bz2",
+    "serps-v0.7.2-jobs.json.bz2",
+    "serps-v0.7.2-knowledge-subcards.json.bz2",
+]
+
+DROPS = {
+    "97404b7b7c61",
+    "45b6e019bfa2",
+    "c9ab650f5bda",
+    "032572e185d3",
+    "be99c971b8f7",
+    "cad43c3268a8",
+    "3c09a0f0c92f",
+    "984065877aad",
+}
+
+UBIQUITOUS = ("general/", "searches_related/", "people_also_ask/")
+
+
+def scrub_note(note: str) -> str:
+    return note.replace("SearchAudits directives crawl", "a directives crawl")
+
+
+def signature(parsed: dict) -> list[str]:
+    pairs = sorted(
+        {(r["type"], r["sub_type"]) for r in parsed["results"]},
+        key=lambda ts: (ts[0] or "", ts[1] or ""),
+    )
+    return [f"{t}/{s}" for t, s in pairs]
+
+
+def load_sources() -> list[dict]:
+    survivors = []
+    for name in SOURCES:
+        path = FIXTURES_DIR / name
+        with bz2.open(path, "rt") as f:
+            for line in f:
+                r = orjson.loads(line)
+                if r["serp_id"][:12] in DROPS:
+                    continue
+                r["_parsed"] = ws.parse_serp(r["html"])
+                r["_sig"] = signature(r["_parsed"])
+                r["_layout"] = r["_parsed"]["features"].get("main_layout")
+                survivors.append(r)
+    return survivors
+
+
+def build_notes(survivors: list[dict]) -> None:
+    pair_carriers: dict[str, set] = defaultdict(set)
+    layout_carriers: dict[str, set] = defaultdict(set)
+    for r in survivors:
+        sid = r["serp_id"][:12]
+        for p in r["_sig"]:
+            pair_carriers[p].add(sid)
+        layout_carriers[str(r["_layout"])].add(sid)
+
+    for r in survivors:
+        if r.get("note"):  # curated -> preserve, scrub only
+            r["note"] = scrub_note(r["note"])
+            continue
+        sid = r["serp_id"][:12]
+        unique = [p for p in r["_sig"] if len(pair_carriers[p]) == 1]
+        rare = [p for p in r["_sig"] if len(pair_carriers[p]) == 2]
+        layout = r["_layout"]
+        clauses = []
+        if unique:
+            clauses.append("sole carrier of " + ", ".join(unique))
+        if layout and layout != "standard" and len(layout_carriers[str(layout)]) == 1:
+            clauses.append(f"only {layout} layout in the corpus")
+        if not unique and rare:
+            clauses.append("one of two carriers of " + ", ".join(rare))
+        if not clauses:
+            notable = [p for p in r["_sig"] if not p.startswith(UBIQUITOUS)]
+            if notable:
+                clauses.append("coverage for " + ", ".join(notable[:4]))
+            else:
+                clauses.append("standard organic-results SERP")
+        contribution = "; ".join(clauses)
+        contribution = contribution[0].upper() + contribution[1:] + "."
+        prov = f"Corpus capture, WebSearcher {r.get('version')}, {r.get('timestamp', '')[:10]}."
+        r["note"] = f"{prov} {contribution}"
+
+
+def emit(survivors: list[dict]) -> None:
+    with bz2.open(OUT, "wt") as f:
+        for r in survivors:
+            rec = {k: v for k, v in r.items() if not k.startswith("_")}
+            html = rec.pop("html")
+            note = rec.pop("note")
+            rec["note"] = note
+            rec["html"] = html  # html last for readability
+            f.write(orjson.dumps(rec).decode() + "\n")
+
+
+def main() -> None:
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--check", action="store_true", help="report only, do not write")
+    args = ap.parse_args()
+
+    survivors = load_sources()
+    build_notes(survivors)
+
+    print(f"survivors: {len(survivors)}  (dropped {len(DROPS)})")
+    print(f"all have notes: {all(r.get('note') for r in survivors)}")
+    print(f"versions: {dict(Counter(r.get('version') for r in survivors))}")
+    print(f"layouts: {dict(Counter(str(r['_layout']) for r in survivors))}")
+    tokens = [r["serp_id"][:12] for r in survivors if "google_abuse" in r.get("url", "")]
+    print(f"urls with google_abuse token (kept as artifact): {tokens}")
+    print("\nsample generated notes:")
+    for r in survivors[:3] + survivors[-4:]:
+        print(f"  [{r['serp_id'][:12]}] {r['note']}")
+
+    if args.check:
+        print("\n--check: not writing.")
+        return
+    emit(survivors)
+    size = OUT.stat().st_size
+    print(f"\nwrote {OUT} ({size / 1e6:.1f} MB)")
+
+
+if __name__ == "__main__":
+    main()