Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 31 additions & 11 deletions docs/guides/fixture-corpus.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,13 +101,15 @@ is unseen elsewhere. It is the conservative middle ground between the pair-set l

## Tooling

Two scripts implement the analysis. Both run through `uv`.
Three scripts, all run through `uv`. They are **report-only** — none mutates the
corpus or auto-recommends deletions (the plan-032 drop decision is already applied;
`scripts/build_fixture_corpus.py` is the one-time builder that produced the file).

### `scripts/profile_fixture_corpus.py`

Parses every record and reports per-record provenance, `main_layout`, fired feature
flags, the pair-set signature, parse errors, and corpus-wide rarity (which pairs /
layouts have only 1–2 carriers) plus subset-coverage.
layouts have only 1–2 carriers).

```bash
uv run python scripts/profile_fixture_corpus.py # human-readable report
Expand All @@ -116,15 +118,29 @@ uv run python scripts/profile_fixture_corpus.py --json # machine-readable

### `scripts/compare_drop_signatures.py`

Applies the signature readings above to the drop candidates, answers "does any
candidate have a unique signature?" under all three readings, and prints the **final
drop list under the distinct-type bar** — each dropped record annotated with the
surviving record that preserves its signature.
Reports the three signature readings and surfaces **distinct-type signature
clusters** (records sharing a component sequence) for human review.

It does NOT recommend drops, by design. A shared `(type, sub_type)` signature is
blind to details-level structure: e.g. two `ai_overview/sectioned` records can
differ in section *count* (1 vs 3), and `test_ai_overview_legacy_sge.py` depends on
the multi-section one specifically. Always confirm at the details level — and check
the query-keyed tests — before treating a cluster as redundant.

```bash
uv run python scripts/compare_drop_signatures.py
```

### `scripts/verify_drops.py`

Corpus-integrity guard: confirms the 8 plan-032 drops are absent, serp_ids are
unique, every record carries a `note`, the three witnessed layouts survive, and
every parsed `(type, sub_type)` has a carrier. Exits non-zero on failure (CI-usable).

```bash
uv run python scripts/verify_drops.py
```

## The `note` field

Surviving records should carry a `note` mirroring the curated format — a provenance
Expand All @@ -135,19 +151,23 @@ clause plus what the record contributes:
```

e.g. `"Bulk corpus capture, v0.6.7a0 crawl 2026-02-06. Sole carrier of
knowledge/unit_converter."` Keep the provenance clause free of private-repo names
and any embedded crawl tokens / IPs (scrub `GOOGLE_ABUSE_EXEMPTION` URLs).
knowledge/unit_converter."` Keep the provenance clause free of private-repo names.
One record (`7049404a2dd6`) intentionally retains a `GOOGLE_ABUSE_EXEMPTION` URL
token as an artifact of how the crawler obtained an abuse exemption.

## Reproducing the corpus assessment

```bash
# 1. Profile every record and find unique contributors + redundancy
# 1. Profile every record: provenance, layouts, unique contributors, rarity
uv run python scripts/profile_fixture_corpus.py

# 2. Compute the drop list under the distinct-type-order bar
# 2. Review distinct-type signature clusters (potential redundancy)
uv run python scripts/compare_drop_signatures.py

# 3. After any change, confirm tests + that no unique pair/layout was lost
# 3. Corpus-integrity guard (drops absent, notes present, layouts + coverage intact)
uv run python scripts/verify_drops.py

# 4. After any change, confirm the tests pass
uv run pytest tests/test_parse_serp.py tests/test_parser_coverage.py \
tests/test_ai_overview_legacy_sge.py -q
```
54 changes: 49 additions & 5 deletions docs/plans/032-fixture-corpus-notes-and-pruning.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
---
status: draft
status: done
branch: feature/fixture-corpus-notes
created: 2026-05-31T11:32:26-07:00
completed:
pr:
completed: 2026-05-31T12:42:13-07:00
pr: https://github.com/gitronald/WebSearcher/pull/143
---

# Annotate and prune the bulk SERP fixture corpus
Expand Down Expand Up @@ -244,8 +244,9 @@ These records are committed to a **public** repo. While rewriting the files:

- Drop the private-repo name from any note text (the curated v0.7.2 notes currently
say "Captured via SearchAudits ...") — use a neutral provenance clause.
- Scrub the `GOOGLE_ABUSE_EXEMPTION` token + embedded IP from the `url` of
`serps-v0.6.7` record `7049404a2dd6` (mooted if that file is retired).
- **Decided: keep** the `GOOGLE_ABUSE_EXEMPTION` token (with its embedded client IP)
on record `7049404a2dd6` as a deliberate artifact — it documents how the crawler
obtained an abuse exemption, which is worth preserving. Not scrubbed.

### 5. Consolidate into `serps.json.bz2` and update loaders

Expand Down Expand Up @@ -309,3 +310,46 @@ than retiring the file wholesale; consolidate into one version-less file.
- **Should the freed budget be backfilled now** with new diverse captures, or left
for a follow-up plan? Recommended: follow-up — keep this plan scoped to
annotate + prune.

## Log

- **2026-05-31** — Implemented as specified across five commits:
- `ca43a31` — committed the methodology artifacts (fixture-corpus guide, this plan,
`scripts/profile_fixture_corpus.py`, `scripts/compare_drop_signatures.py`).
- `f5b6eb4` — recorded the Pass-B drop verification and added `scripts/verify_drops.py`
(per-drop coverage check against the whole surviving corpus: 0 drop-only items).
- `30dcaef` — consolidated all seven version-named fixtures into a single
`tests/fixtures/serps.json.bz2` (renamed from `serps-v0.6.8`, the six others deleted),
added a per-record `note` to every survivor via `scripts/build_fixture_corpus.py`,
dropped the 8 redundant records from §3, deleted their orphaned snapshots, and
`--snapshot-update`'d the ~22 newly-included parser-coverage/sge records. Repointed
the three test loaders at the single file.
- `7426854` — repointed `profile_fixture_corpus.py` / `compare_drop_signatures.py` at
`serps.json.bz2` and made them report-only.
- `4cc6aca` — clarified the specialized-components label in the cluster report.
- **Final corpus:** 80 records in one `serps.json.bz2` (~21 MB). The 8 drops match the
§3 table exactly (5 sky-blue, 1 streaming, 1 paragraph-query, 1 college).
- **Verify suite (§6) green** on close: `149 passed, 80 snapshots passed` across
`test_parse_serp`, `test_parser_coverage`, `test_ai_overview_legacy_sge`,
`test_extractor_serp_features`; `test_features_expose_main_layout` included, no
orphaned or missing snapshots, every single-carrier `(type, sub_type)` pair retained.
- **2026-05-31** — Closed: PR #143 merged into `feature/v0.9.0` (the 0.9.0 integration
branch).

## Retrospective

- The two-pass drop policy (mechanical signature screen, then adversarial value-level
`details` rescue) earned its keep: the unordered pair-set screen flagged ~25 subset
candidates, but the decided distinct-type-order bar plus the rescue pass cut that to
exactly 8 — protecting 17 records the cruder screen would have over-dropped.
- The one genuinely load-bearing constraint was mechanical, not structural: keeping
`f6fae1c9a96e` because it is the sole `standard-overview` carrier *within the
`serps-v*` glob* that `test_features_expose_main_layout` asserts against. Worth
re-checking that assertion's source set first on any future fixture surgery.
- Consolidating to one version-less file widened `test_parse_serp` from the `serps-v*`
subset to the whole corpus (+~22 snapshots). That was the right default — provenance
already lives per-record in the `version` field, so the filename carried no
information the JSON didn't.
- This branch sat fully implemented but with `status: draft` and no PR — the work and
the bookkeeping had drifted apart. Closing it promptly after the last impl commit
would have avoided the "did we execute this?" ambiguity.
154 changes: 154 additions & 0 deletions scripts/build_fixture_corpus.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
"""Build the consolidated SERP fixture corpus: tests/fixtures/serps.json.bz2.

Merges the seven version-named bz2 fixtures into one, dropping the 8 verified-
redundant records (plan 032), scrubbing private-repo names from notes, and adding a
`note` to every survivor. Notes are generated from the POST-DROP corpus so "sole
carrier" claims are accurate; curated notes are preserved (only scrubbed). The one
surviving google_abuse exemption token is kept intact as a deliberate artifact. The
HTML field is written last on each line for readability.

Usage:
uv run python scripts/build_fixture_corpus.py # write the file
uv run python scripts/build_fixture_corpus.py --check # report, do not write
"""

import argparse
import bz2
from collections import Counter, defaultdict
from pathlib import Path

import orjson

import WebSearcher as ws

FIXTURES_DIR = Path("tests/fixtures")
OUT = FIXTURES_DIR / "serps.json.bz2"

# Input files in the order they should appear in the consolidated file.
SOURCES = [
"serps-parser-coverage.json.bz2",
"serps-sge-2024.json.bz2",
"serps-v0.6.7.json.bz2",
"serps-v0.6.8.json.bz2",
"serps-v0.7.2-ads.json.bz2",
"serps-v0.7.2-jobs.json.bz2",
"serps-v0.7.2-knowledge-subcards.json.bz2",
]

DROPS = {
"97404b7b7c61",
"45b6e019bfa2",
"c9ab650f5bda",
"032572e185d3",
"be99c971b8f7",
"cad43c3268a8",
"3c09a0f0c92f",
"984065877aad",
}

UBIQUITOUS = ("general/", "searches_related/", "people_also_ask/")


def scrub_note(note: str) -> str:
return note.replace("SearchAudits directives crawl", "a directives crawl")


def signature(parsed: dict) -> list[str]:
pairs = sorted(
{(r["type"], r["sub_type"]) for r in parsed["results"]},
key=lambda ts: (ts[0] or "", ts[1] or ""),
)
return [f"{t}/{s}" for t, s in pairs]


def load_sources() -> list[dict]:
survivors = []
for name in SOURCES:
path = FIXTURES_DIR / name
with bz2.open(path, "rt") as f:
for line in f:
r = orjson.loads(line)
if r["serp_id"][:12] in DROPS:
continue
r["_parsed"] = ws.parse_serp(r["html"])
r["_sig"] = signature(r["_parsed"])
r["_layout"] = r["_parsed"]["features"].get("main_layout")
survivors.append(r)
return survivors


def build_notes(survivors: list[dict]) -> None:
pair_carriers: dict[str, set] = defaultdict(set)
layout_carriers: dict[str, set] = defaultdict(set)
for r in survivors:
sid = r["serp_id"][:12]
for p in r["_sig"]:
pair_carriers[p].add(sid)
layout_carriers[str(r["_layout"])].add(sid)

for r in survivors:
if r.get("note"): # curated -> preserve, scrub only
r["note"] = scrub_note(r["note"])
continue
sid = r["serp_id"][:12]
unique = [p for p in r["_sig"] if len(pair_carriers[p]) == 1]
rare = [p for p in r["_sig"] if len(pair_carriers[p]) == 2]
layout = r["_layout"]
clauses = []
if unique:
clauses.append("sole carrier of " + ", ".join(unique))
if layout and layout != "standard" and len(layout_carriers[str(layout)]) == 1:
clauses.append(f"only {layout} layout in the corpus")
if not unique and rare:
clauses.append("one of two carriers of " + ", ".join(rare))
if not clauses:
notable = [p for p in r["_sig"] if not p.startswith(UBIQUITOUS)]
if notable:
clauses.append("coverage for " + ", ".join(notable[:4]))
else:
clauses.append("standard organic-results SERP")
contribution = "; ".join(clauses)
contribution = contribution[0].upper() + contribution[1:] + "."
prov = f"Corpus capture, WebSearcher {r.get('version')}, {r.get('timestamp', '')[:10]}."
r["note"] = f"{prov} {contribution}"


def emit(survivors: list[dict]) -> None:
with bz2.open(OUT, "wt") as f:
for r in survivors:
rec = {k: v for k, v in r.items() if not k.startswith("_")}
html = rec.pop("html")
note = rec.pop("note")
rec["note"] = note
rec["html"] = html # html last for readability
f.write(orjson.dumps(rec).decode() + "\n")


def main() -> None:
ap = argparse.ArgumentParser()
ap.add_argument("--check", action="store_true", help="report only, do not write")
args = ap.parse_args()

survivors = load_sources()
build_notes(survivors)

print(f"survivors: {len(survivors)} (dropped {len(DROPS)})")
print(f"all have notes: {all(r.get('note') for r in survivors)}")
print(f"versions: {dict(Counter(r.get('version') for r in survivors))}")
print(f"layouts: {dict(Counter(str(r['_layout']) for r in survivors))}")
tokens = [r["serp_id"][:12] for r in survivors if "google_abuse" in r.get("url", "")]
print(f"urls with google_abuse token (kept as artifact): {tokens}")
print("\nsample generated notes:")
for r in survivors[:3] + survivors[-4:]:
print(f" [{r['serp_id'][:12]}] {r['note']}")

if args.check:
print("\n--check: not writing.")
return
emit(survivors)
size = OUT.stat().st_size
print(f"\nwrote {OUT} ({size / 1e6:.1f} MB)")


if __name__ == "__main__":
main()
Loading