gitronald · gitronald · Jun 1, 2026 · Jun 1, 2026 · Jun 1, 2026 · Jun 1, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -17,6 +17,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/).
 - Fixed a regression on *complementary* `kp-wholepage` panels — a music/entity panel that sits beside an intact organic column instead of collapsing it (e.g. `this is my life billy joel youtube`): the sub-column gate mis-keyed on `div.tF2Cxc` and dropped the 10 video organics, and the panel mistyped as `searches_related`/`or_give_general_feedback` from an inner "People also search for" heading. The collapse gate now recognizes any titled-link `div.g` (`h3` + `a[href]`) as organic, and both the header classifier and the knowledge sub-typer defer for `kp-wholepage-osrp` panels so they classify as `knowledge`/`panel`. The true collapsed case (`30 year mortgage rates`) is unaffected
 - Split bare-`tF2Cxc` organic bundles in `general` components into one result per organic (excluding People-Also-Ask sources), instead of collapsing them to a single result — recovers the organics on finance-style panels (e.g. an `aapl stock price` panel went from 1 to 9 organics)
 - Added `election_dates`, `election_results`, and `election_resources` component types for the election widgets embedded in whole-page election panels
+- Optimized `get_text` (the most-called parse helper) with a native-`text()` fast path: when a subtree contains no `script`/`style`/`template` and the separator semantics permit, it delegates to selectolax's C `text()` instead of the pure-Python fragment walker. Byte-identical output (verified over 315k nodes in the fixture corpus; all 87 snapshots unchanged) for ~7% faster `parse_serp` (corpus 3872 -> 3590 ms, median 39.9 -> 36.7 ms/SERP back-to-back, Python 3.13)
 
 ## [0.8.6] - 2026-05-26
 

diff --git a/WebSearcher/_slx.py b/WebSearcher/_slx.py
@@ -106,6 +106,25 @@ def get_text(node: Node | None, separator: str = "", strip: bool = False) -> str
     + ``get_text``."""
     if node is None:
         return None
+    # Fast path: selectolax's native C ``text()`` is byte-identical to the Python
+    # fragment walker below EXCEPT for two differences -- native includes
+    # script/style/template text and native keeps empty fragments (the
+    # ``strip=True`` walker drops them). On the script/style/template side, the
+    # walker skips those tags only as *descendants* (``_iter_text_fragments``
+    # does not special-case a root that is itself one), so two guards are needed:
+    # ``node.tag not in _SKIP_TEXT_TAGS`` (root) and the ``css_first`` probe
+    # (descendants). The empty-fragment difference vanishes when ``separator`` is
+    # empty (empties add nothing to a ""-join, so kept-vs-dropped is invisible)
+    # or ``strip`` is False (both keep empties identically). Together these
+    # cover ~95% of calls on this corpus; the one remaining case (``strip=True``
+    # with a non-empty separator) keeps the Python walker. Verified
+    # byte-identical over the full fixture corpus (315k nodes).
+    if (
+        (separator == "" or not strip)
+        and node.tag not in _SKIP_TEXT_TAGS
+        and node.css_first("script,style,template") is None
+    ):
+        return node.text(deep=True, separator=separator, strip=strip)
     frags = _iter_text_fragments(node)
     if strip:
         parts = [s for s in (f.strip() for f in frags) if s]

diff --git a/docs/plans/035-get-text-native-fastpath.md b/docs/plans/035-get-text-native-fastpath.md
@@ -0,0 +1,95 @@
+---
+status: done
+branch: claude/new-benchmark-PMegd
+created: 2026-06-01
+---
+
+# `get_text` native-`text()` fast path (post-selectolax benchmark)
+
+A fresh benchmark of the current parse pipeline (the first since the selectolax
+native rewrite in plan 026 and the parser additions in plans 027-034) found that
+the pure-Python `get_text` fragment walker had become the single largest
+*optimizable* cost. This plan records that benchmark and a byte-identical fast
+path that recovers ~7% of `parse_serp` latency.
+
+Methodology follows plan 023: per-SERP median + MAD, gate on the run-to-run noise
+floor, and only trust **back-to-back same-session A/B** numbers (never chain
+deltas across sessions or machines).
+
+## Environment
+
+`scripts/bench_parse.py` now records the interpreter at the top of every run
+(`platform.python_version()` / implementation / platform + `WebSearcher`
+version) -- parse timings are only comparable within one Python build, and the
+repo's pinned `.python-version` (`3.14.0rc2`) currently can't import the package
+(`pydantic 2.13.4` vs the 3.14 RC `typing._eval_type` signature). All numbers
+below are **Python 3.13.12, CPython, linux**, fixture corpus
+`tests/fixtures/serps.json.bz2` (87 SERPs).
+
+## Baseline benchmark (current HEAD)
+
+`bench_parse.py --iterations 10 --runs 3`:
+
+- median **39.5 ms/SERP**, MAD 13.6 ms; min 17.3 / p90 75.7 / max 115.7 ms
+- corpus 3760.6 ms/pass; inter-run MAD 5.7 ms -> **noise floor ~0.3%** (idle box).
+
+cProfile (`--profile`, 870 parses, 52.6 s, tottime) top buckets:
+
+| Frame | self | nature |
+|---|---|---|
+| `make_soup` (lexbor parse) | 10.5 s (20%) | structural -- one parse/SERP, unavoidable |
+| `_iter_text_fragments` (get_text walker) | 5.8 s; 9.6 s cum (**18%**) | pure-Python, hot -- the target |
+| `_ComponentSignals.__init__` | 5.3 s; 6.9 s cum (13%) | pure-Python (one `css('*')` walk/component) |
+| `_extract_from_html` (serp-features regex) | 2.0 s | |
+| `_get_dom_positions` | 1.9 s | |
+
+`get_text` is called ~176x/parse (153,580 over 870), each walking a subtree via a
+Python stack of `.iter()` generators -- 824k fragment visits, 5.4M `next()` calls.
+
+## The fast path
+
+Plan 026 flagged native selectolax `.text()` as the next lever but "unsafe"
+because (a) native includes `script`/`style`/`template` text (the walker skips
+those subtrees) and (b) native `strip=True` keeps empty fragments (the walker
+drops them). Both differences are **observable only under specific conditions**,
+and outside them native C `text()` is byte-identical:
+
+- (a) vanishes when the subtree has no `script`/`style`/`template`.
+- (b) vanishes when `separator == ""` (an empty fragment adds nothing to a
+  `""`-join, so kept-vs-dropped is invisible) **or** `strip is False` (both keep
+  empties identically).
+
+So `get_text` delegates to `node.text(deep=True, separator=sep, strip=strip)`
+when `(separator == "" or not strip)` and the subtree holds no
+script/style/template (one `css_first("script,style,template")` C probe); every
+other call keeps the Python walker. The only call signature that always stays on
+the walker is `strip=True` with a non-empty separator (`get_text(x, " ",
+strip=True)`, 38 sites) -- exactly the drop-empties-with-visible-separator case.
+
+**Correctness verification.** Over the full corpus (315,095 element nodes, 95.2%
+fast-path-eligible) every fast-pathable signature -- `("", False)`, `("", True)`,
+`(" ", False)`, `("<|>", False)` -- produced **0 mismatches** against the walker.
+The snapshot suite stays green without updates (`uv run pytest`: 336 passed, 4
+skipped, **87 snapshots unchanged**).
+
+## Result (back-to-back A/B, same machine state)
+
+Stash/pop A/B with `--iterations 10 --runs 3`:
+
+| Metric | Baseline | Fast path | Delta |
+|---|---|---|---|
+| Corpus total | 3872.0 ms | 3590.3 ms | **-7.3%** |
+| Per-SERP median | 39.9 ms | 36.7 ms | **-8.0%** |
+
+Far above the ~0.4-0.5% noise floor. Post-change profile: `_iter_text_fragments`
+self 5.8 -> 2.2 s, cum 9.6 -> 3.5 s, fragment visits 824k -> 276k (the remainder
+is the `strip=True`/non-empty-sep walker, script-bearing subtrees, and
+`has_text`/`knowledge_box`); the displaced work moved into lexbor's C `text()`.
+
+## Left for follow-up
+
+`_ComponentSignals.__init__` is now the #2 pure-Python cost (~5 s self, ~13%): one
+`css('*')` walk per component building class/id/tag presence sets. A shared scan
+feeding both it and `_get_dom_positions`/`reorder_by_dom_position` (each of which
+also `css('*')`-walks) is the next structural lever -- deferred to keep this
+change small and byte-identical.
diff --git a/docs/plans/036-component-signals-and-extractor-hotpath.md b/docs/plans/036-component-signals-and-extractor-hotpath.md
@@ -0,0 +1,96 @@
+---
+status: proposed
+branch: TBD
+created: 2026-06-01
+---
+
+# `_ComponentSignals` consolidation + extractor hot-path review
+
+Follow-up to [plan 035](035-get-text-native-fastpath.md). With the `get_text`
+native fast path banked (~7% off `parse_serp`), the profile's #2 *optimizable*
+cost is now `classifiers/main.py:_ComponentSignals.__init__`, and the extractor
+phase is the largest unprofiled-in-detail bucket worth a pass. This plan scopes
+both. Same methodology as plans 023/035: per-SERP median + MAD, gate on the
+run-to-run noise floor (~0.3-0.5% on the current idle box), and trust only
+**back-to-back same-session A/B** numbers.
+
+## Current baseline (Python 3.13.12, 87-SERP corpus)
+
+From plan 035's profile (`bench_parse.py --profile`, 870 parses, post-fast-path):
+
+| Frame | self | nature |
+|---|---|---|
+| `make_soup` (lexbor parse) | 10.1 s (~20%) | structural -- one parse/SERP |
+| `_ComponentSignals.__init__` | 5.1 s; 6.7 s cum (**~13%**) | pure-Python, **this plan** |
+| `_iter_text_fragments` (residual walker) | 2.2 s; 3.5 s cum | already optimized in 035 |
+| `_get_dom_positions` | 1.5 s | full-document `css('*')` walk |
+| `_ai_overview_payloads._iter_payload_blobs` | 1.6 s; 2.1 s cum | recent addition |
+| `is_valid` (`extractor_main.py:568`) | 1.0 s; 1.8 s cum | extraction |
+| `extract_from_standard` (`extractor_main.py:333`) | 0.8 s; 2.4 s cum | extraction |
+
+## Lever 1: `_ComponentSignals` (primary)
+
+`ClassifyMain.classify` builds a `_ComponentSignals` per main component -- one
+`cmpt.css('*')` descendant walk that fills three sets (class tokens, ids, tag
+names), feeding the necessary-signal preconditions on the classifier chain
+(plan 023 item 3a). It is called ~13x/parse (11,390 over 870 parses) and is now
+~13% of parse time. The cost is the per-element Python loop:
+`set.update(cls.split())` (1.67M updates), `set.add(tag)` (2.47M adds),
+`str.split` (1.72M), `el.attrs.get`/`el.id`/`el.tag` per element.
+
+Candidate directions (each must stay byte-identical -- preconditions are
+*necessary* conditions, so any change must not drop a real signal; pin with the
+87-snapshot suite, no updates):
+
+1. **Build only the signals the chain actually consults.** The `names` set is
+   queried for ~8 custom-element tags (`g-scrolling-carousel`, `g-tray-header`,
+   `block-component`, `h2`, `promo-throttler`, `product-viewer-group`,
+   `g-more-link`), yet every element's tag is added (2.47M adds). The `ids` set
+   is queried for a similarly small fixed set. Restricting `names`/`ids` to a
+   precomputed interest set (membership-test on add) trades 2.47M unconditional
+   adds for 2.47M cheap `in` checks against a small frozenset -- measure whether
+   that nets out, since the `in` check isn't free either. Classes are broadly
+   consulted and likely must stay full.
+2. **Lexbor-side signal extraction.** Investigate whether a small number of
+   targeted `css_first(...)` C probes for the gated signals beat one Python
+   `css('*')` walk that materializes three sets -- i.e. revisit whether the 3a
+   "presence set" is still the right shape now that the walker (not `find`
+   misses) is the cost. This is the inverse of the 023 decision and must be
+   re-measured, not assumed.
+3. **Share one document walk.** `_get_dom_positions` (1.5 s) already walks the
+   whole document with `css('*')`, and `reorder_by_dom_position._range` walks
+   each main-component subtree with `css('*')` again. A single document walk that
+   yields both the position map and per-component signal sets would remove
+   redundant traversals -- but it couples currently-independent phases and risks
+   the byte-identical contract; scope carefully and gate hard.
+
+Recommended first step: option 1 (lowest risk, local to `_ComponentSignals`),
+A/B it, then decide whether option 3's shared walk is worth the coupling.
+
+## Lever 2: extractor hot-path review (investigate)
+
+`ExtractorMain` (`extract_from_standard`, `get_layout`, `is_valid`,
+`_ads_bottom`) is the largest phase after `make_soup` once classify is addressed,
+and it grew ~648 lines since plan 023 (new layouts, kp-wholepage sub-columns,
+complementary panels). It has not had a dedicated profiling pass on the
+selectolax backend. Worth investigating:
+
+- `is_valid` (1.8 s cum, 25,460 calls) runs per candidate component -- re-check
+  the bad-label text scan and survey-throttler probe on lexbor nodes (the 023
+  bounds were tuned for bs4).
+- `extract_from_standard` / the `_StandardLayout` dispatch -- look for repeated
+  `css`/`css_first` over the same subtrees across layout detection and block
+  collection that could be hoisted or shared.
+- `_iter_payload_blobs` (ai_overview, 2.1 s cum) -- a recent addition; confirm it
+  isn't re-walking payload subtrees.
+
+No commitment yet -- this lever is a profiling/scoping task that may or may not
+surface a gateable win. Capture a `--profile-sort cumulative` split by phase
+(as plan 023 did) before touching extractor code.
+
+## Verification gate (per change)
+
+- `uv run pytest` -- 87 snapshots green **without updates**, full suite passing.
+- Back-to-back A/B of `scripts/bench_parse.py` over the fixture corpus, same
+  session, clearing the noise floor; record numbers in this plan's Log.
+- `ruff check` / `ruff format --check` clean.
diff --git a/scripts/bench_parse.py b/scripts/bench_parse.py
@@ -13,8 +13,10 @@
 import cProfile
 import gc
 import logging
+import platform
 import pstats
 import statistics
+import sys
 import time
 from pathlib import Path
 
@@ -72,6 +74,12 @@ def main(
     """Benchmark or profile parse_serp over the fixture corpus."""
     logging.getLogger("WebSearcher").setLevel(logging.WARNING)
 
+    # Record the interpreter: timings are only comparable within one Python build.
+    typer.echo(
+        f"Python {platform.python_version()} "
+        f"({platform.python_implementation()}, {sys.platform}) | WebSearcher {ws.__version__}"
+    )
+
     paths = fixtures or sorted(FIXTURES_DIR.glob("serps-v*.json.bz2"))
     if not paths:
         typer.echo(f"No fixtures found in {FIXTURES_DIR}")