Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/).
- Fixed a regression on *complementary* `kp-wholepage` panels — a music/entity panel that sits beside an intact organic column instead of collapsing it (e.g. `this is my life billy joel youtube`): the sub-column gate mis-keyed on `div.tF2Cxc` and dropped the 10 video organics, and the panel mistyped as `searches_related`/`or_give_general_feedback` from an inner "People also search for" heading. The collapse gate now recognizes any titled-link `div.g` (`h3` + `a[href]`) as organic, and both the header classifier and the knowledge sub-typer defer for `kp-wholepage-osrp` panels so they classify as `knowledge`/`panel`. The true collapsed case (`30 year mortgage rates`) is unaffected
- Split bare-`tF2Cxc` organic bundles in `general` components into one result per organic (excluding People-Also-Ask sources), instead of collapsing them to a single result — recovers the organics on finance-style panels (e.g. an `aapl stock price` panel went from 1 to 9 organics)
- Added `election_dates`, `election_results`, and `election_resources` component types for the election widgets embedded in whole-page election panels
- Optimized `get_text` (the most-called parse helper) with a native-`text()` fast path: when a subtree contains no `script`/`style`/`template` and the separator semantics permit, it delegates to selectolax's C `text()` instead of the pure-Python fragment walker. Byte-identical output (verified over 315k nodes in the fixture corpus; all 87 snapshots unchanged) for ~7% faster `parse_serp` (corpus 3872 -> 3590 ms, median 39.9 -> 36.7 ms/SERP back-to-back, Python 3.13)

## [0.8.6] - 2026-05-26

Expand Down
19 changes: 19 additions & 0 deletions WebSearcher/_slx.py
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,25 @@ def get_text(node: Node | None, separator: str = "", strip: bool = False) -> str
+ ``get_text``."""
if node is None:
return None
# Fast path: selectolax's native C ``text()`` is byte-identical to the Python
# fragment walker below EXCEPT for two differences -- native includes
# script/style/template text and native keeps empty fragments (the
# ``strip=True`` walker drops them). On the script/style/template side, the
# walker skips those tags only as *descendants* (``_iter_text_fragments``
# does not special-case a root that is itself one), so two guards are needed:
# ``node.tag not in _SKIP_TEXT_TAGS`` (root) and the ``css_first`` probe
# (descendants). The empty-fragment difference vanishes when ``separator`` is
# empty (empties add nothing to a ""-join, so kept-vs-dropped is invisible)
# or ``strip`` is False (both keep empties identically). Together these
# cover ~95% of calls on this corpus; the one remaining case (``strip=True``
# with a non-empty separator) keeps the Python walker. Verified
# byte-identical over the full fixture corpus (315k nodes).
Comment thread
gitronald marked this conversation as resolved.
if (
(separator == "" or not strip)
and node.tag not in _SKIP_TEXT_TAGS
and node.css_first("script,style,template") is None
):
return node.text(deep=True, separator=separator, strip=strip)
frags = _iter_text_fragments(node)
if strip:
parts = [s for s in (f.strip() for f in frags) if s]
Expand Down
95 changes: 95 additions & 0 deletions docs/plans/035-get-text-native-fastpath.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
---
status: done
branch: claude/new-benchmark-PMegd
created: 2026-06-01
---

# `get_text` native-`text()` fast path (post-selectolax benchmark)

A fresh benchmark of the current parse pipeline (the first since the selectolax
native rewrite in plan 026 and the parser additions in plans 027-034) found that
the pure-Python `get_text` fragment walker had become the single largest
*optimizable* cost. This plan records that benchmark and a byte-identical fast
path that recovers ~7% of `parse_serp` latency.

Methodology follows plan 023: per-SERP median + MAD, gate on the run-to-run noise
floor, and only trust **back-to-back same-session A/B** numbers (never chain
deltas across sessions or machines).

## Environment

`scripts/bench_parse.py` now records the interpreter at the top of every run
(`platform.python_version()` / implementation / platform + `WebSearcher`
version) -- parse timings are only comparable within one Python build, and the
repo's pinned `.python-version` (`3.14.0rc2`) currently can't import the package
(`pydantic 2.13.4` vs the 3.14 RC `typing._eval_type` signature). All numbers
below are **Python 3.13.12, CPython, linux**, fixture corpus
`tests/fixtures/serps.json.bz2` (87 SERPs).

## Baseline benchmark (current HEAD)

`bench_parse.py --iterations 10 --runs 3`:

- median **39.5 ms/SERP**, MAD 13.6 ms; min 17.3 / p90 75.7 / max 115.7 ms
- corpus 3760.6 ms/pass; inter-run MAD 5.7 ms -> **noise floor ~0.3%** (idle box).

cProfile (`--profile`, 870 parses, 52.6 s, tottime) top buckets:

| Frame | self | nature |
|---|---|---|
| `make_soup` (lexbor parse) | 10.5 s (20%) | structural -- one parse/SERP, unavoidable |
| `_iter_text_fragments` (get_text walker) | 5.8 s; 9.6 s cum (**18%**) | pure-Python, hot -- the target |
| `_ComponentSignals.__init__` | 5.3 s; 6.9 s cum (13%) | pure-Python (one `css('*')` walk/component) |
| `_extract_from_html` (serp-features regex) | 2.0 s | |
| `_get_dom_positions` | 1.9 s | |

`get_text` is called ~176x/parse (153,580 over 870), each walking a subtree via a
Python stack of `.iter()` generators -- 824k fragment visits, 5.4M `next()` calls.

## The fast path

Plan 026 flagged native selectolax `.text()` as the next lever but "unsafe"
because (a) native includes `script`/`style`/`template` text (the walker skips
those subtrees) and (b) native `strip=True` keeps empty fragments (the walker
drops them). Both differences are **observable only under specific conditions**,
and outside them native C `text()` is byte-identical:

- (a) vanishes when the subtree has no `script`/`style`/`template`.
- (b) vanishes when `separator == ""` (an empty fragment adds nothing to a
`""`-join, so kept-vs-dropped is invisible) **or** `strip is False` (both keep
empties identically).

So `get_text` delegates to `node.text(deep=True, separator=sep, strip=strip)`
when `(separator == "" or not strip)` and the subtree holds no
script/style/template (one `css_first("script,style,template")` C probe); every
other call keeps the Python walker. The only call signature that always stays on
the walker is `strip=True` with a non-empty separator (`get_text(x, " ",
strip=True)`, 38 sites) -- exactly the drop-empties-with-visible-separator case.

**Correctness verification.** Over the full corpus (315,095 element nodes, 95.2%
fast-path-eligible) every fast-pathable signature -- `("", False)`, `("", True)`,
`(" ", False)`, `("<|>", False)` -- produced **0 mismatches** against the walker.
The snapshot suite stays green without updates (`uv run pytest`: 336 passed, 4
skipped, **87 snapshots unchanged**).

## Result (back-to-back A/B, same machine state)

Stash/pop A/B with `--iterations 10 --runs 3`:

| Metric | Baseline | Fast path | Delta |
|---|---|---|---|
| Corpus total | 3872.0 ms | 3590.3 ms | **-7.3%** |
| Per-SERP median | 39.9 ms | 36.7 ms | **-8.0%** |

Far above the ~0.4-0.5% noise floor. Post-change profile: `_iter_text_fragments`
self 5.8 -> 2.2 s, cum 9.6 -> 3.5 s, fragment visits 824k -> 276k (the remainder
is the `strip=True`/non-empty-sep walker, script-bearing subtrees, and
`has_text`/`knowledge_box`); the displaced work moved into lexbor's C `text()`.

## Left for follow-up

`_ComponentSignals.__init__` is now the #2 pure-Python cost (~5 s self, ~13%): one
`css('*')` walk per component building class/id/tag presence sets. A shared scan
feeding both it and `_get_dom_positions`/`reorder_by_dom_position` (each of which
also `css('*')`-walks) is the next structural lever -- deferred to keep this
change small and byte-identical.
96 changes: 96 additions & 0 deletions docs/plans/036-component-signals-and-extractor-hotpath.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
---
status: proposed
branch: TBD
created: 2026-06-01
---

# `_ComponentSignals` consolidation + extractor hot-path review

Follow-up to [plan 035](035-get-text-native-fastpath.md). With the `get_text`
native fast path banked (~7% off `parse_serp`), the profile's #2 *optimizable*
cost is now `classifiers/main.py:_ComponentSignals.__init__`, and the extractor
phase is the largest unprofiled-in-detail bucket worth a pass. This plan scopes
both. Same methodology as plans 023/035: per-SERP median + MAD, gate on the
run-to-run noise floor (~0.3-0.5% on the current idle box), and trust only
**back-to-back same-session A/B** numbers.

## Current baseline (Python 3.13.12, 87-SERP corpus)

From plan 035's profile (`bench_parse.py --profile`, 870 parses, post-fast-path):

| Frame | self | nature |
|---|---|---|
| `make_soup` (lexbor parse) | 10.1 s (~20%) | structural -- one parse/SERP |
| `_ComponentSignals.__init__` | 5.1 s; 6.7 s cum (**~13%**) | pure-Python, **this plan** |
| `_iter_text_fragments` (residual walker) | 2.2 s; 3.5 s cum | already optimized in 035 |
| `_get_dom_positions` | 1.5 s | full-document `css('*')` walk |
| `_ai_overview_payloads._iter_payload_blobs` | 1.6 s; 2.1 s cum | recent addition |
| `is_valid` (`extractor_main.py:568`) | 1.0 s; 1.8 s cum | extraction |
| `extract_from_standard` (`extractor_main.py:333`) | 0.8 s; 2.4 s cum | extraction |

## Lever 1: `_ComponentSignals` (primary)

`ClassifyMain.classify` builds a `_ComponentSignals` per main component -- one
`cmpt.css('*')` descendant walk that fills three sets (class tokens, ids, tag
names), feeding the necessary-signal preconditions on the classifier chain
(plan 023 item 3a). It is called ~13x/parse (11,390 over 870 parses) and is now
~13% of parse time. The cost is the per-element Python loop:
`set.update(cls.split())` (1.67M updates), `set.add(tag)` (2.47M adds),
`str.split` (1.72M), `el.attrs.get`/`el.id`/`el.tag` per element.

Candidate directions (each must stay byte-identical -- preconditions are
*necessary* conditions, so any change must not drop a real signal; pin with the
87-snapshot suite, no updates):

1. **Build only the signals the chain actually consults.** The `names` set is
queried for ~8 custom-element tags (`g-scrolling-carousel`, `g-tray-header`,
`block-component`, `h2`, `promo-throttler`, `product-viewer-group`,
`g-more-link`), yet every element's tag is added (2.47M adds). The `ids` set
is queried for a similarly small fixed set. Restricting `names`/`ids` to a
precomputed interest set (membership-test on add) trades 2.47M unconditional
adds for 2.47M cheap `in` checks against a small frozenset -- measure whether
that nets out, since the `in` check isn't free either. Classes are broadly
consulted and likely must stay full.
2. **Lexbor-side signal extraction.** Investigate whether a small number of
targeted `css_first(...)` C probes for the gated signals beat one Python
`css('*')` walk that materializes three sets -- i.e. revisit whether the 3a
"presence set" is still the right shape now that the walker (not `find`
misses) is the cost. This is the inverse of the 023 decision and must be
re-measured, not assumed.
3. **Share one document walk.** `_get_dom_positions` (1.5 s) already walks the
whole document with `css('*')`, and `reorder_by_dom_position._range` walks
each main-component subtree with `css('*')` again. A single document walk that
yields both the position map and per-component signal sets would remove
redundant traversals -- but it couples currently-independent phases and risks
the byte-identical contract; scope carefully and gate hard.

Recommended first step: option 1 (lowest risk, local to `_ComponentSignals`),
A/B it, then decide whether option 3's shared walk is worth the coupling.

## Lever 2: extractor hot-path review (investigate)

`ExtractorMain` (`extract_from_standard`, `get_layout`, `is_valid`,
`_ads_bottom`) is the largest phase after `make_soup` once classify is addressed,
and it grew ~648 lines since plan 023 (new layouts, kp-wholepage sub-columns,
complementary panels). It has not had a dedicated profiling pass on the
selectolax backend. Worth investigating:

- `is_valid` (1.8 s cum, 25,460 calls) runs per candidate component -- re-check
the bad-label text scan and survey-throttler probe on lexbor nodes (the 023
bounds were tuned for bs4).
- `extract_from_standard` / the `_StandardLayout` dispatch -- look for repeated
`css`/`css_first` over the same subtrees across layout detection and block
collection that could be hoisted or shared.
- `_iter_payload_blobs` (ai_overview, 2.1 s cum) -- a recent addition; confirm it
isn't re-walking payload subtrees.

No commitment yet -- this lever is a profiling/scoping task that may or may not
surface a gateable win. Capture a `--profile-sort cumulative` split by phase
(as plan 023 did) before touching extractor code.

## Verification gate (per change)

- `uv run pytest` -- 87 snapshots green **without updates**, full suite passing.
- Back-to-back A/B of `scripts/bench_parse.py` over the fixture corpus, same
session, clearing the noise floor; record numbers in this plan's Log.
- `ruff check` / `ruff format --check` clean.
8 changes: 8 additions & 0 deletions scripts/bench_parse.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,10 @@
import cProfile
import gc
import logging
import platform
import pstats
import statistics
import sys
import time
from pathlib import Path

Expand Down Expand Up @@ -72,6 +74,12 @@ def main(
"""Benchmark or profile parse_serp over the fixture corpus."""
logging.getLogger("WebSearcher").setLevel(logging.WARNING)

# Record the interpreter: timings are only comparable within one Python build.
typer.echo(
f"Python {platform.python_version()} "
f"({platform.python_implementation()}, {sys.platform}) | WebSearcher {ws.__version__}"
)

paths = fixtures or sorted(FIXTURES_DIR.glob("serps-v*.json.bz2"))
if not paths:
typer.echo(f"No fixtures found in {FIXTURES_DIR}")
Expand Down