Optimize get_text with a native selectolax text() fast path (~7% faster parse) by gitronald · Pull Request #145 · gitronald/WebSearcher

gitronald · 2026-06-01T13:36:10Z

Summary

A fresh benchmark of the parse pipeline (the first since the selectolax native rewrite in plan 026 and the parser additions in plans 027–034) found that the pure-Python get_text fragment walker had become the single largest optimizable cost (make_soup's lexbor parse is bigger but structural). This PR adds a byte-identical native-text() fast path that recovers ~7% of parse_serp latency.

The change

get_text delegates to lexbor's C text() when it is provably equivalent to the Python fragment walker:

the subtree has no script/style/template (native includes their text; the walker skips it), and
either separator == "" (an empty fragment adds nothing to a ""-join, so native's kept-empties are invisible) or strip is False (both keep empties identically).

Every other call keeps the walker — notably the 38 get_text(x, " ", strip=True) sites (drop-empties with a visible separator), which is the one case where native and the walker diverge.

Also: scripts/bench_parse.py now records the interpreter version/platform at the top of every run, since parse timings are only comparable within one Python build.

Correctness (byte-identical)

0 mismatches across 315,095 element nodes of the fixture corpus, for every fast-pathable signature (("", False), ("", True), (" ", False), ("<|>", False)); 95.2% of nodes are fast-path-eligible.
uv run pytest: 336 passed, 4 skipped, 87 snapshots unchanged (no snapshot updates).
ruff check / ruff format --check clean.

Result (back-to-back A/B, same machine, Python 3.13.12)

Metric	Baseline	Fast path	Delta
Corpus total	3872.0 ms	3590.3 ms	−7.3%
Per-SERP median	39.9 ms	36.7 ms	−8.0%

Far above the ~0.5% noise floor. Post-change profile: _iter_text_fragments self-time 5.8 → 2.2 s (cum 9.6 → 3.5 s), fragment visits 824k → 276k — the displaced work moved into lexbor's C text().

Docs / plans

docs/plans/035-get-text-native-fastpath.md — records the benchmark, the fast-path correctness argument, and the A/B result.
docs/plans/036-component-signals-and-extractor-hotpath.md — scopes the next lever (_ComponentSignals, now ~13% of parse time) plus an extractor hot-path review.
CHANGELOG entry under [Unreleased].

Notes

The repo's pinned .python-version (3.14.0rc2) currently can't import the package (pydantic 2.13.4 vs the 3.14 RC typing._eval_type signature); all numbers were captured on Python 3.13.12. Flagged for a separate env/deps fix.

https://claude.ai/code/session_01XH4Tpn5aVFaEq814NoBTrC

Generated by Claude Code

Parse timings are only comparable within one interpreter build, so print the Python version/implementation/platform (and WebSearcher version) at the top of every benchmark and profile run.

The pure-Python get_text fragment walker was the largest optimizable cost in a fresh benchmark of the post-selectolax parse pipeline (~18% cumulative, 824k fragment visits/870 parses). Delegate to lexbor's C text() when it is provably byte-identical: the subtree has no script/style/template (native includes their text; the walker skips it) AND either separator=='' or strip is False (so native's kept empty fragments are invisible). Every other call keeps the walker. Verified byte-identical over the full fixture corpus (315k nodes, 0 mismatches) and the snapshot suite stays green without updates (336 passed, 87 snapshots). Back-to-back A/B on Python 3.13: corpus 3872 -> 3590 ms (-7.3%), median 39.9 -> 36.7 ms/SERP, well above the ~0.5% noise floor. Also record the interpreter version/platform at the top of every bench_parse run, since parse timings are only comparable within one Python build.

Copilot

Pull request overview

This PR optimizes the Selectolax-based parsing pipeline by adding a correctness-preserving fast path in get_text that delegates to Selectolax’s native Node.text() when it is provably byte-identical to the existing Python fragment walker, improving parse_serp latency (~7–8% in the provided benchmark).

Changes:

Add a fast-path in WebSearcher._slx.get_text() to use native node.text(...) when subtree/tag conditions guarantee equivalence to the Python walker.
Enhance scripts/bench_parse.py to print interpreter/platform and WebSearcher version for benchmark comparability.
Add plan documentation (035 done, 036 proposed) and record the optimization in CHANGELOG.md.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
WebSearcher/_slx.py	Adds native `text()` delegation fast path in `get_text` under equivalence conditions.
scripts/bench_parse.py	Prints Python/platform + package version at benchmark start to contextualize timings.
docs/plans/035-get-text-native-fastpath.md	Documents the benchmark, equivalence argument, and measured performance win.
docs/plans/036-component-signals-and-extractor-hotpath.md	Proposes follow-up performance work focusing on `_ComponentSignals` and extractor profiling.
CHANGELOG.md	Notes the `get_text` fast-path optimization and benchmark result under Unreleased.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…e only as descendants Addresses a PR review note -- the Python walker skips those tags only when they are descendants, not when the root node is itself one, which is why the fast path needs both the node.tag guard and the css_first descendant probe.

claude added 3 commits June 1, 2026 07:08

bench_parse: record Python version and platform in output

86870c0

Parse timings are only comparable within one interpreter build, so print the Python version/implementation/platform (and WebSearcher version) at the top of every benchmark and profile run.

add plan 036: _ComponentSignals lever + extractor hot-path review

ea8a3be

gitronald requested a review from Copilot June 1, 2026 13:40

Copilot started reviewing on behalf of gitronald June 1, 2026 13:40 View session

Copilot AI reviewed Jun 1, 2026

View reviewed changes

Comment thread WebSearcher/_slx.py

gitronald merged commit 4c1c8bd into feature/v0.9.0 Jun 1, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize get_text with a native selectolax text() fast path (~7% faster parse)#145

Optimize get_text with a native selectolax text() fast path (~7% faster parse)#145
gitronald merged 4 commits into
feature/v0.9.0from
claude/new-benchmark-PMegd

gitronald commented Jun 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gitronald commented Jun 1, 2026

Summary

The change

Correctness (byte-identical)

Result (back-to-back A/B, same machine, Python 3.13.12)

Docs / plans

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants