Skip to content

Optimize get_text with a native selectolax text() fast path (~7% faster parse)#145

Merged
gitronald merged 4 commits into
feature/v0.9.0from
claude/new-benchmark-PMegd
Jun 1, 2026
Merged

Optimize get_text with a native selectolax text() fast path (~7% faster parse)#145
gitronald merged 4 commits into
feature/v0.9.0from
claude/new-benchmark-PMegd

Conversation

@gitronald
Copy link
Copy Markdown
Owner

Summary

A fresh benchmark of the parse pipeline (the first since the selectolax native rewrite in plan 026 and the parser additions in plans 027–034) found that the pure-Python get_text fragment walker had become the single largest optimizable cost (make_soup's lexbor parse is bigger but structural). This PR adds a byte-identical native-text() fast path that recovers ~7% of parse_serp latency.

The change

get_text delegates to lexbor's C text() when it is provably equivalent to the Python fragment walker:

  • the subtree has no script/style/template (native includes their text; the walker skips it), and
  • either separator == "" (an empty fragment adds nothing to a ""-join, so native's kept-empties are invisible) or strip is False (both keep empties identically).

Every other call keeps the walker — notably the 38 get_text(x, " ", strip=True) sites (drop-empties with a visible separator), which is the one case where native and the walker diverge.

Also: scripts/bench_parse.py now records the interpreter version/platform at the top of every run, since parse timings are only comparable within one Python build.

Correctness (byte-identical)

  • 0 mismatches across 315,095 element nodes of the fixture corpus, for every fast-pathable signature (("", False), ("", True), (" ", False), ("<|>", False)); 95.2% of nodes are fast-path-eligible.
  • uv run pytest: 336 passed, 4 skipped, 87 snapshots unchanged (no snapshot updates).
  • ruff check / ruff format --check clean.

Result (back-to-back A/B, same machine, Python 3.13.12)

Metric Baseline Fast path Delta
Corpus total 3872.0 ms 3590.3 ms −7.3%
Per-SERP median 39.9 ms 36.7 ms −8.0%

Far above the ~0.5% noise floor. Post-change profile: _iter_text_fragments self-time 5.8 → 2.2 s (cum 9.6 → 3.5 s), fragment visits 824k → 276k — the displaced work moved into lexbor's C text().

Docs / plans

  • docs/plans/035-get-text-native-fastpath.md — records the benchmark, the fast-path correctness argument, and the A/B result.
  • docs/plans/036-component-signals-and-extractor-hotpath.md — scopes the next lever (_ComponentSignals, now ~13% of parse time) plus an extractor hot-path review.
  • CHANGELOG entry under [Unreleased].

Notes

The repo's pinned .python-version (3.14.0rc2) currently can't import the package (pydantic 2.13.4 vs the 3.14 RC typing._eval_type signature); all numbers were captured on Python 3.13.12. Flagged for a separate env/deps fix.

https://claude.ai/code/session_01XH4Tpn5aVFaEq814NoBTrC


Generated by Claude Code

claude added 3 commits June 1, 2026 07:08
Parse timings are only comparable within one interpreter build, so print
the Python version/implementation/platform (and WebSearcher version) at the
top of every benchmark and profile run.
The pure-Python get_text fragment walker was the largest optimizable cost in a
fresh benchmark of the post-selectolax parse pipeline (~18% cumulative, 824k
fragment visits/870 parses). Delegate to lexbor's C text() when it is provably
byte-identical: the subtree has no script/style/template (native includes their
text; the walker skips it) AND either separator=='' or strip is False (so
native's kept empty fragments are invisible). Every other call keeps the walker.

Verified byte-identical over the full fixture corpus (315k nodes, 0 mismatches)
and the snapshot suite stays green without updates (336 passed, 87 snapshots).
Back-to-back A/B on Python 3.13: corpus 3872 -> 3590 ms (-7.3%), median 39.9 ->
36.7 ms/SERP, well above the ~0.5% noise floor.

Also record the interpreter version/platform at the top of every bench_parse
run, since parse timings are only comparable within one Python build.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the Selectolax-based parsing pipeline by adding a correctness-preserving fast path in get_text that delegates to Selectolax’s native Node.text() when it is provably byte-identical to the existing Python fragment walker, improving parse_serp latency (~7–8% in the provided benchmark).

Changes:

  • Add a fast-path in WebSearcher._slx.get_text() to use native node.text(...) when subtree/tag conditions guarantee equivalence to the Python walker.
  • Enhance scripts/bench_parse.py to print interpreter/platform and WebSearcher version for benchmark comparability.
  • Add plan documentation (035 done, 036 proposed) and record the optimization in CHANGELOG.md.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
WebSearcher/_slx.py Adds native text() delegation fast path in get_text under equivalence conditions.
scripts/bench_parse.py Prints Python/platform + package version at benchmark start to contextualize timings.
docs/plans/035-get-text-native-fastpath.md Documents the benchmark, equivalence argument, and measured performance win.
docs/plans/036-component-signals-and-extractor-hotpath.md Proposes follow-up performance work focusing on _ComponentSignals and extractor profiling.
CHANGELOG.md Notes the get_text fast-path optimization and benchmark result under Unreleased.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread WebSearcher/_slx.py
…e only as descendants

Addresses a PR review note -- the Python walker skips those tags only when they
are descendants, not when the root node is itself one, which is why the fast
path needs both the node.tag guard and the css_first descendant probe.
@gitronald gitronald merged commit 4c1c8bd into feature/v0.9.0 Jun 1, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants