Skip to content

perf: streaming ReplaceAll + DFA-first FindSubmatchAt (#135)#136

Merged
kolkov merged 7 commits intomainfrom
feature/dfa-first-replaceall
Mar 10, 2026
Merged

perf: streaming ReplaceAll + DFA-first FindSubmatchAt (#135)#136
kolkov merged 7 commits intomainfrom
feature/dfa-first-replaceall

Conversation

@kolkov
Copy link
Contributor

@kolkov kolkov commented Mar 10, 2026

Summary

Rust-style two-phase search architecture + streaming ReplaceAll for Issue #135.

  • Streaming ReplaceAllReplaceAllStringFunc, ReplaceAllFunc, ReplaceAllLiteral, ReplaceAllLiteralString converted from two-pass (collect all indices → iterate) to single-pass streaming. Eliminates [][]int allocation. Returns original string when no matches (Cow-like optimization).

  • DFA-first FindSubmatchAt — Phase 1: DFA/strategy finds match boundaries [start, end]. Phase 2: PikeVM runs anchored within [start..end] for captures. Reduces PikeVM work from O(remaining_haystack) to O(match_len) per match. For 50K matches on 10MB: ~400x less PikeVM work. Adds is_capture_search_needed optimization: when only group 0 is needed, PikeVM is skipped entirely.

  • FindAllSubmatch context fix — now uses FindSubmatchAt with full haystack preservation instead of slicing (haystack[pos:]), fixing lookbehind context loss for \b at match boundaries.

Based on deep research of Rust regex architecture (docs/dev/research/dfa-first-replaceall-research.md).

Closes #135.

Test plan

  • go test ./... — all 11 packages pass
  • gofmt -l . — clean
  • golangci-lint run — 0 issues
  • CI: tests + benchmark comparison
  • regex-bench validation (post-merge)

kolkov added 3 commits March 10, 2026 20:28
…135)

Convert ReplaceAllStringFunc, ReplaceAllFunc, ReplaceAllLiteral,
ReplaceAllLiteralString from two-pass (collect indices then replace)
to single-pass streaming. Eliminates [][]int allocation for
high-match-count inputs. Returns original string when no matches
(Cow-like optimization).
Implement Rust-style two-phase search for capture extraction:
Phase 1: DFA/strategy finds match boundaries [start, end]
Phase 2: PikeVM runs anchored within [start..end] for captures

Add SearchWithCapturesInSpan to PikeVM.
Reduces PikeVM work from O(remaining_haystack) to O(match_len)
per match. For 50K matches on 10MB: ~400x less PikeVM work.

Also optimize: skip PikeVM entirely when CaptureCount <= 1
(only group 0 needed — DFA result already provides boundaries).

Rewrite FindAllSubmatch to use FindSubmatchAt internally,
benefiting from the same two-phase optimization.
@github-actions
Copy link

github-actions bot commented Mar 10, 2026

Benchmark Comparison

Comparing main → PR #136

Summary: geomean 88.49n 85.46n -3.42%

⚠️ Potential regressions detected:

geomean                               ³                +0.00%               ³
geomean                               ³                +0.00%               ³
geomean                         ³                +0.00%               ³
geomean                         ³                +0.00%               ³
AhoCorasickVsStdlib/coregex_IsMatch-4                   310.2n ± ∞ ¹    312.2n ± ∞ ¹    +0.64% (p=0.032 n=5)
AhoCorasickVsStdlib/stdlib_Find-4                       170.8µ ± ∞ ¹    179.5µ ± ∞ ¹    +5.10% (p=0.016 n=5)
AhoCorasickLargeInput/coregex_IsMatch_64KB-4            103.0µ ± ∞ ¹    103.3µ ± ∞ ¹    +0.35% (p=0.008 n=5)
AhoCorasickLargeInput/coregex_Find_64KB-4               103.4µ ± ∞ ¹    104.1µ ± ∞ ¹    +0.67% (p=0.032 n=5)
AhoCorasickManyPatterns/stdlib_10_patterns-4            175.6n ± ∞ ¹    183.5n ± ∞ ¹    +4.50% (p=0.008 n=5)
MatchAnchoredLiteral/short_match-4                      7.019n ± ∞ ¹    7.067n ± ∞ ¹    +0.68% (p=0.008 n=5)

Full results available in workflow artifacts. CI runners have ~10-20% variance.
For accurate benchmarks, run locally: ./scripts/bench.sh --compare

kolkov added 3 commits March 10, 2026 21:19
…pturesAt

Two fixes for PR #136 CI failure:

1. FindSubmatchAt: skip two-phase search for UseBoundedBacktracker and UseNFA
   strategies — BT's recursive backtrackFindWithState overflows 250MB stack
   on 386 with deep UTF-8 NFA chains. These strategies don't benefit from
   two-phase anyway (Phase 1 uses the same engine as Phase 2).

2. SearchWithCapturesAt: use matchesEmptyAt(haystack, at) instead of
   matchesEmpty() for at==len(haystack) fast path. matchesEmpty() loses
   lookbehind context (evaluates with nil,0), causing \B false positives
   at end-of-haystack. SearchWithSlotTableAt already used the correct
   matchesEmptyAt — this aligns SearchWithCapturesAt to match.
…rhead

FindAllSubmatch now acquires SearchState once for entire iteration loop,
matching the pattern in findAllIndicesLoop. Extracted findSubmatchAtWithState
internal method shared by both FindSubmatchAt (public) and FindAllSubmatch.

Prevents race detector test timeouts (>10 min) caused by thousands of
sync.Pool get/put operations per FindAllSubmatch call.
Strategies UseDFA, UseBoth, UseDigitPrefilter access shared mutable
state (e.dfa lazy DFA, e.pikevm) in their findIndicesAt dispatch paths.
When findSubmatchAtWithState routes Phase 1 through these strategies,
concurrent FindSubmatch calls race on the shared state.

Fix: extend the two-phase bypass guard to include all strategies that
use shared mutable state. These strategies now go directly to the
pooled PikeVM (state.pikevm) for capture extraction, which is
thread-safe by design.

Strategies that remain eligible for two-phase search all use their
own immutable instances (ReverseSuffix, ReverseInner, CharClassSearcher,
CompositeSearcher, etc.).
@codecov
Copy link

codecov bot commented Mar 10, 2026

Codecov Report

❌ Patch coverage is 63.18182% with 81 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
nfa/pikevm.go 0.00% 42 Missing and 1 partial ⚠️
regex.go 80.74% 22 Missing and 4 partials ⚠️
meta/findall.go 71.42% 11 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@kolkov kolkov merged commit f5da9d3 into main Mar 10, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf: DFA-first ReplaceAll for Rust-level performance on capture-heavy workloads

1 participant