-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Problem
ReplaceAllStringFunc currently runs the full PikeVM path for every match: forward scan to find match end, then capture extraction. For patterns like \{\{(.*?)\}\} on large inputs (~10MB, 50K replacements), this is significantly slower than Rust's approach.
Rust regex uses lazy DFA for the forward scan to find match boundaries, then falls back to PikeVM only for capture group extraction within the already-found match span. This minimizes the expensive NFA simulation to small bounded regions.
Current State (v0.12.7)
Phase 1 (PR #134) fixed the PikeVM split-chain overhead for . with sparse-dispatch — 2.8-4.8x speedup. But the architectural gap remains: PikeVM still does full forward scanning where a DFA would be faster.
LangArena Template::Regex (\{\{(.*?)\}\}, ReplaceAllStringFunc, ~10MB text):
- Rust: ~3.8s (lazy DFA forward + PikeVM captures)
- Go stdlib: ~6.5s (rune-native NFA)
- coregex v0.12.7: TBD (was ~20s pre-fix, expect ~5-7s with sparse-dispatch)
Proposed Approach
- For
ReplaceAllStringFunc/ReplaceAllString, use lazy DFA for forward match boundary detection - Once match boundaries are found, run PikeVM only on the match span for capture extraction
- This matches Rust's two-phase architecture: DFA for speed, NFA for captures
Impact
- Template::Regex: target Rust parity (~3-4s)
- All
ReplaceAll*andFindAllSubmatch*workloads on large inputs - Follows from perf: PikeVM UTF-8 dot alternation overhead (3x slower than stdlib on Template::Regex) #132 (Phase 1: sparse-dispatch)