-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Labels
area: prefilterSIMD prefilters (memchr, memmem, Teddy)SIMD prefilters (memchr, memmem, Teddy)priority: highImportant for next releaseImportant for next releasestrategy: teddyTeddy SIMD multi-pattern prefilterTeddy SIMD multi-pattern prefiltertype: performanceSpeed/memory improvement or regressionSpeed/memory improvement or regression
Description
Problem
Teddy SIMD multi-pattern prefilter is 8.6x slower than Rust on identical workloads despite implementing the same algorithm (PSHUFB nibble-based matching).
Benchmark (6 MB DNA input, AMD EPYC, regex-bench CI)
| Engine | dna_4 time | Throughput |
|---|---|---|
| Rust regex (Teddy AVX2, inlined) | 3.7 ms | 1.6 GB/s |
| coregex (Teddy SSSE3, Go asm) | 32 ms | 187 MB/s |
Go stdlib regexp |
349 ms | 17 MB/s |
All 9 regexdna patterns affected (v0.12.3, UseTeddy strategy).
Root Cause
Go/assembly function call boundary overhead:
- Each
findSIMD()= Go→asm→Go round-trip (~50-65 cycles: function call + VZEROUPPER + register save/restore) - 375K calls per 6MB scan (16 bytes/call with SSSE3)
- AVX2 disabled — was 4x slower than SSSE3 due to VZEROUPPER cost dominating
- Rust avoids this entirely —
#[inline(always)]keeps the whole find+verify loop in one native code block
Profiling (local, 6MB DNA)
FindAllIndicesStreaming: avg 25.1 ms
Raw Teddy.FindMatch: avg 24.0 ms (96%)
FindAll loop overhead: avg 1.1 ms ( 4%)
100% of the gap is in SIMD function call overhead.
Breakdown
| Factor | Estimated Impact |
|---|---|
| SSSE3 vs AVX2 (16 vs 32 bytes/iter) | ~2x |
| Go/asm boundary per findSIMD call | ~3-4x |
| Go method dispatch + bounds checks | ~1.2x |
| Combined | ~8-10x (measured: 8.6x) |
Potential Solutions
A. simd/archsimd Intrinsics (Go 1.26, GOEXPERIMENT=simd)
Rewrite Teddy core using compiler intrinsics — eliminates Go/asm boundary entirely.
Needs POC to verify: Permute() = PSHUFB? ToMask() = PMOVMSKB? Performance?
B. Batch FindAll in Assembly
Single Go→asm call for entire haystack. Find+verify loop stays in asm.
Results written to pre-allocated buffer.
C. Monolithic ASM Teddy
Entire FindMatch loop in assembly. Maximum performance, highest maintenance.
Research
Full analysis: docs/dev/research/teddy-10x-gap-analysis.md
Related
- DNA benchmark results: https://github.com/kolkov/regex-bench/actions/runs/22075682957
- v0.12.3 cross-product fix: fix: cross-product literal expansion for char classes (110x speedup on regexdna) #119
- AVX2 regression:
prefilter/teddy_ssse3_amd64.go:76-82
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
area: prefilterSIMD prefilters (memchr, memmem, Teddy)SIMD prefilters (memchr, memmem, Teddy)priority: highImportant for next releaseImportant for next releasestrategy: teddyTeddy SIMD multi-pattern prefilterTeddy SIMD multi-pattern prefiltertype: performanceSpeed/memory improvement or regressionSpeed/memory improvement or regression