Skip to content

perf: Teddy SIMD 8.6x slower than Rust due to Go/assembly boundary overhead #120

@kolkov

Description

@kolkov

Problem

Teddy SIMD multi-pattern prefilter is 8.6x slower than Rust on identical workloads despite implementing the same algorithm (PSHUFB nibble-based matching).

Benchmark (6 MB DNA input, AMD EPYC, regex-bench CI)

Engine dna_4 time Throughput
Rust regex (Teddy AVX2, inlined) 3.7 ms 1.6 GB/s
coregex (Teddy SSSE3, Go asm) 32 ms 187 MB/s
Go stdlib regexp 349 ms 17 MB/s

All 9 regexdna patterns affected (v0.12.3, UseTeddy strategy).

Root Cause

Go/assembly function call boundary overhead:

  1. Each findSIMD() = Go→asm→Go round-trip (~50-65 cycles: function call + VZEROUPPER + register save/restore)
  2. 375K calls per 6MB scan (16 bytes/call with SSSE3)
  3. AVX2 disabled — was 4x slower than SSSE3 due to VZEROUPPER cost dominating
  4. Rust avoids this entirely#[inline(always)] keeps the whole find+verify loop in one native code block

Profiling (local, 6MB DNA)

FindAllIndicesStreaming: avg 25.1 ms
Raw Teddy.FindMatch:    avg 24.0 ms  (96%)
FindAll loop overhead:  avg  1.1 ms  ( 4%)

100% of the gap is in SIMD function call overhead.

Breakdown

Factor Estimated Impact
SSSE3 vs AVX2 (16 vs 32 bytes/iter) ~2x
Go/asm boundary per findSIMD call ~3-4x
Go method dispatch + bounds checks ~1.2x
Combined ~8-10x (measured: 8.6x)

Potential Solutions

A. simd/archsimd Intrinsics (Go 1.26, GOEXPERIMENT=simd)

Rewrite Teddy core using compiler intrinsics — eliminates Go/asm boundary entirely.
Needs POC to verify: Permute() = PSHUFB? ToMask() = PMOVMSKB? Performance?

B. Batch FindAll in Assembly

Single Go→asm call for entire haystack. Find+verify loop stays in asm.
Results written to pre-allocated buffer.

C. Monolithic ASM Teddy

Entire FindMatch loop in assembly. Maximum performance, highest maintenance.

Research

Full analysis: docs/dev/research/teddy-10x-gap-analysis.md

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    area: prefilterSIMD prefilters (memchr, memmem, Teddy)priority: highImportant for next releasestrategy: teddyTeddy SIMD multi-pattern prefiltertype: performanceSpeed/memory improvement or regression

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions