Skip to content

perf(scan): memchr2 cross-chunk jump in NEON and AVX2 fast probe#33

Merged
membphis merged 2 commits into
mainfrom
perf/neon-memchr2-cross-chunk
May 16, 2026
Merged

perf(scan): memchr2 cross-chunk jump in NEON and AVX2 fast probe#33
membphis merged 2 commits into
mainfrom
perf/neon-memchr2-cross-chunk

Conversation

@membphis
Copy link
Copy Markdown
Collaborator

@membphis membphis commented May 16, 2026

Summary

After an in-string fast-probe miss (no " or \ in the current 64B chunk), both NEON and AVX2 scanners now call memchr::memchr2 to skip directly to the 64B-aligned chunk containing the next interesting byte, instead of advancing one chunk at a time.

A 4 KB remaining-buffer threshold gates the call so payloads ≤4 KB never pay the libc function-call overhead. Above the threshold the jump amortizes immediately. On large payloads only the final 4 KB foregoes the jump — invisible against MB-scale gains.

Closes #26.

Measured impact (Apple M4, NEON, quickdecode.parse + access 3 fields)

3-run median on each branch (make bench):

Size main PR Δ
2 KB 722,126 705,119 -2.4% (within noise — see below)
60 KB 209,381 333,556 +59%
100 KB 137,363 232,019 +69%
200 KB 79,872 156,250 +96%
500 KB 32,468 70,175 +116%
1 MB 16,322 33,708 +107%
2 MB 8,078 17,021 +111%
5 MB 3,234 6,882 +113%
10 MB 1,614 3,376 +109%

End-to-end scanner throughput on large string-heavy payloads rises from ~17 GB/s to ~36 GB/s — the fast probe was ALU-bound rather than memory-bound, and memchr2 exposes a much tighter inner loop.

Note on small payloads

Small JSON (≤4 KB) is flat, not improved. The 4 KB threshold guarantees memchr2 is never called on those payloads, so they match baseline performance to within bench noise. There's no meaningful win to be had at that size: scanner work is a small fraction of total parse time for 2 KB inputs (Lua FFI dispatch dominates), so any scanner-targeted optimization is invisible there. The threshold's job is to ensure no regression, which it achieves.

Why this works

In an in-string chunk with no " and no \, the in_string polarity cannot flip and no escape sequence can begin. The skipped span is therefore invariant for both in_string and bs_carry, so jumping multiple chunks at once is sound. The jump lands on a 64B boundary so the main SIMD loop invariants are preserved.

Test plan

  • cargo test --release
  • cargo test --release --no-default-features (scalar control)
  • make lint (clippy -D warnings)
  • make bench 3-run medians match README
  • CI x86_64 AVX2 parity (tests/scanner_crosscheck.rs on x86 runner)

Commits

  1. 7844484 — initial NEON + AVX2 cross-chunk jump (256B threshold)
  2. 2fa76cb — bump threshold to 4 KB to eliminate small-payload regression

membphis added 2 commits May 16, 2026 23:03
After a fast-probe miss (no quote/backslash in current 64B chunk), both
NEON and AVX2 scanners now call memchr::memchr2 to skip ahead to the
64B-aligned chunk containing the next interesting byte rather than
advancing one chunk at a time. A 256-byte remaining-buffer threshold
gates the call so short payloads never pay the libc function-call
overhead; above that threshold the jump amortizes immediately.

Measured on Apple M4 (NEON), "parse + access 3 fields" workload:
- 2 KB  (small_api.json): 648,761 ops/s  — regression eliminated, flat vs. pre-jump baseline
- 100 KB: 245,700 ops/s — 17.2x over cjson (+125% vs. pre-jump 108,932)
- 1 MB:   34,884 ops/s  — 23.7x over cjson (+193% vs. pre-jump 11,905)
- 10 MB:   3,406 ops/s  — 22.7x over cjson (+180% vs. pre-jump 1,218)

AVX2 receives the identical change; compile-verified on aarch64;
x86_64 parity is covered by CI.
… regression

The 256-byte threshold still fired memchr2 across most of a 2 KB document
(only the last few chunks were exempt), and the libc call overhead per
fast-probe miss outweighed the scanner work it replaced — net result was
a ~10% regression on small_api.json under 'make bench' methodology where
cjson runs first and leaves a polluted heap.

Bumping the threshold to 4 KB means memchr2 is never called on payloads
≤4 KB total, restoring baseline parity. On larger payloads only the final
4 KB foregoes the jump, which is invisible against MB-scale gains.

3-run median 'qd.parse' on Apple M4 vs main:

  2 KB    -2%  (flat, within noise)
  60 KB   +60%
  100 KB  +69%
  1 MB    +107%
  10 MB   +109%

README ARM64 numbers updated to reflect the post-threshold reality.
@membphis membphis merged commit 7c3b3fd into main May 16, 2026
1 check passed
@membphis membphis deleted the perf/neon-memchr2-cross-chunk branch May 16, 2026 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(scan): memchr2 cross-chunk jump for very long string interiors

1 participant