perf(scan): memchr2 cross-chunk jump in NEON and AVX2 fast probe by membphis · Pull Request #33 · api7/lua-qjson

membphis · 2026-05-16T15:05:04Z

Summary

After an in-string fast-probe miss (no " or \ in the current 64B chunk), both NEON and AVX2 scanners now call memchr::memchr2 to skip directly to the 64B-aligned chunk containing the next interesting byte, instead of advancing one chunk at a time.

A 4 KB remaining-buffer threshold gates the call so payloads ≤4 KB never pay the libc function-call overhead. Above the threshold the jump amortizes immediately. On large payloads only the final 4 KB foregoes the jump — invisible against MB-scale gains.

Closes #26.

Measured impact (Apple M4, NEON, `quickdecode.parse + access 3 fields`)

3-run median on each branch (make bench):

Size	main	PR	Δ
2 KB	722,126	705,119	-2.4% (within noise — see below)
60 KB	209,381	333,556	+59%
100 KB	137,363	232,019	+69%
200 KB	79,872	156,250	+96%
500 KB	32,468	70,175	+116%
1 MB	16,322	33,708	+107%
2 MB	8,078	17,021	+111%
5 MB	3,234	6,882	+113%
10 MB	1,614	3,376	+109%

End-to-end scanner throughput on large string-heavy payloads rises from ~17 GB/s to ~36 GB/s — the fast probe was ALU-bound rather than memory-bound, and memchr2 exposes a much tighter inner loop.

Note on small payloads

Small JSON (≤4 KB) is flat, not improved. The 4 KB threshold guarantees memchr2 is never called on those payloads, so they match baseline performance to within bench noise. There's no meaningful win to be had at that size: scanner work is a small fraction of total parse time for 2 KB inputs (Lua FFI dispatch dominates), so any scanner-targeted optimization is invisible there. The threshold's job is to ensure no regression, which it achieves.

Why this works

In an in-string chunk with no " and no \, the in_string polarity cannot flip and no escape sequence can begin. The skipped span is therefore invariant for both in_string and bs_carry, so jumping multiple chunks at once is sound. The jump lands on a 64B boundary so the main SIMD loop invariants are preserved.

Test plan

cargo test --release
cargo test --release --no-default-features (scalar control)
make lint (clippy -D warnings)
make bench 3-run medians match README
CI x86_64 AVX2 parity (tests/scanner_crosscheck.rs on x86 runner)

Commits

7844484 — initial NEON + AVX2 cross-chunk jump (256B threshold)
2fa76cb — bump threshold to 4 KB to eliminate small-payload regression

After a fast-probe miss (no quote/backslash in current 64B chunk), both NEON and AVX2 scanners now call memchr::memchr2 to skip ahead to the 64B-aligned chunk containing the next interesting byte rather than advancing one chunk at a time. A 256-byte remaining-buffer threshold gates the call so short payloads never pay the libc function-call overhead; above that threshold the jump amortizes immediately. Measured on Apple M4 (NEON), "parse + access 3 fields" workload: - 2 KB (small_api.json): 648,761 ops/s — regression eliminated, flat vs. pre-jump baseline - 100 KB: 245,700 ops/s — 17.2x over cjson (+125% vs. pre-jump 108,932) - 1 MB: 34,884 ops/s — 23.7x over cjson (+193% vs. pre-jump 11,905) - 10 MB: 3,406 ops/s — 22.7x over cjson (+180% vs. pre-jump 1,218) AVX2 receives the identical change; compile-verified on aarch64; x86_64 parity is covered by CI.

… regression The 256-byte threshold still fired memchr2 across most of a 2 KB document (only the last few chunks were exempt), and the libc call overhead per fast-probe miss outweighed the scanner work it replaced — net result was a ~10% regression on small_api.json under 'make bench' methodology where cjson runs first and leaves a polluted heap. Bumping the threshold to 4 KB means memchr2 is never called on payloads ≤4 KB total, restoring baseline parity. On larger payloads only the final 4 KB foregoes the jump, which is invisible against MB-scale gains. 3-run median 'qd.parse' on Apple M4 vs main: 2 KB -2% (flat, within noise) 60 KB +60% 100 KB +69% 1 MB +107% 10 MB +109% README ARM64 numbers updated to reflect the post-threshold reality.

membphis added 2 commits May 16, 2026 23:03

membphis merged commit 7c3b3fd into main May 16, 2026
1 check passed

membphis deleted the perf/neon-memchr2-cross-chunk branch May 16, 2026 15:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(scan): memchr2 cross-chunk jump in NEON and AVX2 fast probe#33

perf(scan): memchr2 cross-chunk jump in NEON and AVX2 fast probe#33
membphis merged 2 commits into
mainfrom
perf/neon-memchr2-cross-chunk

membphis commented May 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

membphis commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Measured impact (Apple M4, NEON, quickdecode.parse + access 3 fields)

Note on small payloads

Why this works

Test plan

Commits

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

membphis commented May 16, 2026 •

edited

Loading

Measured impact (Apple M4, NEON, `quickdecode.parse + access 3 fields`)