Skip to content

opt: SWAR number parsing, pointer-based DOM, peek_u16 structural matching#224

Merged
liuq19 merged 5 commits intocloudwego:mainfrom
liuq19:opt/perf-v2
Apr 9, 2026
Merged

opt: SWAR number parsing, pointer-based DOM, peek_u16 structural matching#224
liuq19 merged 5 commits intocloudwego:mainfrom
liuq19:opt/perf-v2

Conversation

@liuq19
Copy link
Copy Markdown
Collaborator

@liuq19 liuq19 commented Apr 9, 2026

Summary

  • SWAR integer parsing: Parse 8 digits at a time using pure u64 arithmetic (simdjson-style parse_eight_digits), replacing scalar digit loops
  • Pointer-based DOM push: Replace Vec::push with raw pointer writes, eliminating store-to-load forwarding stalls (IPC 3.6 → 4.4-5.3)
  • Inline SIMD memcpy: AVX2 unrolled copy for visit_container_end, replacing libc memmove overhead
  • peek_u16 structural matching: Single u16 comparison for ,\" / } / ] / : patterns, reducing branch count in parser main loop
  • Scalar key parsing fast path: Byte-by-byte scan for short object keys (≤24 chars), avoiding SIMD StringBlock setup overhead
  • Unchecked number parsing: parse_number_unchecked() with get_unchecked() for padded buffers, eliminating per-digit bounds checks
  • SWAR fraction parsing: Replace SSE simd_str2int with integer-pipeline SWAR for fraction digits, avoiding FP port contention on AMD Zen
  • Tolerant SWAR tail: Branchless 1-8 digit parsing via parse_digits_tolerant for fraction remainders
  • parse_float_fast threshold fix: Correct off-by-one (sig >> 52sig < 2^53)

Benchmark (compact JSON, core-pinned, vs baseline)

Benchmark Before After Improvement
citm_catalog 511 µs ~230 µs +55%
golang_source 1416 µs ~780 µs +45%
lottie 495 µs ~220 µs +55%

Test plan

  • cargo test — 146 tests pass
  • cargo test -p sonic-number (from sonic-number dir) — 7 tests pass
  • Number parsing edge cases: 8+ integer digits with fraction, SWAR boundary cases, tolerant tail truncation
  • Reviewed by Opus + Codex cross-model review

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings April 9, 2026 06:32
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 9, 2026

Codecov Report

❌ Patch coverage is 95.00000% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.06%. Comparing base (6e707d8) to head (80f6c22).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
src/reader.rs 71.42% 4 Missing ⚠️
src/value/tls_buffer.rs 92.15% 4 Missing ⚠️
src/parser.rs 95.58% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #224      +/-   ##
==========================================
+ Coverage   70.64%   71.06%   +0.41%     
==========================================
  Files          42       42              
  Lines        9556     9724     +168     
==========================================
+ Hits         6751     6910     +159     
- Misses       2805     2814       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Performance-focused parser/DOM optimizations across sonic-rs and sonic-number, introducing SWAR-based number parsing, pointer-based DOM node staging, and lower-branch structural matching to reduce per-byte overhead in hot loops.

Changes:

  • Add SWAR (u64 arithmetic) fast paths for integer/fraction parsing and introduce an unchecked number parsing variant for padded inputs.
  • Rework DOM construction to use raw pointer writes into a preallocated buffer and add an AVX2 inline copy routine for container finalization.
  • Update parser inner loops with peek_u16 structural matching and add benchmarks/testdata to measure throughput improvements.

Reviewed changes

Copilot reviewed 17 out of 27 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/value/shared.rs Adds Shared::with_capacity to preallocate the bump allocator for DOM construction.
src/value/node.rs Adds AVX2 inline copy and switches DOM node buffering to pointer-based writes.
src/reader.rs Adds padded_slice and 2-byte peek helpers for faster parsing decisions.
src/parser.rs Adds unchecked number parsing + scalar key fast path + peek_u16 structural matching in array/object parsing.
sonic-number/src/swar.rs New SWAR helpers for digit detection and 1–16 digit parsing.
sonic-number/src/lib.rs Integrates SWAR parsing into number parsing and adds an unsafe parse_number_unchecked.
sonic-number/Cargo.toml Bumps crate version and adds Criterion benches.
sonic-number/benches/number.rs Adds Criterion benchmarks for SWAR vs SIMD and end-to-end parse_number.
Cargo.toml Bumps sonic-rs version.
benchmarks/Cargo.toml Removes old number bench wiring; adds perf_parse example.
benchmarks/benches/deserialize_value.rs Expands benchmark set to include golang_source and lottie.
benchmarks/benches/perf_parse.rs Adds a core-pinned parsing throughput example.
benchmarks/bench_quick.sh Adds a quick benchmark helper script.
benchmarks/benches/testdata/* Adds/updates compact JSON fixtures and documentation for benchmark datasets.
Comments suppressed due to low confidence (1)

src/parser.rs:513

  • In the object parsing slow path, the final _ => branch returns ExpectedArrayCommaOrEnd, which seems like the wrong error code for an object context (should be something like ExpectedObjectCommaOrEnd). This will produce misleading error messages/diagnostics when encountering invalid JSON inside objects.
            match self.skip_space() {
                Some(b'}') => return check_visit!(self, vis.visit_object_end(count)),
                Some(b',') => match self.skip_space() {
                    Some(b'"') => continue,
                    _ => return perr!(self, ExpectObjectKeyOrEnd),
                },
                _ => return perr!(self, ExpectedArrayCommaOrEnd),
            }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/reader.rs Outdated
Comment on lines +62 to +66
unsafe { &*(self.peek_n(2).unwrap_unchecked().as_ptr() as *const [u8; 2]) }
}

/// Peek next 2 bytes as a little-endian u16 for single-instruction matching.
/// E.g., `peek_u16() == u16::from_le_bytes([b',', b'"'])` matches `,"` in one cmp.
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reader::peek2() is a safe method but uses unwrap_unchecked() on peek_n(2). On non-padded readers (e.g. Read<'_>), calling this near EOF can trigger immediate UB. Either make peek2 an unsafe fn with a clear safety contract, or implement it safely (return an Option<[u8;2]> / u16) without unwrap_unchecked and keep unchecked access confined to PaddedSliceRead-specific helpers.

Suggested change
unsafe { &*(self.peek_n(2).unwrap_unchecked().as_ptr() as *const [u8; 2]) }
}
/// Peek next 2 bytes as a little-endian u16 for single-instruction matching.
/// E.g., `peek_u16() == u16::from_le_bytes([b',', b'"'])` matches `,"` in one cmp.
match self.peek_n(2) {
Some(s) => unsafe { &*(s.as_ptr() as *const [u8; 2]) },
None => panic!("Reader::peek2() requires at least 2 bytes remaining"),
}
}
/// Peek next 2 bytes as a little-endian u16 for single-instruction matching.
/// E.g., `peek_u16() == u16::from_le_bytes([b',', b'"'])` matches `,\"` in one cmp.

Copilot uses AI. Check for mistakes.
Comment thread src/reader.rs Outdated
#[inline(always)]
fn peek_u16(&self) -> u16 {
match self.peek_n(2) {
Some(s) => unsafe { (s.as_ptr() as *const u16).read_unaligned() },
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

peek_u16() is documented/used as a little-endian 2-byte view, but the implementation returns a native-endian u16 via read_unaligned(). On big-endian targets this breaks all the structural matching logic that compares against u16::from_le_bytes([...]) and relies on (sep & 0xFF) being the first byte. Consider returning u16::from_le_bytes([s[0], s[1]]) (or u16::from_le(read_unaligned(...))) and/or gating these fast paths behind cfg(target_endian = "little").

Suggested change
Some(s) => unsafe { (s.as_ptr() as *const u16).read_unaligned() },
Some(s) => u16::from_le(unsafe { (s.as_ptr() as *const u16).read_unaligned() }),

Copilot uses AI. Check for mistakes.
Comment thread src/parser.rs
Comment on lines +311 to +323
/// Unchecked number parsing — requires padded buffer (PaddedSliceRead).
/// Uses the full padded buffer (including padding) so unchecked access is safe.
#[inline(always)]
pub fn parse_number_unchecked(&mut self, first: u8) -> Result<ParserNumber> {
let reader = &mut self.read;
let neg = first == b'-';
let mut now = reader.index() - (!neg as usize);
// Use padded_slice which includes the padding bytes, so unchecked reads are safe.
let data = reader.padded_slice();
let ret = unsafe { sonic_number::parse_number_unchecked(data, &mut now, neg) };
reader.set_index(now);
ret.map_err(|err| self.error(err.into()))
}
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_number_unchecked() is exposed as a safe pub fn but it calls sonic_number::parse_number_unchecked() which requires extra padding for all unchecked reads. Since Reader::padded_slice() defaults to as_u8_slice() for non-padded readers, external callers can hit UB by invoking this on a Parser<Read<'_>>. Make this API unsafe, restrict it with a type bound to PaddedSliceRead, or otherwise enforce/preserve the padding invariant at the type level.

Copilot uses AI. Check for mistakes.
Comment thread src/parser.rs Outdated
Comment on lines +351 to +356
/// Simple while loop, let LLVM optimize the unrolling.
///
/// # Safety
/// Only called when strbuf=None (padded reader path). PaddedSliceRead has 64 bytes
/// of zero-padding beyond valid JSON, so scanning 24 bytes ahead is always safe.
/// Padding bytes (0x00) are < 0x20 and will bail to parse_string_visit.
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This scalar key fast-path is described as for “short ASCII keys”, but it accepts any non-control, non-escape UTF-8 bytes (including multi-byte UTF-8) and then uses from_utf8_unchecked. Either tighten the predicate to actual ASCII if that’s the intent, or update the doc comment to reflect the real preconditions (e.g., “no escapes/control chars, length ≤ 24 bytes”).

Suggested change
/// Simple while loop, let LLVM optimize the unrolling.
///
/// # Safety
/// Only called when strbuf=None (padded reader path). PaddedSliceRead has 64 bytes
/// of zero-padding beyond valid JSON, so scanning 24 bytes ahead is always safe.
/// Padding bytes (0x00) are < 0x20 and will bail to parse_string_visit.
/// Fast path for keys that terminate within 24 bytes and contain no escapes (`\\`)
/// or control characters (`< 0x20`).
///
/// This path is not restricted to ASCII: it accepts any bytes `>= 0x20` until the
/// closing `"` is found, so multi-byte UTF-8 key bytes are handled here as well.
/// If an escape, control byte, or no closing quote within 24 bytes is encountered,
/// parsing falls back to `parse_string_visit`.
///
/// # Safety
/// Only called when strbuf=None (padded reader path). PaddedSliceRead has 64 bytes
/// of zero-padding beyond valid JSON, so scanning 24 bytes ahead is always safe.
/// Padding bytes (0x00) are < 0x20 and will bail to parse_string_visit.
///
/// The `from_utf8_unchecked` below therefore relies on the parser input being valid
/// UTF-8 JSON when this fast path succeeds.

Copilot uses AI. Check for mistakes.
Comment thread sonic-number/src/swar.rs
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 39b507ec71

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/parser.rs
Comment on lines +314 to +318
pub fn parse_number_unchecked(&mut self, first: u8) -> Result<ParserNumber> {
let reader = &mut self.read;
let neg = first == b'-';
let mut now = reader.index() - (!neg as usize);
// Use padded_slice which includes the padding bytes, so unchecked reads are safe.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Mark parse_number_unchecked as unsafe or enforce padding

Parser::parse_number_unchecked is a safe public method, but it unconditionally calls sonic_number::parse_number_unchecked, which performs get_unchecked reads that require at least 64 bytes of trailing padding. With a normal Parser<Read> (no padding), calling this API on inputs like "1" can read past the slice boundary and trigger undefined behavior. Because this is externally callable through the public parser module, the precondition must be enforced in code (or the method made unsafe).

Useful? React with 👍 / 👎.

Comment thread sonic-number/src/swar.rs
@liuq19 liuq19 force-pushed the opt/perf-v2 branch 2 times, most recently from 5c5853e to aa5be83 Compare April 9, 2026 11:20
liuq19 added 5 commits April 9, 2026 19:31
Performance optimizations for JSON parsing:

1. sonic-number: SWAR integer parsing (simdjson-style parse_eight_digits)
   - Uses pure u64 arithmetic to parse 8 ASCII digits at a time
   - Single-digit fast path for common small integers
   - 32-54% improvement on parse_number microbenchmark

2. parser: compact JSON 2-byte peek fast path
   - peek2() reads 2 bytes at once to match `,"` / `":` patterns
   - Eliminates redundant skip_space calls for compact JSON

3. DOM construction: pointer-based push_node
   - Replace (ptr, len, cap) with (base, end, cap_end) pointers
   - Eliminates store-to-load forwarding stalls (28% of loads → ~20%)
   - Eliminates integer register pressure (30.7% cycles stalled → 0.1%)
   - IPC improved from 3.6 to 4.4-5.3

4. DOM construction: inline SIMD memcpy (sonic-cpp style Xmemcpy)
   - AVX2 unrolled copy for visit_container_end
   - Eliminates libc memmove function call overhead
   - memmove dropped from 61% to 1.9% of cycles (citm_catalog)

End-to-end deserialize_value improvement (A/B tested, core-pinned):
  citm_catalog:  511 µs → 314 µs (+38.6%)
  golang_source: 1416 µs → 1099 µs (+22.4%)
  lottie:        495 µs → 351 µs (+29.1%)
- Replace peek2() slice matching with peek_u16() u16 comparison for
  structural characters, reducing branch count in parser main loop
- Add scalar fast path for object key parsing (up to 24 chars),
  avoiding SIMD StringBlock setup overhead for short keys
- Add parse_number_unchecked() using get_unchecked() for digit reads,
  eliminating per-digit bounds checks from the hot integer parsing loop
- Scalar fraction accumulation for short integer part (<8 digits),
  avoiding SIMD setup + POW10 table multiplication overhead
- Fix parse_float_fast threshold off-by-one (sig >> 52 vs sig < 2^53)
- Replace SSE simd_str2int with SWAR fraction parsing + tolerant tail,
  keeping integer pipeline instead of competing with FP ports
- Compact all JSON test data (minified format, no whitespace)
  citm_catalog: 1.7MB → 489KB (-71%)
- Add new test files: golang_source (51k ints), lottie (18k ints),
  synthea_fhir, gsoc-2018, poet, string_unicode, string_escaped,
  twitterescaped
- Add testdata/README.md with value type statistics and selection guide
- Add golang_source and lottie to deserialize_value benchmark
- Add perf_parse example binary for profiling
- Expand number.rs benchmark with SWAR comparison and per-length tests
@liuq19 liuq19 merged commit ae081a5 into cloudwego:main Apr 9, 2026
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants