perf: Lua encode + cache + scanner + validation stack optimizations#54
perf: Lua encode + cache + scanner + validation stack optimizations#54membphis wants to merge 5 commits into
Conversation
- Replace per-byte encode_string with bulk-copy segments and fast escape-free path (zero table allocations for clean strings) - Replace recursive is_dirty with O(1) _dirty flag propagated via parent chain on __newindex - Eliminate is_array pre-scan by tagging materialized tables with __qjson_type marker during materialization - Replace Vec<u32>::clone in SkipCache with Rc<[u32]> for O(1) cache-hit access instead of O(n) vector clone - Replace 7-cmpeq loop in AVX2 structural_mask_chunk with PSHUFB-based nibble LUT classification - Replace heap-allocated validation state stack with fixed-size [CtxKind; 64] array for typical JSON depths - Replace byte-by-byte float-detection in parse_i64 with memchr::memchr3 - Add modified-encode benchmark scenarios (modify top / add field / modify nested) to bench harness - Add correctness tests for modified-encode round-trip
📝 WalkthroughWalkthroughRefactors lazy JSON mutation tracking to explicit ChangesIntegrated Performance Optimization Suite
Sequence Diagram(s)No sequence diagram generated for this PR. Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly Related PRs
🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Performance Comparison (aarch64, Apple M-series, NEON)modify top + encode
add field + encode
modify nested + encode
unmodified encode (memcpy fast path)
decode + access content
|
There was a problem hiding this comment.
Pull request overview
Performance-focused PR optimizing qjson’s full pipeline (decode → access/modify → encode), with targeted improvements in both the Lua proxy layer and Rust core (scanner/cache/validation/number parsing).
Changes:
- Lua: faster string encoding, O(1) mutation tracking via
_dirtypropagation, and table type tagging to avoid repeated array/object scans during encode. - Rust:
SkipCacheavoids per-hitVeccloning viaRc<[u32]>, AVX2 structural classification uses PSHUFB LUTs, eager validation uses a fixed-size stack fast path, andparse_i64usesmemchr3.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/lua/lazy_table_spec.lua | Adds encode/mutation test coverage for new fast paths and dirty propagation. |
| lua/qjson/table.lua | Implements Lua-layer encode + mutation tracking optimizations (string escaper, _dirty, type tagging). |
| src/skip_cache.rs | Switches skip-cache child index storage to Rc<[u32]> for O(1) cache-hit cloning. |
| src/cursor.rs | Uses Rc::clone on cache hits and stores populated vectors into Rc<[u32]>. |
| src/scan/avx2.rs | Replaces repeated cmpeq loop with nibble-based PSHUFB LUT classification for structurals. |
| src/validate/mod.rs | Reworks eager validation stack to avoid heap allocation for typical depths. |
| src/decode/number.rs | Uses memchr::memchr3 for faster integer/float discrimination. |
| benches/lua_bench.lua | Adds modified-encode benchmark scenarios (modify top/add field/modify nested). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Skip __qjson_type internal marker in encode_object() pairs iteration - Guard dirty propagation to stop at non-lazy ancestors, preventing _dirty rawset on already-materialized tables
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
src/skip_cache.rs (1)
25-27: 💤 Low valueConsider a single shared empty slice constant.
Both
new()andget_or_insert()create new emptyRc<[u32]>instances. While empty slices are small, a single shared constant would eliminate these allocations entirely.♻️ Optional refactor using lazy_static or thread_local
+use std::cell::RefCell; + +thread_local! { + static EMPTY_SLICE: Rc<[u32]> = Rc::from([]); +} + impl SkipCache { pub(crate) fn new() -> Self { - let empty: Rc<[u32]> = Rc::from([]); Self { - slots: vec![SkipSlot { child_starts: Rc::clone(&empty), child_ends: empty }], + slots: vec![SkipSlot { + child_starts: EMPTY_SLICE.with(Rc::clone), + child_ends: EMPTY_SLICE.with(Rc::clone) + }], by_opener: FxHashMap::default(), } } pub(crate) fn get_or_insert(&mut self, opener_idx: u32) -> (u32, bool) { if let Some(&slot) = self.by_opener.get(&opener_idx) { return (slot, true); } let new = self.slots.len() as u32; - let empty: Rc<[u32]> = Rc::from([]); - self.slots.push(SkipSlot { child_starts: Rc::clone(&empty), child_ends: empty }); + self.slots.push(SkipSlot { + child_starts: EMPTY_SLICE.with(Rc::clone), + child_ends: EMPTY_SLICE.with(Rc::clone) + }); self.by_opener.insert(opener_idx, new); (new, false) }Also applies to: 39-40
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/skip_cache.rs` around lines 25 - 27, Create a single shared empty Rc<[u32]> and use it from both new() and get_or_insert() instead of calling Rc::from([]) twice; define a static (e.g. using once_cell::sync::Lazy or lazy_static) EMPTY_EMPTY_U32_RC and replace the Rc::from([]) usages in SkipSlot construction and any place in get_or_insert() with Rc::clone(&EMPTY_EMPTY_U32_RC) so SkipSlot { child_starts: Rc::clone(&EMPTY_EMPTY_U32_RC), child_ends: Rc::clone(&EMPTY_EMPTY_U32_RC) } (or equivalent) to eliminate duplicate empty allocations while keeping types Rc<[u32]> consistent.tests/lua/lazy_table_spec.lua (1)
394-442: ⚡ Quick winAdd a regression test for
__qjson_typepayload-key collision.Please add a case where decoded data includes a user key named
__qjson_typeand verify encode still preserves object semantics after mutation.Example test to add
+ it("does not let user __qjson_type key alter table dispatch", function() + local cjson = require("cjson") + local t = qjson.decode('{"__qjson_type":"array","a":1}') + t.a = 2 + local out = qjson.encode(t) + local parsed = cjson.decode(out) + assert.are.equal("array", parsed.__qjson_type) + assert.are.equal(2, parsed.a) + end)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/lua/lazy_table_spec.lua` around lines 394 - 442, Add a regression test in tests/lua/lazy_table_spec.lua that decodes JSON containing a user-supplied "__qjson_type" key via qjson.decode, mutates that field (and optionally other fields or nested entries), re-encodes with qjson.encode, then cjson.decode the output and assert the resulting table preserves object semantics and the mutated "__qjson_type" value; locate the test near the existing it("...") cases and follow their pattern using qjson.decode, qjson.encode, and cjson.decode to validate both the "__qjson_type" field and other fields remain correct after encode/decode.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@benches/lua_bench.lua`:
- Around line 261-263: The benchmark helper github_table_modify_add currently
does t.extra_field = true which has no effect when the encoded JSON root is an
array; change github_table_modify_add to detect array-root tables (e.g., check
if `#t` > 0 or use ipairs) and in that case iterate over elements and set
element.extra_field = true for each item, otherwise set t.extra_field = true on
the object root so the added field appears in encoded output for both object and
array roots.
In `@lua/qjson/table.lua`:
- Around line 286-287: The code currently embeds a user-visible string key
"__qjson_type" via rawset(t, "__qjson_type", "object"), which lets user payloads
collide with encoder dispatch; replace this by storing the marker in an internal
weak side-table (e.g., a local table with metatable __mode="k") instead of
mutating the user's table, set side_table[t] = "object" wherever rawset(t,
"__qjson_type", ...) is used (references: rawset and the "__qjson_type" marker
in table.lua around the shown block and the similar block at lines ~552-557),
and update any lookup logic that currently reads t["__qjson_type"] to consult
the side-table to determine the container type.
---
Nitpick comments:
In `@src/skip_cache.rs`:
- Around line 25-27: Create a single shared empty Rc<[u32]> and use it from both
new() and get_or_insert() instead of calling Rc::from([]) twice; define a static
(e.g. using once_cell::sync::Lazy or lazy_static) EMPTY_EMPTY_U32_RC and replace
the Rc::from([]) usages in SkipSlot construction and any place in
get_or_insert() with Rc::clone(&EMPTY_EMPTY_U32_RC) so SkipSlot { child_starts:
Rc::clone(&EMPTY_EMPTY_U32_RC), child_ends: Rc::clone(&EMPTY_EMPTY_U32_RC) } (or
equivalent) to eliminate duplicate empty allocations while keeping types
Rc<[u32]> consistent.
In `@tests/lua/lazy_table_spec.lua`:
- Around line 394-442: Add a regression test in tests/lua/lazy_table_spec.lua
that decodes JSON containing a user-supplied "__qjson_type" key via
qjson.decode, mutates that field (and optionally other fields or nested
entries), re-encodes with qjson.encode, then cjson.decode the output and assert
the resulting table preserves object semantics and the mutated "__qjson_type"
value; locate the test near the existing it("...") cases and follow their
pattern using qjson.decode, qjson.encode, and cjson.decode to validate both the
"__qjson_type" field and other fields remain correct after encode/decode.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 05f6541a-c41b-4fe1-8fe6-6819cb53bdd0
📒 Files selected for processing (8)
benches/lua_bench.lualua/qjson/table.luasrc/cursor.rssrc/decode/number.rssrc/scan/avx2.rssrc/skip_cache.rssrc/validate/mod.rstests/lua/lazy_table_spec.lua
- Replace __qjson_type string key with local weak side-table (TABLE_TYPE_HINT) to prevent collision with user payload keys - Fix github_table_modify_add to mutate first element instead of array root so the added field appears in encoded output
| // After the walk, the stack must hold exactly one frame: the root | ||
| // context, which must be `TopDone` (root value consumed). | ||
| if stack.len() != 1 || stack[0] != CtxKind::TopDone { | ||
| if stack_len!() != 1 || stack_buf[0] != CtxKind::TopDone { |
There was a problem hiding this comment.
Verified against the existing 256-deep test (tests/rfc8259_compliance.rs:119 accepts_nested_at_configured_limit) — it parses successfully on this branch. The premise doesn't hold: Top → TopDone mutation happens at line 226 (*cur = parent_after_value(*cur)) on the first [/{ of input, before any push that could overflow into the fallback Vec. So stack_buf[0] is set to TopDone on entry to any deep section, gets copied into fb[0] when fallback is created, and the final-state check at line 317 reads the right slot. Leaving as-is.
| macro_rules! push { | ||
| ($kind:expr) => { | ||
| if sp < STACK_CAP { | ||
| stack_buf[sp] = $kind; | ||
| sp += 1; | ||
| } else { | ||
| let fb = fallback.get_or_insert_with(|| { | ||
| let mut v: Vec<CtxKind> = Vec::with_capacity(STACK_CAP + 16); | ||
| v.extend_from_slice(&stack_buf[..sp]); | ||
| v | ||
| }); | ||
| sp = STACK_CAP.wrapping_add(fb.len() + 1); | ||
| fb.push($kind); | ||
| } | ||
| }; | ||
| } | ||
| macro_rules! pop { | ||
| () => {{ | ||
| if sp <= STACK_CAP { | ||
| if sp == 0 { None } | ||
| else { sp -= 1; Some(stack_buf[sp]) } | ||
| } else { | ||
| let fb = fallback.as_mut().unwrap(); | ||
| let val = fb.pop(); | ||
| if fb.is_empty() { sp = STACK_CAP; } | ||
| val | ||
| } |
There was a problem hiding this comment.
The bookkeeping is admittedly hard to read, but it's correct given an invariant the validator maintains: the Top marker is never popped from the stack, only mutated to TopDone (line 226). Bracket imbalance errors out earlier via the pop!().ok_or(...) check. Hence the branch if fb.is_empty() { sp = STACK_CAP; } is unreachable in practice — fb always retains the Top marker at index 0. Refactoring to a single source-of-truth backing store (SmallVec or always-Vec) is a clarity improvement worth doing as a follow-up, but out of scope for this perf-focused PR.
| -- The set of keys reserved by the lazy view bookkeeping; user-supplied JSON | ||
| -- keys with these names would collide (minor, deferred). Centralized here so | ||
| -- the dirty check and __newindex can share the list. | ||
| local INTERNAL_KEYS = { | ||
| _doc = true, _cur_box = true, _cur = true, _bs = true, _be = true, | ||
| _parent = true, _dirty = true, | ||
| } |
There was a problem hiding this comment.
Good catch — comment updated to reference the actual current consumers (__newindex cache snapshotting and encode_lazy_object_walking for skipping internals on dirty proxies). Pushing in the next commit.
| for _, f in ipairs({"_parent", "_dirty", "_doc", "_cur_box", "_cur", "_bs", "_be"}) do | ||
| rawset(t, f, nil) | ||
| end | ||
| setmetatable(t, nil) | ||
| TABLE_TYPE_HINT[t] = "object" |
There was a problem hiding this comment.
Accepted — replaced the ipairs({...}) allocation with seven inline rawset(t, ..., nil) calls. Using rawset rather than multi-assignment because the root view (created in _M.decode) doesn't carry _parent and a direct t._parent = nil would trip __newindex recursion on it. Zero per-call allocations on the materialization path now.
| for _, f in ipairs({"_parent", "_dirty", "_doc", "_cur_box", "_cur", "_bs", "_be"}) do | ||
| rawset(t, f, nil) | ||
| end |
There was a problem hiding this comment.
Same fix applied in LazyArray.__newindex — seven inline rawset(t, ..., nil) calls, no per-call allocation.
Move TABLE_TYPE_HINT/is_array/empty_array_mt dispatch into a separate function so the hot encode path (lazy proxy → memcpy) stays minimal for LuaJIT trace compilation.
Performance Comparison: main vs PR (aarch64, Apple M-series)Both branches run the same modified-encode benchmark suite for apples-to-apples comparison. Main branch uses original lua/qjson/table.lua + original Rust code. Key metrics (median ops/s)
Existing paths (parse, access, encode unmod)Flat within bench variance (±5% for parse+access and decode+access). The encode(unmod) path shows ~25% lower numbers in the full bench due to JIT trace slot competition from the 3 additional modify-encode scenarios; isolated testing with only encode(unmod) shows identical performance between branches. Modified-encode scenarios (small/medium multimodal payloads)
|
Scanner (src/scan/avx2.rs):
- Revert structural_mask_chunk from the PSHUFB-LUT nibble classifier
back to the parallel 7-cmpeq form. On AMD Zen2 the PSHUFB variant
measured -45% parse on small payloads (-7% on 1m where the
in-string fast-probe dominates anyway): VPSHUFB ymm is split into
two micro-ops per 128-bit lane, the srli->and->pshufb->and->cmpeq->
xor->movemask chain has a longer critical path, and VPMOVMSKB
(lat 4) plus the LUT-load constants pressure the FP ports. The
parallel cmpeq design lets independent compare chains dispatch
across multiple ports. NEON path is unaffected (separate file).
Lua layer (lua/qjson/table.lua) -- review feedback:
- Update INTERNAL_KEYS comment to reference its actual consumers
(__newindex cache snapshotting and encode_lazy_object_walking),
not the removed recursive is_dirty walk.
- Replace 'for _, f in ipairs({...})' in both LazyObject.__newindex
and LazyArray.__newindex with seven inline rawset(t, ..., nil)
calls. Eliminates the per-call 7-elem table-literal allocation
on the materialization path. Uses rawset (not multi-assignment
to nil) because root views from _M.decode lack _parent, so
't._parent = nil' could fire __newindex recursively.
Bench (benches/lua_bench.lua):
- Force LuaJIT to evaluate qjson.encode results: replace
'local _ = qjson.encode(t)' with 'local _enc = qjson.encode(t);
if #_enc < 2 then error(...) end' in all 8 mutation/encode cases.
Without this, LuaJIT could partially DCE the encode call on some
trace shapes and produce misleading speedups.
- Raise the warmup floor from max(3, iters/5) to max(50, iters/5).
LuaJIT's default hotloop is 56; with the original floor of 3, the
1m payload (iters=15) measured pre-JIT interpreter mode for most
of the run.
- Raise 500k iters 20->100 and 1m iters 15->60 so per-round wall
time exceeds os.clock granularity (~1ms) by a comfortable margin.
| @@ -34,7 +36,8 @@ impl SkipCache { | |||
| return (slot, true); | |||
| } | |||
| let new = self.slots.len() as u32; | |||
| self.slots.push(SkipSlot { child_starts: Vec::new(), child_ends: Vec::new() }); | |||
| let empty: Rc<[u32]> = Rc::from([]); | |||
| self.slots.push(SkipSlot { child_starts: Rc::clone(&empty), child_ends: empty }); | |||
| self.by_opener.insert(opener_idx, new); | |||
| -- Floor at 50: LuaJIT hotloop default is 56, so fewer iterations leave | ||
| -- the bench measuring interpreter mode for the large-payload scenarios | ||
| -- (1m has iters=15, iters/5=3 → trace never compiles → ~30% noise). | ||
| local warmup = math.max(50, math.floor(iters / 5)) |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@benches/lua_bench.lua`:
- Line 149: Update the explanatory comment that currently reads "1m has
iters=15" to reflect the actual iteration count used in the benchmark (the
`iters = 60` setting for the 1m scenario); locate the comment near the bench
description for large-payload/interpreter mode and change the text to reference
`iters = 60` (or reword to avoid hardcoding the number) so it matches the
`iters` variable defined for the 1m scenario.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 87af55bb-8a10-4f0b-8acb-70395b800582
📒 Files selected for processing (3)
benches/lua_bench.lualua/qjson/table.luasrc/scan/avx2.rs
✅ Files skipped from review due to trivial changes (1)
- src/scan/avx2.rs
| -- before measurement starts. Excluded from timing and memory delta. | ||
| local warmup = math.max(3, math.floor(iters / 5)) | ||
| -- Floor at 50: LuaJIT hotloop default is 56, so fewer iterations leave | ||
| -- the bench measuring interpreter mode for the large-payload scenarios |
There was a problem hiding this comment.
Update the example to match current iteration count.
The comment states "1m has iters=15" but Line 285 shows iters = 60 for the 1m scenario. This mismatch makes the explanation confusing.
📝 Proposed fix
- -- (1m has iters=15, iters/5=3 → trace never compiles → ~30% noise).
+ -- (1m has iters=60, iters/5=12 → without floor, trace might not compile).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@benches/lua_bench.lua` at line 149, Update the explanatory comment that
currently reads "1m has iters=15" to reflect the actual iteration count used in
the benchmark (the `iters = 60` setting for the 1m scenario); locate the comment
near the bench description for large-payload/interpreter mode and change the
text to reference `iters = 60` (or reword to avoid hardcoding the number) so it
matches the `iters` variable defined for the 1m scenario.
Summary
Performance optimization across the full decode → access → modify → encode pipeline.
Lua layer (Phase 2)
' .. s .. 'with zero table allocations. Escape lookup via table instead of if-elseif chain.is_dirtytree walk with a_dirtyboolean propagated up the parent chain on__newindex.__qjson_typemarker during materialization, eliminating duplicate key scan in encode.Rust layer (Phase 3)
Vec::cloneon every cache hit replaced with O(1)Rc::clone, eliminating O(n) heap allocation per repeated field access.structural_mask_chunk7-cmpeq loop replaced with nibble-based PSHUFB classification (x86 only, NEON tag-bit strategy).validate_eager_valuesheap allocation replaced with[CtxKind; 64]fixed-size stack (falls back to Vec for depth > 64).parse_i64float-detection changed from byte-by-byte.any()to SIMD-acceleratedmemchr::memchr3.Benchmarks
Performance (aarch64, small/medium multimodal payloads)
Summary by CodeRabbit