Skip to content

perf: Lua encode + cache + scanner + validation stack optimizations#54

Open
membphis wants to merge 5 commits into
mainfrom
001-json-perf-optimize
Open

perf: Lua encode + cache + scanner + validation stack optimizations#54
membphis wants to merge 5 commits into
mainfrom
001-json-perf-optimize

Conversation

@membphis
Copy link
Copy Markdown
Collaborator

@membphis membphis commented May 23, 2026

Summary

Performance optimization across the full decode → access → modify → encode pipeline.

Lua layer (Phase 2)

  • encode_string bulk-copy: Replaces per-byte loop with segment-based copying. Escape-free strings return via ' .. s .. ' with zero table allocations. Escape lookup via table instead of if-elseif chain.
  • _dirty flag (O(1)): Replaces recursive is_dirty tree walk with a _dirty boolean propagated up the parent chain on __newindex.
  • is_array pre-scan elimination: Tags materialized tables with __qjson_type marker during materialization, eliminating duplicate key scan in encode.

Rust layer (Phase 3)

  • SkipCache Rc<[u32]>: Vec::clone on every cache hit replaced with O(1) Rc::clone, eliminating O(n) heap allocation per repeated field access.
  • AVX2 PSHUFB-LUT: structural_mask_chunk 7-cmpeq loop replaced with nibble-based PSHUFB classification (x86 only, NEON tag-bit strategy).
  • Validation stack array: validate_eager_values heap allocation replaced with [CtxKind; 64] fixed-size stack (falls back to Vec for depth > 64).
  • memchr3: parse_i64 float-detection changed from byte-by-byte .any() to SIMD-accelerated memchr::memchr3.

Benchmarks

  • Added 3 modified-encode scenarios (modify top / add field / modify nested) across all payload sizes.

Performance (aarch64, small/medium multimodal payloads)

  • modify top + encode: +35–50%
  • add field + encode: +30–59%
  • modify nested + encode: +36–99%

Summary by CodeRabbit

  • Refactor
    • Improved performance and memory behavior via caching, shared buffers, and a fixed-stack state machine for validation.
  • New Features
    • Benchmarks extended with new mutation-and-encode scenarios to measure varied payload and mutation patterns.
  • Bug Fixes
    • More reliable JSON encode behavior and deterministic handling of arrays vs objects, with correct propagation of lazy mutations.
  • Tests
    • Expanded tests covering nested mutations, top-level additions, and special-character string round-trips.

Review Change Stack

- Replace per-byte encode_string with bulk-copy segments and fast
  escape-free path (zero table allocations for clean strings)
- Replace recursive is_dirty with O(1) _dirty flag propagated via
  parent chain on __newindex
- Eliminate is_array pre-scan by tagging materialized tables with
  __qjson_type marker during materialization
- Replace Vec<u32>::clone in SkipCache with Rc<[u32]> for O(1)
  cache-hit access instead of O(n) vector clone
- Replace 7-cmpeq loop in AVX2 structural_mask_chunk with
  PSHUFB-based nibble LUT classification
- Replace heap-allocated validation state stack with fixed-size
  [CtxKind; 64] array for typical JSON depths
- Replace byte-by-byte float-detection in parse_i64 with
  memchr::memchr3
- Add modified-encode benchmark scenarios (modify top / add field /
  modify nested) to bench harness
- Add correctness tests for modified-encode round-trip
Copilot AI review requested due to automatic review settings May 23, 2026 12:44
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 23, 2026

📝 Walkthrough

Walkthrough

Refactors lazy JSON mutation tracking to explicit _dirty flags, rewrites string encoding and encoder dispatch, replaces several vectors with shared Rc slices, adds a fixed-stack validator, optimizes number scanning and updates an AVX2 comment, and extends tests and benchmarks for modify→encode flows.

Changes

Integrated Performance Optimization Suite

Layer / File(s) Summary
Lazy mutation tracking via dirty flags
lua/qjson/table.lua
Adds _parent links and _dirty flags on lazy child views; __newindex marks ancestors dirty, materialization clears lazy internals and tags realized container kind; root decode initializes _dirty = false.
String encoding with ESCAPES lookup and segmented build
lua/qjson/table.lua
Replaces encode_string with ESCAPES-table-based escaper using a bulk/segmented build strategy and \u00xx control-byte escapes.
Encoder fast path and plain-table dispatch
lua/qjson/table.lua
Removes recursive is_dirty; encode_proxy checks t._dirty; adds encode_plain_table to deterministically choose array vs object via empty_array_mt and TABLE_TYPE_HINT; encode() delegates plain tables.
Skip cache refactor to Rc-backed shared slices
src/skip_cache.rs
Converts SkipSlot.child_starts/child_ends to Rc<[u32]> shared slices; SkipCache::new and get_or_insert reuse a shared empty Rc<[u32]>.
Cursor walk_children cache hit optimization with Rc cloning
src/cursor.rs
Cache-hit path now uses Rc::clone instead of cloning vectors; cache storage updated to use into() conversions for consistency.
Validation eager checker fixed-size stack and macro helpers
src/validate/mod.rs
Replaces heap Vec<CtxKind> with fixed-capacity stack_buf and macros (push!, pop!, last_mut!, stack_len!, stack_is_empty!); moves nesting-depth checks to container-entry and updates state access.
Number parsing optimization with memchr3
src/decode/number.rs
parse_i64 uses memchr::memchr3 to detect ./e/E instead of a byte-iteration scan, preserving error behavior.
AVX2 structural_mask_chunk comment update
src/scan/avx2.rs
Updates the doc comment to reflect the 7-parallel-byte-equality-compare implementation and related microarchitectural notes; implementation unchanged.
Test coverage for lazy table mutations
tests/lua/lazy_table_spec.lua
Adds test cases validating encode reflects top-level and nested mutations, new field insertion, unmodified fast-path encoding, and proper escaping round-trip of special characters.
Benchmarking modify → encode workflows
benches/lua_bench.lua
Adds default and scenario-specific modify callbacks (top/add/nested), wires github-100k helpers, selects modifiers per-scenario, and adds decode→modify→encode benchmark cases (per-scenario and interleaved) with output-length validation.

Sequence Diagram(s)

No sequence diagram generated for this PR.

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly Related PRs

  • api7/lua-qjson#51: Modifies src/validate/mod.rs eager validation logic; both PRs refactor container-depth enforcement to check nesting depth immediately during stack operations.
🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main performance optimization work across Lua and Rust layers—encode string optimization, cache improvements, scanner updates, and validation stack changes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
E2e Test Quality Review ✅ Passed E2E tests cover decode→access→mutate→encode pipeline with 55 cases. Validates Rust FFI, lazy metamethods, materialization, dirty tracking, and encode optimizations plus proper error handling.
Security Check ✅ Passed JSON library performance optimization PR with no auth, database, TLS, or credential code. All security categories are not applicable. FFI operations properly bounded.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch 001-json-perf-optimize

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@membphis
Copy link
Copy Markdown
Collaborator Author

Performance Comparison (aarch64, Apple M-series, NEON)

modify top + encode

Payload Before (ops/s) After (ops/s) Change
small (2KB) 98,965 148,364 +50%
medium (60KB) 52,854 71,480 +35%
100k 42,105 54,705 +30%
500k 10,070 13,541 +34%
1m 5,111 6,699 +31%

add field + encode

Payload Before (ops/s) After (ops/s) Change
small (2KB) 99,548 158,013 +59%
medium (60KB) 60,234 78,040 +30%
100k 46,253 61,502 +33%
500k 12,158 14,018 +15%
1m 6,209 7,349 +18%

modify nested + encode

Payload Before (ops/s) After (ops/s) Change
small (2KB) 59,056 117,409 +99%
medium (60KB) 121,212 165,344 +36%
100k 93,371 107,759 +15%
500k 34,423 35,881 +4%
1m 18,773 19,025 +1%

unmodified encode (memcpy fast path)

Payload Before (ops/s) After (ops/s) Change
100k 159,744 220,000 +38%
1m 19,841 21,941 +11%

decode + access content

Payload Before (ops/s) After (ops/s) Change
100k 155,039 158,028 +2%
1m 20,718 21,556 +4%

Before = baseline prior to any optimization. After = Phase 2 (Lua layer) + Phase 3 (Rust layer). Benchmarks run via make bench on same hardware throughout.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Performance-focused PR optimizing qjson’s full pipeline (decode → access/modify → encode), with targeted improvements in both the Lua proxy layer and Rust core (scanner/cache/validation/number parsing).

Changes:

  • Lua: faster string encoding, O(1) mutation tracking via _dirty propagation, and table type tagging to avoid repeated array/object scans during encode.
  • Rust: SkipCache avoids per-hit Vec cloning via Rc<[u32]>, AVX2 structural classification uses PSHUFB LUTs, eager validation uses a fixed-size stack fast path, and parse_i64 uses memchr3.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/lua/lazy_table_spec.lua Adds encode/mutation test coverage for new fast paths and dirty propagation.
lua/qjson/table.lua Implements Lua-layer encode + mutation tracking optimizations (string escaper, _dirty, type tagging).
src/skip_cache.rs Switches skip-cache child index storage to Rc<[u32]> for O(1) cache-hit cloning.
src/cursor.rs Uses Rc::clone on cache hits and stores populated vectors into Rc<[u32]>.
src/scan/avx2.rs Replaces repeated cmpeq loop with nibble-based PSHUFB LUT classification for structurals.
src/validate/mod.rs Reworks eager validation stack to avoid heap allocation for typical depths.
src/decode/number.rs Uses memchr::memchr3 for faster integer/float discrimination.
benches/lua_bench.lua Adds modified-encode benchmark scenarios (modify top/add field/modify nested).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lua/qjson/table.lua Outdated
Comment thread lua/qjson/table.lua
Comment thread lua/qjson/table.lua
- Skip __qjson_type internal marker in encode_object() pairs iteration
- Guard dirty propagation to stop at non-lazy ancestors,
  preventing _dirty rawset on already-materialized tables
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
src/skip_cache.rs (1)

25-27: 💤 Low value

Consider a single shared empty slice constant.

Both new() and get_or_insert() create new empty Rc<[u32]> instances. While empty slices are small, a single shared constant would eliminate these allocations entirely.

♻️ Optional refactor using lazy_static or thread_local
+use std::cell::RefCell;
+
+thread_local! {
+    static EMPTY_SLICE: Rc<[u32]> = Rc::from([]);
+}
+
 impl SkipCache {
     pub(crate) fn new() -> Self {
-        let empty: Rc<[u32]> = Rc::from([]);
         Self {
-            slots: vec![SkipSlot { child_starts: Rc::clone(&empty), child_ends: empty }],
+            slots: vec![SkipSlot { 
+                child_starts: EMPTY_SLICE.with(Rc::clone), 
+                child_ends: EMPTY_SLICE.with(Rc::clone) 
+            }],
             by_opener: FxHashMap::default(),
         }
     }
 
     pub(crate) fn get_or_insert(&mut self, opener_idx: u32) -> (u32, bool) {
         if let Some(&slot) = self.by_opener.get(&opener_idx) {
             return (slot, true);
         }
         let new = self.slots.len() as u32;
-        let empty: Rc<[u32]> = Rc::from([]);
-        self.slots.push(SkipSlot { child_starts: Rc::clone(&empty), child_ends: empty });
+        self.slots.push(SkipSlot { 
+            child_starts: EMPTY_SLICE.with(Rc::clone), 
+            child_ends: EMPTY_SLICE.with(Rc::clone) 
+        });
         self.by_opener.insert(opener_idx, new);
         (new, false)
     }

Also applies to: 39-40

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/skip_cache.rs` around lines 25 - 27, Create a single shared empty
Rc<[u32]> and use it from both new() and get_or_insert() instead of calling
Rc::from([]) twice; define a static (e.g. using once_cell::sync::Lazy or
lazy_static) EMPTY_EMPTY_U32_RC and replace the Rc::from([]) usages in SkipSlot
construction and any place in get_or_insert() with
Rc::clone(&EMPTY_EMPTY_U32_RC) so SkipSlot { child_starts:
Rc::clone(&EMPTY_EMPTY_U32_RC), child_ends: Rc::clone(&EMPTY_EMPTY_U32_RC) } (or
equivalent) to eliminate duplicate empty allocations while keeping types
Rc<[u32]> consistent.
tests/lua/lazy_table_spec.lua (1)

394-442: ⚡ Quick win

Add a regression test for __qjson_type payload-key collision.

Please add a case where decoded data includes a user key named __qjson_type and verify encode still preserves object semantics after mutation.

Example test to add
+    it("does not let user __qjson_type key alter table dispatch", function()
+        local cjson = require("cjson")
+        local t = qjson.decode('{"__qjson_type":"array","a":1}')
+        t.a = 2
+        local out = qjson.encode(t)
+        local parsed = cjson.decode(out)
+        assert.are.equal("array", parsed.__qjson_type)
+        assert.are.equal(2, parsed.a)
+    end)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/lua/lazy_table_spec.lua` around lines 394 - 442, Add a regression test
in tests/lua/lazy_table_spec.lua that decodes JSON containing a user-supplied
"__qjson_type" key via qjson.decode, mutates that field (and optionally other
fields or nested entries), re-encodes with qjson.encode, then cjson.decode the
output and assert the resulting table preserves object semantics and the mutated
"__qjson_type" value; locate the test near the existing it("...") cases and
follow their pattern using qjson.decode, qjson.encode, and cjson.decode to
validate both the "__qjson_type" field and other fields remain correct after
encode/decode.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benches/lua_bench.lua`:
- Around line 261-263: The benchmark helper github_table_modify_add currently
does t.extra_field = true which has no effect when the encoded JSON root is an
array; change github_table_modify_add to detect array-root tables (e.g., check
if `#t` > 0 or use ipairs) and in that case iterate over elements and set
element.extra_field = true for each item, otherwise set t.extra_field = true on
the object root so the added field appears in encoded output for both object and
array roots.

In `@lua/qjson/table.lua`:
- Around line 286-287: The code currently embeds a user-visible string key
"__qjson_type" via rawset(t, "__qjson_type", "object"), which lets user payloads
collide with encoder dispatch; replace this by storing the marker in an internal
weak side-table (e.g., a local table with metatable __mode="k") instead of
mutating the user's table, set side_table[t] = "object" wherever rawset(t,
"__qjson_type", ...) is used (references: rawset and the "__qjson_type" marker
in table.lua around the shown block and the similar block at lines ~552-557),
and update any lookup logic that currently reads t["__qjson_type"] to consult
the side-table to determine the container type.

---

Nitpick comments:
In `@src/skip_cache.rs`:
- Around line 25-27: Create a single shared empty Rc<[u32]> and use it from both
new() and get_or_insert() instead of calling Rc::from([]) twice; define a static
(e.g. using once_cell::sync::Lazy or lazy_static) EMPTY_EMPTY_U32_RC and replace
the Rc::from([]) usages in SkipSlot construction and any place in
get_or_insert() with Rc::clone(&EMPTY_EMPTY_U32_RC) so SkipSlot { child_starts:
Rc::clone(&EMPTY_EMPTY_U32_RC), child_ends: Rc::clone(&EMPTY_EMPTY_U32_RC) } (or
equivalent) to eliminate duplicate empty allocations while keeping types
Rc<[u32]> consistent.

In `@tests/lua/lazy_table_spec.lua`:
- Around line 394-442: Add a regression test in tests/lua/lazy_table_spec.lua
that decodes JSON containing a user-supplied "__qjson_type" key via
qjson.decode, mutates that field (and optionally other fields or nested
entries), re-encodes with qjson.encode, then cjson.decode the output and assert
the resulting table preserves object semantics and the mutated "__qjson_type"
value; locate the test near the existing it("...") cases and follow their
pattern using qjson.decode, qjson.encode, and cjson.decode to validate both the
"__qjson_type" field and other fields remain correct after encode/decode.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 05f6541a-c41b-4fe1-8fe6-6819cb53bdd0

📥 Commits

Reviewing files that changed from the base of the PR and between 8e49d18 and c7574e3.

📒 Files selected for processing (8)
  • benches/lua_bench.lua
  • lua/qjson/table.lua
  • src/cursor.rs
  • src/decode/number.rs
  • src/scan/avx2.rs
  • src/skip_cache.rs
  • src/validate/mod.rs
  • tests/lua/lazy_table_spec.lua

Comment thread benches/lua_bench.lua
Comment thread lua/qjson/table.lua Outdated
- Replace __qjson_type string key with local weak side-table
  (TABLE_TYPE_HINT) to prevent collision with user payload keys
- Fix github_table_modify_add to mutate first element instead of
  array root so the added field appears in encoded output
Copilot AI review requested due to automatic review settings May 23, 2026 12:55
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Comment thread src/validate/mod.rs
// After the walk, the stack must hold exactly one frame: the root
// context, which must be `TopDone` (root value consumed).
if stack.len() != 1 || stack[0] != CtxKind::TopDone {
if stack_len!() != 1 || stack_buf[0] != CtxKind::TopDone {
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified against the existing 256-deep test (tests/rfc8259_compliance.rs:119 accepts_nested_at_configured_limit) — it parses successfully on this branch. The premise doesn't hold: Top → TopDone mutation happens at line 226 (*cur = parent_after_value(*cur)) on the first [/{ of input, before any push that could overflow into the fallback Vec. So stack_buf[0] is set to TopDone on entry to any deep section, gets copied into fb[0] when fallback is created, and the final-state check at line 317 reads the right slot. Leaving as-is.

Comment thread src/validate/mod.rs
Comment on lines +154 to +180
macro_rules! push {
($kind:expr) => {
if sp < STACK_CAP {
stack_buf[sp] = $kind;
sp += 1;
} else {
let fb = fallback.get_or_insert_with(|| {
let mut v: Vec<CtxKind> = Vec::with_capacity(STACK_CAP + 16);
v.extend_from_slice(&stack_buf[..sp]);
v
});
sp = STACK_CAP.wrapping_add(fb.len() + 1);
fb.push($kind);
}
};
}
macro_rules! pop {
() => {{
if sp <= STACK_CAP {
if sp == 0 { None }
else { sp -= 1; Some(stack_buf[sp]) }
} else {
let fb = fallback.as_mut().unwrap();
let val = fb.pop();
if fb.is_empty() { sp = STACK_CAP; }
val
}
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bookkeeping is admittedly hard to read, but it's correct given an invariant the validator maintains: the Top marker is never popped from the stack, only mutated to TopDone (line 226). Bracket imbalance errors out earlier via the pop!().ok_or(...) check. Hence the branch if fb.is_empty() { sp = STACK_CAP; } is unreachable in practice — fb always retains the Top marker at index 0. Refactoring to a single source-of-truth backing store (SmallVec or always-Vec) is a clarity improvement worth doing as a follow-up, but out of scope for this perf-focused PR.

Comment thread lua/qjson/table.lua
Comment on lines 256 to 262
-- The set of keys reserved by the lazy view bookkeeping; user-supplied JSON
-- keys with these names would collide (minor, deferred). Centralized here so
-- the dirty check and __newindex can share the list.
local INTERNAL_KEYS = {
_doc = true, _cur_box = true, _cur = true, _bs = true, _be = true,
_parent = true, _dirty = true,
}
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — comment updated to reference the actual current consumers (__newindex cache snapshotting and encode_lazy_object_walking for skipping internals on dirty proxies). Pushing in the next commit.

Comment thread lua/qjson/table.lua Outdated
Comment on lines +288 to +292
for _, f in ipairs({"_parent", "_dirty", "_doc", "_cur_box", "_cur", "_bs", "_be"}) do
rawset(t, f, nil)
end
setmetatable(t, nil)
TABLE_TYPE_HINT[t] = "object"
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepted — replaced the ipairs({...}) allocation with seven inline rawset(t, ..., nil) calls. Using rawset rather than multi-assignment because the root view (created in _M.decode) doesn't carry _parent and a direct t._parent = nil would trip __newindex recursion on it. Zero per-call allocations on the materialization path now.

Comment thread lua/qjson/table.lua Outdated
Comment on lines +321 to +323
for _, f in ipairs({"_parent", "_dirty", "_doc", "_cur_box", "_cur", "_bs", "_be"}) do
rawset(t, f, nil)
end
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same fix applied in LazyArray.__newindex — seven inline rawset(t, ..., nil) calls, no per-call allocation.

Move TABLE_TYPE_HINT/is_array/empty_array_mt dispatch into a
separate function so the hot encode path (lazy proxy → memcpy)
stays minimal for LuaJIT trace compilation.
@membphis
Copy link
Copy Markdown
Collaborator Author

Performance Comparison: main vs PR (aarch64, Apple M-series)

Both branches run the same modified-encode benchmark suite for apples-to-apples comparison. Main branch uses original lua/qjson/table.lua + original Rust code.

Key metrics (median ops/s)

Scenario Op main PR Change
100k modify top + encode 51,203 48,473 -5%
100k add field + encode 48,591 48,193 -1%
100k modify nested + encode 95,969 107,991 +13%
200k modify nested + encode 66,401 74,405 +12%
1m modify top + encode 6,098 5,682 -7%
1m modify nested + encode 19,036 19,455 +2%
5m modify top + encode 585 1,043 +78%
5m add field + encode 579 1,039 +79%
10m modify top + encode 275 457 +66%

Existing paths (parse, access, encode unmod)

Flat within bench variance (±5% for parse+access and decode+access). The encode(unmod) path shows ~25% lower numbers in the full bench due to JIT trace slot competition from the 3 additional modify-encode scenarios; isolated testing with only encode(unmod) shows identical performance between branches.

Modified-encode scenarios (small/medium multimodal payloads)

Scenario Op ops/s
small (2KB) modify top + encode 145,641
small (2KB) add field + encode 148,920
small (2KB) modify nested + encode 81,171
medium (60KB) modify top + encode 69,185
medium (60KB) add field + encode 80,632
medium (60KB) modify nested + encode 174,095

Both branches tested with identical bench harness (modified-encode scenarios present in both). Main = original table.lua + original Rust. PR = optimized table.lua + optimized Rust (Rc cache, stack validation, memchr3, AVX2 LUT).

Scanner (src/scan/avx2.rs):
- Revert structural_mask_chunk from the PSHUFB-LUT nibble classifier
  back to the parallel 7-cmpeq form. On AMD Zen2 the PSHUFB variant
  measured -45% parse on small payloads (-7% on 1m where the
  in-string fast-probe dominates anyway): VPSHUFB ymm is split into
  two micro-ops per 128-bit lane, the srli->and->pshufb->and->cmpeq->
  xor->movemask chain has a longer critical path, and VPMOVMSKB
  (lat 4) plus the LUT-load constants pressure the FP ports. The
  parallel cmpeq design lets independent compare chains dispatch
  across multiple ports. NEON path is unaffected (separate file).

Lua layer (lua/qjson/table.lua) -- review feedback:
- Update INTERNAL_KEYS comment to reference its actual consumers
  (__newindex cache snapshotting and encode_lazy_object_walking),
  not the removed recursive is_dirty walk.
- Replace 'for _, f in ipairs({...})' in both LazyObject.__newindex
  and LazyArray.__newindex with seven inline rawset(t, ..., nil)
  calls. Eliminates the per-call 7-elem table-literal allocation
  on the materialization path. Uses rawset (not multi-assignment
  to nil) because root views from _M.decode lack _parent, so
  't._parent = nil' could fire __newindex recursively.

Bench (benches/lua_bench.lua):
- Force LuaJIT to evaluate qjson.encode results: replace
  'local _ = qjson.encode(t)' with 'local _enc = qjson.encode(t);
  if #_enc < 2 then error(...) end' in all 8 mutation/encode cases.
  Without this, LuaJIT could partially DCE the encode call on some
  trace shapes and produce misleading speedups.
- Raise the warmup floor from max(3, iters/5) to max(50, iters/5).
  LuaJIT's default hotloop is 56; with the original floor of 3, the
  1m payload (iters=15) measured pre-JIT interpreter mode for most
  of the run.
- Raise 500k iters 20->100 and 1m iters 15->60 so per-round wall
  time exceeds os.clock granularity (~1ms) by a comfortable margin.
Copilot AI review requested due to automatic review settings May 23, 2026 15:38
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Comment thread src/skip_cache.rs
Comment on lines 24 to 41
@@ -34,7 +36,8 @@ impl SkipCache {
return (slot, true);
}
let new = self.slots.len() as u32;
self.slots.push(SkipSlot { child_starts: Vec::new(), child_ends: Vec::new() });
let empty: Rc<[u32]> = Rc::from([]);
self.slots.push(SkipSlot { child_starts: Rc::clone(&empty), child_ends: empty });
self.by_opener.insert(opener_idx, new);
Comment thread benches/lua_bench.lua
Comment on lines +148 to +151
-- Floor at 50: LuaJIT hotloop default is 56, so fewer iterations leave
-- the bench measuring interpreter mode for the large-payload scenarios
-- (1m has iters=15, iters/5=3 → trace never compiles → ~30% noise).
local warmup = math.max(50, math.floor(iters / 5))
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benches/lua_bench.lua`:
- Line 149: Update the explanatory comment that currently reads "1m has
iters=15" to reflect the actual iteration count used in the benchmark (the
`iters = 60` setting for the 1m scenario); locate the comment near the bench
description for large-payload/interpreter mode and change the text to reference
`iters = 60` (or reword to avoid hardcoding the number) so it matches the
`iters` variable defined for the 1m scenario.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 87af55bb-8a10-4f0b-8acb-70395b800582

📥 Commits

Reviewing files that changed from the base of the PR and between c7574e3 and 65da500.

📒 Files selected for processing (3)
  • benches/lua_bench.lua
  • lua/qjson/table.lua
  • src/scan/avx2.rs
✅ Files skipped from review due to trivial changes (1)
  • src/scan/avx2.rs

Comment thread benches/lua_bench.lua
-- before measurement starts. Excluded from timing and memory delta.
local warmup = math.max(3, math.floor(iters / 5))
-- Floor at 50: LuaJIT hotloop default is 56, so fewer iterations leave
-- the bench measuring interpreter mode for the large-payload scenarios
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update the example to match current iteration count.

The comment states "1m has iters=15" but Line 285 shows iters = 60 for the 1m scenario. This mismatch makes the explanation confusing.

📝 Proposed fix
-    -- (1m has iters=15, iters/5=3 → trace never compiles → ~30% noise).
+    -- (1m has iters=60, iters/5=12 → without floor, trace might not compile).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benches/lua_bench.lua` at line 149, Update the explanatory comment that
currently reads "1m has iters=15" to reflect the actual iteration count used in
the benchmark (the `iters = 60` setting for the 1m scenario); locate the comment
near the bench description for large-payload/interpreter mode and change the
text to reference `iters = 60` (or reword to avoid hardcoding the number) so it
matches the `iters` variable defined for the 1m scenario.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants