perf: Lua encode + cache + scanner + validation stack optimizations by membphis · Pull Request #54 · api7/lua-qjson

membphis · 2026-05-23T12:44:45Z

Summary

Performance optimization across the full decode → access → modify → encode pipeline.

Lua layer (Phase 2)

encode_string bulk-copy: Replaces per-byte loop with segment-based copying. Escape-free strings return via ' .. s .. ' with zero table allocations. Escape lookup via table instead of if-elseif chain.
_dirty flag (O(1)): Replaces recursive is_dirty tree walk with a _dirty boolean propagated up the parent chain on __newindex.
is_array pre-scan elimination: Tags materialized tables with __qjson_type marker during materialization, eliminating duplicate key scan in encode.

Rust layer (Phase 3)

SkipCache Rc<[u32]>: Vec::clone on every cache hit replaced with O(1) Rc::clone, eliminating O(n) heap allocation per repeated field access.
AVX2 PSHUFB-LUT: structural_mask_chunk 7-cmpeq loop replaced with nibble-based PSHUFB classification (x86 only, NEON tag-bit strategy).
Validation stack array: validate_eager_values heap allocation replaced with [CtxKind; 64] fixed-size stack (falls back to Vec for depth > 64).
memchr3: parse_i64 float-detection changed from byte-by-byte .any() to SIMD-accelerated memchr::memchr3.

Benchmarks

Added 3 modified-encode scenarios (modify top / add field / modify nested) across all payload sizes.

Performance (aarch64, small/medium multimodal payloads)

modify top + encode: +35–50%
add field + encode: +30–59%
modify nested + encode: +36–99%

Summary by CodeRabbit

Refactor
- Improved performance and memory behavior via caching, shared buffers, and a fixed-stack state machine for validation.
New Features
- Benchmarks extended with new mutation-and-encode scenarios to measure varied payload and mutation patterns.
Bug Fixes
- More reliable JSON encode behavior and deterministic handling of arrays vs objects, with correct propagation of lazy mutations.
Tests
- Expanded tests covering nested mutations, top-level additions, and special-character string round-trips.

- Replace per-byte encode_string with bulk-copy segments and fast escape-free path (zero table allocations for clean strings) - Replace recursive is_dirty with O(1) _dirty flag propagated via parent chain on __newindex - Eliminate is_array pre-scan by tagging materialized tables with __qjson_type marker during materialization - Replace Vec<u32>::clone in SkipCache with Rc<[u32]> for O(1) cache-hit access instead of O(n) vector clone - Replace 7-cmpeq loop in AVX2 structural_mask_chunk with PSHUFB-based nibble LUT classification - Replace heap-allocated validation state stack with fixed-size [CtxKind; 64] array for typical JSON depths - Replace byte-by-byte float-detection in parse_i64 with memchr::memchr3 - Add modified-encode benchmark scenarios (modify top / add field / modify nested) to bench harness - Add correctness tests for modified-encode round-trip

coderabbitai · 2026-05-23T12:44:56Z

📝 Walkthrough

Walkthrough

Refactors lazy JSON mutation tracking to explicit _dirty flags, rewrites string encoding and encoder dispatch, replaces several vectors with shared Rc slices, adds a fixed-stack validator, optimizes number scanning and updates an AVX2 comment, and extends tests and benchmarks for modify→encode flows.

Changes

Integrated Performance Optimization Suite

Layer / File(s)	Summary
Lazy mutation tracking via dirty flags `lua/qjson/table.lua`	Adds `_parent` links and `_dirty` flags on lazy child views; `__newindex` marks ancestors dirty, materialization clears lazy internals and tags realized container kind; root decode initializes `_dirty = false`.
String encoding with ESCAPES lookup and segmented build `lua/qjson/table.lua`	Replaces `encode_string` with ESCAPES-table-based escaper using a bulk/segmented build strategy and `\u00xx` control-byte escapes.
Encoder fast path and plain-table dispatch `lua/qjson/table.lua`	Removes recursive `is_dirty`; `encode_proxy` checks `t._dirty`; adds `encode_plain_table` to deterministically choose array vs object via `empty_array_mt` and `TABLE_TYPE_HINT`; `encode()` delegates plain tables.
Skip cache refactor to Rc-backed shared slices `src/skip_cache.rs`	Converts `SkipSlot.child_starts`/`child_ends` to `Rc<[u32]>` shared slices; `SkipCache::new` and `get_or_insert` reuse a shared empty `Rc<[u32]>`.
Cursor walk_children cache hit optimization with Rc cloning `src/cursor.rs`	Cache-hit path now uses `Rc::clone` instead of cloning vectors; cache storage updated to use `into()` conversions for consistency.
Validation eager checker fixed-size stack and macro helpers `src/validate/mod.rs`	Replaces heap `Vec<CtxKind>` with fixed-capacity `stack_buf` and macros (`push!`, `pop!`, `last_mut!`, `stack_len!`, `stack_is_empty!`); moves nesting-depth checks to container-entry and updates state access.
Number parsing optimization with memchr3 `src/decode/number.rs`	`parse_i64` uses `memchr::memchr3` to detect `.`/`e`/`E` instead of a byte-iteration scan, preserving error behavior.
AVX2 structural_mask_chunk comment update `src/scan/avx2.rs`	Updates the doc comment to reflect the 7-parallel-byte-equality-compare implementation and related microarchitectural notes; implementation unchanged.
Test coverage for lazy table mutations `tests/lua/lazy_table_spec.lua`	Adds test cases validating encode reflects top-level and nested mutations, new field insertion, unmodified fast-path encoding, and proper escaping round-trip of special characters.
Benchmarking modify → encode workflows `benches/lua_bench.lua`	Adds default and scenario-specific modify callbacks (top/add/nested), wires `github-100k` helpers, selects modifiers per-scenario, and adds decode→modify→encode benchmark cases (per-scenario and interleaved) with output-length validation.

Sequence Diagram(s)

No sequence diagram generated for this PR.

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly Related PRs

api7/lua-qjson#51: Modifies src/validate/mod.rs eager validation logic; both PRs refactor container-depth enforcement to check nesting depth immediately during stack operations.

🚥 Pre-merge checks | ✅ 6

✅ Passed checks (6 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main performance optimization work across Lua and Rust layers—encode string optimization, cache improvements, scanner updates, and validation stack changes.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
E2e Test Quality Review	✅ Passed	E2E tests cover decode→access→mutate→encode pipeline with 55 cases. Validates Rust FFI, lazy metamethods, materialization, dirty tracking, and encode optimizations plus proper error handling.
Security Check	✅ Passed	JSON library performance optimization PR with no auth, database, TLS, or credential code. All security categories are not applicable. FFI operations properly bounded.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch 001-json-perf-optimize

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

membphis · 2026-05-23T12:48:51Z

Performance Comparison (aarch64, Apple M-series, NEON)

modify top + encode

Payload	Before (ops/s)	After (ops/s)	Change
small (2KB)	98,965	148,364	+50%
medium (60KB)	52,854	71,480	+35%
100k	42,105	54,705	+30%
500k	10,070	13,541	+34%
1m	5,111	6,699	+31%

add field + encode

Payload	Before (ops/s)	After (ops/s)	Change
small (2KB)	99,548	158,013	+59%
medium (60KB)	60,234	78,040	+30%
100k	46,253	61,502	+33%
500k	12,158	14,018	+15%
1m	6,209	7,349	+18%

modify nested + encode

Payload	Before (ops/s)	After (ops/s)	Change
small (2KB)	59,056	117,409	+99%
medium (60KB)	121,212	165,344	+36%
100k	93,371	107,759	+15%
500k	34,423	35,881	+4%
1m	18,773	19,025	+1%

unmodified encode (memcpy fast path)

Payload	Before (ops/s)	After (ops/s)	Change
100k	159,744	220,000	+38%
1m	19,841	21,941	+11%

decode + access content

Payload	Before (ops/s)	After (ops/s)	Change
100k	155,039	158,028	+2%
1m	20,718	21,556	+4%

Before = baseline prior to any optimization. After = Phase 2 (Lua layer) + Phase 3 (Rust layer). Benchmarks run via make bench on same hardware throughout.

Copilot

Pull request overview

Performance-focused PR optimizing qjson’s full pipeline (decode → access/modify → encode), with targeted improvements in both the Lua proxy layer and Rust core (scanner/cache/validation/number parsing).

Changes:

Lua: faster string encoding, O(1) mutation tracking via _dirty propagation, and table type tagging to avoid repeated array/object scans during encode.
Rust: SkipCache avoids per-hit Vec cloning via Rc<[u32]>, AVX2 structural classification uses PSHUFB LUTs, eager validation uses a fixed-size stack fast path, and parse_i64 uses memchr3.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/lua/lazy_table_spec.lua	Adds encode/mutation test coverage for new fast paths and dirty propagation.
lua/qjson/table.lua	Implements Lua-layer encode + mutation tracking optimizations (string escaper, `_dirty`, type tagging).
src/skip_cache.rs	Switches skip-cache child index storage to `Rc<[u32]>` for O(1) cache-hit cloning.
src/cursor.rs	Uses `Rc::clone` on cache hits and stores populated vectors into `Rc<[u32]>`.
src/scan/avx2.rs	Replaces repeated cmpeq loop with nibble-based PSHUFB LUT classification for structurals.
src/validate/mod.rs	Reworks eager validation stack to avoid heap allocation for typical depths.
src/decode/number.rs	Uses `memchr::memchr3` for faster integer/float discrimination.
benches/lua_bench.lua	Adds modified-encode benchmark scenarios (modify top/add field/modify nested).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Skip __qjson_type internal marker in encode_object() pairs iteration - Guard dirty propagation to stop at non-lazy ancestors, preventing _dirty rawset on already-materialized tables

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

src/skip_cache.rs (1)

25-27: 💤 Low value

Consider a single shared empty slice constant.

Both new() and get_or_insert() create new empty Rc<[u32]> instances. While empty slices are small, a single shared constant would eliminate these allocations entirely.

♻️ Optional refactor using lazy_static or thread_local

+use std::cell::RefCell;
+
+thread_local! {
+    static EMPTY_SLICE: Rc<[u32]> = Rc::from([]);
+}
+
 impl SkipCache {
     pub(crate) fn new() -> Self {
-        let empty: Rc<[u32]> = Rc::from([]);
         Self {
-            slots: vec![SkipSlot { child_starts: Rc::clone(&empty), child_ends: empty }],
+            slots: vec![SkipSlot { 
+                child_starts: EMPTY_SLICE.with(Rc::clone), 
+                child_ends: EMPTY_SLICE.with(Rc::clone) 
+            }],
             by_opener: FxHashMap::default(),
         }
     }
 
     pub(crate) fn get_or_insert(&mut self, opener_idx: u32) -> (u32, bool) {
         if let Some(&slot) = self.by_opener.get(&opener_idx) {
             return (slot, true);
         }
         let new = self.slots.len() as u32;
-        let empty: Rc<[u32]> = Rc::from([]);
-        self.slots.push(SkipSlot { child_starts: Rc::clone(&empty), child_ends: empty });
+        self.slots.push(SkipSlot { 
+            child_starts: EMPTY_SLICE.with(Rc::clone), 
+            child_ends: EMPTY_SLICE.with(Rc::clone) 
+        });
         self.by_opener.insert(opener_idx, new);
         (new, false)
     }

Also applies to: 39-40

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/skip_cache.rs` around lines 25 - 27, Create a single shared empty
Rc<[u32]> and use it from both new() and get_or_insert() instead of calling
Rc::from([]) twice; define a static (e.g. using once_cell::sync::Lazy or
lazy_static) EMPTY_EMPTY_U32_RC and replace the Rc::from([]) usages in SkipSlot
construction and any place in get_or_insert() with
Rc::clone(&EMPTY_EMPTY_U32_RC) so SkipSlot { child_starts:
Rc::clone(&EMPTY_EMPTY_U32_RC), child_ends: Rc::clone(&EMPTY_EMPTY_U32_RC) } (or
equivalent) to eliminate duplicate empty allocations while keeping types
Rc<[u32]> consistent.

tests/lua/lazy_table_spec.lua (1)

394-442: ⚡ Quick win

Add a regression test for __qjson_type payload-key collision.

Please add a case where decoded data includes a user key named __qjson_type and verify encode still preserves object semantics after mutation.

Example test to add

+    it("does not let user __qjson_type key alter table dispatch", function()
+        local cjson = require("cjson")
+        local t = qjson.decode('{"__qjson_type":"array","a":1}')
+        t.a = 2
+        local out = qjson.encode(t)
+        local parsed = cjson.decode(out)
+        assert.are.equal("array", parsed.__qjson_type)
+        assert.are.equal(2, parsed.a)
+    end)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/lua/lazy_table_spec.lua` around lines 394 - 442, Add a regression test
in tests/lua/lazy_table_spec.lua that decodes JSON containing a user-supplied
"__qjson_type" key via qjson.decode, mutates that field (and optionally other
fields or nested entries), re-encodes with qjson.encode, then cjson.decode the
output and assert the resulting table preserves object semantics and the mutated
"__qjson_type" value; locate the test near the existing it("...") cases and
follow their pattern using qjson.decode, qjson.encode, and cjson.decode to
validate both the "__qjson_type" field and other fields remain correct after
encode/decode.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benches/lua_bench.lua`:
- Around line 261-263: The benchmark helper github_table_modify_add currently
does t.extra_field = true which has no effect when the encoded JSON root is an
array; change github_table_modify_add to detect array-root tables (e.g., check
if `#t` > 0 or use ipairs) and in that case iterate over elements and set
element.extra_field = true for each item, otherwise set t.extra_field = true on
the object root so the added field appears in encoded output for both object and
array roots.

In `@lua/qjson/table.lua`:
- Around line 286-287: The code currently embeds a user-visible string key
"__qjson_type" via rawset(t, "__qjson_type", "object"), which lets user payloads
collide with encoder dispatch; replace this by storing the marker in an internal
weak side-table (e.g., a local table with metatable __mode="k") instead of
mutating the user's table, set side_table[t] = "object" wherever rawset(t,
"__qjson_type", ...) is used (references: rawset and the "__qjson_type" marker
in table.lua around the shown block and the similar block at lines ~552-557),
and update any lookup logic that currently reads t["__qjson_type"] to consult
the side-table to determine the container type.

---

Nitpick comments:
In `@src/skip_cache.rs`:
- Around line 25-27: Create a single shared empty Rc<[u32]> and use it from both
new() and get_or_insert() instead of calling Rc::from([]) twice; define a static
(e.g. using once_cell::sync::Lazy or lazy_static) EMPTY_EMPTY_U32_RC and replace
the Rc::from([]) usages in SkipSlot construction and any place in
get_or_insert() with Rc::clone(&EMPTY_EMPTY_U32_RC) so SkipSlot { child_starts:
Rc::clone(&EMPTY_EMPTY_U32_RC), child_ends: Rc::clone(&EMPTY_EMPTY_U32_RC) } (or
equivalent) to eliminate duplicate empty allocations while keeping types
Rc<[u32]> consistent.

In `@tests/lua/lazy_table_spec.lua`:
- Around line 394-442: Add a regression test in tests/lua/lazy_table_spec.lua
that decodes JSON containing a user-supplied "__qjson_type" key via
qjson.decode, mutates that field (and optionally other fields or nested
entries), re-encodes with qjson.encode, then cjson.decode the output and assert
the resulting table preserves object semantics and the mutated "__qjson_type"
value; locate the test near the existing it("...") cases and follow their
pattern using qjson.decode, qjson.encode, and cjson.decode to validate both the
"__qjson_type" field and other fields remain correct after encode/decode.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 05f6541a-c41b-4fe1-8fe6-6819cb53bdd0

📥 Commits

Reviewing files that changed from the base of the PR and between 8e49d18 and c7574e3.

📒 Files selected for processing (8)

benches/lua_bench.lua
lua/qjson/table.lua
src/cursor.rs
src/decode/number.rs
src/scan/avx2.rs
src/skip_cache.rs
src/validate/mod.rs
tests/lua/lazy_table_spec.lua

- Replace __qjson_type string key with local weak side-table (TABLE_TYPE_HINT) to prevent collision with user payload keys - Fix github_table_modify_add to mutate first element instead of array root so the added field appears in encoded output

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

membphis · 2026-05-23T15:34:49Z

    // After the walk, the stack must hold exactly one frame: the root
    // context, which must be `TopDone` (root value consumed).
-    if stack.len() != 1 || stack[0] != CtxKind::TopDone {
+    if stack_len!() != 1 || stack_buf[0] != CtxKind::TopDone {


Verified against the existing 256-deep test (tests/rfc8259_compliance.rs:119 accepts_nested_at_configured_limit) — it parses successfully on this branch. The premise doesn't hold: Top → TopDone mutation happens at line 226 (*cur = parent_after_value(*cur)) on the first [/{ of input, before any push that could overflow into the fallback Vec. So stack_buf[0] is set to TopDone on entry to any deep section, gets copied into fb[0] when fallback is created, and the final-state check at line 317 reads the right slot. Leaving as-is.

membphis · 2026-05-23T15:34:50Z

+    macro_rules! push {
+        ($kind:expr) => {
+            if sp < STACK_CAP {
+                stack_buf[sp] = $kind;
+                sp += 1;
+            } else {
+                let fb = fallback.get_or_insert_with(|| {
+                    let mut v: Vec<CtxKind> = Vec::with_capacity(STACK_CAP + 16);
+                    v.extend_from_slice(&stack_buf[..sp]);
+                    v
+                });
+                sp = STACK_CAP.wrapping_add(fb.len() + 1);
+                fb.push($kind);
+            }
+        };
+    }
+    macro_rules! pop {
+        () => {{
+            if sp <= STACK_CAP {
+                if sp == 0 { None }
+                else { sp -= 1; Some(stack_buf[sp]) }
+            } else {
+                let fb = fallback.as_mut().unwrap();
+                let val = fb.pop();
+                if fb.is_empty() { sp = STACK_CAP; }
+                val
+            }


The bookkeeping is admittedly hard to read, but it's correct given an invariant the validator maintains: the Top marker is never popped from the stack, only mutated to TopDone (line 226). Bracket imbalance errors out earlier via the pop!().ok_or(...) check. Hence the branch if fb.is_empty() { sp = STACK_CAP; } is unreachable in practice — fb always retains the Top marker at index 0. Refactoring to a single source-of-truth backing store (SmallVec or always-Vec) is a clarity improvement worth doing as a follow-up, but out of scope for this perf-focused PR.

membphis · 2026-05-23T15:34:51Z

 -- The set of keys reserved by the lazy view bookkeeping; user-supplied JSON
 -- keys with these names would collide (minor, deferred). Centralized here so
 -- the dirty check and __newindex can share the list.
 local INTERNAL_KEYS = {
    _doc = true, _cur_box = true, _cur = true, _bs = true, _be = true,
+    _parent = true, _dirty = true,
 }


Good catch — comment updated to reference the actual current consumers (__newindex cache snapshotting and encode_lazy_object_walking for skipping internals on dirty proxies). Pushing in the next commit.

membphis · 2026-05-23T15:34:52Z

+    for _, f in ipairs({"_parent", "_dirty", "_doc", "_cur_box", "_cur", "_bs", "_be"}) do
+        rawset(t, f, nil)
+    end
    setmetatable(t, nil)
+    TABLE_TYPE_HINT[t] = "object"


Accepted — replaced the ipairs({...}) allocation with seven inline rawset(t, ..., nil) calls. Using rawset rather than multi-assignment because the root view (created in _M.decode) doesn't carry _parent and a direct t._parent = nil would trip __newindex recursion on it. Zero per-call allocations on the materialization path now.

membphis · 2026-05-23T15:34:54Z

+    for _, f in ipairs({"_parent", "_dirty", "_doc", "_cur_box", "_cur", "_bs", "_be"}) do
+        rawset(t, f, nil)
+    end


Same fix applied in LazyArray.__newindex — seven inline rawset(t, ..., nil) calls, no per-call allocation.

Move TABLE_TYPE_HINT/is_array/empty_array_mt dispatch into a separate function so the hot encode path (lazy proxy → memcpy) stays minimal for LuaJIT trace compilation.

membphis · 2026-05-23T13:20:32Z

Performance Comparison: main vs PR (aarch64, Apple M-series)

Both branches run the same modified-encode benchmark suite for apples-to-apples comparison. Main branch uses original lua/qjson/table.lua + original Rust code.

Key metrics (median ops/s)

Scenario	Op	main	PR	Change
100k	modify top + encode	51,203	48,473	-5%
100k	add field + encode	48,591	48,193	-1%
100k	modify nested + encode	95,969	107,991	+13%
200k	modify nested + encode	66,401	74,405	+12%
1m	modify top + encode	6,098	5,682	-7%
1m	modify nested + encode	19,036	19,455	+2%
5m	modify top + encode	585	1,043	+78%
5m	add field + encode	579	1,039	+79%
10m	modify top + encode	275	457	+66%

Existing paths (parse, access, encode unmod)

Flat within bench variance (±5% for parse+access and decode+access). The encode(unmod) path shows ~25% lower numbers in the full bench due to JIT trace slot competition from the 3 additional modify-encode scenarios; isolated testing with only encode(unmod) shows identical performance between branches.

Modified-encode scenarios (small/medium multimodal payloads)

Scenario	Op	ops/s
small (2KB)	modify top + encode	145,641
small (2KB)	add field + encode	148,920
small (2KB)	modify nested + encode	81,171
medium (60KB)	modify top + encode	69,185
medium (60KB)	add field + encode	80,632
medium (60KB)	modify nested + encode	174,095

Both branches tested with identical bench harness (modified-encode scenarios present in both). Main = original table.lua + original Rust. PR = optimized table.lua + optimized Rust (Rc cache, stack validation, memchr3, AVX2 LUT).

Scanner (src/scan/avx2.rs): - Revert structural_mask_chunk from the PSHUFB-LUT nibble classifier back to the parallel 7-cmpeq form. On AMD Zen2 the PSHUFB variant measured -45% parse on small payloads (-7% on 1m where the in-string fast-probe dominates anyway): VPSHUFB ymm is split into two micro-ops per 128-bit lane, the srli->and->pshufb->and->cmpeq-> xor->movemask chain has a longer critical path, and VPMOVMSKB (lat 4) plus the LUT-load constants pressure the FP ports. The parallel cmpeq design lets independent compare chains dispatch across multiple ports. NEON path is unaffected (separate file). Lua layer (lua/qjson/table.lua) -- review feedback: - Update INTERNAL_KEYS comment to reference its actual consumers (__newindex cache snapshotting and encode_lazy_object_walking), not the removed recursive is_dirty walk. - Replace 'for _, f in ipairs({...})' in both LazyObject.__newindex and LazyArray.__newindex with seven inline rawset(t, ..., nil) calls. Eliminates the per-call 7-elem table-literal allocation on the materialization path. Uses rawset (not multi-assignment to nil) because root views from _M.decode lack _parent, so 't._parent = nil' could fire __newindex recursively. Bench (benches/lua_bench.lua): - Force LuaJIT to evaluate qjson.encode results: replace 'local _ = qjson.encode(t)' with 'local _enc = qjson.encode(t); if #_enc < 2 then error(...) end' in all 8 mutation/encode cases. Without this, LuaJIT could partially DCE the encode call on some trace shapes and produce misleading speedups. - Raise the warmup floor from max(3, iters/5) to max(50, iters/5). LuaJIT's default hotloop is 56; with the original floor of 3, the 1m payload (iters=15) measured pre-JIT interpreter mode for most of the run. - Raise 500k iters 20->100 and 1m iters 15->60 so per-round wall time exceeds os.clock granularity (~1ms) by a comfortable margin.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

@@ -34,7 +36,8 @@ impl SkipCache {
            return (slot, true);
        }
        let new = self.slots.len() as u32;
-        self.slots.push(SkipSlot { child_starts: Vec::new(), child_ends: Vec::new() });
+        let empty: Rc<[u32]> = Rc::from([]);
+        self.slots.push(SkipSlot { child_starts: Rc::clone(&empty), child_ends: empty });
        self.by_opener.insert(opener_idx, new);


+    -- Floor at 50: LuaJIT hotloop default is 56, so fewer iterations leave
+    -- the bench measuring interpreter mode for the large-payload scenarios
+    -- (1m has iters=15, iters/5=3 → trace never compiles → ~30% noise).
+    local warmup = math.max(50, math.floor(iters / 5))


coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benches/lua_bench.lua`:
- Line 149: Update the explanatory comment that currently reads "1m has
iters=15" to reflect the actual iteration count used in the benchmark (the
`iters = 60` setting for the 1m scenario); locate the comment near the bench
description for large-payload/interpreter mode and change the text to reference
`iters = 60` (or reword to avoid hardcoding the number) so it matches the
`iters` variable defined for the 1m scenario.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 87af55bb-8a10-4f0b-8acb-70395b800582

📥 Commits

Reviewing files that changed from the base of the PR and between c7574e3 and 65da500.

📒 Files selected for processing (3)

benches/lua_bench.lua
lua/qjson/table.lua
src/scan/avx2.rs

✅ Files skipped from review due to trivial changes (1)

src/scan/avx2.rs

coderabbitai · 2026-05-23T15:44:04Z

    -- before measurement starts. Excluded from timing and memory delta.
-    local warmup = math.max(3, math.floor(iters / 5))
+    -- Floor at 50: LuaJIT hotloop default is 56, so fewer iterations leave
+    -- the bench measuring interpreter mode for the large-payload scenarios


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update the example to match current iteration count.

The comment states "1m has iters=15" but Line 285 shows iters = 60 for the 1m scenario. This mismatch makes the explanation confusing.

📝 Proposed fix

- -- (1m has iters=15, iters/5=3 → trace never compiles → ~30% noise). + -- (1m has iters=60, iters/5=12 → without floor, trace might not compile).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benches/lua_bench.lua` at line 149, Update the explanatory comment that currently reads "1m has iters=15" to reflect the actual iteration count used in the benchmark (the `iters = 60` setting for the 1m scenario); locate the comment near the bench description for large-payload/interpreter mode and change the text to reference `iters = 60` (or reword to avoid hardcoding the number) so it matches the `iters` variable defined for the 1m scenario.

Copilot AI review requested due to automatic review settings May 23, 2026 12:44

Copilot started reviewing on behalf of membphis May 23, 2026 12:44 View session

Copilot AI reviewed May 23, 2026

View reviewed changes

Comment thread lua/qjson/table.lua Outdated

Comment thread lua/qjson/table.lua

Comment thread lua/qjson/table.lua

fix: prevent __qjson_type and _dirty leakage into encode output

913f6a8

- Skip __qjson_type internal marker in encode_object() pairs iteration - Guard dirty propagation to stop at non-lazy ancestors, preventing _dirty rawset on already-materialized tables

coderabbitai Bot reviewed May 23, 2026

View reviewed changes

Comment thread benches/lua_bench.lua

Comment thread lua/qjson/table.lua Outdated

Copilot AI review requested due to automatic review settings May 23, 2026 12:55

Copilot started reviewing on behalf of membphis May 23, 2026 12:55 View session

Copilot AI reviewed May 23, 2026

View reviewed changes

perf: extract encode_plain_table to keep lazy-proxy fast path narrow

c8630b7

Move TABLE_TYPE_HINT/is_array/empty_array_mt dispatch into a separate function so the hot encode path (lazy proxy → memcpy) stays minimal for LuaJIT trace compilation.

Copilot AI review requested due to automatic review settings May 23, 2026 15:38

Copilot started reviewing on behalf of membphis May 23, 2026 15:38 View session

Copilot AI reviewed May 23, 2026

View reviewed changes

coderabbitai Bot reviewed May 23, 2026

View reviewed changes

Conversation

membphis commented May 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Lua layer (Phase 2)

Rust layer (Phase 3)

Benchmarks

Performance (aarch64, small/medium multimodal payloads)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated Code Review Effort

Possibly Related PRs

Uh oh!

membphis commented May 23, 2026

Performance Comparison (aarch64, Apple M-series, NEON)

modify top + encode

add field + encode

modify nested + encode

unmodified encode (memcpy fast path)

decode + access content

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

membphis May 23, 2026

Choose a reason for hiding this comment

Uh oh!

membphis May 23, 2026

Choose a reason for hiding this comment

Uh oh!

membphis May 23, 2026

Choose a reason for hiding this comment

Uh oh!

membphis May 23, 2026

Choose a reason for hiding this comment

Uh oh!

membphis May 23, 2026

Choose a reason for hiding this comment

Uh oh!

membphis commented May 23, 2026

Performance Comparison: main vs PR (aarch64, Apple M-series)

Key metrics (median ops/s)

Existing paths (parse, access, encode unmod)

Modified-encode scenarios (small/medium multimodal payloads)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

membphis commented May 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 23, 2026 •

edited

Loading