Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 0 additions & 3 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,9 +43,6 @@ jobs:
- name: Test with test-panic feature
run: cargo test --features test-panic --release

- name: Test allocation accounting (count-allocs feature)
run: cargo test --release --features count-allocs --test alloc_count

lua:
name: Lua integration tests
runs-on: ubuntu-latest
Expand Down
7 changes: 3 additions & 4 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,9 @@ name = "quickdecode"
crate-type = ["cdylib", "rlib"]

[features]
default = ["avx2"]
avx2 = []
test-panic = []
count-allocs = []
default = ["avx2"]
avx2 = []
test-panic = []

[dependencies]
memchr = "2"
Expand Down
23 changes: 0 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,26 +38,6 @@ local model = body:get_str("model")
local temp = body:get_f64("temperature")
```

### Reusable decoder (pooled API)

For hot paths that parse many payloads (typical in OpenResty workers), use a
reusable decoder to amortize the per-parse indices / scratch / skip-cache
allocations:

```lua
local decoder = qd.new_decoder() -- one per worker is enough
for _, payload in ipairs(payloads) do
local doc = decoder:parse(payload)
-- ...access doc / open cursors...
end
decoder:reset() -- optional: shrink internal buffers
decoder:destroy() -- optional: free buffers eagerly
```

A `doc` returned by `decoder:parse()` becomes stale as soon as the same
decoder parses another payload (or is reset / destroyed). Accessor calls on a
stale doc return `nil`, the same convention as a missing path.

## Testing — Lua

Requires LuaJIT + busted + lua-cjson installed system-wide.
Expand Down Expand Up @@ -96,6 +76,3 @@ Items intentionally pushed out of the first implementation. Each will be picked
- **`cargo fmt --check` not enforced** — `make lint` runs clippy only. The codebase uses intentional manual column alignment in struct definitions and compact single-line literals that default rustfmt would reflow. Skip rather than reformat until a project-wide style decision is made.
- **`validate_brackets` fusion into scan emit loop** — surfaced by profiling: on structurally-dense workloads `validate_brackets` is 65% of parse time (second linear pass over emitted indices). Folding bracket pairing into the scan emit loop via an inline depth stack eliminates that pass. No effect on the current string-heavy bench (0.3% there); a win for config / JSONL / table-shape JSON.
- **`memchr2` cross-chunk jump for very long string interiors** — the AVX2 in-string fast probe (issue #5) drops per-chunk cost from ~25 to ~10 ops but still pays ALU work for every 64-byte chunk in a string. A `memchr2(b'"', b'\\')` jump can approach memory bandwidth on multi-MB single-string payloads. Deferred until a workload that benefits clearly emerges; needs careful `bs_carry` reasoning across the jump.
- **Eliminate `validate_brackets` per-scan stack alloc on the pooled path** — the bracket-balance check builds a fresh `Vec::with_capacity(32)` every scan. On the pooled decoder API this and the per-parse `Box<qjd_doc>` are the only allocations the count-allocs test still sees (2 / parse). A pre-allocated stack on the `Decoder` would drop the count further; deferred because the absolute cost is tiny and the cleanest fix overlaps with the `validate_brackets` fusion item above.
- **Decoder pool / shared-decoder shortcut for `qd.parse`** — `qd.parse(payload)` still constructs a private decoder per call (1 indices Vec + 1 scratch + 1 skip-cache alloc each). A module-level shared decoder could make the legacy API allocation-free too, but adds a global-state footgun (no concurrent parses from coroutines); decoder pooling is exposed via the explicit `qd.new_decoder()` API instead. Reconsider if profiling shows `qd.parse` callers refusing to migrate.
- **Decoder generation counter wrap** — after `2^32` parses on the same decoder the gen wraps to a value an old (Lua-GC-still-alive) doc might match, masking staleness. With 1 ms/parse that is ~50 days of continuous reuse; in practice the doc is reclaimed long before. Could widen to `u64` or trip a hard error near the wrap point if a real-world workload comes close.
118 changes: 111 additions & 7 deletions benches/lua_bench.lua
Original file line number Diff line number Diff line change
Expand Up @@ -62,17 +62,36 @@ local function make_payload(target_bytes)
.. '[{"role":"user","content":[' .. table.concat(parts, ",") .. ']}]}'
end

local ROUNDS = 5

local function bench(name, iters, fn)
-- Warmup pass: lets JIT compile hot traces and any one-time pools fill
-- before measurement starts. Excluded from timing and memory delta.
local warmup = math.max(3, math.floor(iters / 5))
for _ = 1, warmup do fn() end

collectgarbage("collect")
local mem_before = collectgarbage("count")
local t0 = os.clock()
for _ = 1, iters do fn() end
local t1 = os.clock()

local ops = {}
for r = 1, ROUNDS do
local t0 = os.clock()
for _ = 1, iters do fn() end
local t1 = os.clock()
ops[r] = iters / (t1 - t0)
end
local mem_after = collectgarbage("count")
local elapsed = t1 - t0
print(string.format("%-44s %7.2fms total %10.0f ops/s %+8.1fKB",
name, elapsed * 1000, iters / elapsed,
mem_after - mem_before))

table.sort(ops)
local median = ops[math.ceil(ROUNDS / 2)]
local lo, hi = ops[1], ops[ROUNDS]
local sum = 0
for i = 1, ROUNDS do sum = sum + ops[i] end
local mean = sum / ROUNDS

print(string.format(
"%-44s median %9.0f ops/s mean %9.0f range %7.0f..%-9.0f %+8.1fKB",
name, median, mean, lo, hi, mem_after - mem_before))
end

local scenarios = {
Expand All @@ -87,6 +106,11 @@ local scenarios = {
{name = "10m", iters = 20, payload = make_payload(10 * 1024 * 1024)},
}

-- The pooled API (qd.new_decoder + :parse) only exists on commits that
-- landed the Decoder refactor. Probe so the bench still runs on older builds.
local has_pooled_api = type(qd.new_decoder) == "function"
local pooled_decoder = has_pooled_api and qd.new_decoder() or nil

for _, s in ipairs(scenarios) do
print(string.format("=== %s (%d bytes) ===", s.name, #s.payload))

Expand All @@ -103,4 +127,84 @@ for _, s in ipairs(scenarios) do
local _ = d:get_f64("temperature")
local _ = d:get_str("messages[0].role")
end)

if has_pooled_api then
bench("quickdecode pooled :parse + access 3 fields", s.iters, function()
local d = pooled_decoder:parse(s.payload)
local _ = d:get_str("model")
local _ = d:get_f64("temperature")
local _ = d:get_str("messages[0].role")
end)

-- One-shot-per-request pattern: each iter creates a fresh decoder,
-- parses once, and lets both decoder and doc fall to GC. No reuse.
-- This is the typical "user does not cache the decoder" path.
bench("quickdecode new_decoder()+parse (one-shot)", s.iters, function()
local dec = qd.new_decoder()
local d = dec:parse(s.payload)
local _ = d:get_str("model")
local _ = d:get_f64("temperature")
local _ = d:get_str("messages[0].role")
end)
end
end

-- Interleaved scenario: cycle through several payloads of different sizes
-- back-to-back, mirroring a server processing variable-size requests. The
-- single-payload loops above hand the allocator the same block over and over
-- and have no allocation to amortize away — they cannot exercise the doc
-- pool. This scenario can.
local function scenario_by_name(n)
for _, s in ipairs(scenarios) do
if s.name == n then return s end
end
error("no scenario " .. n)
end

local interleaved_names = {"100k", "200k", "500k", "1m"}
local interleaved = {}
for _, n in ipairs(interleaved_names) do
interleaved[#interleaved + 1] = scenario_by_name(n).payload
end

local function make_cycler(items)
local i = 0
local n = #items
return function()
i = i + 1
return items[((i - 1) % n) + 1]
end
end

print(string.format("=== interleaved %s ===", table.concat(interleaved_names, ",")))

do
local next_p = make_cycler(interleaved)
bench("cjson.decode + access 3 fields", 400, function()
local p = next_p()
local obj = cjson.decode(p)
local _ = obj.model
local _ = obj.temperature
local _ = obj.messages and obj.messages[1] and obj.messages[1].role
end)

next_p = make_cycler(interleaved)
bench("quickdecode.parse + access 3 fields", 400, function()
local p = next_p()
local d = qd.parse(p)
local _ = d:get_str("model")
local _ = d:get_f64("temperature")
local _ = d:get_str("messages[0].role")
end)

if has_pooled_api then
next_p = make_cycler(interleaved)
bench("quickdecode pooled :parse + access 3 fields", 400, function()
local p = next_p()
local d = pooled_decoder:parse(p)
local _ = d:get_str("model")
local _ = d:get_f64("temperature")
local _ = d:get_str("messages[0].role")
end)
end
end
60 changes: 60 additions & 0 deletions benches/perf_probe.lua
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
-- Minimal probe for perf: hammers qd.parse on a fixed 100K payload so perf
-- samples concentrate on the FFI entry + parse hot path. Not a benchmark —
-- there is no timing or memory accounting here, just sustained work.

package.path = package.path .. ";./lua/?.lua"
package.cpath = package.cpath .. ";./target/release/lib?.so"

local qd = require("quickdecode")

-- Same payload generator as lua_bench.lua so probe output corresponds to
-- the same shape the bench measures. Park-Miller LCG keeps it deterministic.
local function make_payload(target_bytes)
local rng_state = 42
local function rng_range(lo, hi)
rng_state = (rng_state * 48271) % 2147483647
return lo + (rng_state % (hi - lo + 1))
end

local text = string.rep("Q", 1500)
local text_part = '{"type":"text","text":"' .. text .. '"}'
local parts = { text_part }
local current = 200 + #text_part

while current < target_bytes do
local remaining = target_bytes - current
local img_size
if remaining < 50 * 1024 then
img_size = math.max(1024, remaining)
else
local upper = math.min(500 * 1024, remaining)
img_size = rng_range(50 * 1024, upper)
end
local b64 = string.rep("A", img_size)
local img_part = '{"type":"image_url","image_url":{"url":"data:image/jpeg;base64,'
.. b64 .. '"}}'
parts[#parts + 1] = img_part
current = current + #img_part + 1
end

return '{"model":"gpt-4-vision","temperature":0.7,"messages":'
.. '[{"role":"user","content":[' .. table.concat(parts, ",") .. ']}]}'
end

local payload = make_payload(100 * 1024)
local iters = tonumber(arg[1]) or 500000

-- Warmup so JIT traces compile before perf starts sampling steady state.
for _ = 1, 1000 do
local d = qd.parse(payload)
local _ = d:get_str("model")
end

io.stderr:write(string.format("probe: %d bytes payload, %d iters\n", #payload, iters))

for _ = 1, iters do
local d = qd.parse(payload)
local _ = d:get_str("model")
local _ = d:get_f64("temperature")
local _ = d:get_str("messages[0].role")
end
Loading
Loading