Revert "perf: decoder/document instance pooling (#11)" by membphis · Pull Request #14 · api7/lua-qjson

membphis · 2026-05-15T23:59:32Z

Summary

Revert commit 0721d7d (PR perf: decoder/document instance pooling (#6) #11) which refactored the one-shot qjd_parse path through a separately-allocated qjd_decoder + state machine + generation counter.
Keep the bench-harness improvements from the abandoned fold attempt (perf(ffi): fold legacy qjd_parse to single allocation; preserve pooled API #13): warmup + 5-round median, interleaved-payload scenario, and a probe for the pooled API that turns inert when the API is absent.

Why

The pool API in #11 was designed to amortize per-parse allocation cost across many parses on a reused decoder. In our actual deployment — an API gateway that allocates a fresh decoder per request and lets GC reclaim everything — the pool advantage is zero:

The decoder lifetime equals one request, so there is no Vec capacity reuse, no scratch buffer reuse, no skip-cache reuse.
The doc indirection (qjd_doc handle pointing at a separately-Box'd qjd_decoder) and per-get_* check_doc_alive (state + generation check) become pure overhead.

Bench results across the design space made this concrete:

path on this branch	role	perf vs scan-only
`qd.parse()` (legacy)	what callers use today	restored
(pool API removed)	reused decoder	n/a
(pool API removed)	fresh decoder per request	n/a

#13 tried to fix the legacy regression by folding doc + decoder into one allocation and fast-pathing check_doc_alive. Three-run median-of-medians measurements (in #13 comments) showed the fold still trailed scan-only by 14–27% on 100 KB–5 MB payloads — the bulk of API gateway traffic. The residual cost was not worth the optionality of a pool API that production cannot use anyway.

Bench (3-run median-of-medians)

payload	scan-only `7a895e5`	this branch `qd.parse`	Δ
small 2 KB	127,877	136,370	+6.6%
medium 60 KB	123,122	113,122	−8.1%
100 KB	90,253	104,167	+15.4%
200 KB	66,050	67,024	+1.5%
500 KB	29,718	41,929	+41.1% *
1 MB	19,430	19,659	+1.2%
2 MB	8,489	8,407	−1.0%
5 MB	2,814	2,543	−9.6%
10 MB	491	517	+5.3%
interleaved 100k–1m	27,172	25,760	−5.2%

All deltas fall within the run-to-run noise band we saw on this machine (most rows ±10%; the 500K row's apparent +41% sits inside the underlying range overlap and is system noise, not a real win). Net read: the legacy path is fully restored to its scan-only baseline.

(*) Bench harness preserved from the fold-attempt branch — warmup + 5-round median plus an interleaved-size scenario — so future runs are easier to read.

What's reverted

src/decoder.rs (deleted) → src/doc.rs (restored)
qjd_decoder_new / _free / _reset / _destroy / _parse exports gone
QJD_STALE_DOC error code gone
Generation counter, state machine, and check_doc_alive gone
count-allocs feature + tests/alloc_count.rs + tests/decoder_ffi.rs + tests/lua/decoder_spec.lua gone
CLAUDE.md / README pool-API section gone
Spec doc docs/superpowers/specs/2026-05-15-decoder-pooling-design.md gone

What's kept (cherry-picked from #13)

benches/lua_bench.lua: warmup pass + 5-round median + interleaved 100k–1m scenario. The qd.new_decoder probes degrade gracefully when the API is absent — they print nothing on this branch.
benches/perf_probe.lua: minimal hammer for perf record investigations.

Test plan

cargo test --release (108 tests, was 132 — the 24-test delta is exactly the pool-API surface that was reverted)
cargo test --release --no-default-features
cargo test --features test-panic --release
3-round bench vs 7a895e5, results above
make test (Lua busted) — busted not on this dev machine; CI will run

Context

Supersedes #13. See the closing comment on #13 and the long perf-investigation transcript that motivated this for the full reasoning.

This reverts commit 0721d7d.

The single-run-with-mean output the bench used to print swung 30-40% between invocations on noisy machines, making it hard to tell signal from noise when comparing perf commits. - bench() now runs a warmup pass (JIT trace compile, pool fill), then five timed rounds. Reports median and mean ops/s plus the round-by- round min..max range so reviewers can see whether a delta is real. - Add an `interleaved 100k,200k,500k,1m` scenario that rotates through four payload sizes, matching a server that handles varying request sizes back to back. The single-payload loops cannot exercise the doc pool the way real traffic does. - For each scenario, probe `qd.new_decoder` and run two extra qd variants when present: quickdecode pooled :parse — reused decoder across iters quickdecode new_decoder()+parse — one-shot per iter (no reuse) So a reader can directly compare the legacy qd.parse path, the pool-API-with-reuse path, and the realistic "user creates a fresh decoder per request" pattern in one bench run. Also ship benches/perf_probe.lua: a minimal hammer over qd.parse on a fixed payload for use under `perf record` when investigating FFI hot paths. Not invoked by Makefile targets.

membphis added 2 commits May 15, 2026 23:54

Revert "perf: decoder/document instance pooling (#6) (#11)"

75c65f6

This reverts commit 0721d7d.

membphis merged commit 67773c2 into main May 16, 2026
2 checks passed

membphis mentioned this pull request May 16, 2026

perf(scan): scanner micro-optimizations — shuffle structural detection + adaptive reserve + SmallVec #8

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "perf: decoder/document instance pooling (#11)"#14

Revert "perf: decoder/document instance pooling (#11)"#14
membphis merged 2 commits into
mainfrom
revert/decoder-instance-pooling

membphis commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

membphis commented May 15, 2026

Summary

Why

Bench (3-run median-of-medians)

What's reverted

What's kept (cherry-picked from #13)

Test plan

Context

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant