Skip to content

Revert "perf: decoder/document instance pooling (#11)"#14

Merged
membphis merged 2 commits into
mainfrom
revert/decoder-instance-pooling
May 16, 2026
Merged

Revert "perf: decoder/document instance pooling (#11)"#14
membphis merged 2 commits into
mainfrom
revert/decoder-instance-pooling

Conversation

@membphis
Copy link
Copy Markdown
Collaborator

Summary

Why

The pool API in #11 was designed to amortize per-parse allocation cost across many parses on a reused decoder. In our actual deployment — an API gateway that allocates a fresh decoder per request and lets GC reclaim everything — the pool advantage is zero:

  • The decoder lifetime equals one request, so there is no Vec capacity reuse, no scratch buffer reuse, no skip-cache reuse.
  • The doc indirection (qjd_doc handle pointing at a separately-Box'd qjd_decoder) and per-get_* check_doc_alive (state + generation check) become pure overhead.

Bench results across the design space made this concrete:

path on this branch role perf vs scan-only
qd.parse() (legacy) what callers use today restored
(pool API removed) reused decoder n/a
(pool API removed) fresh decoder per request n/a

#13 tried to fix the legacy regression by folding doc + decoder into one allocation and fast-pathing check_doc_alive. Three-run median-of-medians measurements (in #13 comments) showed the fold still trailed scan-only by 14–27% on 100 KB–5 MB payloads — the bulk of API gateway traffic. The residual cost was not worth the optionality of a pool API that production cannot use anyway.

Bench (3-run median-of-medians)

payload scan-only 7a895e5 this branch qd.parse Δ
small 2 KB 127,877 136,370 +6.6%
medium 60 KB 123,122 113,122 −8.1%
100 KB 90,253 104,167 +15.4%
200 KB 66,050 67,024 +1.5%
500 KB 29,718 41,929 +41.1% *
1 MB 19,430 19,659 +1.2%
2 MB 8,489 8,407 −1.0%
5 MB 2,814 2,543 −9.6%
10 MB 491 517 +5.3%
interleaved 100k–1m 27,172 25,760 −5.2%

All deltas fall within the run-to-run noise band we saw on this machine (most rows ±10%; the 500K row's apparent +41% sits inside the underlying range overlap and is system noise, not a real win). Net read: the legacy path is fully restored to its scan-only baseline.

(*) Bench harness preserved from the fold-attempt branch — warmup + 5-round median plus an interleaved-size scenario — so future runs are easier to read.

What's reverted

  • src/decoder.rs (deleted) → src/doc.rs (restored)
  • qjd_decoder_new / _free / _reset / _destroy / _parse exports gone
  • QJD_STALE_DOC error code gone
  • Generation counter, state machine, and check_doc_alive gone
  • count-allocs feature + tests/alloc_count.rs + tests/decoder_ffi.rs + tests/lua/decoder_spec.lua gone
  • CLAUDE.md / README pool-API section gone
  • Spec doc docs/superpowers/specs/2026-05-15-decoder-pooling-design.md gone

What's kept (cherry-picked from #13)

  • benches/lua_bench.lua: warmup pass + 5-round median + interleaved 100k–1m scenario. The qd.new_decoder probes degrade gracefully when the API is absent — they print nothing on this branch.
  • benches/perf_probe.lua: minimal hammer for perf record investigations.

Test plan

  • cargo test --release (108 tests, was 132 — the 24-test delta is exactly the pool-API surface that was reverted)
  • cargo test --release --no-default-features
  • cargo test --features test-panic --release
  • 3-round bench vs 7a895e5, results above
  • make test (Lua busted) — busted not on this dev machine; CI will run

Context

Supersedes #13. See the closing comment on #13 and the long perf-investigation transcript that motivated this for the full reasoning.

membphis added 2 commits May 15, 2026 23:54
The single-run-with-mean output the bench used to print swung 30-40%
between invocations on noisy machines, making it hard to tell signal
from noise when comparing perf commits.

- bench() now runs a warmup pass (JIT trace compile, pool fill), then
  five timed rounds. Reports median and mean ops/s plus the round-by-
  round min..max range so reviewers can see whether a delta is real.
- Add an `interleaved 100k,200k,500k,1m` scenario that rotates through
  four payload sizes, matching a server that handles varying request
  sizes back to back. The single-payload loops cannot exercise the
  doc pool the way real traffic does.
- For each scenario, probe `qd.new_decoder` and run two extra qd
  variants when present:
    quickdecode pooled :parse           — reused decoder across iters
    quickdecode new_decoder()+parse     — one-shot per iter (no reuse)
  So a reader can directly compare the legacy qd.parse path, the
  pool-API-with-reuse path, and the realistic "user creates a fresh
  decoder per request" pattern in one bench run.

Also ship benches/perf_probe.lua: a minimal hammer over qd.parse on a
fixed payload for use under `perf record` when investigating FFI hot
paths. Not invoked by Makefile targets.
@membphis membphis merged commit 67773c2 into main May 16, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant