diff --git a/README.md b/README.md index b836e85..c02f9cd 100644 --- a/README.md +++ b/README.md @@ -132,7 +132,7 @@ Items intentionally pushed out of the first implementation. Each will be picked - **Adaptive `out.reserve` in scanners** — `out.reserve(buf.len() / 6)` is calibrated for object-heavy JSON. On string-heavy multimodal payloads (one big content array, mostly base64) the actual emit rate is <1 structural per 1 KB, so we over-reserve by 100x+. Mainly a memory hygiene concern (mmap'd pages stay lazily faulted), <5% throughput effect. - **AVX-512 scanner backend** — 64-byte → 128-byte chunks. On the 1 MB string-heavy bench, profile shows scan throughput is L3-bandwidth-bound, so realistic win is ~1.5–1.8×, not a clean 2×; larger wins need fixtures that fit in L1/L2. Needs `avx512bw` + `vpclmulqdq` (Sapphire Rapids, Zen 4+). - **`cargo fmt --check` not enforced** — `make lint` runs clippy only. The codebase uses intentional manual column alignment in struct definitions and compact single-line literals that default rustfmt would reflow. Skip rather than reformat until a project-wide style decision is made. -- **`validate_brackets` fusion in SIMD scanners** — fused into `ScalarScanner` via `scan_and_validate`; AVX2 and NEON scanners still run the two-pass emit + `validate_brackets` design. Folding bracket pairing into the SIMD emit loops would require carrying a depth stack across chunks (the inline `emit_bits` loop currently has no such state). <1% effect on string-heavy workloads; worth revisiting only if profiling on structurally-dense input flags it. +- **`validate_brackets` fusion in SIMD scanners** — fused into `ScalarScanner` via `scan_and_validate`; AVX2 and NEON scanners still run the two-pass emit + `validate_brackets` design. A working implementation was prototyped in [#18](https://github.com/membphis/lua-quick-decode/pull/18) (closed): `emit_bits_validate` carries a depth stack inline and dispatches on `buf[pos]` per emitted bit, eliminating the second pass over `indices`. Measured ±2% (within noise) on the multimodal bench because the per-emit `buf[pos]` lookup adds back roughly what the eliminated pass saved, and the structural-char density is too low for the savings to dominate. Revisit only when a structurally-dense fixture (config / JSONL / object-shape JSON with hundreds of keys per chunk) is added to the bench harness and profiles flag `validate_brackets` as the bottleneck. - **`memchr2` cross-chunk jump for very long string interiors** — the AVX2 in-string fast probe (issue #5) drops per-chunk cost from ~25 to ~10 ops but still pays ALU work for every 64-byte chunk in a string. A `memchr2(b'"', b'\\')` jump can approach memory bandwidth on multi-MB single-string payloads. Deferred until a workload that benefits clearly emerges; needs careful `bs_carry` reasoning across the jump. - **Stateful O(N) iterator FFI** — current `qd.pairs` and the `__newindex` materialization path walk the object cursor from the start on every step,