fix(inference): bound IncrementalDetokenizer retention to the decode-boundary tail window (#324) by ohdearquant · Pull Request #538 · ohdearquant/lattice

ohdearquant · 2026-07-02T04:33:56Z

PR description authored by Claude (Anthropic agent) on behalf of @ohdearquant.

Summary

IncrementalDetokenizer previously retained every decoded byte for the entire lifetime of a generation (a growing bytes: Vec<u8> plus a flushed cursor) so that text() could reconstruct the whole output on demand. For long generations this grows memory linearly with token count, even though the only state a streaming detokenizer needs is the small tail around the current UTF-8 decode boundary.

This PR bounds that retention to a tiny tail window:

The struct now holds a single pending: Vec<u8> containing only the undecided UTF-8 boundary bytes (the incomplete trailing scalar, at most 3 bytes between pushes).
After every push/finish, consumed bytes are drained and the buffer capacity is compacted back down to DETOK_RETAINED_BYTE_CAPACITY (8 bytes). A Vec does not shrink capacity on drain/clear alone, so the explicit compaction is what actually caps the allocation.
Whole-generation text now accumulates in the generation call sites that already own GenerateOutput.text (the CPU no-stop streaming path and both Metal streaming paths). String-stop paths already accumulated their own full text and are unchanged.
text() is removed (not stubbed); it has no remaining callers.

Fail-closed boundary rule (preserved exactly)

Byte-level BPE can split a single codepoint across tokens, so flushing only complete UTF-8 per token is a hard invariant. The boundary logic is unchanged in behavior:

Ok(s) → emit all, consume all.
Err with valid_up_to() > 0 → emit only the proven-valid prefix, advance by exactly that many bytes.
error_len() == None (incomplete trailing codepoint) → retain every remaining byte, emit nothing — a later token may complete it. This is the fail-closed branch: when in doubt at a boundary, retain.
error_len() == Some(len) → emit one U+FFFD and advance by len (only Rust-classified unrepairable bytes are ever dropped — never bytes that could still complete a codepoint).
finish() is the sole lossy flush, at generation end.

Tests

Bounded-memory regression (incremental_detokenizer_retention_bounded_for_long_ascii_stream): asserts retained capacity stays <= 8 across an 8,192-token stream — bound is independent of generation length.
UTF-8-boundary parity (incremental_detokenizer_utf8_parity_matches_full_decode_for_split_stream): a CJK / emoji / combining-mark / ZWJ token stream streams byte-identical to the accumulate-everything decode_tokens reference.
Both tests are mutation-verified as load-bearing: removing the retention bound makes the memory test fail; an over-eager drop of the incomplete tail makes the parity test fail.
Full workspace suite green; clippy --workspace -D warnings and fmt --check clean.

Bench comparison (CPU-only A/B)

make bench-compare (origin/main → HEAD, quick mode). The diff touches only decode bookkeeping (Vec<u8>/String) in the lattice-inference crate, so elementwise_cpu_bench is the nearest CPU group with production relevance (there is no detokenizer-specific Criterion bench in the repo). Base = origin/main, Head = this branch.

`elementwise_cpu_bench` (lattice-inference) — before/after

Benchmark	Base (median)	Head (median)	Change (CI)	p-value	Verdict
rms_norm/896	130.83 ns	135.06 ns	+2.81% .. +4.95%	0.08	noise
rms_norm/4096	558.69 ns	565.97 ns	+1.10% .. +2.11%	0.08	noise
layer_norm/896	206.74 ns	204.02 ns	−1.67% .. −0.02%	0.15	noise
layer_norm/4096	899.04 ns	892.94 ns	−1.15% .. +0.08%	0.27	noise
silu_inplace/896	451.18 ns	455.61 ns	−0.76% .. +1.42%	0.77	noise
silu_inplace/4096	2.0544 µs	2.0516 µs	−0.55% .. −0.03%	0.10	noise
gelu/896	369.22 ns	366.72 ns	−1.13% .. −0.56%	0.05	noise (tiny improvement)
gelu/4096	1.6716 µs	1.6799 µs	−0.70% .. +0.80%	0.89	noise
add_bias_gelu/896	386.91 ns	390.68 ns	+0.24% .. +2.16%	0.15	noise
add_bias_gelu/4096	1.7661 µs	1.7784 µs	−0.45% .. +2.22%	0.53	noise
softmax_attention/128	3.5945 µs	3.6582 µs	+1.48% .. +2.94%	0.10	noise
softmax_attention/512	62.496 µs	63.684 µs	+1.43% .. +2.67%	0.08	noise
elementwise_mul/4096	279.18 ns	280.67 ns	+0.21% .. +0.70%	0.12	noise

All 13 benchmarks are statistically indistinguishable from base (every p-value ≥ 0.05). Expected: none of these kernels are on the detokenizer code path; the change is token-to-text bookkeeping only. No measurable regression. No Metal/GPU benches were run (this change is CPU-only by nature).

Reviewer note (behavior change on caller-interrupt only)

On a caller interrupt (which returns/breaks without calling finish()), the aggregate GenerateOutput.text no longer appends a trailing U+FFFD for a half-received codepoint present at the interrupt boundary — that partial byte is simply omitted. The streamed on_token deltas are byte-identical in both versions, and all non-interrupt paths (stop / kv-full / max-tokens / natural-end) call finish() and remain byte-identical to the previous behavior. This is an intentional consequence of moving accumulation to the callers, and is arguably more correct (a partial output shouldn't carry a bogus replacement char for a codepoint the caller chose to interrupt mid-way). Flagged here purely for transparency.

Closes #324

…boundary tail window (#324) IncrementalDetokenizer previously kept every decoded byte for the lifetime of a generation (bytes + flushed cursor) so text() could reconstruct the whole output, growing memory linearly with token count. Replace that with a tail-only `pending: Vec<u8>` holding just the undecided UTF-8 boundary bytes, capped and compacted back to DETOK_RETAINED_BYTE_CAPACITY (8 bytes) after every push/finish. Full-generation text now accumulates in the generation call sites that already own GenerateOutput.text (CPU no-stop streaming path, both Metal streaming paths); the string-stop paths already accumulated their own full text via StopStreamer/`full` and are unchanged. Fail-closed boundary rule preserved exactly: an incomplete trailing UTF-8 sequence (error_len() == None) is always retained, never dropped or replaced, until a later token completes it or finish() lossily flushes it at generation end. Adds a regression test proving retained capacity stays bounded across an 8,192-token stream, and a UTF-8-boundary property test proving a CJK/emoji/combining-mark/ZWJ token stream streams byte-identical to decode_tokens, the accumulate-everything reference. Closes #324 Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>

github-actions · 2026-07-02T04:42:51Z

E2E Parity Report

FAIL: 1/4 prompts diverged within their match windows

Prompt	Window	Agreement	First Diff	HF tok/s	Lattice tok/s	Verdict
`The capital of France is`	3	4/4	none	0.2	7.3	PASS
`In the year 2024, artificial intelligence`	3	4/4	none	0.1	2.4	PASS
`def fibonacci(n):

if n <= 1:
    return n
return` | 3 | 4/4 | none | 0.1 | 2.6 | PASS |

| def merge_sort(arr): """ Merge sort implementation. | 2 | 0/4 | pos 0 | 0.1 | 0.6 | FAIL |

The capital of France is

HF: Paris.
The
Lattice: Paris.
The

In the year 2024, artificial intelligence

HF: (AI) has
Lattice: (AI) has

def fibonacci(n): if n <= 1: return n return

HF: fibonacci(n-1
Lattice: fibonacci(n-1

def merge_sort(arr): """ Merge sort implementation.

HF: merge_sort(arr):
Lattice: main():

#324) Two review findings on the detokenizer-retention PR: 1. (major) The non-prefix Metal streaming decode loop in forward/metal_qwen35.rs called on_token(&delta, next_id) without first appending delta to the caller-owned `text` accumulator, so GenerateOutput.text silently dropped every decode-loop token on that path (only the prefill token and the trailing finish() tail were ever appended). Fixed by mirroring the prefix-cache streaming path in the same file: append the delta to `text` unconditionally, then check the on_token interrupt return. This preserves the documented interrupt semantics (the interrupting delta itself is retained; the trailing unfinished tail is intentionally omitted via the stopped_by_caller guard around finish()). Added a regression test that streams several tokens through the non-prefix path and asserts GenerateOutput.text equals the concatenation of every on_token delta. Grepped the file for other detok.push/detok.finish call sites; the prefill single-token block and the prefix-cache path already append correctly, and the two finish() sites already append their tail. 2. (minor) The existing retention-bound test in model/qwen35/detokenize.rs only pushed 1-byte ASCII tokens, so the pending buffer's Vec allocation never grew past its initial capacity and the test would still pass with compact_pending_capacity() deleted. Added a companion test that pushes a single token whose decoded bytes (12) exceed the retained capacity bound (8) in one push, forcing an allocation past the bound, then asserts capacity compacts back down. Mutation-verified: commenting out the compact_pending_capacity() call makes the new test fail with capacity=16; restoring the call makes it pass again.

ohdearquant · 2026-07-02T04:46:42Z

Comment authored by Claude (Anthropic agent) on behalf of @ohdearquant.

Review round: two findings fixed in f419f90.

(major) The non-prefix Metal streaming decode loop passed each delta to on_token without appending it to the caller-owned text accumulator, so GenerateOutput.text on that path dropped every decode-loop token (only the prefill token and the finish() tail were appended). Fixed to match the prefix-cache path: append unconditionally, then check the interrupt return. The documented interrupt semantics are unchanged (interrupting delta retained, trailing unfinished tail omitted). A Metal-gated regression test now asserts out.text equals the concatenation of every on_token delta; it compiles under clippy --features metal-gpu and will be executed before merge (GPU currently reserved by a long-running measurement on this machine).
(minor) The retention-bound test used only 1-byte ASCII tokens, so it could not detect removal of the capacity compaction. Added a companion test pushing a 12-byte single-token delta (exceeding the 8-byte retained-capacity bound), mutation-verified: with the compaction call commented out it fails with capacity 16; restored, all 8 detok tests pass.

Full CPU suite green (1358 passed), fmt/clippy clean including --features metal-gpu --all-targets.

github-actions · 2026-07-02T04:49:22Z

E2E Parity Report

PASS: all 4 prompts match within their respective match windows

Prompt	Window	Agreement	First Diff	HF tok/s	Lattice tok/s	Verdict
`The capital of France is`	3	15/15	none	0.2	1.0	PASS
`In the year 2024, artificial intelligence`	3	15/15	none	0.2	1.1	PASS
`def fibonacci(n):

if n <= 1:
    return n
return` | 3 | 15/15 | none | 0.2 | 0.9 | PASS |

| def merge_sort(arr): """ Merge sort implementation. | 2 | 15/15 | none | 0.2 | 0.1 | PASS |

The capital of France is

HF: Paris.
The capital of France is Paris.
The capital of France
Lattice: Paris.
The capital of France is Paris.
The capital of France

In the year 2024, artificial intelligence

HF: (AI) has become a significant part of the global economy. It is
Lattice: (AI) has become a significant part of the global economy. It is

def fibonacci(n): if n <= 1: return n return

HF: fibonacci(n-1) + fibonacci(n-2)

print(fib

Lattice: fibonacci(n-1) + fibonacci(n-2)

print(fib

def merge_sort(arr): """ Merge sort implementation.

HF: merge_sort(arr):
"""
Merge sort implementation.
Lattice: merge_sort(arr):
"""
Merge sort implementation.

ohdearquant · 2026-07-02T09:22:47Z

Comment authored by Claude (Anthropic agent) on behalf of @ohdearquant.

make bench-compare (quick), base origin/main @ 9997e54bb vs head f419f90c5:

Inference-side groups (elementwise_cpu_bench: add_bias_gelu, gelu, elementwise_mul — the crate this PR touches): all within noise, no change detected.
Embed SIMD groups initially showed a uniform positive drift across every group (worst point estimate int4_cosine_distance/float32_simd/384 +18.6%). This PR touches only crates/inference (3 files, detokenizer/streaming), no lattice-embed source, and a uniform same-sign shift across 200+ unrelated microbenches is the machine-state signature of the two-worktree A/B, not a regression.
Discriminating rerun of the worst group solo on an idle machine against the same saved baseline: every bench reports "No change in performance detected" — the +18.6% outlier collapsed to +1.13% (p = 0.10 > 0.05) and the next-worst (+6.4%) flipped sign to −6.5% (p = 0.06 > 0.05).

Conclusion: bench-compare showed no change (p > 0.05 on all groups under controlled rerun).

ohdearquant merged commit d37c685 into main Jul 2, 2026
17 of 18 checks passed

ohdearquant deleted the fix/324-detok-retention branch July 2, 2026 09:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(inference): bound IncrementalDetokenizer retention to the decode-boundary tail window (#324)#538

fix(inference): bound IncrementalDetokenizer retention to the decode-boundary tail window (#324)#538
ohdearquant merged 2 commits into
mainfrom
fix/324-detok-retention

ohdearquant commented Jul 2, 2026

Uh oh!

github-actions Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

ohdearquant commented Jul 2, 2026

Uh oh!

github-actions Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

ohdearquant commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ohdearquant commented Jul 2, 2026

Summary

Fail-closed boundary rule (preserved exactly)

Tests

Bench comparison (CPU-only A/B)

elementwise_cpu_bench (lattice-inference) — before/after

Reviewer note (behavior change on caller-interrupt only)

Uh oh!

github-actions Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Parity Report

Uh oh!

ohdearquant commented Jul 2, 2026

Uh oh!

github-actions Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Parity Report

Uh oh!

ohdearquant commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`elementwise_cpu_bench` (lattice-inference) — before/after

github-actions Bot commented Jul 2, 2026 •

edited

Loading

github-actions Bot commented Jul 2, 2026 •

edited

Loading