fix(inference): bound IncrementalDetokenizer retention to the decode-boundary tail window (#324)#538
Conversation
…boundary tail window (#324) IncrementalDetokenizer previously kept every decoded byte for the lifetime of a generation (bytes + flushed cursor) so text() could reconstruct the whole output, growing memory linearly with token count. Replace that with a tail-only `pending: Vec<u8>` holding just the undecided UTF-8 boundary bytes, capped and compacted back to DETOK_RETAINED_BYTE_CAPACITY (8 bytes) after every push/finish. Full-generation text now accumulates in the generation call sites that already own GenerateOutput.text (CPU no-stop streaming path, both Metal streaming paths); the string-stop paths already accumulated their own full text via StopStreamer/`full` and are unchanged. Fail-closed boundary rule preserved exactly: an incomplete trailing UTF-8 sequence (error_len() == None) is always retained, never dropped or replaced, until a later token completes it or finish() lossily flushes it at generation end. Adds a regression test proving retained capacity stays bounded across an 8,192-token stream, and a UTF-8-boundary property test proving a CJK/emoji/combining-mark/ZWJ token stream streams byte-identical to decode_tokens, the accumulate-everything reference. Closes #324 Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
E2E Parity ReportFAIL: 1/4 prompts diverged within their match windows
|
|
#324) Two review findings on the detokenizer-retention PR: 1. (major) The non-prefix Metal streaming decode loop in forward/metal_qwen35.rs called on_token(&delta, next_id) without first appending delta to the caller-owned `text` accumulator, so GenerateOutput.text silently dropped every decode-loop token on that path (only the prefill token and the trailing finish() tail were ever appended). Fixed by mirroring the prefix-cache streaming path in the same file: append the delta to `text` unconditionally, then check the on_token interrupt return. This preserves the documented interrupt semantics (the interrupting delta itself is retained; the trailing unfinished tail is intentionally omitted via the stopped_by_caller guard around finish()). Added a regression test that streams several tokens through the non-prefix path and asserts GenerateOutput.text equals the concatenation of every on_token delta. Grepped the file for other detok.push/detok.finish call sites; the prefill single-token block and the prefix-cache path already append correctly, and the two finish() sites already append their tail. 2. (minor) The existing retention-bound test in model/qwen35/detokenize.rs only pushed 1-byte ASCII tokens, so the pending buffer's Vec allocation never grew past its initial capacity and the test would still pass with compact_pending_capacity() deleted. Added a companion test that pushes a single token whose decoded bytes (12) exceed the retained capacity bound (8) in one push, forcing an allocation past the bound, then asserts capacity compacts back down. Mutation-verified: commenting out the compact_pending_capacity() call makes the new test fail with capacity=16; restoring the call makes it pass again.
Review round: two findings fixed in f419f90.
Full CPU suite green (1358 passed), fmt/clippy clean including |
E2E Parity ReportPASS: all 4 prompts match within their respective match windows
|
print(fib
print(fib
|
Conclusion: bench-compare showed no change (p > 0.05 on all groups under controlled rerun). |
Summary
IncrementalDetokenizerpreviously retained every decoded byte for the entire lifetime of a generation (a growingbytes: Vec<u8>plus aflushedcursor) so thattext()could reconstruct the whole output on demand. For long generations this grows memory linearly with token count, even though the only state a streaming detokenizer needs is the small tail around the current UTF-8 decode boundary.This PR bounds that retention to a tiny tail window:
pending: Vec<u8>containing only the undecided UTF-8 boundary bytes (the incomplete trailing scalar, at most 3 bytes between pushes).push/finish, consumed bytes aredrained and the buffer capacity is compacted back down toDETOK_RETAINED_BYTE_CAPACITY(8 bytes). AVecdoes not shrink capacity ondrain/clearalone, so the explicit compaction is what actually caps the allocation.GenerateOutput.text(the CPU no-stop streaming path and both Metal streaming paths). String-stop paths already accumulated their own full text and are unchanged.text()is removed (not stubbed); it has no remaining callers.Fail-closed boundary rule (preserved exactly)
Byte-level BPE can split a single codepoint across tokens, so flushing only complete UTF-8 per token is a hard invariant. The boundary logic is unchanged in behavior:
Ok(s)→ emit all, consume all.Errwithvalid_up_to() > 0→ emit only the proven-valid prefix, advance by exactly that many bytes.error_len() == None(incomplete trailing codepoint) → retain every remaining byte, emit nothing — a later token may complete it. This is the fail-closed branch: when in doubt at a boundary, retain.error_len() == Some(len)→ emit oneU+FFFDand advance bylen(only Rust-classified unrepairable bytes are ever dropped — never bytes that could still complete a codepoint).finish()is the sole lossy flush, at generation end.Tests
incremental_detokenizer_retention_bounded_for_long_ascii_stream): asserts retained capacity stays<= 8across an 8,192-token stream — bound is independent of generation length.incremental_detokenizer_utf8_parity_matches_full_decode_for_split_stream): a CJK / emoji / combining-mark / ZWJ token stream streams byte-identical to the accumulate-everythingdecode_tokensreference.clippy --workspace -D warningsandfmt --checkclean.Bench comparison (CPU-only A/B)
make bench-compare(origin/main→HEAD, quick mode). The diff touches only decode bookkeeping (Vec<u8>/String) in thelattice-inferencecrate, soelementwise_cpu_benchis the nearest CPU group with production relevance (there is no detokenizer-specific Criterion bench in the repo). Base =origin/main, Head = this branch.elementwise_cpu_bench(lattice-inference) — before/afterAll 13 benchmarks are statistically indistinguishable from base (every p-value ≥ 0.05). Expected: none of these kernels are on the detokenizer code path; the change is token-to-text bookkeeping only. No measurable regression. No Metal/GPU benches were run (this change is CPU-only by nature).
Reviewer note (behavior change on caller-interrupt only)
On a caller interrupt (which returns/breaks without calling
finish()), the aggregateGenerateOutput.textno longer appends a trailingU+FFFDfor a half-received codepoint present at the interrupt boundary — that partial byte is simply omitted. The streamedon_tokendeltas are byte-identical in both versions, and all non-interrupt paths (stop / kv-full / max-tokens / natural-end) callfinish()and remain byte-identical to the previous behavior. This is an intentional consequence of moving accumulation to the callers, and is arguably more correct (a partial output shouldn't carry a bogus replacement char for a codepoint the caller chose to interrupt mid-way). Flagged here purely for transparency.Closes #324