Skip to content

fix(inference): bound IncrementalDetokenizer retention to the decode-boundary tail window (#324)#538

Merged
ohdearquant merged 2 commits into
mainfrom
fix/324-detok-retention
Jul 2, 2026
Merged

fix(inference): bound IncrementalDetokenizer retention to the decode-boundary tail window (#324)#538
ohdearquant merged 2 commits into
mainfrom
fix/324-detok-retention

Conversation

@ohdearquant

Copy link
Copy Markdown
Owner

PR description authored by Claude (Anthropic agent) on behalf of @ohdearquant.

Summary

IncrementalDetokenizer previously retained every decoded byte for the entire lifetime of a generation (a growing bytes: Vec<u8> plus a flushed cursor) so that text() could reconstruct the whole output on demand. For long generations this grows memory linearly with token count, even though the only state a streaming detokenizer needs is the small tail around the current UTF-8 decode boundary.

This PR bounds that retention to a tiny tail window:

  • The struct now holds a single pending: Vec<u8> containing only the undecided UTF-8 boundary bytes (the incomplete trailing scalar, at most 3 bytes between pushes).
  • After every push/finish, consumed bytes are drained and the buffer capacity is compacted back down to DETOK_RETAINED_BYTE_CAPACITY (8 bytes). A Vec does not shrink capacity on drain/clear alone, so the explicit compaction is what actually caps the allocation.
  • Whole-generation text now accumulates in the generation call sites that already own GenerateOutput.text (the CPU no-stop streaming path and both Metal streaming paths). String-stop paths already accumulated their own full text and are unchanged.
  • text() is removed (not stubbed); it has no remaining callers.

Fail-closed boundary rule (preserved exactly)

Byte-level BPE can split a single codepoint across tokens, so flushing only complete UTF-8 per token is a hard invariant. The boundary logic is unchanged in behavior:

  • Ok(s) → emit all, consume all.
  • Err with valid_up_to() > 0 → emit only the proven-valid prefix, advance by exactly that many bytes.
  • error_len() == None (incomplete trailing codepoint) → retain every remaining byte, emit nothing — a later token may complete it. This is the fail-closed branch: when in doubt at a boundary, retain.
  • error_len() == Some(len) → emit one U+FFFD and advance by len (only Rust-classified unrepairable bytes are ever dropped — never bytes that could still complete a codepoint).
  • finish() is the sole lossy flush, at generation end.

Tests

  • Bounded-memory regression (incremental_detokenizer_retention_bounded_for_long_ascii_stream): asserts retained capacity stays <= 8 across an 8,192-token stream — bound is independent of generation length.
  • UTF-8-boundary parity (incremental_detokenizer_utf8_parity_matches_full_decode_for_split_stream): a CJK / emoji / combining-mark / ZWJ token stream streams byte-identical to the accumulate-everything decode_tokens reference.
  • Both tests are mutation-verified as load-bearing: removing the retention bound makes the memory test fail; an over-eager drop of the incomplete tail makes the parity test fail.
  • Full workspace suite green; clippy --workspace -D warnings and fmt --check clean.

Bench comparison (CPU-only A/B)

make bench-compare (origin/mainHEAD, quick mode). The diff touches only decode bookkeeping (Vec<u8>/String) in the lattice-inference crate, so elementwise_cpu_bench is the nearest CPU group with production relevance (there is no detokenizer-specific Criterion bench in the repo). Base = origin/main, Head = this branch.

elementwise_cpu_bench (lattice-inference) — before/after

Benchmark Base (median) Head (median) Change (CI) p-value Verdict
rms_norm/896 130.83 ns 135.06 ns +2.81% .. +4.95% 0.08 noise
rms_norm/4096 558.69 ns 565.97 ns +1.10% .. +2.11% 0.08 noise
layer_norm/896 206.74 ns 204.02 ns −1.67% .. −0.02% 0.15 noise
layer_norm/4096 899.04 ns 892.94 ns −1.15% .. +0.08% 0.27 noise
silu_inplace/896 451.18 ns 455.61 ns −0.76% .. +1.42% 0.77 noise
silu_inplace/4096 2.0544 µs 2.0516 µs −0.55% .. −0.03% 0.10 noise
gelu/896 369.22 ns 366.72 ns −1.13% .. −0.56% 0.05 noise (tiny improvement)
gelu/4096 1.6716 µs 1.6799 µs −0.70% .. +0.80% 0.89 noise
add_bias_gelu/896 386.91 ns 390.68 ns +0.24% .. +2.16% 0.15 noise
add_bias_gelu/4096 1.7661 µs 1.7784 µs −0.45% .. +2.22% 0.53 noise
softmax_attention/128 3.5945 µs 3.6582 µs +1.48% .. +2.94% 0.10 noise
softmax_attention/512 62.496 µs 63.684 µs +1.43% .. +2.67% 0.08 noise
elementwise_mul/4096 279.18 ns 280.67 ns +0.21% .. +0.70% 0.12 noise

All 13 benchmarks are statistically indistinguishable from base (every p-value ≥ 0.05). Expected: none of these kernels are on the detokenizer code path; the change is token-to-text bookkeeping only. No measurable regression. No Metal/GPU benches were run (this change is CPU-only by nature).

Reviewer note (behavior change on caller-interrupt only)

On a caller interrupt (which returns/breaks without calling finish()), the aggregate GenerateOutput.text no longer appends a trailing U+FFFD for a half-received codepoint present at the interrupt boundary — that partial byte is simply omitted. The streamed on_token deltas are byte-identical in both versions, and all non-interrupt paths (stop / kv-full / max-tokens / natural-end) call finish() and remain byte-identical to the previous behavior. This is an intentional consequence of moving accumulation to the callers, and is arguably more correct (a partial output shouldn't carry a bogus replacement char for a codepoint the caller chose to interrupt mid-way). Flagged here purely for transparency.

Closes #324

…boundary tail window (#324)

IncrementalDetokenizer previously kept every decoded byte for the
lifetime of a generation (bytes + flushed cursor) so text() could
reconstruct the whole output, growing memory linearly with token
count. Replace that with a tail-only `pending: Vec<u8>` holding just
the undecided UTF-8 boundary bytes, capped and compacted back to
DETOK_RETAINED_BYTE_CAPACITY (8 bytes) after every push/finish.

Full-generation text now accumulates in the generation call sites
that already own GenerateOutput.text (CPU no-stop streaming path,
both Metal streaming paths); the string-stop paths already
accumulated their own full text via StopStreamer/`full` and are
unchanged.

Fail-closed boundary rule preserved exactly: an incomplete trailing
UTF-8 sequence (error_len() == None) is always retained, never
dropped or replaced, until a later token completes it or finish()
lossily flushes it at generation end.

Adds a regression test proving retained capacity stays bounded across
an 8,192-token stream, and a UTF-8-boundary property test proving a
CJK/emoji/combining-mark/ZWJ token stream streams byte-identical to
decode_tokens, the accumulate-everything reference.

Closes #324

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown

E2E Parity Report

FAIL: 1/4 prompts diverged within their match windows

Prompt Window Agreement First Diff HF tok/s Lattice tok/s Verdict
The capital of France is 3 4/4 none 0.2 7.3 PASS
In the year 2024, artificial intelligence 3 4/4 none 0.1 2.4 PASS
`def fibonacci(n):
if n <= 1:
    return n
return` | 3 | 4/4 | none | 0.1 | 2.6 | PASS |

| def merge_sort(arr): """ Merge sort implementation. | 2 | 0/4 | pos 0 | 0.1 | 0.6 | FAIL |

The capital of France is

  • HF: Paris.
    The
  • Lattice: Paris.
    The

In the year 2024, artificial intelligence

  • HF: (AI) has
  • Lattice: (AI) has

def fibonacci(n): if n <= 1: return n return

  • HF: fibonacci(n-1
  • Lattice: fibonacci(n-1

def merge_sort(arr): """ Merge sort implementation.

  • HF: merge_sort(arr):
  • Lattice: main():

#324)

Two review findings on the detokenizer-retention PR:

1. (major) The non-prefix Metal streaming decode loop in
   forward/metal_qwen35.rs called on_token(&delta, next_id) without first
   appending delta to the caller-owned `text` accumulator, so
   GenerateOutput.text silently dropped every decode-loop token on that
   path (only the prefill token and the trailing finish() tail were ever
   appended). Fixed by mirroring the prefix-cache streaming path in the
   same file: append the delta to `text` unconditionally, then check the
   on_token interrupt return. This preserves the documented interrupt
   semantics (the interrupting delta itself is retained; the trailing
   unfinished tail is intentionally omitted via the stopped_by_caller
   guard around finish()). Added a regression test that streams several
   tokens through the non-prefix path and asserts GenerateOutput.text
   equals the concatenation of every on_token delta. Grepped the file for
   other detok.push/detok.finish call sites; the prefill single-token
   block and the prefix-cache path already append correctly, and the two
   finish() sites already append their tail.

2. (minor) The existing retention-bound test in
   model/qwen35/detokenize.rs only pushed 1-byte ASCII tokens, so the
   pending buffer's Vec allocation never grew past its initial capacity
   and the test would still pass with compact_pending_capacity() deleted.
   Added a companion test that pushes a single token whose decoded bytes
   (12) exceed the retained capacity bound (8) in one push, forcing an
   allocation past the bound, then asserts capacity compacts back down.
   Mutation-verified: commenting out the compact_pending_capacity() call
   makes the new test fail with capacity=16; restoring the call makes it
   pass again.
@ohdearquant

Copy link
Copy Markdown
Owner Author

Comment authored by Claude (Anthropic agent) on behalf of @ohdearquant.

Review round: two findings fixed in f419f90.

  1. (major) The non-prefix Metal streaming decode loop passed each delta to on_token without appending it to the caller-owned text accumulator, so GenerateOutput.text on that path dropped every decode-loop token (only the prefill token and the finish() tail were appended). Fixed to match the prefix-cache path: append unconditionally, then check the interrupt return. The documented interrupt semantics are unchanged (interrupting delta retained, trailing unfinished tail omitted). A Metal-gated regression test now asserts out.text equals the concatenation of every on_token delta; it compiles under clippy --features metal-gpu and will be executed before merge (GPU currently reserved by a long-running measurement on this machine).

  2. (minor) The retention-bound test used only 1-byte ASCII tokens, so it could not detect removal of the capacity compaction. Added a companion test pushing a 12-byte single-token delta (exceeding the 8-byte retained-capacity bound), mutation-verified: with the compaction call commented out it fails with capacity 16; restored, all 8 detok tests pass.

Full CPU suite green (1358 passed), fmt/clippy clean including --features metal-gpu --all-targets.

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown

E2E Parity Report

PASS: all 4 prompts match within their respective match windows

Prompt Window Agreement First Diff HF tok/s Lattice tok/s Verdict
The capital of France is 3 15/15 none 0.2 1.0 PASS
In the year 2024, artificial intelligence 3 15/15 none 0.2 1.1 PASS
`def fibonacci(n):
if n <= 1:
    return n
return` | 3 | 15/15 | none | 0.2 | 0.9 | PASS |

| def merge_sort(arr): """ Merge sort implementation. | 2 | 15/15 | none | 0.2 | 0.1 | PASS |

The capital of France is

  • HF: Paris.
    The capital of France is Paris.
    The capital of France
  • Lattice: Paris.
    The capital of France is Paris.
    The capital of France

In the year 2024, artificial intelligence

  • HF: (AI) has become a significant part of the global economy. It is
  • Lattice: (AI) has become a significant part of the global economy. It is

def fibonacci(n): if n <= 1: return n return

  • HF: fibonacci(n-1) + fibonacci(n-2)

print(fib

  • Lattice: fibonacci(n-1) + fibonacci(n-2)

print(fib

def merge_sort(arr): """ Merge sort implementation.

  • HF: merge_sort(arr):
    """
    Merge sort implementation.

  • Lattice: merge_sort(arr):
    """
    Merge sort implementation.

@ohdearquant

Copy link
Copy Markdown
Owner Author

Comment authored by Claude (Anthropic agent) on behalf of @ohdearquant.

make bench-compare (quick), base origin/main @ 9997e54bb vs head f419f90c5:

  • Inference-side groups (elementwise_cpu_bench: add_bias_gelu, gelu, elementwise_mul — the crate this PR touches): all within noise, no change detected.
  • Embed SIMD groups initially showed a uniform positive drift across every group (worst point estimate int4_cosine_distance/float32_simd/384 +18.6%). This PR touches only crates/inference (3 files, detokenizer/streaming), no lattice-embed source, and a uniform same-sign shift across 200+ unrelated microbenches is the machine-state signature of the two-worktree A/B, not a regression.
  • Discriminating rerun of the worst group solo on an idle machine against the same saved baseline: every bench reports "No change in performance detected" — the +18.6% outlier collapsed to +1.13% (p = 0.10 > 0.05) and the next-worst (+6.4%) flipped sign to −6.5% (p = 0.06 > 0.05).

Conclusion: bench-compare showed no change (p > 0.05 on all groups under controlled rerun).

@ohdearquant ohdearquant merged commit d37c685 into main Jul 2, 2026
17 of 18 checks passed
@ohdearquant ohdearquant deleted the fix/324-detok-retention branch July 2, 2026 09:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant