feat(inference): UTF-8-safe streaming detokenization + generate_streaming#196
Merged
Conversation
…ming Add generate_streaming() for token-by-token output via a delta callback, and make incremental detokenization UTF-8-boundary-safe: byte-level BPE can split a multi-byte codepoint across tokens, so only complete-UTF-8 deltas are flushed while partial bytes buffer until the next token completes them. Without this, CJK and emoji streams surface U+FFFD replacement characters. Consumed by the Lattice Studio Chat surface (chat_metal --json streaming). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
E2E Parity ReportPASS: all 3 prompts match within first 3 tokens
print(fib
print(fib |
- flush_complete: branch on Utf8Error::error_len() so a malformed byte emits U+FFFD and advances immediately instead of stalling the stream until finish(); add two invalid-byte regression tests - generate_streaming: guard max_new_tokens == 0 (return before sampling, so no unrequested token is emitted) - generate_streaming: replace .expect() on all_ids.last() with a let-else returning InferenceError (no expect/unwrap in library code) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
codex_review_pr196.md is a local review note that leaked into the branch via `git add -A`; it is not product code. The PR diff should contain only the streaming-detokenization source. Codex round-2 confirmed the three code findings are resolved; this removes the only remaining REQUEST CHANGES item. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
UTF-8-safe incremental detokenization for byte-level BPE, plus
Qwen35Model::generate_streaming— a token-at-a-time generation entry point with a per-token callback.Why
Byte-level BPE splits multi-byte codepoints (CJK, emoji) across tokens. Naive per-token UTF-8 decode emits U+FFFD replacement characters mid-stream, and English-only smoke tests hide it.
detokenize.rsnow buffers incomplete codepoints and flushes only complete UTF-8.generate_streaminggives the app a streaming surface, consumed bychat_metal(PR-2) andgenerate_lora(PR-2b).Files
crates/inference/src/model/qwen35/detokenize.rs(+143/-10)crates/inference/src/model/qwen35/generation.rs(+127/-1)Verification
cargo build -p lattice-inferenceclean. The integrated tree (all 8 slices merged) builds all 9 app-shipped binaries green viascripts/build-app-bins.sh.Bench
make bench-comparewas run (origin/main vs integrated tree). Its two-worktree comparator errored while assembling the delta table (a known fragility; the base benches themselves ran clean — rms_norm/layer_norm/silu/gelu/softmax/simd_normalize all reported normal numbers). The change is bench-neutral by construction: no Criterion harness compiles detok/generation into a measured path (grep: zero bench references to either symbol), so the micro-benches are structurally blind to it. The real decode-throughput surface is covered bye2e-parity.yml's informational speed report.Series
Part of the PR #193 engine-slice (finest split). All engine code lands on main; the macOS app surfaces a subset (Models + Chat) for v0.0.1.