feat(inference): UTF-8-safe streaming detokenization + generate_streaming by ohdearquant · Pull Request #196 · ohdearquant/lattice

ohdearquant · 2026-06-22T18:23:43Z

What

UTF-8-safe incremental detokenization for byte-level BPE, plus Qwen35Model::generate_streaming — a token-at-a-time generation entry point with a per-token callback.

Why

Byte-level BPE splits multi-byte codepoints (CJK, emoji) across tokens. Naive per-token UTF-8 decode emits U+FFFD replacement characters mid-stream, and English-only smoke tests hide it. detokenize.rs now buffers incomplete codepoints and flushes only complete UTF-8. generate_streaming gives the app a streaming surface, consumed by chat_metal (PR-2) and generate_lora (PR-2b).

Files

crates/inference/src/model/qwen35/detokenize.rs (+143/-10)
crates/inference/src/model/qwen35/generation.rs (+127/-1)

Verification

cargo build -p lattice-inference clean. The integrated tree (all 8 slices merged) builds all 9 app-shipped binaries green via scripts/build-app-bins.sh.

Bench

make bench-compare was run (origin/main vs integrated tree). Its two-worktree comparator errored while assembling the delta table (a known fragility; the base benches themselves ran clean — rms_norm/layer_norm/silu/gelu/softmax/simd_normalize all reported normal numbers). The change is bench-neutral by construction: no Criterion harness compiles detok/generation into a measured path (grep: zero bench references to either symbol), so the micro-benches are structurally blind to it. The real decode-throughput surface is covered by e2e-parity.yml's informational speed report.

Series

Part of the PR #193 engine-slice (finest split). All engine code lands on main; the macOS app surfaces a subset (Models + Chat) for v0.0.1.

…ming Add generate_streaming() for token-by-token output via a delta callback, and make incremental detokenization UTF-8-boundary-safe: byte-level BPE can split a multi-byte codepoint across tokens, so only complete-UTF-8 deltas are flushed while partial bytes buffer until the next token completes them. Without this, CJK and emoji streams surface U+FFFD replacement characters. Consumed by the Lattice Studio Chat surface (chat_metal --json streaming). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-22T18:32:16Z

E2E Parity Report

PASS: all 3 prompts match within first 3 tokens

Prompt	Agreement	First Diff	HF tok/s	Lattice tok/s	Verdict
`The capital of France is`	3/15	pos 3	0.3	2.2	PASS
`In the year 2024, artificial intelligence`	10/15	pos 9	0.3	1.9	PASS
`def fibonacci(n):

if n <= 1:
    return n
return` | 15/15 | none | 0.2 | 1.4 | PASS |

The capital of France is

HF: Paris.
The capital of France is Paris.
The capital of France
Lattice: Paris.
A: Yes, the capital of France is Paris.

In the year 2024, artificial intelligence

HF: (AI) has become a significant part of the global economy. It is
Lattice: (AI) has become a significant part of our daily lives. From personal

def fibonacci(n): if n <= 1: return n return

HF: fibonacci(n-1) + fibonacci(n-2)

print(fib

Lattice: fibonacci(n-1) + fibonacci(n-2)

print(fib

- flush_complete: branch on Utf8Error::error_len() so a malformed byte emits U+FFFD and advances immediately instead of stalling the stream until finish(); add two invalid-byte regression tests - generate_streaming: guard max_new_tokens == 0 (return before sampling, so no unrequested token is emitted) - generate_streaming: replace .expect() on all_ids.last() with a let-else returning InferenceError (no expect/unwrap in library code) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codex_review_pr196.md is a local review note that leaked into the branch via `git add -A`; it is not product code. The PR diff should contain only the streaming-detokenization source. Codex round-2 confirmed the three code findings are resolved; this removes the only remaining REQUEST CHANGES item. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ohdearquant and others added 3 commits June 22, 2026 14:49

Merge branch 'main' into pr/eng-1-streaming-detok

9d20a2c

ohdearquant merged commit 178cea5 into main Jun 22, 2026
10 checks passed

ohdearquant deleted the pr/eng-1-streaming-detok branch June 22, 2026 23:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(inference): UTF-8-safe streaming detokenization + generate_streaming#196

feat(inference): UTF-8-safe streaming detokenization + generate_streaming#196
ohdearquant merged 4 commits into
mainfrom
pr/eng-1-streaming-detok

ohdearquant commented Jun 22, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ohdearquant commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Files

Verification

Bench

Series

Uh oh!

github-actions Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Parity Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ohdearquant commented Jun 22, 2026 •

edited

Loading

github-actions Bot commented Jun 22, 2026 •

edited

Loading