Skip to content

feat(inference): UTF-8-safe streaming detokenization + generate_streaming#196

Merged
ohdearquant merged 4 commits into
mainfrom
pr/eng-1-streaming-detok
Jun 22, 2026
Merged

feat(inference): UTF-8-safe streaming detokenization + generate_streaming#196
ohdearquant merged 4 commits into
mainfrom
pr/eng-1-streaming-detok

Conversation

@ohdearquant

@ohdearquant ohdearquant commented Jun 22, 2026

Copy link
Copy Markdown
Owner

What

UTF-8-safe incremental detokenization for byte-level BPE, plus Qwen35Model::generate_streaming — a token-at-a-time generation entry point with a per-token callback.

Why

Byte-level BPE splits multi-byte codepoints (CJK, emoji) across tokens. Naive per-token UTF-8 decode emits U+FFFD replacement characters mid-stream, and English-only smoke tests hide it. detokenize.rs now buffers incomplete codepoints and flushes only complete UTF-8. generate_streaming gives the app a streaming surface, consumed by chat_metal (PR-2) and generate_lora (PR-2b).

Files

  • crates/inference/src/model/qwen35/detokenize.rs (+143/-10)
  • crates/inference/src/model/qwen35/generation.rs (+127/-1)

Verification

cargo build -p lattice-inference clean. The integrated tree (all 8 slices merged) builds all 9 app-shipped binaries green via scripts/build-app-bins.sh.

Bench

make bench-compare was run (origin/main vs integrated tree). Its two-worktree comparator errored while assembling the delta table (a known fragility; the base benches themselves ran clean — rms_norm/layer_norm/silu/gelu/softmax/simd_normalize all reported normal numbers). The change is bench-neutral by construction: no Criterion harness compiles detok/generation into a measured path (grep: zero bench references to either symbol), so the micro-benches are structurally blind to it. The real decode-throughput surface is covered by e2e-parity.yml's informational speed report.

Series

Part of the PR #193 engine-slice (finest split). All engine code lands on main; the macOS app surfaces a subset (Models + Chat) for v0.0.1.

…ming

Add generate_streaming() for token-by-token output via a delta callback,
and make incremental detokenization UTF-8-boundary-safe: byte-level BPE can
split a multi-byte codepoint across tokens, so only complete-UTF-8 deltas are
flushed while partial bytes buffer until the next token completes them.
Without this, CJK and emoji streams surface U+FFFD replacement characters.

Consumed by the Lattice Studio Chat surface (chat_metal --json streaming).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 22, 2026

Copy link
Copy Markdown

E2E Parity Report

PASS: all 3 prompts match within first 3 tokens

Prompt Agreement First Diff HF tok/s Lattice tok/s Verdict
The capital of France is 3/15 pos 3 0.3 2.2 PASS
In the year 2024, artificial intelligence 10/15 pos 9 0.3 1.9 PASS
`def fibonacci(n):
if n <= 1:
    return n
return` | 15/15 | none | 0.2 | 1.4 | PASS |

The capital of France is

  • HF: Paris.
    The capital of France is Paris.
    The capital of France
  • Lattice: Paris.
    A: Yes, the capital of France is Paris.

In the year 2024, artificial intelligence

  • HF: (AI) has become a significant part of the global economy. It is
  • Lattice: (AI) has become a significant part of our daily lives. From personal

def fibonacci(n): if n <= 1: return n return

  • HF: fibonacci(n-1) + fibonacci(n-2)

print(fib

  • Lattice: fibonacci(n-1) + fibonacci(n-2)

print(fib

ohdearquant and others added 3 commits June 22, 2026 14:49
- flush_complete: branch on Utf8Error::error_len() so a malformed byte
  emits U+FFFD and advances immediately instead of stalling the stream
  until finish(); add two invalid-byte regression tests
- generate_streaming: guard max_new_tokens == 0 (return before sampling,
  so no unrequested token is emitted)
- generate_streaming: replace .expect() on all_ids.last() with a let-else
  returning InferenceError (no expect/unwrap in library code)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
codex_review_pr196.md is a local review note that leaked into the branch
via `git add -A`; it is not product code. The PR diff should contain only
the streaming-detokenization source. Codex round-2 confirmed the three code
findings are resolved; this removes the only remaining REQUEST CHANGES item.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ohdearquant ohdearquant merged commit 178cea5 into main Jun 22, 2026
10 checks passed
@ohdearquant ohdearquant deleted the pr/eng-1-streaming-detok branch June 22, 2026 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant