feat(inference): chat_metal streaming CLI for Lattice Studio Chat#197
Merged
Conversation
…er loading Make chat_metal the full Chat-surface CLI: - Token-by-token streaming via generate_streaming / chat_completion_streaming - Sampler controls: --temperature, --top-k, --top-p, --repetition-penalty, --seed - --lora adapter loading (PEFT and MLX key formats; alpha from metadata) - --json mode emitting @@lattice events for the Lattice Studio Chat surface Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
chat_metal --json (Lattice Studio's streaming path) emitted mojibake for CJK and emoji. MetalQwen35State::generate_streaming decoded each token with the per-token-lossy local decode_tokens, so a codepoint split across BPE tokens streamed U+FFFD before the CLI ever JSON-escaped it. Switch generate_streaming to the shared IncrementalDetokenizer — already used by the non-Metal model::qwen35::generation path and unit-tested in detokenize.rs — which buffers raw bytes and emits only the longest complete-UTF-8 prefix per token, flushing any truncated tail once at the end. Streamed deltas now concatenate to exactly the returned text. Also stop discarding stdout write/flush errors in the JSON streaming loop. A broken pipe previously let generation run to completion and still report success; now the first failure stops the stream and the CLI exits nonzero. Widens model::qwen35::detokenize to pub(crate) so the Metal forward path can share the detokenizer. The local decode_tokens stays for the non-streaming final-text paths (separate latent CJK issue, tracked apart). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
E2E Parity ReportPASS: all 3 prompts match within first 3 tokens
print(fib
print(fib |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
chat_metal— the full Chat-surface CLI for Lattice Studio. Streaming token output, sampler flags, and LoRA adapter loading.Why
The macOS Studio Chat screen drives generation through this binary. It streams via
generate_streaming/chat_completion_streaming(PR-1), exposes--temperature --top-k --top-p --repetition-penalty --seed --json --prompt --max-tokens --model --model-dir --lora, and loads LoRA adapters in both PEFT and MLX key formats (alpha read from adapter metadata).Files
crates/inference/src/bin/chat_metal.rs(+555/-67)Verification
cargo build --release -p lattice-inference --bin chat_metal --features f16,metal-gpuclean. Built green in the integrated-tree gate (scripts/build-app-bins.sh).Bench
Binary-only change (no library hot path touched).
make bench-compare's comparator errored assembling the delta (known two-worktree fragility; base benches ran clean) — bench-neutral by construction, as the changed code is off every Criterion-measured path.Base
Stacked on
pr/eng-1-streaming-detok(depends ongenerate_streaming). Review/merge PR-1 first.Series
Part of the PR #193 engine-slice (finest split). All engine code lands on main; the macOS app surfaces a subset (Models + Chat) for v0.0.1.