Skip to content

feat(inference): chat_metal streaming CLI for Lattice Studio Chat#197

Merged
ohdearquant merged 3 commits into
mainfrom
pr/eng-2-chat-metal
Jun 22, 2026
Merged

feat(inference): chat_metal streaming CLI for Lattice Studio Chat#197
ohdearquant merged 3 commits into
mainfrom
pr/eng-2-chat-metal

Conversation

@ohdearquant

@ohdearquant ohdearquant commented Jun 22, 2026

Copy link
Copy Markdown
Owner

What

chat_metal — the full Chat-surface CLI for Lattice Studio. Streaming token output, sampler flags, and LoRA adapter loading.

Why

The macOS Studio Chat screen drives generation through this binary. It streams via generate_streaming / chat_completion_streaming (PR-1), exposes --temperature --top-k --top-p --repetition-penalty --seed --json --prompt --max-tokens --model --model-dir --lora, and loads LoRA adapters in both PEFT and MLX key formats (alpha read from adapter metadata).

Files

  • crates/inference/src/bin/chat_metal.rs (+555/-67)

Verification

cargo build --release -p lattice-inference --bin chat_metal --features f16,metal-gpu clean. Built green in the integrated-tree gate (scripts/build-app-bins.sh).

Bench

Binary-only change (no library hot path touched). make bench-compare's comparator errored assembling the delta (known two-worktree fragility; base benches ran clean) — bench-neutral by construction, as the changed code is off every Criterion-measured path.

Base

Stacked on pr/eng-1-streaming-detok (depends on generate_streaming). Review/merge PR-1 first.

Series

Part of the PR #193 engine-slice (finest split). All engine code lands on main; the macOS app surfaces a subset (Models + Chat) for v0.0.1.

ohdearquant and others added 2 commits June 22, 2026 14:01
…er loading

Make chat_metal the full Chat-surface CLI:
- Token-by-token streaming via generate_streaming / chat_completion_streaming
- Sampler controls: --temperature, --top-k, --top-p, --repetition-penalty, --seed
- --lora adapter loading (PEFT and MLX key formats; alpha from metadata)
- --json mode emitting @@lattice events for the Lattice Studio Chat surface

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
chat_metal --json (Lattice Studio's streaming path) emitted mojibake for
CJK and emoji. MetalQwen35State::generate_streaming decoded each token
with the per-token-lossy local decode_tokens, so a codepoint split across
BPE tokens streamed U+FFFD before the CLI ever JSON-escaped it. Switch
generate_streaming to the shared IncrementalDetokenizer — already used by
the non-Metal model::qwen35::generation path and unit-tested in
detokenize.rs — which buffers raw bytes and emits only the longest
complete-UTF-8 prefix per token, flushing any truncated tail once at the
end. Streamed deltas now concatenate to exactly the returned text.

Also stop discarding stdout write/flush errors in the JSON streaming loop.
A broken pipe previously let generation run to completion and still report
success; now the first failure stops the stream and the CLI exits nonzero.

Widens model::qwen35::detokenize to pub(crate) so the Metal forward path
can share the detokenizer. The local decode_tokens stays for the
non-streaming final-text paths (separate latent CJK issue, tracked apart).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ohdearquant ohdearquant changed the base branch from pr/eng-1-streaming-detok to main June 22, 2026 19:50
@ohdearquant ohdearquant reopened this Jun 22, 2026
@github-actions

github-actions Bot commented Jun 22, 2026

Copy link
Copy Markdown

E2E Parity Report

PASS: all 3 prompts match within first 3 tokens

Prompt Agreement First Diff HF tok/s Lattice tok/s Verdict
The capital of France is 3/15 pos 3 0.5 2.2 PASS
In the year 2024, artificial intelligence 10/15 pos 9 0.4 1.9 PASS
`def fibonacci(n):
if n <= 1:
    return n
return` | 15/15 | none | 0.4 | 1.8 | PASS |

The capital of France is

  • HF: Paris.
    The capital of France is Paris.
    The capital of France
  • Lattice: Paris.
    A: Yes, the capital of France is Paris.

In the year 2024, artificial intelligence

  • HF: (AI) has become a significant part of the global economy. It is
  • Lattice: (AI) has become a significant part of our daily lives. From personal

def fibonacci(n): if n <= 1: return n return

  • HF: fibonacci(n-1) + fibonacci(n-2)

print(fib

  • Lattice: fibonacci(n-1) + fibonacci(n-2)

print(fib

@ohdearquant ohdearquant enabled auto-merge June 22, 2026 23:21
@ohdearquant ohdearquant merged commit 3b54b88 into main Jun 22, 2026
10 checks passed
@ohdearquant ohdearquant deleted the pr/eng-2-chat-metal branch June 22, 2026 23:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant