feat(tune): stream generate_lora output via generate_streaming#198
Merged
Conversation
Add token-by-token streaming via generate_streaming plus sampler controls to the LoRA generation CLI, mirroring chat_metal. Used by the Lattice Studio A/B compare surface. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
E2E Parity ReportPASS: all 3 prompts match within first 3 tokens
print(fib
print(fib |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Stream tokens from
generate_loraviagenerate_streaming, with sampler controls.Why
generate_lora(the LoRA-adapter generation binary) previously buffered the full output. It now uses the streaming entry point from PR-1 so adapter-driven generation streams token-by-token likechat_metal.Files
crates/tune/src/bin/generate_lora.rs(+119/-24)Verification
cargo build --release -p lattice-tune --bin generate_lora --features safetensors,inference-hookclean. This binary's compile break (E0599:generate_streamingnot in scope) was one of the two regressions that motivated the PR-G gate. Built green in the integrated-tree gate.Base
Stacked on
pr/eng-1-streaming-detok(depends ongenerate_streaming). Review/merge PR-1 first.Series
Part of the PR #193 engine-slice (finest split). All engine code lands on main; the macOS app surfaces a subset (Models + Chat) for v0.0.1.