feat(inference): QuaRot on-disk byte accounting + dry-run quant event#201
Merged
Conversation
…event Account for actual on-disk SafetensorsFile byte sizes during QuaRot conversion and report compression in GB, and emit a quant_done event (honest-nil ratio when unknown) for the Lattice Studio quantize surface. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
E2E Parity ReportPASS: all 3 prompts match within first 3 tokens
print(fib
print(fib |
Dry-run returned total_bytes_out = 0 (and zero tensor counts), so the Studio could not preview the compression ratio before committing a write. Remove the early-return and run the full tensor loop in both modes, computing each tensor's footprint from shape/numel with the same formula the writer applies (Q4: header + data.len().div_ceil(32)*20; f16: header + numel*2) and gating only the disk writes, dir creation, and index/config emission on !dry_run. Dry-run now reports the identical planned_quantized, kept_f16, and total_bytes_out a real write produces. Adds parity tests (tied + untied) asserting dry == real, plus a non-circular guard that sums the actual on-disk .q4/.f16 file sizes and asserts they equal the reported total — catching future drift between the byte formula and write_f16_file / save_q4_file. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
QuaRot quantization: on-disk SafetensorsFile byte accounting, a GB size report, and a dry-run
quant_doneevent with an honest-nil ratio.Why
The Quantize surface reports how large the quantized artifact is and whether a dry run would shrink it. Byte accounting reads the on-disk SafetensorsFile rather than estimating. When a real ratio is not computable (dry run), the event omits it rather than fabricating a number (honest-nil).
Files
crates/inference/src/quant/quarot/convert.rscrates/inference/src/quant/quarot/io.rscrates/inference/src/bin/quantize_quarot.rs(+95/-15 across the three)
Verification
cargo build --release -p lattice-inference --bin quantize_quarotclean. Built green in the integrated-tree gate.Bench
QuaRot conversion is not on the decode hot path and no Criterion harness covers it.
make bench-compare's comparator errored assembling the delta (known two-worktree fragility; base benches ran clean) — bench-neutral by construction.Series
Part of the PR #193 engine-slice (finest split). All engine code lands on main; the macOS app surfaces a subset (Models + Chat) for v0.0.1.