Fast Rust-native quantization for neural-network weights, with Python bindings via pyo3.
Built primarily for compression research where the post-quantization byte stream is fed through a second-stage entropy coder (zstd, brotli, lzma). The crate's design assumes that byte-level redundancy is what the second stage exploits — so quantization that produces fewer distinct byte values is preferable to denser packing schemes that produce near-uniform byte distributions.
| Module | What it does |
|---|---|
quantize.rs |
Symmetric int6/int8 scalar quantization with per-row scales |
lloyd_max.rs |
Lloyd-Max optimal scalar codebook (iterative MSE-minimal) |
residual_quant.rs |
Multi-stage residual codebooks |
product_quant.rs |
Product quantization (independent codebooks per sub-vector) |
gptq.rs |
GPTQ Hessian-based rounding |
ngram_cache.rs |
Score-first n-gram cache for eval-time mixing |
compress.rs |
zstd wrapper with bit-packing utilities |
lib.rs |
Python bindings (pyo3) |
# Rust-only
cargo build --release
# Python wheel via maturin
pip install maturin
maturin build --release --interpreter python3
pip install --force-reinstall target/wheels/model_compress_py-*.whlimport model_compress_py as mcp
# Symmetric int6 with per-row scales
packed, scales = mcp.quantize_int6_symmetric(weights_2d)
# Score-first n-gram cache
cache = mcp.NgramCachePy(order=7, vocab_size=8192)
cache.update(context, next_token)
log_p = cache.log_prob(context, next_token)Two results from research using this crate worth flagging:
-
MSE is not BPB. Codebook methods (Lloyd-Max, Residual, Product Quant) can produce better MSE than scalar int8 yet substantially worse end-to-end perplexity in language models. Correlated quantization errors break in-context computation in a way uncorrelated scalar noise does not.
-
Compression beats packing. Bit-packing int6 weights densely (6 bits/value) gives a 25% raw size reduction, but the resulting near-uniform byte distribution compresses ~32% worse with zstd/brotli. Net artifact size is 11–20% larger than int8-stored int6, and the gap grows with model size. For any dense entropy-coded payload, byte-level redundancy beats packing density.
These findings shaped the design of related research in OpenAI Parameter Golf (compression-aware QAT). See the writeup linked from the parent organization.
Active research code, not a stable API. Versioning is best-effort.
MIT — see LICENSE.