model-compress

Fast Rust-native quantization for neural-network weights, with Python bindings via pyo3.

Built primarily for compression research where the post-quantization byte stream is fed through a second-stage entropy coder (zstd, brotli, lzma). The crate's design assumes that byte-level redundancy is what the second stage exploits — so quantization that produces fewer distinct byte values is preferable to denser packing schemes that produce near-uniform byte distributions.

Modules

Module	What it does
`quantize.rs`	Symmetric int6/int8 scalar quantization with per-row scales
`lloyd_max.rs`	Lloyd-Max optimal scalar codebook (iterative MSE-minimal)
`residual_quant.rs`	Multi-stage residual codebooks
`product_quant.rs`	Product quantization (independent codebooks per sub-vector)
`gptq.rs`	GPTQ Hessian-based rounding
`ngram_cache.rs`	Score-first n-gram cache for eval-time mixing
`compress.rs`	zstd wrapper with bit-packing utilities
`lib.rs`	Python bindings (pyo3)

Build

# Rust-only
cargo build --release

# Python wheel via maturin
pip install maturin
maturin build --release --interpreter python3
pip install --force-reinstall target/wheels/model_compress_py-*.whl

Use

import model_compress_py as mcp

# Symmetric int6 with per-row scales
packed, scales = mcp.quantize_int6_symmetric(weights_2d)

# Score-first n-gram cache
cache = mcp.NgramCachePy(order=7, vocab_size=8192)
cache.update(context, next_token)
log_p = cache.log_prob(context, next_token)

Empirical findings

Two results from research using this crate worth flagging:

MSE is not BPB. Codebook methods (Lloyd-Max, Residual, Product Quant) can produce better MSE than scalar int8 yet substantially worse end-to-end perplexity in language models. Correlated quantization errors break in-context computation in a way uncorrelated scalar noise does not.
Compression beats packing. Bit-packing int6 weights densely (6 bits/value) gives a 25% raw size reduction, but the resulting near-uniform byte distribution compresses ~32% worse with zstd/brotli. Net artifact size is 11–20% larger than int8-stored int6, and the gap grows with model size. For any dense entropy-coded payload, byte-level redundancy beats packing density.

These findings shaped the design of related research in OpenAI Parameter Golf (compression-aware QAT). See the writeup linked from the parent organization.

Status

Active research code, not a stable API. Versioning is best-effort.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

model-compress

Modules

Build

Use

Empirical findings

Status

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

model-compress

Modules

Build

Use

Empirical findings

Status

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages