Skip to content

3DCF-Labs/model-compress

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

model-compress

Fast Rust-native quantization for neural-network weights, with Python bindings via pyo3.

Built primarily for compression research where the post-quantization byte stream is fed through a second-stage entropy coder (zstd, brotli, lzma). The crate's design assumes that byte-level redundancy is what the second stage exploits — so quantization that produces fewer distinct byte values is preferable to denser packing schemes that produce near-uniform byte distributions.

Modules

Module What it does
quantize.rs Symmetric int6/int8 scalar quantization with per-row scales
lloyd_max.rs Lloyd-Max optimal scalar codebook (iterative MSE-minimal)
residual_quant.rs Multi-stage residual codebooks
product_quant.rs Product quantization (independent codebooks per sub-vector)
gptq.rs GPTQ Hessian-based rounding
ngram_cache.rs Score-first n-gram cache for eval-time mixing
compress.rs zstd wrapper with bit-packing utilities
lib.rs Python bindings (pyo3)

Build

# Rust-only
cargo build --release

# Python wheel via maturin
pip install maturin
maturin build --release --interpreter python3
pip install --force-reinstall target/wheels/model_compress_py-*.whl

Use

import model_compress_py as mcp

# Symmetric int6 with per-row scales
packed, scales = mcp.quantize_int6_symmetric(weights_2d)

# Score-first n-gram cache
cache = mcp.NgramCachePy(order=7, vocab_size=8192)
cache.update(context, next_token)
log_p = cache.log_prob(context, next_token)

Empirical findings

Two results from research using this crate worth flagging:

  1. MSE is not BPB. Codebook methods (Lloyd-Max, Residual, Product Quant) can produce better MSE than scalar int8 yet substantially worse end-to-end perplexity in language models. Correlated quantization errors break in-context computation in a way uncorrelated scalar noise does not.

  2. Compression beats packing. Bit-packing int6 weights densely (6 bits/value) gives a 25% raw size reduction, but the resulting near-uniform byte distribution compresses ~32% worse with zstd/brotli. Net artifact size is 11–20% larger than int8-stored int6, and the gap grows with model size. For any dense entropy-coded payload, byte-level redundancy beats packing density.

These findings shaped the design of related research in OpenAI Parameter Golf (compression-aware QAT). See the writeup linked from the parent organization.

Status

Active research code, not a stable API. Versioning is best-effort.

License

MIT — see LICENSE.

About

Fast Rust-native quantization for neural-network weights (int6/int8 scalar, Lloyd-Max, residual, product, GPTQ) with Python bindings via pyo3.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages