PyTorch SDK for TurboQuant — near-optimal vector quantization with zero metadata overhead.
Based on Google Research (ICLR 2026):
Reference implementation: mlx-vlm PR #858
TurboQuant compresses vectors (KV cache embeddings, search indices) to 2-4 bits with:
- Zero calibration — no training data needed (data-oblivious)
- Zero metadata overhead — a 4-byte seed replaces all per-block scales/zero-points
- Near-optimal compression — within 2.7x of information-theoretic lower bound
- Unbiased inner products — critical for attention score accuracy
Input vector x (e.g., a KV cache key, dim=128)
│
▼
[1] Normalize: x̂ = x / ‖x‖ (store ‖x‖ as scalar)
│
▼
[2] Rotate: x' = R · x̂ (R = random orthogonal from seed)
│ After rotation, coordinates ~ Beta distribution
│ (known analytically → no calibration needed)
▼
[3] Scalar quantize: qᵢ = LloydMax(x'ᵢ) (precomputed codebook, b-1 bits)
│
▼
[4] QJL correction: (1 extra bit for unbiased inner products)
residual = x̂ - dequant(q)
signs = sign(P · residual) (P = random Gaussian from seed)
│
▼
Stored: ‖x‖ (32-bit) + q (b-1 bits × d) + ‖residual‖ (32-bit) + signs (d bits)
Total overhead: 0.5 bits/element (vs 1.5-2 bits for GPTQ/KIVI)
pip install torch numpyNo other dependencies. Clone and use directly:
git clone https://github.com/GenauraApp/TurboQuant.git
cd TurboQuant
python tests/test_real.py # Verify everything worksfrom turboquant.core import MSECodec
codec = MSECodec(dim=128, bits=4, seed=42)
# Quantize
state = codec.quantize(vectors) # vectors: [n, 128] float tensor
# Reconstruct
reconstructed = codec.dequantize(state) # [n, 128] — 99.5% cosine similarityfrom turboquant.core import ProdCodec
codec = ProdCodec(dim=128, bits=4, seed=42)
# Quantize keys
state = codec.quantize(keys) # keys: [n, 128]
# Compute attention scores directly from quantized keys
scores = codec.score(queries, state) # [n_queries, n_keys] — unbiasedimport torch
import math
from turboquant.core import ProdCodec, MSECodec
dim, bits = 128, 4
key_codec = ProdCodec(dim, bits, seed=0)
val_codec = MSECodec(dim, bits, seed=1)
# Quantize KV cache
key_state = key_codec.quantize(keys) # keys: [seq_len, dim]
val_state = val_codec.quantize(values) # values: [seq_len, dim]
# Attention (no full dequantize needed for keys)
scale = 1.0 / math.sqrt(dim)
scores = key_codec.score(queries, key_state) * scale
weights = torch.softmax(scores, dim=-1)
output = weights @ val_codec.dequantize(val_state)from transformers import AutoModelForCausalLM
from turboquant.core import ProdCodec, MSECodec
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
# Patch the KV cache after prefill
for layer in model.model.layers:
attn = layer.self_attn
# After model generates keys/values, quantize them:
# key_state = ProdCodec(dim=128, bits=4).quantize(keys)
# val_state = MSECodec(dim=128, bits=4).quantize(values)
# This replaces the FP16 cache with 3.56x smaller quantized stateFor a complete KV cache integration, see the mlx-vlm reference which implements TurboQuantKVCache as a drop-in replacement.
from turboquant.core import ProdCodec
codec = ProdCodec(dim=200, bits=4, seed=42)
# Index building: O(n) — no k-means training needed
states = codec.quantize(database_vectors)
# Search: asymmetric scoring (query is full precision)
scores = codec.score(query.unsqueeze(0), states)
top_k = torch.topk(scores.squeeze(0), k=10)| Use case | Codec | Why |
|---|---|---|
| KV cache keys | ProdCodec |
Unbiased inner products for attention scores |
| KV cache values | MSECodec |
Best reconstruction (weighted sum after softmax) |
| Vector search index | ProdCodec |
Unbiased scoring for retrieval |
| Embedding compression | MSECodec |
Minimum reconstruction error |
All verified via Qwen CLI + independent scripts + Claude python-reviewer.
| Metric | Result |
|---|---|
| MSE distortion (4-bit) | 0.009 relative (within 2.3x of theoretical limit) |
| Inner product bias | 0.012% (effectively zero) |
| 4-bit cosine similarity | 99.5% reconstruction, 96-98% attention fidelity |
| 3-bit cosine similarity | 98.3% reconstruction, 91-95% attention fidelity |
| Compression vs FP16 | 3.56x (4-bit) to 6.40x (2-bit) |
| Encode throughput (CPU) | 700K-1M vectors/sec (d=128) |
| Calibration required | None |
@inproceedings{zandieh2026turboquant,
title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
author={Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
booktitle={ICLR},
year={2026}
}
@inproceedings{zandieh2024qjl,
title={QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead},
author={Zandieh, Amir and Daliri, Majid and Han, Insu},
booktitle={AAAI},
year={2024}
}
@inproceedings{han2026polarquant,
title={PolarQuant: Quantizing KV Caches with Polar Transformation},
author={Han, Insu and Kacham, Praneeth and Karbasi, Amin and Mirrokni, Vahab and Zandieh, Amir},
booktitle={AISTATS},
year={2026}
}Apache 2.0