NexusQuant
Compress your LLM's KV cache 10-33x. Training-free. One line of code.
Early-stage research project. Results validated on Mistral-7B and Phi-3-mini only. NIAH testing shows factual recall degrades under compression (40% at 35% eviction). Not production-ready. Contributions and feedback welcome.
Token eviction + E8 lattice quantization, applied once after prefill. No training, no calibration data, no model modifications.
pip install nexusquant-kv
pip install "nexusquant-kv[hf]" # with HuggingFace transformersfrom nexusquant import nexusquant_evict
with nexusquant_evict(model, quality="balanced"):
output = model.generate(input_ids, max_new_tokens=512)| Without NexusQuant | With NexusQuant |
|---|---|
| 128K context on 70B = ~42 GB KV cache (GQA) | Same context = ~2.5 GB KV cache (17x) |
| KV cache competes with model weights for VRAM | KV cache fits comfortably alongside weights |
| Long context needs multi-GPU or offloading | Single GPU, single machine |
| Deploy a fine-tuned retrieval model | One with block, no code changes |
Measured on Mistral-7B, Phi-3-mini, Qwen2.5-7B. Compression ratios include all overhead.
| Preset | Compression | PPL degradation | Config |
|---|---|---|---|
high |
~9x | <0.5% | K3V2 + real scorer + 35% evict |
asym |
~14x | ~1% | K3V2 + 60% evict |
balanced |
~17x | ~1.3% | K2V2 + 60% evict |
max |
~33x | +0.66% | K2V2 + real scorer + 80% evict |
Important: PPL alone does not tell the full quality story. Our NIAH (needle-in-a-haystack) tests show 40% factual recall at K3V2-35% despite only +0.82% PPL. The key-key scorer can evict early-context tokens containing critical facts. Multi-window scoring (first-16 + last-16 queries) improves NIAH to 80% and PPL to +0.57% at 35% eviction. See Limitations below.
Attention sharpening (scaling keys by sqrt(1.05) after quantization) gives a free +0.05pp quality improvement at zero compression cost. Enabled by default.
| Model | K2V2 35% | K3V2 35% | K2V2 60% | K3V2 60% |
|---|---|---|---|---|
| Mistral-7B (GQA 8:1) | +0.91% | +0.82% | +1.64% | +1.22% |
| Phi-3-mini (d=96) | +0.82% | +0.59% | +2.81% | +1.10% |
| Qwen2.5-7B | catastrophic | catastrophic | catastrophic | catastrophic |
| Qwen2.5-7B + boundary(2) | +7.9% | +8.7% | +23.8% | +23.3% |
Note: Qwen-family models require
protect_boundary=2(first/last 2 layers at FP16). Mistral and Phi-3 work without it.
- Importance scoring - rank tokens by attention weight. Two options: key-key proxy (fast, no extra pass) or real attention scorer (uses
attn_implementation='eager', zero quality loss at 35% eviction) - Token eviction - drop lowest-scoring tokens; always keep BOS and a recent sliding window
- RoPE removal - undo rotary embeddings on keys so they share a common subspace
- Hadamard rotation - spread energy uniformly across dimensions (handles non-power-of-2 head dims via zero-padding)
- E8 lattice quantization - quantize 8-float groups onto the E8 root lattice. Asymmetric: 3-bit keys + 2-bit values (keys need more precision due to softmax amplification)
- Boundary protection - optionally keep first/last N layers at FP16 (mandatory for Qwen-family)
- Delta coding + zstd - consecutive tokens produce similar lattice indices; storing deltas then compressing with zstd yields another 2-3x
Token eviction reduces count (2.5x at 60% eviction). E8 quantization reduces precision (~7x after entropy coding). Combined: 17x.
| Method | Compression | PPL degradation | Training required | Notes |
|---|---|---|---|---|
| NexusQuant (K3V2+scorer) | 9-33x | +0.0-0.66% | No | Includes eviction |
| NexusQuant (K2V2) | 10-33x | +0.4-2.6% | No | Includes eviction |
| TurboQuant+ | 3.8-6.4x | ~0-1% | No | Quant-only, no eviction |
| KVTC (NVIDIA) | up to 20x | <1% | Yes (calibration) | |
| CommVQ (Apple) | ~8x | ~0% | Yes (retraining) | |
| Palu | 11x | ~25% rel | Yes (calibration) |
NexusQuant ratios include token eviction (10-80% of tokens removed). TurboQuant+ ratios are pure quantization without eviction - not directly comparable. Competitor numbers from their papers.
Any HuggingFace causal LM using split-half RoPE (the standard since Llama-2):
- Llama family (Llama-2, Llama-3, Llama-3.1)
- Mistral / Mixtral
- Qwen
- Phi
- Gemma
Not yet supported: models with interleaved RoPE (GPT-NeoX, GPT-J).
Graduated layer bit profile - gives boundary layers (first/last 15%) higher precision (3-bit K+V) while middle layers use standard asymmetric (K3V2). Small but consistent quality win (~0.02pp on Mistral-7B). GPU-validated.
with nexusquant_evict(model, quality="high", layer_bit_profile="graduated"):
output = model.generate(input_ids, max_new_tokens=200)Hybrid model compression - for models like Gemma4 with sliding-window + global attention layers, only compress the global layers (which scale with context). SWA layers have fixed memory cost.
with nexusquant_evict(model, compress_layers="global_only"):
output = model.generate(input_ids, max_new_tokens=200)Soft eviction (experimental, not recommended) - quantizes evicted tokens at 1-bit instead of removing them. In testing, this performed worse than hard eviction (+2.24% vs +0.82% PPL at 35% eviction on Mistral-7B). The 1-bit tokens corrupt attention patterns more than masking them out. Kept for research purposes.
with nexusquant_evict(model, soft_eviction=True): # not recommended
output = model.generate(input_ids, max_new_tokens=200)- Quality is text-dependent. Creative/narrative text degrades more than structured/technical text. Test on your actual workload.
- Short prefixes hurt. Prefixes under 500 tokens see more degradation. The scorer needs enough tokens to distinguish signal from noise.
- Architecture-dependent boundary protection. Qwen-family models catastrophically fail without
protect_boundary=2. Mistral and Phi-3 work without it. Always test your specific model. - E8 quantization is CPU-bound. Triton GPU kernel is written (
nexusquant/kernels/e8_triton.py) but not yet benchmarked for latency. Physical KV truncation (truncate=True) is implemented for actual VRAM savings. - Eviction hurts factual recall. NIAH benchmark: baseline 100%, K3V2-35% eviction 40% recall, K3V2-60% eviction 53% recall (Mistral-7B-Instruct, ctx=1024-3072). PPL (+0.82%) hides this damage. Multi-window scoring improves recall to 80% at 35% eviction. If your task requires precise fact retrieval, test with NIAH before deploying.
- PPL is not a sufficient quality metric. Always validate with NIAH or downstream accuracy benchmarks. PPL averages over all positions and masks the loss of specific tokens.
- Results on 7B-class models only. 70B validation pending. Mistral-7B quantizes "exceptionally well" (ikawrakow) and is not representative of harder models.
- Batch size > 1 is partially broken.
NexusQuantSimpleonly compresses batch index 0; other batch elements are silently dropped to the first element's compressed result.NexusQuantEvictTruncatecomputes one keep-mask from batch element 0 and applies it to all sequences - incorrect when sequences differ in importance. Validate batch inference results carefully. - Multi-turn chat (persistent KV cache) is not supported. The hook compresses on every incoming prefill (seq > 1). If the same cache is reused across conversation turns, the second turn's user message triggers re-compression of an already-quantized cache. Use a fresh context manager per turn, or call
model.generatewithpast_key_values=Noneto reset the cache between turns. - Speculative decoding is not supported. Speculative decoding writes multiple draft tokens to the KV cache during the decode phase. Because the hook triggers on any batch of >1 new tokens, it will incorrectly fire on draft verification steps, compressing decode-phase tokens.
- KV cache offloading is not supported.
OffloadedCache(used by HuggingFace'sacceleratemax_memoryoffloading) does not inherit fromDynamicLayer, so the NexusQuant hooks do not intercept it. Compression silently does nothing when offloading is active. - Encoder-decoder models (T5, BART, Whisper) are not supported. These models use cross-attention whose KV cache stores encoder representations rather than decoder tokens. RoPE removal in the pipeline assumes decoder self-attention with split-half rotary embeddings, which does not apply to T5-style relative position biases. Applying NexusQuant to encoder-decoder models will produce incorrect results.
- Vision-language models (LLaVA, Qwen-VL, LLaVA-Next) are untested. Model config detection handles nested
text_config, but image tokens are scored for importance and evicted by the same heuristic as text tokens. High-information image tokens may be evicted. Results on VLMs have not been measured. - GGUF models are not supported. GGUF format is typically run via llama.cpp or ctransformers, which do not use HuggingFace
DynamicCache. The integration hooks have no effect. Only GPTQ/AWQ models loaded throughAutoModelForCausalLMwith HuggingFace are compatible. - rope_scaling (extended context) is not accounted for. Models using linear or NTK rope scaling (e.g., Llama-3.1 at >8K context) read
rope_thetabut ignorerope_scalingconfig. At contexts beyond the original training length, the RoPE removal introduces a small frequency mismatch. Impact is unmeasured.
@software{nexusquant2026,
author = {Marques, Jo\~{a}o Andr\'{e} Gomes},
title = {{NexusQuant}: Training-Free {KV} Cache Compression via {E8} Lattice Quantization and Attention-Aware Token Eviction},
year = {2026},
url = {https://github.com/jagmarques/nexusquant},
license = {Apache-2.0},
}Apache 2.0. See LICENSE.