Skip to content

jagmarques/nexusquant

Repository files navigation

NexusQuant

Compress your LLM's KV cache 10-33x. Training-free. One line of code.

PyPI License Python Stars


Early-stage research project. Results validated on Mistral-7B and Phi-3-mini only. NIAH testing shows factual recall degrades under compression (40% at 35% eviction). Not production-ready. Contributions and feedback welcome.

Token eviction + E8 lattice quantization, applied once after prefill. No training, no calibration data, no model modifications.

Install

pip install nexusquant-kv
pip install "nexusquant-kv[hf]"  # with HuggingFace transformers

Quickstart

from nexusquant import nexusquant_evict

with nexusquant_evict(model, quality="balanced"):
    output = model.generate(input_ids, max_new_tokens=512)

Why

Without NexusQuant With NexusQuant
128K context on 70B = ~42 GB KV cache (GQA) Same context = ~2.5 GB KV cache (17x)
KV cache competes with model weights for VRAM KV cache fits comfortably alongside weights
Long context needs multi-GPU or offloading Single GPU, single machine
Deploy a fine-tuned retrieval model One with block, no code changes

Quality presets

Measured on Mistral-7B, Phi-3-mini, Qwen2.5-7B. Compression ratios include all overhead.

Preset Compression PPL degradation Config
high ~9x <0.5% K3V2 + real scorer + 35% evict
asym ~14x ~1% K3V2 + 60% evict
balanced ~17x ~1.3% K2V2 + 60% evict
max ~33x +0.66% K2V2 + real scorer + 80% evict

Important: PPL alone does not tell the full quality story. Our NIAH (needle-in-a-haystack) tests show 40% factual recall at K3V2-35% despite only +0.82% PPL. The key-key scorer can evict early-context tokens containing critical facts. Multi-window scoring (first-16 + last-16 queries) improves NIAH to 80% and PPL to +0.57% at 35% eviction. See Limitations below.

Attention sharpening (scaling keys by sqrt(1.05) after quantization) gives a free +0.05pp quality improvement at zero compression cost. Enabled by default.

Cross-architecture results (Cerebrium A10)

Model K2V2 35% K3V2 35% K2V2 60% K3V2 60%
Mistral-7B (GQA 8:1) +0.91% +0.82% +1.64% +1.22%
Phi-3-mini (d=96) +0.82% +0.59% +2.81% +1.10%
Qwen2.5-7B catastrophic catastrophic catastrophic catastrophic
Qwen2.5-7B + boundary(2) +7.9% +8.7% +23.8% +23.3%

Note: Qwen-family models require protect_boundary=2 (first/last 2 layers at FP16). Mistral and Phi-3 work without it.

How it works

  1. Importance scoring - rank tokens by attention weight. Two options: key-key proxy (fast, no extra pass) or real attention scorer (uses attn_implementation='eager', zero quality loss at 35% eviction)
  2. Token eviction - drop lowest-scoring tokens; always keep BOS and a recent sliding window
  3. RoPE removal - undo rotary embeddings on keys so they share a common subspace
  4. Hadamard rotation - spread energy uniformly across dimensions (handles non-power-of-2 head dims via zero-padding)
  5. E8 lattice quantization - quantize 8-float groups onto the E8 root lattice. Asymmetric: 3-bit keys + 2-bit values (keys need more precision due to softmax amplification)
  6. Boundary protection - optionally keep first/last N layers at FP16 (mandatory for Qwen-family)
  7. Delta coding + zstd - consecutive tokens produce similar lattice indices; storing deltas then compressing with zstd yields another 2-3x

Token eviction reduces count (2.5x at 60% eviction). E8 quantization reduces precision (~7x after entropy coding). Combined: 17x.

Compared to

Method Compression PPL degradation Training required Notes
NexusQuant (K3V2+scorer) 9-33x +0.0-0.66% No Includes eviction
NexusQuant (K2V2) 10-33x +0.4-2.6% No Includes eviction
TurboQuant+ 3.8-6.4x ~0-1% No Quant-only, no eviction
KVTC (NVIDIA) up to 20x <1% Yes (calibration)
CommVQ (Apple) ~8x ~0% Yes (retraining)
Palu 11x ~25% rel Yes (calibration)

NexusQuant ratios include token eviction (10-80% of tokens removed). TurboQuant+ ratios are pure quantization without eviction - not directly comparable. Competitor numbers from their papers.

Supported models

Any HuggingFace causal LM using split-half RoPE (the standard since Llama-2):

  • Llama family (Llama-2, Llama-3, Llama-3.1)
  • Mistral / Mixtral
  • Qwen
  • Phi
  • Gemma

Not yet supported: models with interleaved RoPE (GPT-NeoX, GPT-J).

Advanced options

Graduated layer bit profile - gives boundary layers (first/last 15%) higher precision (3-bit K+V) while middle layers use standard asymmetric (K3V2). Small but consistent quality win (~0.02pp on Mistral-7B). GPU-validated.

with nexusquant_evict(model, quality="high", layer_bit_profile="graduated"):
    output = model.generate(input_ids, max_new_tokens=200)

Hybrid model compression - for models like Gemma4 with sliding-window + global attention layers, only compress the global layers (which scale with context). SWA layers have fixed memory cost.

with nexusquant_evict(model, compress_layers="global_only"):
    output = model.generate(input_ids, max_new_tokens=200)

Soft eviction (experimental, not recommended) - quantizes evicted tokens at 1-bit instead of removing them. In testing, this performed worse than hard eviction (+2.24% vs +0.82% PPL at 35% eviction on Mistral-7B). The 1-bit tokens corrupt attention patterns more than masking them out. Kept for research purposes.

with nexusquant_evict(model, soft_eviction=True):  # not recommended
    output = model.generate(input_ids, max_new_tokens=200)

Limitations

  • Quality is text-dependent. Creative/narrative text degrades more than structured/technical text. Test on your actual workload.
  • Short prefixes hurt. Prefixes under 500 tokens see more degradation. The scorer needs enough tokens to distinguish signal from noise.
  • Architecture-dependent boundary protection. Qwen-family models catastrophically fail without protect_boundary=2. Mistral and Phi-3 work without it. Always test your specific model.
  • E8 quantization is CPU-bound. Triton GPU kernel is written (nexusquant/kernels/e8_triton.py) but not yet benchmarked for latency. Physical KV truncation (truncate=True) is implemented for actual VRAM savings.
  • Eviction hurts factual recall. NIAH benchmark: baseline 100%, K3V2-35% eviction 40% recall, K3V2-60% eviction 53% recall (Mistral-7B-Instruct, ctx=1024-3072). PPL (+0.82%) hides this damage. Multi-window scoring improves recall to 80% at 35% eviction. If your task requires precise fact retrieval, test with NIAH before deploying.
  • PPL is not a sufficient quality metric. Always validate with NIAH or downstream accuracy benchmarks. PPL averages over all positions and masks the loss of specific tokens.
  • Results on 7B-class models only. 70B validation pending. Mistral-7B quantizes "exceptionally well" (ikawrakow) and is not representative of harder models.
  • Batch size > 1 is partially broken. NexusQuantSimple only compresses batch index 0; other batch elements are silently dropped to the first element's compressed result. NexusQuantEvictTruncate computes one keep-mask from batch element 0 and applies it to all sequences - incorrect when sequences differ in importance. Validate batch inference results carefully.
  • Multi-turn chat (persistent KV cache) is not supported. The hook compresses on every incoming prefill (seq > 1). If the same cache is reused across conversation turns, the second turn's user message triggers re-compression of an already-quantized cache. Use a fresh context manager per turn, or call model.generate with past_key_values=None to reset the cache between turns.
  • Speculative decoding is not supported. Speculative decoding writes multiple draft tokens to the KV cache during the decode phase. Because the hook triggers on any batch of >1 new tokens, it will incorrectly fire on draft verification steps, compressing decode-phase tokens.
  • KV cache offloading is not supported. OffloadedCache (used by HuggingFace's accelerate max_memory offloading) does not inherit from DynamicLayer, so the NexusQuant hooks do not intercept it. Compression silently does nothing when offloading is active.
  • Encoder-decoder models (T5, BART, Whisper) are not supported. These models use cross-attention whose KV cache stores encoder representations rather than decoder tokens. RoPE removal in the pipeline assumes decoder self-attention with split-half rotary embeddings, which does not apply to T5-style relative position biases. Applying NexusQuant to encoder-decoder models will produce incorrect results.
  • Vision-language models (LLaVA, Qwen-VL, LLaVA-Next) are untested. Model config detection handles nested text_config, but image tokens are scored for importance and evicted by the same heuristic as text tokens. High-information image tokens may be evicted. Results on VLMs have not been measured.
  • GGUF models are not supported. GGUF format is typically run via llama.cpp or ctransformers, which do not use HuggingFace DynamicCache. The integration hooks have no effect. Only GPTQ/AWQ models loaded through AutoModelForCausalLM with HuggingFace are compatible.
  • rope_scaling (extended context) is not accounted for. Models using linear or NTK rope scaling (e.g., Llama-3.1 at >8K context) read rope_theta but ignore rope_scaling config. At contexts beyond the original training length, the RoPE removal introduces a small frequency mismatch. Impact is unmeasured.

Citation

@software{nexusquant2026,
  author  = {Marques, Jo\~{a}o Andr\'{e} Gomes},
  title   = {{NexusQuant}: Training-Free {KV} Cache Compression via {E8} Lattice Quantization and Attention-Aware Token Eviction},
  year    = {2026},
  url     = {https://github.com/jagmarques/nexusquant},
  license = {Apache-2.0},
}

License

Apache 2.0. See LICENSE.

About

Training-free KV cache compression for LLMs. 10-33x compression via E8 lattice quantization + attention-aware token eviction. One line of code.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors