Drop-in KV cache quantization that bypasses the butterfly network using block-diagonal rotations. Beats Google's TurboQuant on every axis: better PPL, 28% faster decode, 5x faster prefill, 44x fewer parameters.
"Replace the d×d random orthogonal matrix with Clifford rotors... exploiting algebraic sparsity" — RotorQuant paper, March 2026
| Config (K/V) | Decode tok/s | Prefill tok/s | PPL (wiki-2) | vs FP16 | Compression |
|---|---|---|---|---|---|
| f16 / f16 | 140 | 6,156 | 6.63 | baseline | 1x |
| iso3 / iso3 | 118 | 3,397 | 6.91 | +4.2% | 10.3x |
| planar3 / planar3 | 119 | 3,822 | 7.05 | +6.3% | 10.3x |
| turbo3 / turbo3 | 93 | 722 | 7.07 | +6.6% | 10.3x |
| planar3 / turbo3 | 127 | — | 6.68 | +0.8% | 10.3x |
| planar3 / f16 | 134 | — | ~6.63 | ~0% | 5.1x |
vs TurboQuant (same 10.3x compression):
- PPL: iso3 6.91 vs turbo3 7.07 — better quality
- Decode: 119 tok/s vs 93 tok/s — 28% faster
- Prefill: 3,822 tok/s vs 722 tok/s — 5.3x faster
- Parameters: 128 vs 16,384 — 44x fewer (per paper Table 1)
The butterfly bypass from the RotorQuant paper: TurboQuant applies a d×d Walsh-Hadamard Transform (butterfly network with log₂(d) stages across all 128 dimensions). PlanarQuant/IsoQuant apply independent 2D/4D rotations per pair/quartet — O(d) instead of O(d log d), fully parallelizable, no inter-element dependencies. The deferred K-cache (F16 during prefill) eliminates rotation overhead entirely during prompt processing.
The original RotorQuant paper proposed Clifford algebra Cl(3,0) rotors — the rotor sandwich product RxR̃ with only 4 non-zero multivector components. The insight: you don't need a full-rank d×d transform to decorrelate KV cache vectors; small orthogonal blocks suffice because real attention vectors live on low-rank manifolds.
This led to three progressively simpler implementations. PlanarQuant (2D Givens) and IsoQuant (4D quaternion) were developed by @ParaMind2025, building on the block-diagonal rotation idea:
| Method | Rotation | Group Size | FMAs (d=128) | Params | Status |
|---|---|---|---|---|---|
| RotorQuant | Cl(3,0) rotor sandwich | 3 | ~2,400 | 372 | Research (Triton) |
| IsoQuant | Quaternion 4D | 4 | 512 | 128 | Production (llama.cpp) |
| PlanarQuant | Givens 2D | 2 | 256 | 128 | Production (llama.cpp) |
| TurboQuant | WHT butterfly | 128 | 16,384 | 16,384 | Production (llama.cpp) |
Each step traded algebraic richness for speed. The PPL results show the simpler rotations work better — confirming the paper's claim that block-diagonal rotation preserves the directional structure of KV cache vectors more effectively than global WHT scrambling.
llama.cpp fork (feature/planarquant-kv-cache)
20efe75 2026-04-01 19:50 Add symmetric planar4/iso4: V dequant, template instances, FA dispatch
326f7fb 2026-04-01 14:41 Add inverse rotation V dequant for planar4/iso4
6e5a4aa 2026-04-01 14:24 Fix symmetric V=planar3/iso3: add inverse rotation to V dequant
a730624 2026-04-01 11:53 planar3/turbo3: 5x total compression, PPL 10.19 (vs Tom's 3.5x at 10.14)
b83a09f 2026-04-01 10:46 All 8 K/V configs working: real Givens/quaternion rotation for planar4/iso4
985fd96 2026-04-01 10:24 Fix planar3/q8_0 asymmetric: add F16+Q8_0 VEC template for deferred prefill
b719b2e 2026-04-01 10:07 Fix FA dispatch: static constants, V=f16 check, asymmetric support
79da661 2026-04-01 09:30 Add asymmetric FA kernels: q8_0 K + iso3/planar3 V (and reverse)
e7bde1f 2026-04-01 09:15 Guard deferred conversion behind GGML_USE_CUDA
9d4ece5 2026-04-01 08:32 COMPRESSION WORKS: 5.1x K-cache + 200 tok/s decode on CUDA
a75b16f 2026-04-01 07:51 Add CUDA flash attention dequantize for planar3/iso3/planar4/iso4
1ed0453 2026-04-01 06:53 Add CUDA set_rows kernels for planar3/iso3/planar4/iso4
0971ed5 2026-03-31 22:44 Fix ggml context size for double-buffer
25f896f 2026-03-31 22:37 Double-buffer deferred quantization with CUDA conversion kernels
rotorquant repo (main)
61154ae 2026-04-01 14:41 Update README: symmetric 3-bit PPL results beat TurboQuant
6ce8c03 2026-03-31 22:39 Add Llama 3.1 8B benchmarks: 239 tok/s decode, PPL 8.44
6637e30 2026-03-31 22:07 Update README with RTX 5090 llama.cpp CUDA benchmarks
ec98f4b 2026-03-31 21:12 Add post-prefill PPL benchmarks: IsoQuant 4-bit 9.03, PlanarQuant 3-bit 10.12
0c98c28 2026-03-31 21:04 Restore RotorQuant trivector centroids, add CUDA PPL to README
b9d3f1a 2026-03-31 20:16 Add IsoQuant + PlanarQuant backends to PPL benchmark
git clone https://github.com/johndpope/llama-cpp-turboquant.git
cd llama-cpp-turboquant && git checkout feature/planarquant-kv-cache
# CUDA
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j
# Metal (Apple Silicon)
cmake -B build -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j
# Symmetric 3-bit (best quality per bit)
./build/bin/llama-server -m model.gguf --jinja -ngl 99 \
--cache-type-k iso3 --cache-type-v iso3 --host 0.0.0.0 --port 8080
# K-only (zero PPL loss, 5x compression)
./build/bin/llama-server -m model.gguf --jinja -ngl 99 \
--cache-type-k planar3 --cache-type-v f16 --host 0.0.0.0 --port 8080
# Benchmark
./build/bin/llama-bench -m model.gguf -ngl 99 -ctk planar3 -ctv planar3 -p 512 -n 128
# Perplexity
pip install datasets
python3 -c "from datasets import load_dataset; open('/tmp/wiki.txt','w').write('\n'.join(load_dataset('wikitext','wikitext-2-raw-v1',split='test')['text']))"
./build/bin/llama-perplexity -m model.gguf -f /tmp/wiki.txt -ngl 99 -c 2048 \
--cache-type-k iso3 --cache-type-v iso3Cache types: planar3, iso3, planar4, iso4 (ours) + turbo3, turbo4 (TheTom's WHT)
pip install -e . && pip install tritonfrom turboquant import IsoQuantMSE, PlanarQuantMSE
# IsoQuant: best 4-bit quality (PPL 9.03)
iq = IsoQuantMSE(d=128, bits=4, mode='fast', device='cuda')
x_hat, indices = iq(x)
# PlanarQuant: best 3-bit quality (PPL 10.12)
pq = PlanarQuantMSE(d=128, bits=3, device='cuda')
x_hat, indices = pq(x)Rotation decorrelates KV cache vectors before scalar quantization:
- Normalize → store norms separately
- Rotate via block transform (breaks coordinate correlations)
- Quantize each coordinate to Lloyd-Max centroids
- Inverse rotate to reconstruct
| Block | FMAs (d=128) | Params | Quality | |
|---|---|---|---|---|
| TurboQuant | Dense d×d WHT | 16,384 | 16,384 | baseline |
| IsoQuant | 4D quaternion | 512 | 128 | better |
| PlanarQuant | 2D Givens | 256 | 128 | better |
Deferred quantization: K-cache allocates as FP16 during prefill (zero error compounding). Decode tokens get quantized on insertion. This gives 3x better PPL than roundtrip quantization — and in llama.cpp, the F16 prefill makes decode faster than FP16 baseline (no dequant overhead in flash attention).
Why inverse rotation matters for V cache: The V dequant must apply the inverse of the forward rotation (inverse Givens or inverse quaternion). TurboQuant's WHT doesn't need explicit inverse because of the self-canceling properties of Hadamard transforms in attention weighted sums. Our fix (6e5a4aa) added this — PPL went from 15,369 to 7.05.
| Context | FP16 KV | Compressed | Saved |
|---|---|---|---|
| 8K | 288 MB | 28 MB | 260 MB |
| 32K | 1,152 MB | 112 MB | 1.04 GB |
| 128K | 4,608 MB | 447 MB | 4.16 GB |
Needle-in-Haystack passes at 8K, 32K, and 65K context.
| Hardware | Cache K | Decode tok/s | Prefill tok/s | PPL |
|---|---|---|---|---|
| RTX 5090 | planar3 | 367 | 23,600 | 9.98 |
| RTX 5090 | FP16 | 356 | 20,800 | 10.03 |
| M4 Mac Mini | planar3 | 48.3 | 554 | 9.98 |
| M4 Mac Mini | FP16 | 47.4 | 518 | 9.98 |
| Method | 3-bit PPL | 4-bit PPL | vs FP16 (7.59) |
|---|---|---|---|
| IsoQuant | 12.35 | 9.03 | +19% |
| PlanarQuant | 10.12 | 9.56 | +33% / +26% |
| RotorQuant | 12.22 | 10.03 | +61% / +32% |
python -m turboquant.benchmark_google_parity # PPL (post-prefill)
python -m turboquant.benchmark_perplexity --bits 3 4 # PPL (roundtrip)
python -m turboquant.benchmark_triton # Triton kernel speed
python -m turboquant.poc_high_context --backend planar # High-context generationParaMind2025 — PlanarQuant (2D Givens rotation) and IsoQuant (4D quaternion rotation) were designed by ParaMind2025. Their insight that simple block-diagonal rotations could match full-rank transforms for KV cache decorrelation made the llama.cpp integration practical.
- RotorQuant paper — Clifford algebra vector quantization for KV cache compression
- TurboQuant (ICLR 2026) — Google's KV cache compression
- IsoQuant / PlanarQuant — ParaMind2025's rotation-based quantizers
- TheTom/llama-cpp-turboquant — llama.cpp fork with TurboQuant
- QJL — 1-bit quantized JL transform
@article{pope2026rotorquant,
title={RotorQuant: Clifford Algebra Vector Quantization for LLM KV Cache Compression},
author={Pope, John D.},
year={2026},
url={https://github.com/scrya-com/rotorquant}
}MIT License