Languages: English | νκ΅μ΄
Comprehensive evaluation of TurboQuant, a near-optimal vector quantization algorithm for compressing LLM key-value caches.
This repository extends the original implementation with:
- β Interactive CLI tool for real-time model comparison with actual KV cache compression analysis
- β Multi-model evaluation (Qwen, LLaMA, Phi, Mistral)
- β Comprehensive performance benchmarking (generation speed, memory, attention accuracy)
- β Detailed experimental results and analysis
- β Reproducible evaluation framework with actual KV compression analysis
# Install all dependencies
pip install -r requirements.txtStart an interactive session where you can enter prompts and see real KV cache compression:
cd experiments/2_multi_model_evaluation
python interactive_with_real_kv.py --model "Qwen/Qwen2.5-3B-Instruct" --bits 3Then enter prompts:
[PROMPT] Enter text: What is artificial intelligence?
[PROMPT] Enter text: Explain machine learning
[PROMPT] Enter text: quit
You'll see:
- Generated text
- Real KV cache compression analysis (memory savings, speed)
- Attention accuracy metrics (cosine similarity, top-1/top-5 match)
Interactive web interface for side-by-side original vs TurboQuant comparison:
cd streamlit_app
streamlit run app.pyOpen browser at http://localhost:8501
Features:
- π¬ Side-by-side text generation: Original KV vs TurboQuant output
- β‘ Real-time metrics: KV cache size, compression ratio, generation time
- π― Attention quality: Cosine similarity, top-1/top-5 match percentages
- π Generation impact analysis: How many attention heads change due to compression
- π§ Model selection: Choose from Qwen, Phi, Mistral
- βοΈ Quantization control: Test 2-bit, 3-bit, 4-bit compression
Requirements:
- Python 3.10+
- PyTorch 2.0+ with CUDA
- 12GB+ GPU VRAM
Example:
# Run synthetic tests (no GPU needed)
cd experiments/1_paper_reproduction
python ../../original_implementation/test_turboquant.py
# Evaluate on a model (GPU required)
cd experiments/2_multi_model_evaluation
./run_all_models_complete.sh# Navigate to experiments
cd experiments/2_multi_model_evaluation
# Run all models at once
.\run_all_models_complete.bat| Model | Compression | Cosine Sim | Top-1 % | Top-5 % |
|---|---|---|---|---|
| Qwen2.5-3B | 5.0x | 0.9945 | 86.1% | 94.4% |
| Mistral-7B | 5.0x | 0.9887 | 97.7% | 100.0% |
| Phi-2 | 4.8x | 0.9924 | 28.2% | 55.7% |
Best Overall: Mistral-7B (highest top-1 match across all contexts) Highest Cosine Similarity: Qwen2.5-3B (most similar attention distributions)
Interpretation:
- 5.0x compression: KV cache shrinks from 290 MB to 58 MB (8K context)
- 0.9945 cosine sim: Attention distributions 99.45% similar
- 86% top-1 match: 86/100 attention heads pick same token
- 94% top-5 match: Real top token in estimated top-5 94% of time
TurboQuant is a data-oblivious online vector quantization algorithm that:
- Rotates vectors randomly (makes coordinates independent)
- Quantizes each coordinate with optimal Lloyd-Max codebooks (2-4 bits)
- Corrects inner product bias using QJL (1 bit)
Result: High compression with minimal attention accuracy loss.
Paper: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate (ICLR 2026)
turboquant-experiments/
βββ original_implementation/ # Reference code (with attribution)
β βββ lloyd_max.py # Lloyd-Max solver
β βββ turboquant.py # Core algorithm
β βββ compressors.py # Asymmetric attention
β βββ test_turboquant.py # Synthetic tests
β βββ validate.py # Original Qwen validation
β βββ ATTRIBUTION.md # Source attribution
β
βββ experiments/
β βββ 1_paper_reproduction/ # Reproduce paper results
β β βββ Verify MSE bounds, inner product unbiasedness
β β
β βββ 2_multi_model_evaluation/ # Evaluate different models
β β βββ interactive_with_real_kv.py # β Interactive CLI tool (RECOMMENDED)
β β βββ benchmark_generation.py # Generation performance benchmark
β β βββ benchmark_turboquant.py # Attention accuracy benchmark
β β βββ simple_prompt_test.py # Basic comparison test
β β βββ evaluate_model.py # Generic evaluation framework
β β βββ analyze_results.py # Result analysis & plots
β β βββ run_all_models_complete.sh # Batch evaluation script
β β βββ run_all_models_complete.bat # Windows batch script
β β βββ results/ # Benchmark results (JSON)
β β
β βββ 3_performance_analysis/ # Speed & memory benchmarks
β βββ benchmark_speed.py
β βββ benchmark_memory.py
β
βββ docs/
β βββ HOW_TO_RUN.md # Detailed execution guide
β βββ METHODOLOGY.md # Experimental methodology
β βββ RESULTS.md # Comprehensive results
β βββ README.md
β
βββ README.md (this file)
- Qwen2.5-3B-Instruct (3.5GB) - β Primary baseline
- Microsoft Phi-2 (2.7GB) - β Small, GPU-efficient
- Mistral-7B-Instruct-v0.1 (13GB) - β Best overall performance
- Meta LLaMA-2-7B (13GB) - β Requires HuggingFace authentication (gated repo)
- Compression Ratio: KV cache size reduction (higher is better)
- Cosine Similarity: Attention distribution similarity (closer to 1.0 is better)
- Top-1 Match %: Same most-attended token (higher is better)
- Top-5 Match %: Top token in top-5 predictions (higher is better)
Tested on: 2K, 4K, 8K tokens (covering short β long contexts)
Verify core algorithm correctness:
cd experiments/1_paper_reproduction
python ../../original_implementation/test_turboquant.pycd experiments/2_multi_model_evaluation
./run_all_models_complete.shcd experiments/2_multi_model_evaluation
.\run_all_models_complete.batResults are automatically saved to: experiments/2_multi_model_evaluation/results/ with JSON files per model.
Using validate.py (proven stable):
cd original_implementation
python validate.py --model Qwen/Qwen2.5-3B-Instruct
python validate.py --model microsoft/phi-2
python validate.py --model mistralai/Mistral-7B-Instruct-v0.1Or directly:
cd experiments/2_multi_model_evaluation
python evaluate_model.py --model Qwen/Qwen2.5-3B-Instruct --bits 3Create charts from experimental results:
cd experiments/2_multi_model_evaluation
python generate_charts.pyCharts saved to: docs/charts/
Measure speed and memory:
cd experiments/3_performance_analysis
python benchmark_speed.pyAll experimental results are visualized for easy interpretation:
For detailed analysis and additional charts, see docs/RESULTS.md and docs/charts/README.md
| Metric | Qwen2.5-3B | Phi-2 | Mistral-7B |
|---|---|---|---|
| Compression Ratio | 5.0x | 4.8x | 5.0x |
| Cosine Sim (2K) | 0.9961 | 0.9918 | 0.9930 |
| Top-1 Match (8K) | 86.1% | 28.2% | 97.7% |
| Top-5 Match (8K) | 94.4% | 55.7% | 100.0% |
At 3-bit quantization:
- 5.0x compression (Qwen, Mistral)
- 4.8x compression (Phi-2)
- Stable across context lengths (2K-8K tokens)
| Quantization | Qwen | Phi-2 | Mistral | Range |
|---|---|---|---|---|
| 3-bit @ 8K | 0.9945 | 0.9924 | 0.9887 | 98.9% - 99.5% |
Interpretation: Even at 3-bit, attention distributions are 98.9% - 99.5% similar to FP16 (original model).
| Model | 2K Tokens | 4K Tokens | 8K Tokens |
|---|---|---|---|
| Mistral | 97.3% top-1 | 96.5% top-1 | 97.7% top-1 |
| Qwen | 84.7% top-1 | 72.2% top-1 | 86.1% top-1 |
| Phi-2 | 59.7% top-1 | 39.8% top-1 | 28.2% top-1 |
Finding: Mistral maintains consistent performance across all context lengths. Phi-2 degrades significantly with longer contexts.
On a 12GB GPU with 3-bit TurboQuant:
- FP16 baseline: ~8K tokens max context
- TurboQuant 3-bit: ~40K tokens possible (5x improvement)
- Mistral-7B: Best for long-context applications
- Qwen-3B: Best cosine similarity, ideal for similarity-critical tasks
This repository includes improvements over the original:
| Aspect | Original | Enhanced |
|---|---|---|
| Model Support | Qwen only | 4+ models |
| Evaluation Scripts | Single validate.py | Generic framework |
| Documentation | README + code | Comprehensive docs |
| Analysis | Manual | Automated plotting |
| Reproducibility | Good | Excellent (tested) |
All dependencies are listed in requirements.txt:
torch>=2.0 # PyTorch with CUDA support
transformers>=4.40 # Hugging Face transformers
accelerate>=0.25 # Distributed training utilities
bitsandbytes>=0.43 # Quantization library
scipy>=1.10 # Scientific computing
matplotlib>=3.7 # Plotting
pandas>=2.0 # Data analysis
numpy>=1.24 # Numerical computing
streamlit>=1.28 # Web UI framework
Install all dependencies:
pip install -r requirements.txtIf you use TurboQuant or this evaluation framework:
@article{turboquant2026,
title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
year={2026},
journal={ICLR},
url={https://arxiv.org/abs/2504.19874}
}- Original implementation: tonbistudio/turboquant-pytorch
- Paper: TurboQuant (ICLR 2026)
- Evaluation framework extensions: This repository
MIT License - See LICENSE file for details
Original implementation attribution in original_implementation/ATTRIBUTION.md
- TurboQuant Paper: https://arxiv.org/abs/2504.19874
- Original Implementation: https://github.com/tonbistudio/turboquant-pytorch
- QJL (QJL residual correction): https://arxiv.org/abs/2406.03482
- PolarQuant (related work): https://arxiv.org/abs/2502.02617
For detailed information:
- π See HOW_TO_RUN.md for execution guide
- π¬ See METHODOLOGY.md for experimental details
- π See RESULTS.md for comprehensive results






