Skip to content

chahero/turboquant-experiments

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

TurboQuant: KV Cache Compression Experiments

Languages: English | ν•œκ΅­μ–΄

Comprehensive evaluation of TurboQuant, a near-optimal vector quantization algorithm for compressing LLM key-value caches.

This repository extends the original implementation with:

  • βœ… Interactive CLI tool for real-time model comparison with actual KV cache compression analysis
  • βœ… Multi-model evaluation (Qwen, LLaMA, Phi, Mistral)
  • βœ… Comprehensive performance benchmarking (generation speed, memory, attention accuracy)
  • βœ… Detailed experimental results and analysis
  • βœ… Reproducible evaluation framework with actual KV compression analysis

Quick Start

Installation

# Install all dependencies
pip install -r requirements.txt

Try Interactive Comparison (Recommended for Quick Testing)

Start an interactive session where you can enter prompts and see real KV cache compression:

cd experiments/2_multi_model_evaluation
python interactive_with_real_kv.py --model "Qwen/Qwen2.5-3B-Instruct" --bits 3

Then enter prompts:

[PROMPT] Enter text: What is artificial intelligence?
[PROMPT] Enter text: Explain machine learning
[PROMPT] Enter text: quit

You'll see:

  • Generated text
  • Real KV cache compression analysis (memory savings, speed)
  • Attention accuracy metrics (cosine similarity, top-1/top-5 match)

Try Streamlit Web UI (Easy-to-use Interface)

Interactive web interface for side-by-side original vs TurboQuant comparison:

cd streamlit_app
streamlit run app.py

Open browser at http://localhost:8501

Features:

  • πŸ’¬ Side-by-side text generation: Original KV vs TurboQuant output
  • ⚑ Real-time metrics: KV cache size, compression ratio, generation time
  • 🎯 Attention quality: Cosine similarity, top-1/top-5 match percentages
  • πŸ“Š Generation impact analysis: How many attention heads change due to compression
  • πŸ”§ Model selection: Choose from Qwen, Phi, Mistral
  • βš™οΈ Quantization control: Test 2-bit, 3-bit, 4-bit compression

Requirements:

  • Python 3.10+
  • PyTorch 2.0+ with CUDA
  • 12GB+ GPU VRAM

Example:

Streamlit Comparison

Run Full Benchmarks

Linux/Mac

# Run synthetic tests (no GPU needed)
cd experiments/1_paper_reproduction
python ../../original_implementation/test_turboquant.py

# Evaluate on a model (GPU required)
cd experiments/2_multi_model_evaluation
./run_all_models_complete.sh

Windows (PowerShell/CMD)

# Navigate to experiments
cd experiments/2_multi_model_evaluation

# Run all models at once
.\run_all_models_complete.bat

Key Results (3-bit Quantization @ 8K Context)

Model Compression Cosine Sim Top-1 % Top-5 %
Qwen2.5-3B 5.0x 0.9945 86.1% 94.4%
Mistral-7B 5.0x 0.9887 97.7% 100.0%
Phi-2 4.8x 0.9924 28.2% 55.7%

Best Overall: Mistral-7B (highest top-1 match across all contexts) Highest Cosine Similarity: Qwen2.5-3B (most similar attention distributions)

Interpretation:

  • 5.0x compression: KV cache shrinks from 290 MB to 58 MB (8K context)
  • 0.9945 cosine sim: Attention distributions 99.45% similar
  • 86% top-1 match: 86/100 attention heads pick same token
  • 94% top-5 match: Real top token in estimated top-5 94% of time

What is TurboQuant?

TurboQuant is a data-oblivious online vector quantization algorithm that:

  1. Rotates vectors randomly (makes coordinates independent)
  2. Quantizes each coordinate with optimal Lloyd-Max codebooks (2-4 bits)
  3. Corrects inner product bias using QJL (1 bit)

Result: High compression with minimal attention accuracy loss.

Paper: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate (ICLR 2026)

Repository Structure

turboquant-experiments/
β”œβ”€β”€ original_implementation/              # Reference code (with attribution)
β”‚   β”œβ”€β”€ lloyd_max.py                      # Lloyd-Max solver
β”‚   β”œβ”€β”€ turboquant.py                     # Core algorithm
β”‚   β”œβ”€β”€ compressors.py                    # Asymmetric attention
β”‚   β”œβ”€β”€ test_turboquant.py                # Synthetic tests
β”‚   β”œβ”€β”€ validate.py                       # Original Qwen validation
β”‚   └── ATTRIBUTION.md                    # Source attribution
β”‚
β”œβ”€β”€ experiments/
β”‚   β”œβ”€β”€ 1_paper_reproduction/             # Reproduce paper results
β”‚   β”‚   └── Verify MSE bounds, inner product unbiasedness
β”‚   β”‚
β”‚   β”œβ”€β”€ 2_multi_model_evaluation/         # Evaluate different models
β”‚   β”‚   β”œβ”€β”€ interactive_with_real_kv.py   # ⭐ Interactive CLI tool (RECOMMENDED)
β”‚   β”‚   β”œβ”€β”€ benchmark_generation.py       # Generation performance benchmark
β”‚   β”‚   β”œβ”€β”€ benchmark_turboquant.py       # Attention accuracy benchmark
β”‚   β”‚   β”œβ”€β”€ simple_prompt_test.py         # Basic comparison test
β”‚   β”‚   β”œβ”€β”€ evaluate_model.py             # Generic evaluation framework
β”‚   β”‚   β”œβ”€β”€ analyze_results.py            # Result analysis & plots
β”‚   β”‚   β”œβ”€β”€ run_all_models_complete.sh    # Batch evaluation script
β”‚   β”‚   β”œβ”€β”€ run_all_models_complete.bat   # Windows batch script
β”‚   β”‚   └── results/                      # Benchmark results (JSON)
β”‚   β”‚
β”‚   └── 3_performance_analysis/           # Speed & memory benchmarks
β”‚       β”œβ”€β”€ benchmark_speed.py
β”‚       └── benchmark_memory.py
β”‚
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ HOW_TO_RUN.md                     # Detailed execution guide
β”‚   β”œβ”€β”€ METHODOLOGY.md                    # Experimental methodology
β”‚   β”œβ”€β”€ RESULTS.md                        # Comprehensive results
β”‚   └── README.md
β”‚
└── README.md (this file)

Evaluation Framework

Models Tested

  • Qwen2.5-3B-Instruct (3.5GB) - βœ… Primary baseline
  • Microsoft Phi-2 (2.7GB) - βœ… Small, GPU-efficient
  • Mistral-7B-Instruct-v0.1 (13GB) - βœ… Best overall performance
  • Meta LLaMA-2-7B (13GB) - ❌ Requires HuggingFace authentication (gated repo)

Metrics

  1. Compression Ratio: KV cache size reduction (higher is better)
  2. Cosine Similarity: Attention distribution similarity (closer to 1.0 is better)
  3. Top-1 Match %: Same most-attended token (higher is better)
  4. Top-5 Match %: Top token in top-5 predictions (higher is better)

Context Lengths

Tested on: 2K, 4K, 8K tokens (covering short β†’ long contexts)

Running Experiments

1. Synthetic Validation (No GPU)

Verify core algorithm correctness:

cd experiments/1_paper_reproduction
python ../../original_implementation/test_turboquant.py

2. Model Evaluation (GPU Required)

Linux/Mac

cd experiments/2_multi_model_evaluation
./run_all_models_complete.sh

Windows (PowerShell/CMD)

cd experiments/2_multi_model_evaluation
.\run_all_models_complete.bat

Results are automatically saved to: experiments/2_multi_model_evaluation/results/ with JSON files per model.

Individual Model Evaluation

Using validate.py (proven stable):

cd original_implementation
python validate.py --model Qwen/Qwen2.5-3B-Instruct
python validate.py --model microsoft/phi-2
python validate.py --model mistralai/Mistral-7B-Instruct-v0.1

Or directly:

cd experiments/2_multi_model_evaluation
python evaluate_model.py --model Qwen/Qwen2.5-3B-Instruct --bits 3

3. Generate Visualization Charts

Create charts from experimental results:

cd experiments/2_multi_model_evaluation
python generate_charts.py

Charts saved to: docs/charts/

4. Performance Benchmarking

Measure speed and memory:

cd experiments/3_performance_analysis
python benchmark_speed.py

Results Summary (Comprehensive Evaluation)

Visualization Charts

All experimental results are visualized for easy interpretation:

Compression Comparison (8K Context)

Compression Comparison

Cosine Similarity Across Context Lengths

Cosine Similarity by Context

Top-1 Match Accuracy (3-bit @ 8K)

Top-1 Accuracy

Context Sensitivity Analysis

Context Sensitivity Heatmap

Model Comparison (3-bit @ 8K)

Model Radar Chart

Compression-Accuracy Tradeoff

Compression-Accuracy Tradeoff

For detailed analysis and additional charts, see docs/RESULTS.md and docs/charts/README.md

Overall Performance by Model @ 3-bit

Metric Qwen2.5-3B Phi-2 Mistral-7B
Compression Ratio 5.0x 4.8x 5.0x
Cosine Sim (2K) 0.9961 0.9918 0.9930
Top-1 Match (8K) 86.1% 28.2% 97.7%
Top-5 Match (8K) 94.4% 55.7% 100.0%

Compression Effectiveness

At 3-bit quantization:

  • 5.0x compression (Qwen, Mistral)
  • 4.8x compression (Phi-2)
  • Stable across context lengths (2K-8K tokens)

Attention Accuracy (Cosine Similarity)

Quantization Qwen Phi-2 Mistral Range
3-bit @ 8K 0.9945 0.9924 0.9887 98.9% - 99.5%

Interpretation: Even at 3-bit, attention distributions are 98.9% - 99.5% similar to FP16 (original model).

Context Length Stability

Model 2K Tokens 4K Tokens 8K Tokens
Mistral 97.3% top-1 96.5% top-1 97.7% top-1
Qwen 84.7% top-1 72.2% top-1 86.1% top-1
Phi-2 59.7% top-1 39.8% top-1 28.2% top-1

Finding: Mistral maintains consistent performance across all context lengths. Phi-2 degrades significantly with longer contexts.

Practical Implications

On a 12GB GPU with 3-bit TurboQuant:

  • FP16 baseline: ~8K tokens max context
  • TurboQuant 3-bit: ~40K tokens possible (5x improvement)
  • Mistral-7B: Best for long-context applications
  • Qwen-3B: Best cosine similarity, ideal for similarity-critical tasks

Code Quality & Improvements

This repository includes improvements over the original:

Aspect Original Enhanced
Model Support Qwen only 4+ models
Evaluation Scripts Single validate.py Generic framework
Documentation README + code Comprehensive docs
Analysis Manual Automated plotting
Reproducibility Good Excellent (tested)

Dependencies

All dependencies are listed in requirements.txt:

torch>=2.0          # PyTorch with CUDA support
transformers>=4.40  # Hugging Face transformers
accelerate>=0.25    # Distributed training utilities
bitsandbytes>=0.43  # Quantization library
scipy>=1.10         # Scientific computing
matplotlib>=3.7     # Plotting
pandas>=2.0         # Data analysis
numpy>=1.24         # Numerical computing
streamlit>=1.28     # Web UI framework

Install all dependencies:

pip install -r requirements.txt

Citation

If you use TurboQuant or this evaluation framework:

@article{turboquant2026,
  title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
  year={2026},
  journal={ICLR},
  url={https://arxiv.org/abs/2504.19874}
}

Acknowledgments

License

MIT License - See LICENSE file for details

Original implementation attribution in original_implementation/ATTRIBUTION.md

References


For detailed information:

About

Interactive Benchmarking Tool for TurboQuant KV Cache Compression. Supports 2-4 bit quantization with Real-time Metrics

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors