Which quantization should I use? One command tells you.
pip install quantsim-bench
quant-sim qwen2.5:7bBenchmarks every quantization level of a model on YOUR GPU. Measures speed, quality, and VRAM. Tells you the best tradeoff.
Every model on Ollama has 5+ quantization levels. You ask Reddit "should I use Q4_K_M or Q5_K_M?" and get 10 different answers. The right answer depends on YOUR GPU, YOUR tasks, YOUR quality threshold.
No existing tool benchmarks speed AND quality across quant levels automatically:
- ollamabench, llm-benchmark, LocalScore: speed only
- lm-evaluation-harness: quality only, manual setup
- ollama-grid-search: prompt tuning, not quant comparison
quant-sim does both in one command.
$ quant-sim qwen2.5:7b --quick --speed-only
Quant Size VRAM Speed Quality Note
------------ ------ ------- ---------- -------- ---------------
Q3_K_S 3.3G 15004M 128.8/s --
Q4_K_M 4.4G 7885M 134.2/s -- * BEST *
Q5_K_M 5.1G 8532M 105.8/s --
Q6_K 5.8G 9160M 89.0/s --
Q8_0 7.5G 10813M 69.7/s --
Recommendation: Use Q4_K_M (qwen2.5:7b-instruct-q4_k_m).
134 tok/s, 4.4 GB.
$ quant-sim --local --quick --speed-only
Quant Size VRAM Speed Note
------------ ------ ------- ---------- ---------------
Q4_K_M 7.5G 7857M 117.9/s * BEST *
Q4_K_M 4.4G 7888M 112.0/s
Q5_K_M 5.1G 8532M 101.2/s
Q4_K_M 4.9G 8619M 98.6/s
Q6_K 5.8G 9220M 89.1/s
Q4_K_M 6.1G 10717M 80.4/s
Q4_K_M 6.1G 10723M 75.9/s
Q8_0 7.5G 11096M 72.7/s
Q4_K_M 8.6G 12375M 50.9/s
Q3_K_M 13.4G 15843M 2.1/s (CPU offload)
Real output from RTX 4080 16GB with 11 models installed.
pip install quantsim-bench # coming soon — for now: pip install -e . from sourceRequires: Ollama running locally, NVIDIA GPU.
# Benchmark a model (auto-discovers quant variants, pulls if needed)
quant-sim qwen2.5:7b
# Benchmark ALL locally installed models (no downloads)
quant-sim --local
# Quick mode (~2 min instead of ~10 min)
quant-sim qwen2.5:7b --quick
# Speed only (skip quality test)
quant-sim --local --quick --speed-only
# Don't download anything (only test what's already installed)
quant-sim qwen2.5:7b --no-pull
# Compare specific tags
quant-sim test --tags "qwen3:8b,qwen3:14b,qwen3.5:9b"
# Save results as JSON
quant-sim qwen2.5:7b --json results.json
# Show GPU info
quant-sim --gpu
# List local models
quant-sim --list| Metric | How |
|---|---|
| Speed | Tokens/sec via Ollama chat API (prompt + generation) |
| Quality | 20 built-in questions: facts, math, coding, reasoning |
| VRAM | Peak GPU memory via nvidia-smi during inference |
| Size | Model file size from Ollama |
Built-in 20-question test covering:
- Facts (5): capitals, science, literature
- Math (5): arithmetic, word problems
- Coding (5): Python functions, one-liners
- Reasoning (5): logic puzzles, trick questions
- Discovers available quantization variants of your model
- For each variant: loads model, measures VRAM, runs speed prompts, runs quality questions
- Grades quality answers automatically (keyword matching, code syntax checking)
- Recommends the best tradeoff: highest quality above 80%, then fastest
Share your benchmarks. Compare your GPU against others.
# Submit results after benchmarking
quant-sim --local --quick --submit
# View community results
quant-sim --leaderboardResults stored as GitHub issues — no backend server needed. Set GITHUB_TOKEN env var to submit.
- Python 3.10+
- Ollama installed and running
- NVIDIA GPU with nvidia-smi
Apache 2.0