Skip to content

back2matching/quant-sim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

quant-sim

Which quantization should I use? One command tells you.

pip install quantsim-bench
quant-sim qwen2.5:7b

Benchmarks every quantization level of a model on YOUR GPU. Measures speed, quality, and VRAM. Tells you the best tradeoff.

Why

Every model on Ollama has 5+ quantization levels. You ask Reddit "should I use Q4_K_M or Q5_K_M?" and get 10 different answers. The right answer depends on YOUR GPU, YOUR tasks, YOUR quality threshold.

No existing tool benchmarks speed AND quality across quant levels automatically:

  • ollamabench, llm-benchmark, LocalScore: speed only
  • lm-evaluation-harness: quality only, manual setup
  • ollama-grid-search: prompt tuning, not quant comparison

quant-sim does both in one command.

Example: Compare All Quant Levels

$ quant-sim qwen2.5:7b --quick --speed-only

Quant          Size    VRAM      Speed  Quality Note
------------ ------ ------- ---------- -------- ---------------
Q3_K_S         3.3G  15004M    128.8/s       --
Q4_K_M         4.4G   7885M    134.2/s       -- * BEST *
Q5_K_M         5.1G   8532M    105.8/s       --
Q6_K           5.8G   9160M     89.0/s       --
Q8_0           7.5G  10813M     69.7/s       --

Recommendation: Use Q4_K_M (qwen2.5:7b-instruct-q4_k_m).
  134 tok/s, 4.4 GB.

Example: Benchmark All Local Models

$ quant-sim --local --quick --speed-only

Quant          Size    VRAM      Speed  Note
------------ ------ ------- ---------- ---------------
Q4_K_M         7.5G   7857M    117.9/s * BEST *
Q4_K_M         4.4G   7888M    112.0/s
Q5_K_M         5.1G   8532M    101.2/s
Q4_K_M         4.9G   8619M     98.6/s
Q6_K           5.8G   9220M     89.1/s
Q4_K_M         6.1G  10717M     80.4/s
Q4_K_M         6.1G  10723M     75.9/s
Q8_0           7.5G  11096M     72.7/s
Q4_K_M         8.6G  12375M     50.9/s
Q3_K_M        13.4G  15843M      2.1/s  (CPU offload)

Real output from RTX 4080 16GB with 11 models installed.

Install

pip install quantsim-bench  # coming soon — for now: pip install -e . from source

Requires: Ollama running locally, NVIDIA GPU.

Usage

# Benchmark a model (auto-discovers quant variants, pulls if needed)
quant-sim qwen2.5:7b

# Benchmark ALL locally installed models (no downloads)
quant-sim --local

# Quick mode (~2 min instead of ~10 min)
quant-sim qwen2.5:7b --quick

# Speed only (skip quality test)
quant-sim --local --quick --speed-only

# Don't download anything (only test what's already installed)
quant-sim qwen2.5:7b --no-pull

# Compare specific tags
quant-sim test --tags "qwen3:8b,qwen3:14b,qwen3.5:9b"

# Save results as JSON
quant-sim qwen2.5:7b --json results.json

# Show GPU info
quant-sim --gpu

# List local models
quant-sim --list

What It Measures

Metric How
Speed Tokens/sec via Ollama chat API (prompt + generation)
Quality 20 built-in questions: facts, math, coding, reasoning
VRAM Peak GPU memory via nvidia-smi during inference
Size Model file size from Ollama

Quality Test

Built-in 20-question test covering:

  • Facts (5): capitals, science, literature
  • Math (5): arithmetic, word problems
  • Coding (5): Python functions, one-liners
  • Reasoning (5): logic puzzles, trick questions

How It Works

  1. Discovers available quantization variants of your model
  2. For each variant: loads model, measures VRAM, runs speed prompts, runs quality questions
  3. Grades quality answers automatically (keyword matching, code syntax checking)
  4. Recommends the best tradeoff: highest quality above 80%, then fastest

Community Leaderboard

Share your benchmarks. Compare your GPU against others.

# Submit results after benchmarking
quant-sim --local --quick --submit

# View community results
quant-sim --leaderboard

Results stored as GitHub issues — no backend server needed. Set GITHUB_TOKEN env var to submit.

Requirements

  • Python 3.10+
  • Ollama installed and running
  • NVIDIA GPU with nvidia-smi

License

Apache 2.0

About

Which quantization should I use? One command benchmarks every quant level on YOUR GPU.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages