kvcache-bench

Benchmark every KV cache compression method on your GPU. One command, real numbers.

kvcache-bench --model qwen3.5:9b

| KV Type | Context | Prompt    | Gen tok/s | Prefill tok/s | VRAM +MB | Quality |
|---------|---------|-----------|-----------|---------------|----------|---------|
| f16     | 4096    | short     | 86.9      | 509.5         | +80      | PASS    |
| f16     | 16384   | short     | 78.8      | 784.6         | +316     | PASS    |
| q8_0    | 4096    | short     | 86.7      | 793.1         | +48      | PASS    |
| q8_0    | 16384   | short     | 87.2      | 741.9         | +219     | PASS    |
| q4_0    | 4096    | short     | 86.7      | 798.0         | +59      | PASS    |
| q4_0    | 16384   | short     | 86.7      | 522.7         | +156     | PASS    |

──────────────────────────────────────────────────
RECOMMENDATION
──────────────────────────────────────────────────

  Use q8_0 (8-bit KV cache)
  Speed: 87 tok/s (-0.1% vs f16)
  VRAM: saves 97 MB vs f16 (2x compression)
  Quality: negligible loss (+0.004 perplexity)

  Set: OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_FLASH_ATTENTION=1

Real output from Qwen3.5-9B on RTX 4080 16GB.

Why

When you run a local LLM, the KV cache eats your VRAM. Ollama and llama.cpp support different KV cache quantization types (f16, q8_0, q4_0), but nobody tells you what the actual tradeoff is on YOUR hardware.

Current state of the world:

You Google "ollama kv cache quantization" and find forum posts with conflicting advice
You manually test each config, eyeball nvidia-smi, and guess
No tool compares them systematically

kvcache-bench fixes this. It tests every KV cache type on your GPU and gives you a comparison table with speed, VRAM, and quality.

Install

pip install kvcache-bench

Usage

# Auto-detect your first model, test all KV types
kvcache-bench

# Specific model
kvcache-bench --model qwen3.5:9b

# Test at multiple context lengths (where KV savings matter most)
kvcache-bench --model llama3.1:8b --context 4096,8192,16384

# Include tool calling test
kvcache-bench --model qwen3.5:9b --prompts short,code,reasoning,tool_call

# Save results as JSON
kvcache-bench --model qwen3.5:9b --json results.json

# Just show GPU info
kvcache-bench --gpu

# List available models
kvcache-bench --list-models

What It Tests

For each KV cache type (f16, q8_0, q4_0), it measures:

Metric	How
Generation speed	Tokens per second during generation
Prefill speed	Tokens per second processing the prompt
VRAM delta	Extra VRAM used beyond model weights (measured via nvidia-smi)
Quality	Auto-checked against expected answers (Paris, code structure, reasoning)

How It Works

Detects your GPU and Ollama installation
For each KV cache type: restarts Ollama with OLLAMA_KV_CACHE_TYPE=<type>, warms up the model, runs benchmark prompts
Measures VRAM before and during inference via nvidia-smi
Extracts timing from Ollama's API response (prompt_eval_duration, eval_duration)
Checks response quality with simple auto-graders
Produces a markdown table (and optional JSON)

What the Research Says

Based on llama.cpp community benchmarks and our testing:

KV Type	VRAM Savings	Perplexity Impact	Best For
f16	Baseline	None	When you have VRAM to spare
q8_0	2x	+0.004 (negligible)	Default recommendation. Free VRAM, zero quality cost.
q4_0	4x	+0.2 (noticeable)	When you need max context length or are VRAM-constrained

The sweet spot for most users: q8_0. Halves your KV cache VRAM with essentially zero quality loss.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
docs		docs
kvcache_bench		kvcache_bench
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
chart_speed_comparison.png		chart_speed_comparison.png
chart_vram_by_context.png		chart_vram_by_context.png
chart_vram_savings.png		chart_vram_savings.png
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kvcache-bench

Why

Install

Usage

What It Tests

How It Works

What the Research Says

Requirements

Roadmap

License

Related

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

kvcache-bench

Why

Install

Usage

What It Tests

How It Works

What the Research Says

Requirements

Roadmap

License

Related

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages