Systematically sweep every meaningful llama-bench parameter combination for any GGUF model, record every result as JSONL, and surface the optimal configuration for your hardware.
llamaseye runs llama-bench across every meaningful parameter combination for any GGUF model. It sweeps each axis independently — GPU layer offload (ngl), flash attention, KV cache quantisation type, thread count, KV offload ratio, batch size, and context size — then runs a full combination matrix (Phase 7) to confirm which configs work together and find the true performance ceiling.
Every result is recorded as JSONL in a per-model output directory, alongside a human-readable Markdown summary, a raw log, a hardware snapshot, and a resume-state file. Runs that trigger an OOM or timeout are caught and logged — the sweep never hangs. OOM and timeout are distinguished: OOM means a context size is impossible at that memory budget; timeout means it is achievable but slow. Timeout runs write a "status": "timeout" record with wall_time_sec to sweep.jsonl and appear in a dedicated section of sweep.md.
The script is fully portable: it detects CPU core count, available RAM, GPU VRAM, the active compute backend (cuda / metal / cpu), and the correct thermal-sensor commands at runtime. There are no hardcoded machine values. Optionally, pass a TurboQuant build of llama-bench via --turbo-bench to unlock turbo2/turbo3/turbo4 KV cache types from the llama-cpp-turboquant fork, which compress the KV cache 3–6× and enable much longer contexts on the same hardware.
Build the binary (Go 1.22+):
go build -o llamaseye .Single model:
./llamaseye --model ~/Models/Qwen3-14B-Q4_K_M.gguf --output-dir ./resultsAll models in a directory:
./llamaseye --models-dir ~/Models --output-dir ./resultsFrom a model list file:
./llamaseye --models-dir ~/Models --model-list my_models.txt --output-dir ./resultsWith TurboQuant KV types:
./llamaseye --model ~/Models/model.gguf --turbo-bench ~/llama-cpp-turboquant/build/bin/llama-benchUnattended overnight run:
nohup ./llamaseye --models-dir ~/Models --output-dir ./results > /dev/null 2>&1 &All environment variables can be set in a .env file instead of passing flags every time. The binary auto-loads .env from the working directory if it exists — no source step needed:
cp example.env .env
# Edit .env to match your paths and preferences
./llamaseye --models-dir ~/Models # .env is loaded automaticallyTo load a file at a different path use --env-file:
./llamaseye --env-file ~/my-config.env --model ~/Models/model.ggufLoad order: .env file → process environment → CLI flags. Process env vars always override file values; CLI flags override everything.
Every CLI flag has a corresponding environment variable — env vars set the default value, and CLI flags override them when both are provided. example.env in the repo root documents every available variable with its default value and a description. The most important ones to set are LLAMA_BENCH_BIN (path to your llama-bench binary) and SWEEP_OUTPUT_DIR (where results are written).
$VAR and ${VAR} references in unquoted and double-quoted .env values are expanded against the process environment, so paths like SWEEP_OUTPUT_DIR=${HOME}/Models/bench/sweep work as expected. Single-quoted values are kept literal.
.env is gitignored — your local paths and configuration will not be committed.
llamaseye is a single Go binary with no runtime dependencies beyond the OS. Requires Go 1.22 or later.
# Build
go build -o llamaseye .
# Optionally install into your PATH
go install github.com/WagnerJust/llamaseye@latestThe binary statically links all dependencies. No external tools are required at runtime except llama-bench itself.
llamaseye does not install or build llama-bench for you. You must build it yourself from llama.cpp with whatever backend flags suit your hardware before running llamaseye.
# Clone llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# Build for CUDA (NVIDIA GPU)
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release --target llama-bench -j$(nproc)
# Build for Metal (macOS — Apple Silicon or Intel Mac)
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release --target llama-bench -j$(sysctl -n hw.logicalcpu)
# Build CPU-only (any platform, no GPU)
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release --target llama-bench -j$(nproc)The binary will be at build/bin/llama-bench. Pass its path to llamaseye via --llama-bench <path> or set the LLAMA_BENCH_BIN environment variable. There is no default — llamaseye will exit with an error if the binary is not specified.
The build flags you choose determine which backends and features are available during the sweep. llamaseye works with any valid llama-bench binary — it does not require any specific build flags itself.
To enable turbo2/turbo3/turbo4 KV cache types, build a second binary from the llama-cpp-turboquant fork (branch feature/turboquant-kv-cache) and pass it via --turbo-bench <path>. The fork is otherwise identical to llama.cpp — same build flags apply.
git clone https://github.com/TheTom/llama-cpp-turboquant \
--branch feature/turboquant-kv-cache --depth=1
cd llama-cpp-turboquant
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release --target llama-bench -j$(nproc)llamaseye verifies the binary at startup by running <binary> --help and checking for the turbo3 marker in the output. If the marker is present, turbo types are enabled. If the path is missing, not executable, or the marker is absent, turbo types are silently omitted and the sweep continues with the standard KV type set. It is safe to always pass --turbo-bench — llamaseye handles an invalid path gracefully.
Warning: RotorQuant support is currently non-functional. Do not use
--rotor-benchuntil this is resolved.
To enable planar3/planar4/iso3/iso4 KV cache types (from the RotorQuant project), build from the johndpope/llama-cpp-turboquant fork (branch feature/planarquant-kv-cache) and pass it via --rotor-bench <path>.
git clone https://github.com/johndpope/llama-cpp-turboquant \
--branch feature/planarquant-kv-cache --depth=1 llama-cpp-rotorquant
cd llama-cpp-rotorquant
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release --target llama-bench -j$(nproc)RotorQuant types slot between q4_0 and TurboQuant types in the quality ordering — they offer better PPL at equivalent compression (e.g. iso3 PPL 6.91 vs turbo3 PPL 7.07 on Llama 3.1 8B). They are included in the Phase 6 OOM fallback chain when --rotor-bench is provided.
The Go binary uses native OS APIs for hardware detection and thermal monitoring. A few optional tools extend these capabilities:
| Tool | Purpose | Install |
|---|---|---|
nvidia-smi |
NVIDIA GPU VRAM/temp detection | Included with NVIDIA drivers |
sensors |
Linux CPU temperature reading | apt install lm-sensors |
osx-cpu-temp |
macOS CPU temperature reading (optional) | brew install osx-cpu-temp |
If these are absent, llamaseye disables the corresponding thermal guard and logs a warning — the sweep still runs.
| Flag | Description |
|---|---|
--model <path> |
Single GGUF model to benchmark |
--models-dir <dir> |
Directory to scan for GGUF models |
--model-list <file> |
Text file listing model filenames (one per line) |
--output-dir <dir> |
Root directory for all results (default: ./results) |
--llama-bench <path> |
Path to standard llama-bench binary |
--turbo-bench <path> |
Path to TurboQuant llama-bench binary (enables turbo2/3/4 KV types) |
--rotor-bench <path> |
[EXPERIMENTAL - broken] Path to RotorQuant llama-bench binary (enables planar3/4, iso3/4 KV types) |
--asymmetric-kv / --no-asymmetric-kv |
Include asymmetric K/V quant combos in Phase 2 when --turbo-bench is set (default: enabled) |
--ngl-step <n> |
Step size for NGL axis sweep (default: 4) |
--repetitions <n> |
Repetitions per benchmark run (default: 3) |
--timeout <s> |
Per-run timeout in seconds (default: 600) |
--goal <spec> |
Goal-directed Phase 7: stop after 3 validated configs meeting the spec. Format: ctx=N,tg=N,pp=N (all optional). Example: --goal "ctx=32768,tg=5" |
--resume |
Resume a previous sweep, skipping completed phases |
--overwrite |
Delete existing output dir and re-run everything |
--only-phases <list> |
Comma-separated list of phase numbers to run (e.g. 0,1,7) |
--focused |
Only run combos not already in sweep.jsonl (requires --only-phases). Skipped combos still populate working sets for downstream phases. |
--skip-phases <list> |
Comma-separated list of phase numbers to skip |
--report |
Read-only: regenerate sweep.md from existing sweep.jsonl files without running any benchmarks. Also generates summary.md when multiple models are found. Combine with --model/--models-dir to target a subset; omit both to scan all subdirs of --output-dir. |
--dry-run |
Print what would run without executing |
--no-confirm |
Skip the pre-run confirmation prompt |
--debug |
Enable verbose [DEBUG] lines in the log: full command lines, raw stdout/stderr, OOM matches, thermal polls, GGUF metadata |
--goal-hits N |
Stop goal mode after N distinct (ngl, ctk, nkvo, ctx) configs are found (default: 3) |
--goal-sort tg|ctx|ngl|pp |
Sort Goal Results table by this axis, descending (default: tg) |
--cpu-temp-limit <°C> |
Pause if CPU exceeds this temperature (default: 88) |
--gpu-temp-limit <°C> |
Pause if GPU exceeds this temperature (default: 81) |
--no-thermal-guard |
Disable thermal polling entirely |
These control where each phase begins its sweep and which direction it moves. Direction flags accept up or down.
| Flag | Description |
|---|---|
--start-ngl <n> |
Begin NGL sweep at this value (default: MAX_NGL − 2×step) |
--ngl-dir up|down |
NGL sweep direction (default: up = toward MAX_NGL) |
--start-threads <n> |
Begin thread count sweep at this value |
--threads-dir up|down |
Thread sweep direction (default: up) |
--start-ctx <n> |
Begin context sweep at this prompt size; also sets Phase 7 min-ctx |
--ctx-dir up|down |
Context sweep direction (default: up = toward 131072) |
--fine-ctx |
Enable midpoint bisection in Phase 6 (see Fine-grained context sweep) |
--ctx-step-min <n> |
Minimum bisection step for --fine-ctx (default: 8192) |
--start-ctk <type> |
Begin KV quant sweep at this type |
--ctk-dir up|down |
KV type sweep direction (default: up = toward more compression) |
--start-ctv <type> |
Begin V-cache quant sweep at this type (ignored when --ctv is set) |
--ctv-dir up|down |
V-cache type sweep direction (default: up = toward more compression) |
--ctv <list> |
Restrict Phase 2 to these CTV values (comma-separated, e.g. turbo3,turbo2); takes precedence over --start-ctv / --ctv-dir |
--start-b <n> |
Begin batch size sweep at this value |
--b-dir up|down |
Batch sweep direction (default: up) |
--start-ub <n> |
Begin ubatch size sweep at this value |
--ub-dir up|down |
Ubatch sweep direction (default: up) |
--start-fa 0|1 |
Begin FA sweep at this value (default: 0) |
--fa-dir up|down |
FA sweep direction (default: up = 0→1) |
These filter the Phase 7 combination matrix without affecting phases 1–6. When not set, smart defaults are derived automatically (see Smart defaults).
| Flag | Description |
|---|---|
--min-ngl <n> |
Exclude NGL values below N from Phase 7 |
--min-threads <n> |
Exclude thread counts below N from Phase 7 |
--min-ctx <n> |
Exclude context sizes below N from Phase 7 |
--min-ctk <type> |
Exclude KV types below TYPE (by quality) from Phase 7 |
--min-b <n> |
Exclude batch sizes below N from Phase 7 |
--min-ub <n> |
Exclude ubatch sizes below N from Phase 7 |
Every CLI flag can also be set via environment variable — useful for .env files so you don't repeat flags on every invocation. CLI flags always override env vars when both are set. See example.env for the full list with defaults and descriptions.
| Variable | Equivalent flag | Example |
|---|---|---|
SWEEP_RESUME |
--resume |
SWEEP_RESUME=true |
SWEEP_OVERWRITE |
--overwrite |
SWEEP_OVERWRITE=true |
SWEEP_SKIP_PHASES |
--skip-phases |
SWEEP_SKIP_PHASES=7 |
SWEEP_ONLY_PHASES |
--only-phases |
SWEEP_ONLY_PHASES=0,1,6 |
SWEEP_FOCUSED |
--focused |
SWEEP_FOCUSED=true |
SWEEP_NGL_STEP |
--ngl-step |
SWEEP_NGL_STEP=2 |
SWEEP_START_NGL |
--start-ngl |
SWEEP_START_NGL=40 |
SWEEP_NGL_DIR |
--ngl-dir |
SWEEP_NGL_DIR=down |
SWEEP_START_CTX |
--start-ctx |
SWEEP_START_CTX=32768 |
SWEEP_MIN_CTX |
--min-ctx |
SWEEP_MIN_CTX=32768 |
SWEEP_MIN_NGL |
--min-ngl |
SWEEP_MIN_NGL=16 |
SWEEP_MIN_CTK |
--min-ctk |
SWEEP_MIN_CTK=q8_0 |
SWEEP_MIN_THREADS |
--min-threads |
SWEEP_MIN_THREADS=8 |
SWEEP_MIN_B |
--min-b |
SWEEP_MIN_B=1024 |
SWEEP_MODEL_LIST |
--model-list |
SWEEP_MODEL_LIST=~/list.txt |
SWEEP_NO_CONFIRM |
--no-confirm |
SWEEP_NO_CONFIRM=true |
SWEEP_DRY_RUN |
--dry-run |
SWEEP_DRY_RUN=true |
SWEEP_DEBUG |
--debug |
SWEEP_DEBUG=true |
SWEEP_GOAL_HITS |
--goal-hits |
SWEEP_GOAL_HITS=5 |
SWEEP_GOAL_SORT |
--goal-sort |
SWEEP_GOAL_SORT=ctx |
| Phase | Name | What varies | Everything else |
|---|---|---|---|
| 0 | NGL Probe | Binary search for max stable GPU layers — starts at model's layer count (from GGUF metadata), falls back to 99 | Defaults — establishes MAX_NGL |
| 1 | NGL Axis | NGL values up to model's layer count (capped there since higher values are identical); near MAX_NGL by default (use --start-ngl 0 for full sweep) |
Defaults |
| 2 | FA + KV Quant Axis | Flash attention on/off × KV cache type | Best NGL from Phase 1 |
| 3 | Thread Count | CPU thread count variants | Best NGL, best FA/KV |
| 4 | KV Offload | KV cache in VRAM (nkvo=0) vs RAM (nkvo=1) | Best NGL, best FA/KV, best threads |
| 5 | Batch Size | ubatch and batch size variants | Best values so far |
| 6 | Context Ceiling | Prompt size scaled up to OOM/timeout, with fallback configs on OOM; timeout runs are recorded with wall time | Best values so far |
| 7 | Full Combination Matrix | Cartesian product of all best-per-axis working sets; with --goal, runs ranked and exits early once the goal is satisfied |
— |
Common use cases work without any extra flags. The key smart behaviors:
At sweep start, llamaseye reads the model's layer count from its GGUF metadata. NGL values above that count are functionally identical (llama.cpp silently clamps NGL to the layer count), so:
- Phase 0 starts its probe at
NumLayersinstead of 99 — eliminating up to 15 wasted probe runs for small models - Phase 1 caps its sweep list at
NumLayers— shrinking the NGL working set and keeping Phase 7's cartesian product manageable
If GGUF parsing fails (non-standard file), both phases fall back to the 99 ceiling.
Phase 1 starts at MAX_NGL − 2×step by default, testing only the top ~3 NGL values near the VRAM ceiling. Low-NGL configs rarely matter for performance. Use --start-ngl 0 for a full 0→MAX_NGL sweep, or --start-ngl 40 --ngl-dir down to sweep downward from a specific cap.
When the primary config OOMs at a given context size, Phase 6 automatically tries progressively more memory-friendly alternatives before giving up:
- Flip nkvo (move KV cache from VRAM → RAM)
- V-first: keep CTK fixed, try more-compressed CTV types — V compression is effectively free quality-wise, so this is exhausted before touching K
- K+V: try more-compressed CTK types (with their paired CTV from Phase 2) × both nkvo values
Only ctk/ctv/nkvo values already validated by Phases 2 and 4 are tried. The KV quality order for fallback selection (most to least quality): f16 > q8_0 > q4_0 > iso4 > planar4 > turbo4 > iso3 > planar3 > turbo3 > turbo2.
By default Phase 6 sweeps context as powers of two (512 → 1024 → … → 65536 → 131072). This means a more-compressed KV type might unlock an intermediate context size that the sweep never discovers.
Use --fine-ctx to enable midpoint bisection: when all fallbacks fail at a ctx size, the sweep bisects between the last successful ctx and the failed ctx, probing the midpoint and narrowing until the gap is ≤ --ctx-step-min (default 8192).
# example: turbo3 unlocks 98304 on a card where q4_0 maxes at 65536
bash llamaseye.sh --model my.gguf --fine-ctx
Each bisection probe runs the full primary + fallback sequence, so --fine-ctx is off by default — it adds real runtime cost at large context sizes.
Phase 7 now generates the cartesian product of independent CTK × CTV axes rather than replaying only the exact pairs tested in Phase 2. This expands the search space to include combos like (ctk=f16, ctv=turbo3) even if that exact pair wasn't in Phase 2.
Precision filter: combos where V is more precise than K are automatically skipped as wasteful (e.g. ctk=turbo3, ctv=f16 is dropped). The filter rule: CTK quality ≥ CTV quality.
When --min-* flags are not set, Phase 7 auto-applies minimum filters so the combination matrix stays focused on high-value configs:
| Axis | Auto default | Override to disable |
|---|---|---|
| NGL | MAX_NGL − 1 step (top 2 values) |
--min-ngl 0 |
| Threads | HW_CPU_PHYSICAL (physical core count) |
--min-threads 1 |
| Context | --start-ctx value, else 8192 |
--min-ctx 0 |
| KV type | q8_0 normally; auto-lowered to most-compressed Phase-2-validated type when Phase 6 hit OOM |
--min-ctk q4_0 |
| Batch | BEST_B / 2 |
--min-b 512 |
If --start-ctx is set and no context at or above that size succeeds in Phase 6, Phase 7 is skipped with a clear warning rather than silently running a useless matrix at a tiny fallback context.
KV quality order (low → high): turbo2 turbo3 planar3 iso3 turbo4 planar4 iso4 q4_0 q8_0 f16
--min-ctk turbo3keeps turbo3 and everything higher quality (excludes only turbo2)--min-ctk q8_0keeps q8_0 and f16 only (the default)--min-ctk q4_0includes all types at or above q4_0 quality (effectively disables the ctk filter)
Passing --turbo-bench <path> enables three additional KV cache quantisation types: turbo2, turbo3, and turbo4, sourced from the llama-cpp-turboquant fork. These compress the KV cache 3–6× compared to f16, freeing VRAM for more layers or larger contexts without a significant quality penalty.
| Type | Compression vs f16 | Flash attn required |
|---|---|---|
turbo2 |
~6.4× | No |
turbo3 |
~4.3× | Yes (auto-enabled) |
turbo4 |
~3.2× | Yes (auto-enabled) |
The TurboQuant binary is verified at startup by running <binary> --help and checking for the turbo3 marker in the output. If the marker is present, turbo types are enabled. If the path is missing, not executable, or the marker is absent, turbo types are silently omitted and the sweep continues with the standard KV type set. It is safe to always pass --turbo-bench — llamaseye handles an invalid path gracefully.
When --turbo-bench is available, Phase 2 also tests asymmetric K/V combinations (e.g. ctk=q8_0, ctv=turbo3) by default. TurboQuant research shows that V cache compression is effectively free — compressing V has near-zero effect on attention quality — while all quality degradation comes from K compression. Asymmetric combos capture the best of both: high K precision with aggressive V compression. Use --no-asymmetric-kv to restrict Phase 2 to symmetric pairs only.
Passing --rotor-bench <path> enables four additional KV cache types: planar3, planar4, iso3, iso4, sourced from the johndpope/llama-cpp-turboquant fork (branch feature/planarquant-kv-cache).
RotorQuant uses block-diagonal rotations (O(d), 128 params) rather than full-rank WHT (O(d log d), 16,384 params), which preserves directional structure better at equivalent compression. Benchmarks on Llama 3.1 8B (RTX 5090):
| Type | Compression | PPL |
|---|---|---|
iso3 |
~10.3× | 6.91 |
planar3 |
~10.3× | 7.05 |
turbo3 |
~10.3× | 7.07 |
RotorQuant types slot between q4_0 and TurboQuant in the quality ordering: f16 > q8_0 > q4_0 > iso4 > planar4 > turbo4 > iso3 > planar3 > turbo3 > turbo2.
Results are written to <output-dir>/<model-stem>/:
results/
├── summary.md # Cross-model winner table (multi-model runs only)
└── Qwen3-14B-Q4_K_M/
├── sweep.jsonl # One JSON object per completed run (source of truth)
├── sweep.md # Human-readable Markdown summary (regenerable with --report)
├── sweep.log # Full execution log
├── hardware.json # Hardware snapshot captured at start
├── state.json # Resume state (completed phases + best values + working sets)
└── raw/
└── <run-id>.txt # Raw llama-bench stdout for each run
sweep.jsonl is append-only and is the source of truth. state.json tracks which phases are complete and the best parameter values discovered so far, enabling --resume to pick up exactly where it left off.
Graceful shutdown: Pressing Ctrl-C (or sending SIGTERM) cancels the current run, saves state.json, and exits cleanly. Use --resume to continue from where it stopped.
sweep.md sections:
- Best Configurations — top 10 results across all phases ranked by TG t/s
- Per-phase tables — all runs for each phase, sorted by TG t/s, with a
> **Winner:**callout line showing the best config for that axis - Goal Results — when
--goalwas used, Phase 7 rows that met the target - Context Frontier — max stable context per (ngl, ctk, nkvo) combo from Phase 7
- Slow context — Phase 6 sizes that timed out (achievable but impractical for interactive use)
sweep.md can be regenerated at any time from sweep.jsonl without re-running benchmarks:
./llamaseye --report --output-dir ./results--model-list accepts a plain text file with one model filename per line. Lines beginning with # and blank lines are ignored.
# my_models.txt — models to sweep
Qwen3-14B-Q4_K_M.gguf
Qwen3-14B-Q6_K.gguf
Llama-3.1-8B-Instruct-Q8_0.gguf
# WIP — not ready yet
# Mixtral-8x7B-Q4_K_M.gguf
Paths are resolved relative to --models-dir. If --models-dir is not set, filenames are treated as absolute or relative to the working directory.
At startup the script probes:
- CPU cores — via
nproc(Linux) orsysctl -n hw.logicalcpu(macOS) - System RAM — to compute safe context-size upper bounds
- GPU VRAM — via
nvidia-smiorsystem_profiler(Apple Silicon) - Compute backend — cuda / metal / cpu, inferred from the llama-bench binary
- Thermal sensors —
nvidia-smifor GPU,sensorsor/sys/class/thermalfor CPU
All sweep parameters are derived from these detected values. The script contains no hardcoded machine-specific constants.
Each phase sweeps exactly one axis while holding everything else at sane defaults. The only cross-phase dependency is MAX_NGL, which is established by Phase 0 and used as the ceiling for all subsequent phases. Phases 1–6 each produce a working set of values for their axis. Phase 7 takes the Cartesian product of all working sets and runs the full combination matrix, confirming which configs compose well and revealing the true peak configuration. This one-variable-at-a-time discipline keeps results interpretable and makes it straightforward to re-run individual phases in isolation with --only-phases.
See docs/spec.md for the full engineering specification.