llamaseye — Exhaustive llama-bench parameter sweep harness

Systematically sweep every meaningful llama-bench parameter combination for any GGUF model, record every result as JSONL, and surface the optimal configuration for your hardware.

What it does

llamaseye runs llama-bench across every meaningful parameter combination for any GGUF model. It sweeps each axis independently — GPU layer offload (ngl), flash attention, KV cache quantisation type, thread count, KV offload ratio, batch size, and context size — then runs a full combination matrix (Phase 7) to confirm which configs work together and find the true performance ceiling.

Every result is recorded as JSONL in a per-model output directory, alongside a human-readable Markdown summary, a raw log, a hardware snapshot, and a resume-state file. Runs that trigger an OOM or timeout are caught and logged — the sweep never hangs. OOM and timeout are distinguished: OOM means a context size is impossible at that memory budget; timeout means it is achievable but slow. Timeout runs write a "status": "timeout" record with wall_time_sec to sweep.jsonl and appear in a dedicated section of sweep.md.

The script is fully portable: it detects CPU core count, available RAM, GPU VRAM, the active compute backend (cuda / metal / cpu), and the correct thermal-sensor commands at runtime. There are no hardcoded machine values. Optionally, pass a TurboQuant build of llama-bench via --turbo-bench to unlock turbo2/turbo3/turbo4 KV cache types from the llama-cpp-turboquant fork, which compress the KV cache 3–6× and enable much longer contexts on the same hardware.

Quick start

Build the binary (Go 1.22+):

go build -o llamaseye .

Single model:

./llamaseye --model ~/Models/Qwen3-14B-Q4_K_M.gguf --output-dir ./results

All models in a directory:

./llamaseye --models-dir ~/Models --output-dir ./results

From a model list file:

./llamaseye --models-dir ~/Models --model-list my_models.txt --output-dir ./results

With TurboQuant KV types:

./llamaseye --model ~/Models/model.gguf --turbo-bench ~/llama-cpp-turboquant/build/bin/llama-bench

Unattended overnight run:

nohup ./llamaseye --models-dir ~/Models --output-dir ./results > /dev/null 2>&1 &

Using a .env file

All environment variables can be set in a .env file instead of passing flags every time. The binary auto-loads .env from the working directory if it exists — no source step needed:

cp example.env .env
# Edit .env to match your paths and preferences
./llamaseye --models-dir ~/Models   # .env is loaded automatically

To load a file at a different path use --env-file:

./llamaseye --env-file ~/my-config.env --model ~/Models/model.gguf

Load order: .env file → process environment → CLI flags. Process env vars always override file values; CLI flags override everything.

Every CLI flag has a corresponding environment variable — env vars set the default value, and CLI flags override them when both are provided. example.env in the repo root documents every available variable with its default value and a description. The most important ones to set are LLAMA_BENCH_BIN (path to your llama-bench binary) and SWEEP_OUTPUT_DIR (where results are written).

$VAR and ${VAR} references in unquoted and double-quoted .env values are expanded against the process environment, so paths like SWEEP_OUTPUT_DIR=${HOME}/Models/bench/sweep work as expected. Single-quoted values are kept literal.

.env is gitignored — your local paths and configuration will not be committed.

Building

llamaseye is a single Go binary with no runtime dependencies beyond the OS. Requires Go 1.22 or later.

# Build
go build -o llamaseye .

# Optionally install into your PATH
go install github.com/WagnerJust/llamaseye@latest

The binary statically links all dependencies. No external tools are required at runtime except llama-bench itself.

Dependencies

llama-bench (required)

llamaseye does not install or build llama-bench for you. You must build it yourself from llama.cpp with whatever backend flags suit your hardware before running llamaseye.

# Clone llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# Build for CUDA (NVIDIA GPU)
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release --target llama-bench -j$(nproc)

# Build for Metal (macOS — Apple Silicon or Intel Mac)
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release --target llama-bench -j$(sysctl -n hw.logicalcpu)

# Build CPU-only (any platform, no GPU)
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release --target llama-bench -j$(nproc)

The binary will be at build/bin/llama-bench. Pass its path to llamaseye via --llama-bench <path> or set the LLAMA_BENCH_BIN environment variable. There is no default — llamaseye will exit with an error if the binary is not specified.

The build flags you choose determine which backends and features are available during the sweep. llamaseye works with any valid llama-bench binary — it does not require any specific build flags itself.

Optional: TurboQuant llama-bench

To enable turbo2/turbo3/turbo4 KV cache types, build a second binary from the llama-cpp-turboquant fork (branch feature/turboquant-kv-cache) and pass it via --turbo-bench <path>. The fork is otherwise identical to llama.cpp — same build flags apply.

git clone https://github.com/TheTom/llama-cpp-turboquant \
  --branch feature/turboquant-kv-cache --depth=1
cd llama-cpp-turboquant
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release --target llama-bench -j$(nproc)

llamaseye verifies the binary at startup by running <binary> --help and checking for the turbo3 marker in the output. If the marker is present, turbo types are enabled. If the path is missing, not executable, or the marker is absent, turbo types are silently omitted and the sweep continues with the standard KV type set. It is safe to always pass --turbo-bench — llamaseye handles an invalid path gracefully.

Optional: RotorQuant llama-bench (experimental — currently broken, do not use)

Warning: RotorQuant support is currently non-functional. Do not use --rotor-bench until this is resolved.

To enable planar3/planar4/iso3/iso4 KV cache types (from the RotorQuant project), build from the johndpope/llama-cpp-turboquant fork (branch feature/planarquant-kv-cache) and pass it via --rotor-bench <path>.

git clone https://github.com/johndpope/llama-cpp-turboquant \
  --branch feature/planarquant-kv-cache --depth=1 llama-cpp-rotorquant
cd llama-cpp-rotorquant
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release --target llama-bench -j$(nproc)

RotorQuant types slot between q4_0 and TurboQuant types in the quality ordering — they offer better PPL at equivalent compression (e.g. iso3 PPL 6.91 vs turbo3 PPL 7.07 on Llama 3.1 8B). They are included in the Phase 6 OOM fallback chain when --rotor-bench is provided.

Optional system tools

The Go binary uses native OS APIs for hardware detection and thermal monitoring. A few optional tools extend these capabilities:

Tool	Purpose	Install
`nvidia-smi`	NVIDIA GPU VRAM/temp detection	Included with NVIDIA drivers
`sensors`	Linux CPU temperature reading	`apt install lm-sensors`
`osx-cpu-temp`	macOS CPU temperature reading (optional)	`brew install osx-cpu-temp`

If these are absent, llamaseye disables the corresponding thermal guard and logs a warning — the sweep still runs.

Key flags

Core

Flag	Description
`--model <path>`	Single GGUF model to benchmark
`--models-dir <dir>`	Directory to scan for GGUF models
`--model-list <file>`	Text file listing model filenames (one per line)
`--output-dir <dir>`	Root directory for all results (default: `./results`)
`--llama-bench <path>`	Path to standard llama-bench binary
`--turbo-bench <path>`	Path to TurboQuant llama-bench binary (enables turbo2/3/4 KV types)
`--rotor-bench <path>`	[EXPERIMENTAL - broken] Path to RotorQuant llama-bench binary (enables planar3/4, iso3/4 KV types)
`--asymmetric-kv` / `--no-asymmetric-kv`	Include asymmetric K/V quant combos in Phase 2 when `--turbo-bench` is set (default: enabled)
`--ngl-step <n>`	Step size for NGL axis sweep (default: 4)
`--repetitions <n>`	Repetitions per benchmark run (default: 3)
`--timeout <s>`	Per-run timeout in seconds (default: 600)
`--goal <spec>`	Goal-directed Phase 7: stop after 3 validated configs meeting the spec. Format: `ctx=N,tg=N,pp=N` (all optional). Example: `--goal "ctx=32768,tg=5"`
`--resume`	Resume a previous sweep, skipping completed phases
`--overwrite`	Delete existing output dir and re-run everything
`--only-phases <list>`	Comma-separated list of phase numbers to run (e.g. `0,1,7`)
`--focused`	Only run combos not already in `sweep.jsonl` (requires `--only-phases`). Skipped combos still populate working sets for downstream phases.
`--skip-phases <list>`	Comma-separated list of phase numbers to skip
`--report`	Read-only: regenerate `sweep.md` from existing `sweep.jsonl` files without running any benchmarks. Also generates `summary.md` when multiple models are found. Combine with `--model`/`--models-dir` to target a subset; omit both to scan all subdirs of `--output-dir`.
`--dry-run`	Print what would run without executing
`--no-confirm`	Skip the pre-run confirmation prompt
`--debug`	Enable verbose `[DEBUG]` lines in the log: full command lines, raw stdout/stderr, OOM matches, thermal polls, GGUF metadata
`--goal-hits N`	Stop goal mode after N distinct (ngl, ctk, nkvo, ctx) configs are found (default: 3)
`--goal-sort tg\|ctx\|ngl\|pp`	Sort Goal Results table by this axis, descending (default: `tg`)
`--cpu-temp-limit <°C>`	Pause if CPU exceeds this temperature (default: 88)
`--gpu-temp-limit <°C>`	Pause if GPU exceeds this temperature (default: 81)
`--no-thermal-guard`	Disable thermal polling entirely

Axis start and direction

These control where each phase begins its sweep and which direction it moves. Direction flags accept up or down.

Flag	Description
`--start-ngl <n>`	Begin NGL sweep at this value (default: `MAX_NGL − 2×step`)
`--ngl-dir up\|down`	NGL sweep direction (default: `up` = toward MAX_NGL)
`--start-threads <n>`	Begin thread count sweep at this value
`--threads-dir up\|down`	Thread sweep direction (default: `up`)
`--start-ctx <n>`	Begin context sweep at this prompt size; also sets Phase 7 min-ctx
`--ctx-dir up\|down`	Context sweep direction (default: `up` = toward 131072)
`--fine-ctx`	Enable midpoint bisection in Phase 6 (see Fine-grained context sweep)
`--ctx-step-min <n>`	Minimum bisection step for `--fine-ctx` (default: `8192`)
`--start-ctk <type>`	Begin KV quant sweep at this type
`--ctk-dir up\|down`	KV type sweep direction (default: `up` = toward more compression)
`--start-ctv <type>`	Begin V-cache quant sweep at this type (ignored when `--ctv` is set)
`--ctv-dir up\|down`	V-cache type sweep direction (default: `up` = toward more compression)
`--ctv <list>`	Restrict Phase 2 to these CTV values (comma-separated, e.g. `turbo3,turbo2`); takes precedence over `--start-ctv` / `--ctv-dir`
`--start-b <n>`	Begin batch size sweep at this value
`--b-dir up\|down`	Batch sweep direction (default: `up`)
`--start-ub <n>`	Begin ubatch size sweep at this value
`--ub-dir up\|down`	Ubatch sweep direction (default: `up`)
`--start-fa 0\|1`	Begin FA sweep at this value (default: `0`)
`--fa-dir up\|down`	FA sweep direction (default: `up` = 0→1)

Phase 7 minimum filters

These filter the Phase 7 combination matrix without affecting phases 1–6. When not set, smart defaults are derived automatically (see Smart defaults).

Flag	Description
`--min-ngl <n>`	Exclude NGL values below N from Phase 7
`--min-threads <n>`	Exclude thread counts below N from Phase 7
`--min-ctx <n>`	Exclude context sizes below N from Phase 7
`--min-ctk <type>`	Exclude KV types below TYPE (by quality) from Phase 7
`--min-b <n>`	Exclude batch sizes below N from Phase 7
`--min-ub <n>`	Exclude ubatch sizes below N from Phase 7

Environment variables

Every CLI flag can also be set via environment variable — useful for .env files so you don't repeat flags on every invocation. CLI flags always override env vars when both are set. See example.env for the full list with defaults and descriptions.

Variable	Equivalent flag	Example
`SWEEP_RESUME`	`--resume`	`SWEEP_RESUME=true`
`SWEEP_OVERWRITE`	`--overwrite`	`SWEEP_OVERWRITE=true`
`SWEEP_SKIP_PHASES`	`--skip-phases`	`SWEEP_SKIP_PHASES=7`
`SWEEP_ONLY_PHASES`	`--only-phases`	`SWEEP_ONLY_PHASES=0,1,6`
`SWEEP_FOCUSED`	`--focused`	`SWEEP_FOCUSED=true`
`SWEEP_NGL_STEP`	`--ngl-step`	`SWEEP_NGL_STEP=2`
`SWEEP_START_NGL`	`--start-ngl`	`SWEEP_START_NGL=40`
`SWEEP_NGL_DIR`	`--ngl-dir`	`SWEEP_NGL_DIR=down`
`SWEEP_START_CTX`	`--start-ctx`	`SWEEP_START_CTX=32768`
`SWEEP_MIN_CTX`	`--min-ctx`	`SWEEP_MIN_CTX=32768`
`SWEEP_MIN_NGL`	`--min-ngl`	`SWEEP_MIN_NGL=16`
`SWEEP_MIN_CTK`	`--min-ctk`	`SWEEP_MIN_CTK=q8_0`
`SWEEP_MIN_THREADS`	`--min-threads`	`SWEEP_MIN_THREADS=8`
`SWEEP_MIN_B`	`--min-b`	`SWEEP_MIN_B=1024`
`SWEEP_MODEL_LIST`	`--model-list`	`SWEEP_MODEL_LIST=~/list.txt`
`SWEEP_NO_CONFIRM`	`--no-confirm`	`SWEEP_NO_CONFIRM=true`
`SWEEP_DRY_RUN`	`--dry-run`	`SWEEP_DRY_RUN=true`
`SWEEP_DEBUG`	`--debug`	`SWEEP_DEBUG=true`
`SWEEP_GOAL_HITS`	`--goal-hits`	`SWEEP_GOAL_HITS=5`
`SWEEP_GOAL_SORT`	`--goal-sort`	`SWEEP_GOAL_SORT=ctx`

Sweep phases

Phase	Name	What varies	Everything else
0	NGL Probe	Binary search for max stable GPU layers — starts at model's layer count (from GGUF metadata), falls back to 99	Defaults — establishes MAX_NGL
1	NGL Axis	NGL values up to model's layer count (capped there since higher values are identical); near MAX_NGL by default (use `--start-ngl 0` for full sweep)	Defaults
2	FA + KV Quant Axis	Flash attention on/off × KV cache type	Best NGL from Phase 1
3	Thread Count	CPU thread count variants	Best NGL, best FA/KV
4	KV Offload	KV cache in VRAM (nkvo=0) vs RAM (nkvo=1)	Best NGL, best FA/KV, best threads
5	Batch Size	ubatch and batch size variants	Best values so far
6	Context Ceiling	Prompt size scaled up to OOM/timeout, with fallback configs on OOM; timeout runs are recorded with wall time	Best values so far
7	Full Combination Matrix	Cartesian product of all best-per-axis working sets; with `--goal`, runs ranked and exits early once the goal is satisfied	—

Smart defaults

Common use cases work without any extra flags. The key smart behaviors:

Phase 0/1 — NGL ceiling capped at model layer count

At sweep start, llamaseye reads the model's layer count from its GGUF metadata. NGL values above that count are functionally identical (llama.cpp silently clamps NGL to the layer count), so:

Phase 0 starts its probe at NumLayers instead of 99 — eliminating up to 15 wasted probe runs for small models
Phase 1 caps its sweep list at NumLayers — shrinking the NGL working set and keeping Phase 7's cartesian product manageable

If GGUF parsing fails (non-standard file), both phases fall back to the 99 ceiling.

Phase 1 — NGL start

Phase 1 starts at MAX_NGL − 2×step by default, testing only the top ~3 NGL values near the VRAM ceiling. Low-NGL configs rarely matter for performance. Use --start-ngl 0 for a full 0→MAX_NGL sweep, or --start-ngl 40 --ngl-dir down to sweep downward from a specific cap.

Phase 6 — Context fallbacks

When the primary config OOMs at a given context size, Phase 6 automatically tries progressively more memory-friendly alternatives before giving up:

Flip nkvo (move KV cache from VRAM → RAM)
V-first: keep CTK fixed, try more-compressed CTV types — V compression is effectively free quality-wise, so this is exhausted before touching K
K+V: try more-compressed CTK types (with their paired CTV from Phase 2) × both nkvo values

Only ctk/ctv/nkvo values already validated by Phases 2 and 4 are tried. The KV quality order for fallback selection (most to least quality): f16 > q8_0 > q4_0 > iso4 > planar4 > turbo4 > iso3 > planar3 > turbo3 > turbo2.

Fine-grained context sweep

By default Phase 6 sweeps context as powers of two (512 → 1024 → … → 65536 → 131072). This means a more-compressed KV type might unlock an intermediate context size that the sweep never discovers.

Use --fine-ctx to enable midpoint bisection: when all fallbacks fail at a ctx size, the sweep bisects between the last successful ctx and the failed ctx, probing the midpoint and narrowing until the gap is ≤ --ctx-step-min (default 8192).

# example: turbo3 unlocks 98304 on a card where q4_0 maxes at 65536
bash llamaseye.sh --model my.gguf --fine-ctx

Each bisection probe runs the full primary + fallback sequence, so --fine-ctx is off by default — it adds real runtime cost at large context sizes.

Phase 7 — Independent K/V cartesian product

Phase 7 now generates the cartesian product of independent CTK × CTV axes rather than replaying only the exact pairs tested in Phase 2. This expands the search space to include combos like (ctk=f16, ctv=turbo3) even if that exact pair wasn't in Phase 2.

Precision filter: combos where V is more precise than K are automatically skipped as wasteful (e.g. ctk=turbo3, ctv=f16 is dropped). The filter rule: CTK quality ≥ CTV quality.

Phase 7 — Auto-derived minimum filters

When --min-* flags are not set, Phase 7 auto-applies minimum filters so the combination matrix stays focused on high-value configs:

Axis	Auto default	Override to disable
NGL	`MAX_NGL − 1 step` (top 2 values)	`--min-ngl 0`
Threads	`HW_CPU_PHYSICAL` (physical core count)	`--min-threads 1`
Context	`--start-ctx` value, else `8192`	`--min-ctx 0`
KV type	`q8_0` normally; auto-lowered to most-compressed Phase-2-validated type when Phase 6 hit OOM	`--min-ctk q4_0`
Batch	`BEST_B / 2`	`--min-b 512`

If --start-ctx is set and no context at or above that size succeeds in Phase 6, Phase 7 is skipped with a clear warning rather than silently running a useless matrix at a tiny fallback context.

KV quality order (low → high): turbo2 turbo3 planar3 iso3 turbo4 planar4 iso4 q4_0 q8_0 f16

--min-ctk turbo3 keeps turbo3 and everything higher quality (excludes only turbo2)
--min-ctk q8_0 keeps q8_0 and f16 only (the default)
--min-ctk q4_0 includes all types at or above q4_0 quality (effectively disables the ctk filter)

Specialised KV cache types

TurboQuant

Passing --turbo-bench <path> enables three additional KV cache quantisation types: turbo2, turbo3, and turbo4, sourced from the llama-cpp-turboquant fork. These compress the KV cache 3–6× compared to f16, freeing VRAM for more layers or larger contexts without a significant quality penalty.

Type	Compression vs f16	Flash attn required
`turbo2`	~6.4×	No
`turbo3`	~4.3×	Yes (auto-enabled)
`turbo4`	~3.2×	Yes (auto-enabled)

The TurboQuant binary is verified at startup by running <binary> --help and checking for the turbo3 marker in the output. If the marker is present, turbo types are enabled. If the path is missing, not executable, or the marker is absent, turbo types are silently omitted and the sweep continues with the standard KV type set. It is safe to always pass --turbo-bench — llamaseye handles an invalid path gracefully.

When --turbo-bench is available, Phase 2 also tests asymmetric K/V combinations (e.g. ctk=q8_0, ctv=turbo3) by default. TurboQuant research shows that V cache compression is effectively free — compressing V has near-zero effect on attention quality — while all quality degradation comes from K compression. Asymmetric combos capture the best of both: high K precision with aggressive V compression. Use --no-asymmetric-kv to restrict Phase 2 to symmetric pairs only.

RotorQuant

Passing --rotor-bench <path> enables four additional KV cache types: planar3, planar4, iso3, iso4, sourced from the johndpope/llama-cpp-turboquant fork (branch feature/planarquant-kv-cache).

RotorQuant uses block-diagonal rotations (O(d), 128 params) rather than full-rank WHT (O(d log d), 16,384 params), which preserves directional structure better at equivalent compression. Benchmarks on Llama 3.1 8B (RTX 5090):

Type	Compression	PPL
`iso3`	~10.3×	6.91
`planar3`	~10.3×	7.05
`turbo3`	~10.3×	7.07

RotorQuant types slot between q4_0 and TurboQuant in the quality ordering: f16 > q8_0 > q4_0 > iso4 > planar4 > turbo4 > iso3 > planar3 > turbo3 > turbo2.

Output structure

Results are written to <output-dir>/<model-stem>/:

results/
├── summary.md                        # Cross-model winner table (multi-model runs only)
└── Qwen3-14B-Q4_K_M/
    ├── sweep.jsonl       # One JSON object per completed run (source of truth)
    ├── sweep.md          # Human-readable Markdown summary (regenerable with --report)
    ├── sweep.log         # Full execution log
    ├── hardware.json     # Hardware snapshot captured at start
    ├── state.json        # Resume state (completed phases + best values + working sets)
    └── raw/
        └── <run-id>.txt  # Raw llama-bench stdout for each run

sweep.jsonl is append-only and is the source of truth. state.json tracks which phases are complete and the best parameter values discovered so far, enabling --resume to pick up exactly where it left off.

Graceful shutdown: Pressing Ctrl-C (or sending SIGTERM) cancels the current run, saves state.json, and exits cleanly. Use --resume to continue from where it stopped.

sweep.md sections:

Best Configurations — top 10 results across all phases ranked by TG t/s
Per-phase tables — all runs for each phase, sorted by TG t/s, with a > **Winner:** callout line showing the best config for that axis
Goal Results — when --goal was used, Phase 7 rows that met the target
Context Frontier — max stable context per (ngl, ctk, nkvo) combo from Phase 7
Slow context — Phase 6 sizes that timed out (achievable but impractical for interactive use)

sweep.md can be regenerated at any time from sweep.jsonl without re-running benchmarks:

./llamaseye --report --output-dir ./results

Model list file format

--model-list accepts a plain text file with one model filename per line. Lines beginning with # and blank lines are ignored.

# my_models.txt — models to sweep

Qwen3-14B-Q4_K_M.gguf
Qwen3-14B-Q6_K.gguf
Llama-3.1-8B-Instruct-Q8_0.gguf

# WIP — not ready yet
# Mixtral-8x7B-Q4_K_M.gguf

Paths are resolved relative to --models-dir. If --models-dir is not set, filenames are treated as absolute or relative to the working directory.

Hardware portability

At startup the script probes:

CPU cores — via nproc (Linux) or sysctl -n hw.logicalcpu (macOS)
System RAM — to compute safe context-size upper bounds
GPU VRAM — via nvidia-smi or system_profiler (Apple Silicon)
Compute backend — cuda / metal / cpu, inferred from the llama-bench binary
Thermal sensors — nvidia-smi for GPU, sensors or /sys/class/thermal for CPU

All sweep parameters are derived from these detected values. The script contains no hardcoded machine-specific constants.

Design philosophy

Each phase sweeps exactly one axis while holding everything else at sane defaults. The only cross-phase dependency is MAX_NGL, which is established by Phase 0 and used as the ceiling for all subsequent phases. Phases 1–6 each produce a working set of values for their axis. Phase 7 takes the Cartesian product of all working sets and runs the full combination matrix, confirming which configs compose well and revealing the true peak configuration. This one-variable-at-a-time discipline keeps results interpretable and makes it straightforward to re-run individual phases in isolation with --only-phases.

Docs

See docs/spec.md for the full engineering specification.

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
.claude/skills/llamaseye		.claude/skills/llamaseye
.github		.github
bench		bench
cmd		cmd
config		config
docs		docs
envfile		envfile
gguf		gguf
hardware		hardware
output		output
phase		phase
state		state
sweep		sweep
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
example.env		example.env
example_models.txt		example_models.txt
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Folders and files

Latest commit

History

Repository files navigation

llamaseye — Exhaustive llama-bench parameter sweep harness

What it does

Quick start

Using a .env file

Building

Dependencies

llama-bench (required)

Optional: TurboQuant llama-bench

Optional: RotorQuant llama-bench (experimental — currently broken, do not use)

Optional system tools

Key flags

Core

Axis start and direction

Phase 7 minimum filters

Environment variables

Sweep phases

Smart defaults

Phase 0/1 — NGL ceiling capped at model layer count

Phase 1 — NGL start

Phase 6 — Context fallbacks

Fine-grained context sweep

Phase 7 — Independent K/V cartesian product

Phase 7 — Auto-derived minimum filters

Specialised KV cache types

TurboQuant

RotorQuant

Output structure

Model list file format

Hardware portability

Design philosophy

Docs

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages