LLM Testbench

Local multi-benchmark workbench for LLM evaluation

Endpoint configuration · Model discovery · Benchmark selection

Compare local LLMs on speed, SQL accuracy, and tool calling — no cloud, no containers, no dataset downloads. Point it at any OpenAI-compatible server and get results in seconds.

Features


⚡ Speed	TTFT, total time, prompt tokens, completion tokens, decode TPS
🗄️ SQL Accuracy	DuckDB execution against local AdventureWorks fixtures, tool-call and grammar modes
🔧 Tool Calling	Local BFCL v2 single-turn adapter — single, parallel, multiple, and no-call categories
📦 Coding & Schema	Tiny Python coding tasks and JSON schema instruction-following fixtures
📼 Prompt Replay	Fixed prompts for fast local regressions — required / optional / forbidden checks
💾 Exports	JSONL, CSV, TSV, summary JSON, manifest JSON, dashboard aggregate
🔌 Endpoints	LM Studio, llama.cpp, Ollama, and any OpenAI-compatible server

Benchmarks

Module	Status	What it measures
Speed	✅ Startable	TTFT, total time, prompt/completion tokens, prefill TPS, decode TPS
SQL Accuracy	✅ Startable	SQL generation correctness against local DuckDB fixtures
BFCL v2	✅ Startable	Single-turn function/tool calling — argument AST comparison, per-category pass counts in the UI
Coding Micro	🔶 Fixture-ready	Python coding tasks with syntax and static checks
JSON Schema	🔶 Fixture-ready	Instruction following scored by JSON parsing and schema-lite checks
Prompt Replay	🔶 Fixture-ready	Fixed prompts for fast local regression comparisons

Fixture-ready modules are wired to the metadata and validation endpoints and can be connected to live generation without changing their fixture format.

Quick Start

Windows:

run.bat

Linux / macOS:

./run.sh

The launcher creates a virtual environment, installs python/requirements.txt, starts the backend, and opens:

http://127.0.0.1:8765/

Optional flags:

./run.sh --host 127.0.0.1 --port 8765 --log-level INFO

run.bat --host 127.0.0.1 --port 8765 --log-level INFO

Screenshots

SQL Accuracy — Live Results

Per-question pass/fail grid across TRIVIAL → EASY → MEDIUM → HARD.

History & Exports

Saved runs with one-click JSONL, CSV, TSV, Manifest, and Summary JSON exports.

Workflow

Start a local inference server — LM Studio, llama.cpp, Ollama, or any OpenAI-compatible endpoint.
Open LLM Testbench at http://127.0.0.1:8765/.
Scan Local to auto-discover endpoints, or + Manual to enter a base URL.
Click Discover Models and select the models to benchmark.
Choose benchmark modules and execution mode (sequential / parallel).
Hit Start and watch live results stream in.
Export saved runs from the history panel — JSONL, CSV, TSV, or Summary.

Speed Metrics

The live speed path runs through BenchmarkServer._run_single_benchmark. SpeedAdapter is metadata-only and raises instead of returning a placeholder, so it cannot silently report phantom passes.

OpenAI-compatible — decode TPS is calculated from streamed completion tokens over post-first-token stream time.

Ollama — decode TPS uses eval_count / eval_duration, excluding model load and prompt evaluation time.

Use warmup_runs > 0 to keep cold model-loading time out of measured runs.

SQL Accuracy

Supports tool-calling and grammar-style SQL generation modes. Additional controls:

Thinking mode — off, on, or both
Reasoning effort — provider default, none, minimal, low, medium, high, xhigh
Per-question timeout and stop/reload recovery
Mismatch details — row count, columns, first row, generated SQL

Provider default (omit) does not send a reasoning field. none sends an explicit request to disable reasoning. Servers that reject unknown reasoning fields get an automatic retry without it.

API

GET /api/benchmark/contract
GET /api/benchmark/modules
GET /api/benchmark/modules/{module_id}
GET /api/benchmark/modules/{module_id}/adapter
GET /api/benchmark/presets
GET /api/benchmark/presets/{preset_id}
GET /api/benchmark/dashboard
GET /api/fixtures
GET /api/fixtures/validate

Saved run exports:

GET /api/benchmark/{job_id}/results.jsonl
GET /api/benchmark/{job_id}/results.csv
GET /api/benchmark/{job_id}/results.tsv
GET /api/benchmark/{job_id}/summary.json
GET /api/benchmark/{job_id}/manifest.json
GET /api/benchmark/summaries

Repository Layout

llm-testbench/
├── index.html                  # Single-page browser UI
├── run.bat / run.sh            # Cross-platform launchers
├── python/
│   ├── server.py               # Backend API and benchmark orchestration
│   ├── adapter.py              # Benchmark adapter ABC + Speed / SQL / BFCL adapters
│   ├── sql_benchmark.py        # SQL benchmark runner
│   ├── bfcl.py                 # BFCL loader, scorer, and argument comparator
│   └── local_benchmarks.py     # Local fixture loaders, validators, and scorers
├── sql_benchmark_data/         # SQL questions and AdventureWorks tables
├── bfcl_data/                  # Local BFCL-style questions and answers
├── coding_data/                # Tiny Python coding tasks
├── json_schema_data/           # JSON instruction-following tasks
├── prompt_replay_data/         # Fixed regression prompts
├── docs/screenshots/           # README screenshots
└── tests/                      # Backend, adapter, fixture, dashboard, and frontend tests

Development

Install dependencies:

python -m pip install -r python/requirements.txt pytest

Run the test suite (166 tests):

python -m pytest tests -q

Run the backend directly:

python -m python.server

Scope

In scope — local model endpoints, small repository-owned fixtures, deterministic local tests, simple adapter and API contracts, fast smoke and comparison runs.

Out of scope — Terminal-Bench; SWE-bench / SWE-rebench / Multi-SWE-bench / SWE-agent / OpenHands / SWE-ReX; WebArena / OSWorld / CodeClash / GAIA / tau-bench; LiveCodeBench and BigCodeBench as external integrations; Docker orchestration, browser farms, desktop VMs, remote services, large downloaded benchmark datasets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Testbench

Contents

Features

Benchmarks

Quick Start

Screenshots

SQL Accuracy — Live Results

History & Exports

Workflow

Speed Metrics

SQL Accuracy

API

Repository Layout

Development

Scope

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
bfcl_data		bfcl_data
coding_data		coding_data
docs/screenshots		docs/screenshots
json_schema_data		json_schema_data
prompt_replay_data		prompt_replay_data
python		python
sql_benchmark_data		sql_benchmark_data
tests		tests
.gitignore		.gitignore
README.md		README.md
ROADMAP.md		ROADMAP.md
__init__.py		__init__.py
index.html		index.html
run.bat		run.bat
run.sh		run.sh

Folders and files

Latest commit

History

Repository files navigation

LLM Testbench

Contents

Features

Benchmarks

Quick Start

Screenshots

SQL Accuracy — Live Results

History & Exports

Workflow

Speed Metrics

SQL Accuracy

API

Repository Layout

Development

Scope

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages