Local multi-benchmark workbench for LLM evaluation
Endpoint configuration · Model discovery · Benchmark selection
Compare local LLMs on speed, SQL accuracy, and tool calling — no cloud, no containers, no dataset downloads. Point it at any OpenAI-compatible server and get results in seconds.
- Features
- Benchmarks
- Quick Start
- Screenshots
- Workflow
- Speed Metrics
- SQL Accuracy
- API
- Repository Layout
- Development
| ⚡ Speed | TTFT, total time, prompt tokens, completion tokens, decode TPS |
| 🗄️ SQL Accuracy | DuckDB execution against local AdventureWorks fixtures, tool-call and grammar modes |
| 🔧 Tool Calling | Local BFCL v2 single-turn adapter — single, parallel, multiple, and no-call categories |
| 📦 Coding & Schema | Tiny Python coding tasks and JSON schema instruction-following fixtures |
| 📼 Prompt Replay | Fixed prompts for fast local regressions — required / optional / forbidden checks |
| 💾 Exports | JSONL, CSV, TSV, summary JSON, manifest JSON, dashboard aggregate |
| 🔌 Endpoints | LM Studio, llama.cpp, Ollama, and any OpenAI-compatible server |
| Module | Status | What it measures |
|---|---|---|
| Speed | ✅ Startable | TTFT, total time, prompt/completion tokens, prefill TPS, decode TPS |
| SQL Accuracy | ✅ Startable | SQL generation correctness against local DuckDB fixtures |
| BFCL v2 | ✅ Startable | Single-turn function/tool calling — argument AST comparison, per-category pass counts in the UI |
| Coding Micro | 🔶 Fixture-ready | Python coding tasks with syntax and static checks |
| JSON Schema | 🔶 Fixture-ready | Instruction following scored by JSON parsing and schema-lite checks |
| Prompt Replay | 🔶 Fixture-ready | Fixed prompts for fast local regression comparisons |
Fixture-ready modules are wired to the metadata and validation endpoints and can be connected to live generation without changing their fixture format.
Windows:
run.batLinux / macOS:
./run.shThe launcher creates a virtual environment, installs python/requirements.txt, starts the backend, and opens:
http://127.0.0.1:8765/
Optional flags:
./run.sh --host 127.0.0.1 --port 8765 --log-level INFOrun.bat --host 127.0.0.1 --port 8765 --log-level INFOPer-question pass/fail grid across TRIVIAL → EASY → MEDIUM → HARD.
Saved runs with one-click JSONL, CSV, TSV, Manifest, and Summary JSON exports.
- Start a local inference server — LM Studio, llama.cpp, Ollama, or any OpenAI-compatible endpoint.
- Open LLM Testbench at
http://127.0.0.1:8765/. - Scan Local to auto-discover endpoints, or + Manual to enter a base URL.
- Click Discover Models and select the models to benchmark.
- Choose benchmark modules and execution mode (sequential / parallel).
- Hit Start and watch live results stream in.
- Export saved runs from the history panel — JSONL, CSV, TSV, or Summary.
The live speed path runs through BenchmarkServer._run_single_benchmark. SpeedAdapter is metadata-only and raises instead of returning a placeholder, so it cannot silently report phantom passes.
OpenAI-compatible — decode TPS is calculated from streamed completion tokens over post-first-token stream time.
Ollama — decode TPS uses eval_count / eval_duration, excluding model load and prompt evaluation time.
Use warmup_runs > 0 to keep cold model-loading time out of measured runs.
Supports tool-calling and grammar-style SQL generation modes. Additional controls:
- Thinking mode — off, on, or both
- Reasoning effort — provider default, none, minimal, low, medium, high, xhigh
- Per-question timeout and stop/reload recovery
- Mismatch details — row count, columns, first row, generated SQL
Provider default (omit) does not send a reasoning field. none sends an explicit request to disable reasoning. Servers that reject unknown reasoning fields get an automatic retry without it.
GET /api/benchmark/contract
GET /api/benchmark/modules
GET /api/benchmark/modules/{module_id}
GET /api/benchmark/modules/{module_id}/adapter
GET /api/benchmark/presets
GET /api/benchmark/presets/{preset_id}
GET /api/benchmark/dashboard
GET /api/fixtures
GET /api/fixtures/validate
Saved run exports:
GET /api/benchmark/{job_id}/results.jsonl
GET /api/benchmark/{job_id}/results.csv
GET /api/benchmark/{job_id}/results.tsv
GET /api/benchmark/{job_id}/summary.json
GET /api/benchmark/{job_id}/manifest.json
GET /api/benchmark/summaries
llm-testbench/
├── index.html # Single-page browser UI
├── run.bat / run.sh # Cross-platform launchers
├── python/
│ ├── server.py # Backend API and benchmark orchestration
│ ├── adapter.py # Benchmark adapter ABC + Speed / SQL / BFCL adapters
│ ├── sql_benchmark.py # SQL benchmark runner
│ ├── bfcl.py # BFCL loader, scorer, and argument comparator
│ └── local_benchmarks.py # Local fixture loaders, validators, and scorers
├── sql_benchmark_data/ # SQL questions and AdventureWorks tables
├── bfcl_data/ # Local BFCL-style questions and answers
├── coding_data/ # Tiny Python coding tasks
├── json_schema_data/ # JSON instruction-following tasks
├── prompt_replay_data/ # Fixed regression prompts
├── docs/screenshots/ # README screenshots
└── tests/ # Backend, adapter, fixture, dashboard, and frontend tests
Install dependencies:
python -m pip install -r python/requirements.txt pytestRun the test suite (166 tests):
python -m pytest tests -qRun the backend directly:
python -m python.serverIn scope — local model endpoints, small repository-owned fixtures, deterministic local tests, simple adapter and API contracts, fast smoke and comparison runs.
Out of scope — Terminal-Bench; SWE-bench / SWE-rebench / Multi-SWE-bench / SWE-agent / OpenHands / SWE-ReX; WebArena / OSWorld / CodeClash / GAIA / tau-bench; LiveCodeBench and BigCodeBench as external integrations; Docker orchestration, browser farms, desktop VMs, remote services, large downloaded benchmark datasets.

