This repository provides a framework for simulation coding benchmarks targeting LLM agents.
- Python 3.10+
- Docker
uvfor host-side tooling (recommended)
Install host dev tools:
uv sync --extra devNotes:
- Runner supports Python 3.10+.
- For Python <3.11,
tomliis installed from dev extras and used as TOML parser fallback.
Configured in agents_default.toml (with per-agent overrides under sample/*.toml):
- OpenCode (
opencode) - Claude Code (
claude) - Codex (
codex) - Github Copilot (
copilot)
- Demo benchmark for Runge-Kutta 2 (RK2) midpoint method.
- 3D wave equation solver with finite difference method.
- Magnetohydrodynamics (MHD) solver.
By default, pull the published GHCR toolchain image for local use. This repo
publishes ghcr.io/amanotk/simbench:develop for the shared develop
toolchain. Build the image locally only if you need a custom toolchain or you
have changed docker/Dockerfile or scripts/build_image.py.
Published toolchain image:
docker pull ghcr.io/amanotk/simbench:develop
docker tag ghcr.io/amanotk/simbench:develop simbench:0.1If the package is not publicly accessible to you, authenticate first:
docker login ghcr.ioBuild locally only if needed:
python3 scripts/build_image.pyDirect Docker build (fallback):
docker build -t simbench:0.1 -f docker/Dockerfile .List and validate tasks:
python3 runner/bench.py list
python3 runner/bench.py checkRun a task:
python3 runner/bench.py run sample/opencode.toml demo/py --image simbench:0.1Run the tiny OpenCode smoke task (kept under tests/test-tasks/, not benchmarks/):
python3 runner/bench.py run tests/fixtures/agent_configs/opencode-smoke.toml test:smoke/py --image simbench:0.1Run the tiny Copilot smoke task:
python3 runner/bench.py run tests/fixtures/agent_configs/copilot-smoke.toml test:smoke/py --image simbench:0.1Runner smoke tests:
python3 -m unittest -q tests.test_runner_smoke.TestOpenCodeSmokepython3 -m unittest -q tests.test_runner_smoke.TestCopilotSmoke- Set
SIMBENCH_SKIP_OPENCODE_SMOKE=1to skip the live OpenCode smoke run. - Set
SIMBENCH_SKIP_COPILOT_SMOKE=1to skip the live Copilot smoke run. - Set
COPILOT_GITHUB_TOKENfor token-only Copilot CLI auth. - The Copilot smoke config uses
gpt-4.1for a faster, more stable smoke run. - OpenCode, Copilot, Codex, and Claude parser coverage replays real CLI logs from
tests/fixtures/agent_streams/; seetests/fixtures/agent_streams/README.mdto refresh those golden files after CLI output changes. - CI skips the live OpenCode smoke run by default.
- CI runs the live Copilot smoke on
py3.11whenCOPILOT_GITHUB_TOKENis available. - CI pulls
ghcr.io/amanotk/simbench:<head-branch>for PRs when available, falls back toghcr.io/amanotk/simbench:develop, and otherwise builds locally.
Eval only:
python3 runner/bench.py eval demo/py --workdir /path/to/workdir --image simbench:0.1Publish a completed run:
python3 runner/bench.py publish runs/<run_id>/<suite>/<task_id>The publish command validates the run record and renders a deterministic issue
payload for benchmark result publication. See docs/run-flow.md for details on
run artifacts and the publication workflow.
benchmarks/<suite>: benchmark suitestests/test-tasks/<suite>: smoke and E2E support tasksdocs/: documentationrunner/bench.py: runner CLIagents_default.toml: default agent configsample/*.toml: sample per-agent overrides
docs/development.md: developer workflow, branching, and CI policydocs/toolchain.md: default Docker toolchain and preinstalled librariesdocs/task-development.md: task-author quickstartdocs/task-reference.md: task format and contractsdocs/run-flow.md: runtime and artifact flow