Simulation Coding Benchmark for LLM Agents

This repository provides a framework for simulation coding benchmarks targeting LLM agents.

Requirements

Python 3.10+
Docker
uv for host-side tooling (recommended)

Install host dev tools:

uv sync --extra dev

Notes:

Runner supports Python 3.10+.
For Python <3.11, tomli is installed from dev extras and used as TOML parser fallback.

Supported Agents

Configured in agents_default.toml (with per-agent overrides under sample/*.toml):

OpenCode (opencode)
Claude Code (claude)
Codex (codex)
Github Copilot (copilot)

Available Benchmarks

Demo benchmark for Runge-Kutta 2 (RK2) midpoint method.
3D wave equation solver with finite difference method.
Magnetohydrodynamics (MHD) solver.

Quick Start

By default, pull the published GHCR toolchain image for local use. This repo publishes ghcr.io/amanotk/simbench:develop for the shared develop toolchain. Build the image locally only if you need a custom toolchain or you have changed docker/Dockerfile or scripts/build_image.py.

Published toolchain image:

docker pull ghcr.io/amanotk/simbench:develop
docker tag ghcr.io/amanotk/simbench:develop simbench:0.1

If the package is not publicly accessible to you, authenticate first:

docker login ghcr.io

Build locally only if needed:

python3 scripts/build_image.py

Direct Docker build (fallback):

docker build -t simbench:0.1 -f docker/Dockerfile .

List and validate tasks:

python3 runner/bench.py list
python3 runner/bench.py check

Run a task:

python3 runner/bench.py run sample/opencode.toml demo/py --image simbench:0.1

Run the tiny OpenCode smoke task (kept under tests/test-tasks/, not benchmarks/):

python3 runner/bench.py run tests/fixtures/agent_configs/opencode-smoke.toml test:smoke/py --image simbench:0.1

Run the tiny Copilot smoke task:

python3 runner/bench.py run tests/fixtures/agent_configs/copilot-smoke.toml test:smoke/py --image simbench:0.1

Runner smoke tests:

python3 -m unittest -q tests.test_runner_smoke.TestOpenCodeSmoke
python3 -m unittest -q tests.test_runner_smoke.TestCopilotSmoke
Set SIMBENCH_SKIP_OPENCODE_SMOKE=1 to skip the live OpenCode smoke run.
Set SIMBENCH_SKIP_COPILOT_SMOKE=1 to skip the live Copilot smoke run.
Set COPILOT_GITHUB_TOKEN for token-only Copilot CLI auth.
The Copilot smoke config uses gpt-4.1 for a faster, more stable smoke run.
OpenCode, Copilot, Codex, and Claude parser coverage replays real CLI logs from tests/fixtures/agent_streams/; see tests/fixtures/agent_streams/README.md to refresh those golden files after CLI output changes.
CI skips the live OpenCode smoke run by default.
CI runs the live Copilot smoke on py3.11 when COPILOT_GITHUB_TOKEN is available.
CI pulls ghcr.io/amanotk/simbench:<head-branch> for PRs when available, falls back to ghcr.io/amanotk/simbench:develop, and otherwise builds locally.

Eval only:

python3 runner/bench.py eval demo/py --workdir /path/to/workdir --image simbench:0.1

Publish a completed run:

python3 runner/bench.py publish runs/<run_id>/<suite>/<task_id>

The publish command validates the run record and renders a deterministic issue payload for benchmark result publication. See docs/run-flow.md for details on run artifacts and the publication workflow.

Repository Layout

benchmarks/<suite>: benchmark suites
tests/test-tasks/<suite>: smoke and E2E support tasks
docs/: documentation
runner/bench.py: runner CLI
agents_default.toml: default agent config
sample/*.toml: sample per-agent overrides

Documentation

docs/development.md: developer workflow, branching, and CI policy
docs/toolchain.md: default Docker toolchain and preinstalled libraries
docs/task-development.md: task-author quickstart
docs/task-reference.md: task format and contracts
docs/run-flow.md: runtime and artifact flow

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
.github		.github
.vscode		.vscode
benchmarks		benchmarks
docker		docker
docs		docs
runner		runner
sample		sample
scripts		scripts
tests		tests
.clang-format		.clang-format
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
README.md		README.md
agents_default.toml		agents_default.toml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simulation Coding Benchmark for LLM Agents

Requirements

Supported Agents

Available Benchmarks

Quick Start

Repository Layout

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Simulation Coding Benchmark for LLM Agents

Requirements

Supported Agents

Available Benchmarks

Quick Start

Repository Layout

Documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages