Release: MHD benchmark suite + runner refactoring + test infrastructure by amanotk · Pull Request #10 · amanotk/simbench

amanotk · 2026-04-02T06:58:48Z

Summary

Merges the MHD benchmark suite and extensive runner refactoring into main.

Major Changes

New Features

Magnetohydrodynamics (MHD) benchmark suite (feat: Add Magnetohydrodynamics (MHD) benchmark suite #9)
- 4 C++ tasks: cpp-hlld-00, cpp-hlld-01, cpp-full1d-00, cpp-full1d-01
- HLLD Riemann solver implementation with Brio-Wu reference solutions
- Shared eval infrastructure (mhd1d_shared.py)
- Public/hidden test suites with numeric tolerance (1.0e-12)

Runner Refactoring

Split monolithic test_runner_bench.py (2465 lines) into focused modules:
- config_helpers.py, docker_runner_helpers.py
- execution_agent.py, execution_helpers.py
- metrics_helpers.py, results_helpers.py
- run_record_helpers.py, task_loading_helpers.py
- publish_helpers.py
Added adversarial test coverage for all new modules
Converted parser tests to golden file approach (tests/fixtures/agent_streams/)

Test Infrastructure

Moved test-tasks/ → tests/test-tasks/ for better organization
Added unittest-based import compatibility tests
Fixed unittest discovery in CI
Added smoke task validation to CI

CI/CD

Added benchmark result publishing workflow
CI pulls ghcr.io/amanotk/simbench:<branch> for PRs when available
Copilot smoke runs on PRs with token auth

Documentation

Added docs/run-flow.md: runtime and artifact flow
Updated docs/development.md: developer workflow
Updated docs/toolchain.md: C++ headers and libraries

Bug Fixes

Fixed HLLD flux loop buffer overflow (bound ubx + 1 → ubx)
Fixed smoke config path resolution
Fixed test discovery patterns in CI

Testing

All runner unit tests pass (adversarial + seam coverage)
Golden parser tests for OpenCode, Copilot, Codex, Claude
E2E smoke tests for OpenCode and Copilot

Stats

134 files changed
+19,061 insertions, -4,128 deletions

Split runner/bench.py responsibilities into dedicated helper modules while preserving the stable CLI entrypoint and bench-level execution seams. Reorganize runner tests around the extracted surfaces and restore direct script execution compatibility.

Keep smoke-only agent configs with other test assets so sample/ stays focused on user-facing examples. Update smoke tests and contributor docs to use the new fixture paths.

Update the workflow to run the reorganized runner test suite and clean up lint issues in the extracted adversarial tests. This restores the failing GitHub Actions jobs without changing runner behavior.

Use the correct unittest discover syntax in GitHub Actions and include the full test module pattern so the reorganized suite runs in CI as it does locally.

Remove the implicit pytest dependency from the metrics import compatibility tests so GitHub Actions can run the full discovered suite with stdlib unittest only.

Record canonical run metadata and render stable publish payloads so completed benchmark runs can be reviewed and documented without breaking existing result.json consumers.

Reference the relocated smoke-task fixture configs in the development guide so the documentation matches the path relocation regression tests and current repository layout.

Keep local swarm planning and evidence artifacts out of version control alongside other generated run data.

* Add magnetohydrodynamics cpp-hlld benchmark scaffold * Fix GitHub math rendering in HLLD doc * Strengthen cpp-hlld benchmark guidance and tests * Update formatting of starred-state update equations Reformatted math equations for better readability. * Fix hidden test merge artifact * Strengthen HLLD coverage and fix Docker OpenCode setup * Add full 1D MHD benchmark with Brio-Wu reference * Add golden Brio-Wu CLI regression test * Refactor MHD state storage to mdspan views * checkpoint: spiral-unknown-1774769904852 * checkpoint: spiral-unknown-1774773137200 * Migrate cpp-full-solver1d to mdspan views * Refactor Brio-Wu CLI around SolverWorkspace * WIP: normalize solver workspace shapes * WIP: add primitive reconstruction helper * WIP: add workspace-patterned RK3 path * WIP: run full solver via workspace-patterned RK3 * Migrate fully to workspace-patterned RK3 solver * checkpoint: spiral-unknown-1775016836330 * checkpoint: spiral-unknown-1775016843861 * checkpoint: spiral-unknown-1775016849781 * refactor(cpp-full-solver1d): simplify API and remove dead code - Rename ArrayView/XView to ArrayView2D/ArrayView1D, drop ConstArrayView - Remove all _inplace wrappers and ConstArrayView overloads - Drop default constructor parameters (domain fixed to [0,1]) - Use constexpr coefficient table for SSP-RK3 substeps - Extract init_brio_wu_primitive() in main.cpp - Add cell center view (x) to SolverWorkspace, remove cell_centers() - Update tests to match new type names and required constructor args * refactor(cpp-full-solver1d): drop dt/t_final from SolverWorkspace Remove dt and t_final from SolverWorkspace constructor and members. Pass them directly to evolve_ssp_rk3(workspace, dt, t_final) instead. Constructor now takes (nx, gamma, bx) only. * refactor(cpp-full-solver1d): clean up initialize and cell center indexing - Simplify initialize() signature: remove discontinuity_x parameter, hardcode 0.5 inside; remove unused constants from main.cpp - Fix cell center initialization to only fill interior cells (Lbx..Ubx) instead of all padded cells, matching the physical domain convention - Rename local dt/t_final to delt/tmax in main() * refactor(cpp-full-solver1d): simplify storage and boundary handling * refactor(cpp-full-solver1d): tighten workspace layout and boundaries * refactor(magnetohydrodynamics): reduce grid resolution from 400 to 100 cells Update default Nx constant and regenerate golden/reference CSV files. Improve plot styling with math mode labels and remove subplot titles. * refactor(magnetohydrodynamics): move reference code into shared assets Split the Brio-Wu reference solver and its test fixtures into shared assets, remove the JSON-based helper path, and make the solver grid size configurable * chore(magnetohydrodynamics): remove obsolete fixture json * refactor(magnetohydrodynamics): streamline shared solver interfaces * refactor(magnetohydrodynamics): split cpp-hlld into 00/01 variants * chore: add vscode workspace excludes * chore: add uv lockfile * refactor(magnetohydrodynamics): replace full-solver task with cpp-full1d-00 * feat(magnetohydrodynamics): add cpp-full1d-01 variant with minimal API - cpp-full1d-01 exposes only evolve_ssp_rk3(...) as public solver entrypoint - HLLD and CLI remain provided, same as cpp-full1d-00 - Public C++ tests adapted to reduced scaffolding - Fixed hidden eval path in both 00 and 01 to use /eval_shared mount Co-authored-by: OpenCode Assistant <assistant@opencode.ai> * fix(magnetohydrodynamics): use absolute paths in hidden eval tests The eval test was using relative paths based on __file__ position which breaks when tests run from /eval/tests/. Fixed to use absolute paths: - WORKSPACE_ROOT = Path("/work/workspace") - REFERENCE_CSV_PATH uses /eval_shared mount This affects both cpp-full1d-00 and cpp-full1d-01 variants. * fix(magnetohydrodynamics): fix WORKSPACE_ROOT path to /work The workdir is mounted at /work, not /work/workspace. The original workspace subdirectory doesn't exist in the run workdir. * Simplify MHD test build dir and code style * fix(mhd): correct HLLD flux loop bound to prevent buffer overflow The flux loop was iterating to ubx + 1 and accessing up_r(ix + 1), which read past the allocated Nx + 2*N_margin buffer on the last iteration. Changed loop bound from ubx + 1 to ubx to stay within allocated memory. --------- Co-authored-by: OpenCode Assistant <assistant@opencode.ai>

Consolidates test-related directories by moving test-tasks/ into tests/test-tasks/ for better organization. Updates all references in: - runner/bench.py: TEST_TASK_ROOT path - runner/task_loading_helpers.py: task root map key - Test files: path references and expected keys - Documentation: README.md, AGENTS.md, docs/task-development.md

amanotk · 2026-04-02T07:19:28Z

Review @codex

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bfd99c37e3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-02T07:31:20Z

+
+def _collect_opencode_usage_metrics(*, state_dir: Path) -> dict[str, Any]:
+    env = dict(os.environ)
+    env["HOME"] = str(state_dir)


Use XDG_DATA_HOME for OpenCode stats lookup

Host-mode OpenCode runs are configured with XDG_DATA_HOME (runner/execution_agent.py), but post-run stats collection now rewrites HOME instead. That points opencode stats at a different state location, so host-mode runs can silently lose fallback token/cost metrics (especially when streamed usage is missing, e.g., failures/timeouts). This is a regression from the previous behavior where stats used the same env key as execution.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-02T07:31:20Z

+    completed_at = _require_str_field(record, "completed_at")
+    _validate_completed_at(completed_at)
+
+    repo_commit_sha = _require_str_field(record, "repo_commit_sha").lower()


Handle nullable repo provenance in publish validation

run_record_helpers explicitly allows unresolved git metadata to degrade to None, but publish normalization hard-requires string/bool types for repo_commit_sha, repo_branch, and repo_dirty. As a result, bench publish rejects runner-produced run.json files from environments where git metadata is unavailable, even though those runs are otherwise valid; this blocks publication instead of surfacing a warning/signal.

Useful? React with 👍 / 👎.

amanotk and others added 26 commits March 26, 2026 15:00

Clarify published image usage in docs

50ed3db

Move smoke tasks out of benchmarks

e68c9da

Add Copilot smoke E2E coverage

88b13b9

Stabilize Copilot smoke runtime

05ceec8

Add golden parser coverage for Codex and Claude

d439f16

Add OpenCode and Copilot golden parser tests

cc0f4f7

Validate smoke task scaffolding in CI

5ea8243

checkpoint: before_task_1_2_test_reorg

090b95b

checkpoint: before_task_2_1_result_helpers

8caab60

checkpoint: before_task_2_1_metrics_helpers

2cf31b5

checkpoint: before_task_2_1_task_loading_helpers

0059bbb

checkpoint: before_task_2_1_config_helpers

3a4a8e1

checkpoint: before_task_2_2_execution_helpers

704dbc4

checkpoint: before_task_2_2_agent_execution

3169a50

Move smoke configs into test fixtures

04a6bb6

Keep smoke-only agent configs with other test assets so sample/ stays focused on user-facing examples. Update smoke tests and contributor docs to use the new fixture paths.

Fix CI test discovery and lint failures

1a4c5c9

Update the workflow to run the reorganized runner test suite and clean up lint issues in the extracted adversarial tests. This restores the failing GitHub Actions jobs without changing runner behavior.

Fix unittest discovery invocation in CI

ca73fa7

Use the correct unittest discover syntax in GitHub Actions and include the full test module pattern so the reorganized suite runs in CI as it does locally.

Convert import compatibility tests to unittest

ff8d61d

Remove the implicit pytest dependency from the metrics import compatibility tests so GitHub Actions can run the full discovered suite with stdlib unittest only.

checkpoint: pre-task-1-1-run-metadata

a100d08

Add benchmark result publishing workflow

54edc7e

Record canonical run metadata and render stable publish payloads so completed benchmark runs can be reviewed and documented without breaking existing result.json consumers.

Fix smoke config path docs

492991f

Reference the relocated smoke-task fixture configs in the development guide so the documentation matches the path relocation regression tests and current repository layout.

Ignore swarm workspace state

ffceb68

Keep local swarm planning and evidence artifacts out of version control alongside other generated run data.

test: update task prefix expectation after test-tasks move

bfd99c3

chatgpt-codex-connector Bot reviewed Apr 2, 2026

View reviewed changes

fix: address PR review regressions

e3a54be

amanotk merged commit f7eae30 into main Apr 2, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release: MHD benchmark suite + runner refactoring + test infrastructure#10

Release: MHD benchmark suite + runner refactoring + test infrastructure#10
amanotk merged 27 commits intomainfrom
develop

amanotk commented Apr 2, 2026

Uh oh!

amanotk commented Apr 2, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 2, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

amanotk commented Apr 2, 2026

Summary

Major Changes

New Features

Runner Refactoring

Test Infrastructure

CI/CD

Documentation

Bug Fixes

Testing

Stats

Uh oh!

amanotk commented Apr 2, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant