Skip to content

Release: MHD benchmark suite + runner refactoring + test infrastructure#10

Merged
amanotk merged 27 commits intomainfrom
develop
Apr 2, 2026
Merged

Release: MHD benchmark suite + runner refactoring + test infrastructure#10
amanotk merged 27 commits intomainfrom
develop

Conversation

@amanotk
Copy link
Copy Markdown
Owner

@amanotk amanotk commented Apr 2, 2026

Summary

Merges the MHD benchmark suite and extensive runner refactoring into main.

Major Changes

New Features

  • Magnetohydrodynamics (MHD) benchmark suite (feat: Add Magnetohydrodynamics (MHD) benchmark suite #9)
    • 4 C++ tasks: cpp-hlld-00, cpp-hlld-01, cpp-full1d-00, cpp-full1d-01
    • HLLD Riemann solver implementation with Brio-Wu reference solutions
    • Shared eval infrastructure (mhd1d_shared.py)
    • Public/hidden test suites with numeric tolerance (1.0e-12)

Runner Refactoring

  • Split monolithic test_runner_bench.py (2465 lines) into focused modules:
    • config_helpers.py, docker_runner_helpers.py
    • execution_agent.py, execution_helpers.py
    • metrics_helpers.py, results_helpers.py
    • run_record_helpers.py, task_loading_helpers.py
    • publish_helpers.py
  • Added adversarial test coverage for all new modules
  • Converted parser tests to golden file approach (tests/fixtures/agent_streams/)

Test Infrastructure

  • Moved test-tasks/tests/test-tasks/ for better organization
  • Added unittest-based import compatibility tests
  • Fixed unittest discovery in CI
  • Added smoke task validation to CI

CI/CD

  • Added benchmark result publishing workflow
  • CI pulls ghcr.io/amanotk/simbench:<branch> for PRs when available
  • Copilot smoke runs on PRs with token auth

Documentation

  • Added docs/run-flow.md: runtime and artifact flow
  • Updated docs/development.md: developer workflow
  • Updated docs/toolchain.md: C++ headers and libraries

Bug Fixes

  • Fixed HLLD flux loop buffer overflow (bound ubx + 1ubx)
  • Fixed smoke config path resolution
  • Fixed test discovery patterns in CI

Testing

  • All runner unit tests pass (adversarial + seam coverage)
  • Golden parser tests for OpenCode, Copilot, Codex, Claude
  • E2E smoke tests for OpenCode and Copilot

Stats

  • 134 files changed
  • +19,061 insertions, -4,128 deletions

amanotk and others added 26 commits March 26, 2026 15:00
Split runner/bench.py responsibilities into dedicated helper modules while preserving the stable CLI entrypoint and bench-level execution seams. Reorganize runner tests around the extracted surfaces and restore direct script execution compatibility.
Keep smoke-only agent configs with other test assets so sample/ stays focused on user-facing examples. Update smoke tests and contributor docs to use the new fixture paths.
Update the workflow to run the reorganized runner test suite and clean up lint issues in the extracted adversarial tests. This restores the failing GitHub Actions jobs without changing runner behavior.
Use the correct unittest discover syntax in GitHub Actions and include the full test module pattern so the reorganized suite runs in CI as it does locally.
Remove the implicit pytest dependency from the metrics import compatibility tests so GitHub Actions can run the full discovered suite with stdlib unittest only.
Record canonical run metadata and render stable publish payloads so completed benchmark runs can be reviewed and documented without breaking existing result.json consumers.
Reference the relocated smoke-task fixture configs in the development guide so the documentation matches the path relocation regression tests and current repository layout.
Keep local swarm planning and evidence artifacts out of version control alongside other generated run data.
* Add magnetohydrodynamics cpp-hlld benchmark scaffold

* Fix GitHub math rendering in HLLD doc

* Strengthen cpp-hlld benchmark guidance and tests

* Update formatting of starred-state update equations

Reformatted math equations for better readability.

* Fix hidden test merge artifact

* Strengthen HLLD coverage and fix Docker OpenCode setup

* Add full 1D MHD benchmark with Brio-Wu reference

* Add golden Brio-Wu CLI regression test

* Refactor MHD state storage to mdspan views

* checkpoint: spiral-unknown-1774769904852

* checkpoint: spiral-unknown-1774773137200

* Migrate cpp-full-solver1d to mdspan views

* Refactor Brio-Wu CLI around SolverWorkspace

* WIP: normalize solver workspace shapes

* WIP: add primitive reconstruction helper

* WIP: add workspace-patterned RK3 path

* WIP: run full solver via workspace-patterned RK3

* Migrate fully to workspace-patterned RK3 solver

* checkpoint: spiral-unknown-1775016836330

* checkpoint: spiral-unknown-1775016843861

* checkpoint: spiral-unknown-1775016849781

* refactor(cpp-full-solver1d): simplify API and remove dead code

- Rename ArrayView/XView to ArrayView2D/ArrayView1D, drop ConstArrayView
- Remove all _inplace wrappers and ConstArrayView overloads
- Drop default constructor parameters (domain fixed to [0,1])
- Use constexpr coefficient table for SSP-RK3 substeps
- Extract init_brio_wu_primitive() in main.cpp
- Add cell center view (x) to SolverWorkspace, remove cell_centers()
- Update tests to match new type names and required constructor args

* refactor(cpp-full-solver1d): drop dt/t_final from SolverWorkspace

Remove dt and t_final from SolverWorkspace constructor and members.
Pass them directly to evolve_ssp_rk3(workspace, dt, t_final) instead.
Constructor now takes (nx, gamma, bx) only.

* refactor(cpp-full-solver1d): clean up initialize and cell center indexing

- Simplify initialize() signature: remove discontinuity_x parameter,
  hardcode 0.5 inside; remove unused constants from main.cpp
- Fix cell center initialization to only fill interior cells (Lbx..Ubx)
  instead of all padded cells, matching the physical domain convention
- Rename local dt/t_final to delt/tmax in main()

* refactor(cpp-full-solver1d): simplify storage and boundary handling

* refactor(cpp-full-solver1d): tighten workspace layout and boundaries

* refactor(magnetohydrodynamics): reduce grid resolution from 400 to 100 cells

Update default Nx constant and regenerate golden/reference CSV files.
Improve plot styling with math mode labels and remove subplot titles.

* refactor(magnetohydrodynamics): move reference code into shared assets

Split the Brio-Wu reference solver and its test fixtures into shared assets,
remove the JSON-based helper path, and make the solver grid size configurable

* chore(magnetohydrodynamics): remove obsolete fixture json

* refactor(magnetohydrodynamics): streamline shared solver interfaces

* refactor(magnetohydrodynamics): split cpp-hlld into 00/01 variants

* chore: add vscode workspace excludes

* chore: add uv lockfile

* refactor(magnetohydrodynamics): replace full-solver task with cpp-full1d-00

* feat(magnetohydrodynamics): add cpp-full1d-01 variant with minimal API

- cpp-full1d-01 exposes only evolve_ssp_rk3(...) as public solver entrypoint
- HLLD and CLI remain provided, same as cpp-full1d-00
- Public C++ tests adapted to reduced scaffolding
- Fixed hidden eval path in both 00 and 01 to use /eval_shared mount

Co-authored-by: OpenCode Assistant <assistant@opencode.ai>

* fix(magnetohydrodynamics): use absolute paths in hidden eval tests

The eval test was using relative paths based on __file__ position which
breaks when tests run from /eval/tests/. Fixed to use absolute paths:
- WORKSPACE_ROOT = Path("/work/workspace")
- REFERENCE_CSV_PATH uses /eval_shared mount

This affects both cpp-full1d-00 and cpp-full1d-01 variants.

* fix(magnetohydrodynamics): fix WORKSPACE_ROOT path to /work

The workdir is mounted at /work, not /work/workspace. The original
workspace subdirectory doesn't exist in the run workdir.

* Simplify MHD test build dir and code style

* fix(mhd): correct HLLD flux loop bound to prevent buffer overflow

The flux loop was iterating to ubx + 1 and accessing up_r(ix + 1),
which read past the allocated Nx + 2*N_margin buffer on the last
iteration. Changed loop bound from ubx + 1 to ubx to stay within
allocated memory.

---------

Co-authored-by: OpenCode Assistant <assistant@opencode.ai>
Consolidates test-related directories by moving test-tasks/ into
tests/test-tasks/ for better organization.

Updates all references in:
- runner/bench.py: TEST_TASK_ROOT path
- runner/task_loading_helpers.py: task root map key
- Test files: path references and expected keys
- Documentation: README.md, AGENTS.md, docs/task-development.md
@amanotk
Copy link
Copy Markdown
Owner Author

amanotk commented Apr 2, 2026

Review @codex

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bfd99c37e3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread runner/metrics_helpers.py Outdated

def _collect_opencode_usage_metrics(*, state_dir: Path) -> dict[str, Any]:
env = dict(os.environ)
env["HOME"] = str(state_dir)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use XDG_DATA_HOME for OpenCode stats lookup

Host-mode OpenCode runs are configured with XDG_DATA_HOME (runner/execution_agent.py), but post-run stats collection now rewrites HOME instead. That points opencode stats at a different state location, so host-mode runs can silently lose fallback token/cost metrics (especially when streamed usage is missing, e.g., failures/timeouts). This is a regression from the previous behavior where stats used the same env key as execution.

Useful? React with 👍 / 👎.

Comment thread runner/publish_helpers.py Outdated
completed_at = _require_str_field(record, "completed_at")
_validate_completed_at(completed_at)

repo_commit_sha = _require_str_field(record, "repo_commit_sha").lower()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle nullable repo provenance in publish validation

run_record_helpers explicitly allows unresolved git metadata to degrade to None, but publish normalization hard-requires string/bool types for repo_commit_sha, repo_branch, and repo_dirty. As a result, bench publish rejects runner-produced run.json files from environments where git metadata is unavailable, even though those runs are otherwise valid; this blocks publication instead of surfacing a warning/signal.

Useful? React with 👍 / 👎.

@amanotk amanotk merged commit f7eae30 into main Apr 2, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant