Add objdump parser benchmark with aggregated JSON output#51
Merged
Conversation
Add a Catch2 micro-benchmark for the ObjDumpParser pipeline (fromStream + outputJson) over the pre-captured objdump resource files, plus tooling to run it repeatedly and aggregate run-to-run statistics. - src/test/objdump_benchmark.cpp: BENCHMARK over four inputs of increasing size, tagged [!benchmark] so it stays out of the default test run. - src/test/CMakeLists.txt: compile the benchmark and enable Catch2's CATCH_CONFIG_ENABLE_BENCHMARKING. - scripts/run-benchmarks.sh: run N times, emit one Catch2 XML per run. - scripts/aggregate_bench.py: combine the XML runs into a summary JSON (mean of per-run means, run-to-run stddev, CV, min/max, per-run means). - .github/workflows/benchmark.yml: manual workflow that builds Release, runs the benchmarks, and uploads the summary + raw XML as an artifact. - .gitignore: ignore build-release/ and bench-results/. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Trigger the benchmark workflow on push to main and on pull requests (matching build.yml), instead of only on manual dispatch. Keep workflow_dispatch for ad-hoc reruns. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drop the branch filters so the benchmark runs on push to any branch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bump all actions to their latest releases to clear the Node.js 20 deprecation warnings (runners now default to Node 24): - actions/checkout@v4 -> @v7 (Node 24) - actions/upload-artifact@v4 -> @v7 (Node 24) - turtlebrowser/get-conan@main -> @v1.2 (pin to release) - fnkr/github-action-ghr@v1 -> @v1.3 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Capture a baseline benchmark summary (5 runs, GitHub runner) and compare each new run against it in the workflow. - bench-results/baseline.json: committed baseline; .gitignore keeps it tracked while ignoring other generated benchmark output. - scripts/compare_bench.py: compares a summary against the baseline, prints a per-benchmark delta table, writes a Markdown table to the GitHub step summary, and exits non-zero on a regression beyond the threshold (default 10%). - benchmark.yml: run the comparison after generating the summary; upload the artifact even on failure. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds a benchmark for the
ObjDumpParserpipeline (fromStream+outputJson) plus tooling to run it repeatedly and aggregate run-to-run statistics into a JSON artifact, intended as the basis for a performance baseline.How it works
src/test/objdump_benchmark.cpp— a Catch2BENCHMARKover four pre-captured objdump resource files of increasing size (example.asm,example_intel.asm,gcc12_sort_object_reloc.asm,gcc12_bin_fmt_O2_flto.asm). Each input is pre-loaded into memory so file I/O isn't measured, and re-parsed with a fresh parser per iteration. Tagged[!benchmark]so it stays out of the default./asm-parser-testrun — existing CI is unaffected.src/test/CMakeLists.txt— compiles the benchmark and enablesCATCH_CONFIG_ENABLE_BENCHMARKING(Catch2 v2.13's built-in micro-benchmarking; no new dependencies).scripts/run-benchmarks.sh— runs the benchmark N times (default 5), writing one Catch2 XML per run.scripts/aggregate_bench.py— combines the per-run XMLs into a summary JSON: mean of per-run means, run-to-run stddev, CV%, min/max, and the full per-run list, withcommit/ref/label/generated_utcmetadata..github/workflows/benchmark.yml— manual (workflow_dispatch) workflow mirroringbuild.yml's Conan/gcc-10/Release setup. Builds, runs the benchmarks, prints the summary, and uploadsbench-summary.json+ raw run XMLs as an artifact..gitignore— ignoresbuild-release/andbench-results/.Usage
Raw per-run data is also available directly as Catch2 XML:
./asm-parser-test "[objdump]" -r xml -o run.xmlStability notes
Run-to-run variance on the large inputs is small (~2-3%), while
example.asm(~150us) is noise-dominated (CV ~15%). The summary'scv_percentfield surfaces this per benchmark — the large files are the trustworthy signal.Next steps (not in this PR)
Once this runs on GitHub, the
bench-summaryartifact can be promoted to a committed baseline (e.g.bench-results/baseline.json). A follow-up can add acompare_bench.py+ PR gate that fails on regressions beyond a threshold. Note GitHub-hosted runners are noisier than a dedicated box, so any gate should use a generous threshold and likely only the large inputs.🤖 Generated with Claude Code