-
Notifications
You must be signed in to change notification settings - Fork 33
Add benchmark-pr skill for PR performance comparison #328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
ChaoWao
merged 1 commit into
hw-native-sys:main
from
chenshengxin2026:skills/benchmark-pr
Mar 19, 2026
+243
−0
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| Benchmark the hardware performance of a single example at $ARGUMENTS under `tests/device_tests/a2a3/tensormap_and_ringbuffer/`. | ||
|
|
||
| Reference `tools/benchmark_rounds.sh` for the full implementation pattern (device log resolution, timing parsing, reporting format). This skill runs the same logic but for a single example only. | ||
|
|
||
| 1. Verify `$ARGUMENTS` exists and contains `kernels/kernel_config.py` and `golden.py` | ||
| 2. Check `command -v npu-smi` — if not found, tell the user this requires hardware and stop | ||
| 3. Run `npu-smi info`, find the lowest-ID idle device (HBM-Usage = 0). If none, stop | ||
| 4. Run the example following the same pattern as `run_bench()` in `tools/benchmark_rounds.sh`: | ||
| - Snapshot logs, run `run_example.py` with `-n 10`, find new log, parse timing, report results |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| Benchmark the hardware performance of all examples under `tests/device_tests/a2a3/$ARGUMENTS/`. | ||
|
|
||
| Reference `tools/benchmark_rounds.sh` for the full implementation pattern (device log resolution, timing parsing, reporting format). | ||
|
|
||
| 1. Validate `$ARGUMENTS` is one of: `host_build_graph`, `aicpu_build_graph`, `tensormap_and_ringbuffer`. If not, list valid runtimes and stop | ||
| 2. Check `command -v npu-smi` — if not found, tell the user this requires hardware and stop | ||
| 3. Run `npu-smi info`, find the lowest-ID idle device (HBM-Usage = 0). If none, stop | ||
| 4. Enumerate all subdirectories under `tests/device_tests/a2a3/$ARGUMENTS/` that contain both `kernels/kernel_config.py` and `golden.py` | ||
| 5. For each example, run the same `run_bench()` pattern from `tools/benchmark_rounds.sh`: | ||
| - Snapshot logs, run `run_example.py` with `-n 10`, find new log, parse timing, report results | ||
| 6. Print a final summary table with example name, average latency, trimmed average, and pass/fail |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,223 @@ | ||
| --- | ||
| name: benchmark-pr | ||
| description: Benchmark a GitHub PR's performance impact by comparing benchmark results between the PR head and its fork point (merge-base). Use when the user asks to benchmark a PR, measure performance changes, or compare latency before/after a PR. | ||
| --- | ||
|
|
||
| # Benchmark PR Workflow | ||
|
|
||
| Compare benchmark performance between a PR's changes and its merge-base to quantify performance impact. | ||
|
|
||
| ## Task Tracking | ||
|
|
||
| Create tasks to track progress through this workflow: | ||
|
|
||
| 1. Match input to PR and fetch metadata | ||
| 2. Determine merge-base (fork point) | ||
| 3. Pin PTO-ISA environment | ||
| 4. Run baseline benchmark (merge-base) | ||
| 5. Run PR benchmark (HEAD) | ||
| 6. Compare and report results | ||
|
|
||
| ## Input | ||
|
|
||
| Accept PR number (`123`, `#123`) with optional benchmark arguments: | ||
|
|
||
| ``` | ||
| /benchmark-pr #123 | ||
| /benchmark-pr 123 -d 4 -n 20 | ||
| ``` | ||
|
|
||
| Extra arguments (`-d`, `-n`, etc.) are forwarded to `tools/benchmark_rounds.sh`. | ||
|
|
||
| **Defaults** (when not specified): `-d 0 -n 10 -p a2a3` | ||
|
|
||
| ## Step 1: Setup and PR Resolution | ||
|
|
||
| 1. [Setup](../../lib/github/setup.md) — authenticate and detect context (role, remotes, state) | ||
|
|
||
| 2. Use [lookup-pr](../../lib/github/lookup-pr.md) to find the PR: | ||
|
|
||
| ```bash | ||
| PR_DATA=$(gh pr view $PR_NUMBER --repo "$PR_REPO_OWNER/$PR_REPO_NAME" \ | ||
| --json number,title,headRefName,baseRefName,state,headRefOid) | ||
| ``` | ||
|
|
||
| Validate PR state: OPEN or MERGED (continue), CLOSED (warn and confirm). | ||
|
|
||
| Extract fields: | ||
|
|
||
| ```bash | ||
| HEAD_BRANCH=$(echo "$PR_DATA" | jq -r '.headRefName') | ||
| BASE_BRANCH=$(echo "$PR_DATA" | jq -r '.baseRefName') | ||
| HEAD_SHA=$(echo "$PR_DATA" | jq -r '.headRefOid') | ||
| PR_TITLE=$(echo "$PR_DATA" | jq -r '.title') | ||
| ``` | ||
|
|
||
| ## Step 2: Determine Merge-Base (Fork Point) | ||
|
|
||
| The baseline is the merge-base between the PR and its target branch — the point where the PR diverged. | ||
|
|
||
| ```bash | ||
| # Ensure upstream base branch is fresh | ||
| git fetch upstream "$BASE_BRANCH" --quiet | ||
|
|
||
| # Fetch the PR head | ||
| git fetch upstream "pull/$PR_NUMBER/head:pr-$PR_NUMBER" --quiet | ||
|
|
||
| # Compute fork point | ||
| MERGE_BASE=$(git merge-base "upstream/$BASE_BRANCH" "pr-$PR_NUMBER") | ||
| echo "Merge base: $MERGE_BASE" | ||
| echo "PR head: $(git rev-parse pr-$PR_NUMBER)" | ||
| ``` | ||
|
|
||
| **Validation:** Confirm the merge-base is not the PR head itself: | ||
|
|
||
| ```bash | ||
| if [ "$MERGE_BASE" = "$(git rev-parse pr-$PR_NUMBER)" ]; then | ||
| echo "ERROR: merge-base equals PR head — PR has no commits" | ||
| exit 1 | ||
| fi | ||
| ``` | ||
|
|
||
| ## Step 3: Stash and Prepare | ||
|
|
||
| Save current work before switching branches: | ||
|
|
||
| ```bash | ||
| ORIGINAL_BRANCH=$(git branch --show-current) | ||
| STASHED=false | ||
| if [ -n "$(git status --porcelain)" ]; then | ||
| git stash push -m "benchmark-pr: auto-stash before benchmarking" | ||
| STASHED=true | ||
| fi | ||
| ``` | ||
|
|
||
| ## Step 4: Pin PTO-ISA Environment | ||
|
|
||
| `tools/benchmark_rounds.sh` calls `run_example.py`, which auto-clones PTO-ISA to `examples/scripts/_deps/pto-isa/`. To ensure reproducible builds, pin PTO-ISA to the same commit used by CI. | ||
|
|
||
| Extract the pinned commit from `.github/workflows/ci.yml`: | ||
|
|
||
| ```bash | ||
| PTO_ISA_COMMIT=$(grep -oP '(?<=-c )\w+' .github/workflows/ci.yml | head -1) | ||
| if [ -z "$PTO_ISA_COMMIT" ]; then | ||
| echo "Could not parse PTO-ISA commit from ci.yml, using latest" | ||
| PTO_ISA_COMMIT="" # empty means run_example.py will use the latest PTO-ISA commit | ||
| fi | ||
| echo "PTO-ISA commit: ${PTO_ISA_COMMIT:-latest}" | ||
| ``` | ||
|
|
||
| If PTO-ISA is already cloned, pre-checkout the pinned commit so both baseline and PR use the same version. Otherwise, `run_example.py` will clone it automatically when invoked with the `-c` flag. | ||
|
|
||
| ```bash | ||
| PTO_ISA_DIR="examples/scripts/_deps/pto-isa" | ||
| if [ -d "$PTO_ISA_DIR/.git" ]; then | ||
| git -C "$PTO_ISA_DIR" fetch origin --quiet | ||
| git -C "$PTO_ISA_DIR" checkout "$PTO_ISA_COMMIT" --quiet | ||
| fi | ||
| ``` | ||
|
|
||
| Append the commit flag to benchmark args so `run_example.py` picks it up: | ||
|
|
||
| ```bash | ||
| if [ -n "$PTO_ISA_COMMIT" ]; then | ||
| BENCH_ARGS="$BENCH_ARGS -c $PTO_ISA_COMMIT" | ||
| fi | ||
| ``` | ||
|
|
||
| Create timestamped temp files under the project directory to avoid conflicts with concurrent runs: | ||
|
|
||
| ```bash | ||
| TIMESTAMP=$(date +%Y%m%d_%H%M%S) | ||
| mkdir -p tmp | ||
| BENCH_BASELINE="tmp/benchmark_baseline_${TIMESTAMP}.txt" | ||
| BENCH_PR="tmp/benchmark_pr_${TIMESTAMP}.txt" | ||
| ``` | ||
|
|
||
| **Why this matters:** Without pinning, each checkout (merge-base vs PR HEAD) might resolve a different PTO-ISA version, introducing noise unrelated to the PR's actual changes. | ||
|
|
||
| ## Step 5: Run Baseline Benchmark (Merge-Base) | ||
|
|
||
| ```bash | ||
| git checkout "$MERGE_BASE" --quiet | ||
|
|
||
| echo "Running baseline benchmark at merge-base ($MERGE_BASE)..." | ||
| ./tools/benchmark_rounds.sh $BENCH_ARGS 2>&1 | tee "$BENCH_BASELINE" | ||
| ``` | ||
|
|
||
| Capture the output. Parse trimmed averages per example from the output for comparison. | ||
|
|
||
| ## Step 6: Run PR Benchmark (HEAD) | ||
|
|
||
| ```bash | ||
| git checkout "pr-$PR_NUMBER" --quiet | ||
|
|
||
| echo "Running PR benchmark at HEAD ($HEAD_SHA)..." | ||
| ./tools/benchmark_rounds.sh $BENCH_ARGS 2>&1 | tee "$BENCH_PR" | ||
| ``` | ||
|
|
||
| ## Step 7: Restore Working State | ||
|
|
||
| ```bash | ||
| git checkout "$ORIGINAL_BRANCH" --quiet | ||
| if [ "$STASHED" = true ]; then | ||
| git stash pop | ||
| fi | ||
|
|
||
| # Clean up the temporary PR branch | ||
| git branch -D "pr-$PR_NUMBER" 2>/dev/null || true | ||
| ``` | ||
|
|
||
| ## Step 8: Compare and Report | ||
|
|
||
| Parse both output files to extract per-example trimmed averages, then compute the delta. | ||
|
|
||
| Present results as a table: | ||
|
|
||
| ``` | ||
| PR #123: <PR title> | ||
| Merge-base: <short SHA> → PR HEAD: <short SHA> | ||
| Benchmark args: -d 4 -n 20 | ||
|
|
||
| Example Baseline (us) PR (us) Delta (us) Change (%) | ||
| -------------------------- ------------- ------- ---------- ---------- | ||
| alternating_matmul_add 1240.1 1235.5 -4.6 -0.37% | ||
| benchmark_bgemm 890.3 892.1 +1.8 +0.20% | ||
| paged_attention_unroll/Case1 2100.0 2050.3 -49.7 -2.37% | ||
| ... | ||
|
|
||
| Overall: X of Y examples improved, Z regressed | ||
| ``` | ||
|
|
||
| **Interpretation guidelines:** | ||
|
|
||
| | Change (%) | Assessment | | ||
| | ---------- | ---------- | | ||
| | < -2% | Notable improvement | | ||
| | -2% to +2% | Within noise margin | | ||
| | > +2% | Potential regression — flag for review | | ||
|
|
||
| If any example shows > 5% regression, highlight it explicitly and recommend investigation. | ||
|
|
||
| ## Error Handling | ||
|
|
||
| | Error | Action | | ||
| | ----- | ------ | | ||
| | PR not found | `gh pr list`; ask user | | ||
| | Benchmark script fails on baseline | Report which examples failed; continue with remaining | | ||
| | Benchmark script fails on PR | Same as above; compare only examples that succeeded on both | | ||
| | No timing data (profiling not enabled) | Warn user: "No timing markers found — ensure `PTO2_PROFILING` is enabled" | | ||
| | Device not available | Suggest simulation platform or different device ID | | ||
| | All examples fail and no `-d` was specified | Default device 0 is often occupied or unstable. Prompt user to specify a device ID explicitly (e.g. `/benchmark-pr #123 -d 4`). Suggest running `npu-smi info` to find idle devices (HBM-Usage = 0) | | ||
|
|
||
| ## Checklist | ||
|
|
||
| - [ ] PR resolved and metadata fetched | ||
| - [ ] Merge-base computed correctly (fork point, not stale local branch) | ||
| - [ ] PTO-ISA pinned to CI commit | ||
| - [ ] Working state saved (stash if needed) | ||
| - [ ] Baseline benchmark completed at merge-base | ||
| - [ ] PR benchmark completed at HEAD | ||
| - [ ] Working state restored (branch + stash pop) | ||
| - [ ] Comparison table presented with per-example deltas | ||
| - [ ] Regressions > 2% flagged for attention | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.