diff --git a/.claude/commands/perf-example-device.md b/.claude/commands/perf-example-device.md new file mode 100644 index 00000000..a359da5f --- /dev/null +++ b/.claude/commands/perf-example-device.md @@ -0,0 +1,9 @@ +Benchmark the hardware performance of a single example at $ARGUMENTS under `tests/device_tests/a2a3/tensormap_and_ringbuffer/`. + +Reference `tools/benchmark_rounds.sh` for the full implementation pattern (device log resolution, timing parsing, reporting format). This skill runs the same logic but for a single example only. + +1. Verify `$ARGUMENTS` exists and contains `kernels/kernel_config.py` and `golden.py` +2. Check `command -v npu-smi` — if not found, tell the user this requires hardware and stop +3. Run `npu-smi info`, find the lowest-ID idle device (HBM-Usage = 0). If none, stop +4. Run the example following the same pattern as `run_bench()` in `tools/benchmark_rounds.sh`: + - Snapshot logs, run `run_example.py` with `-n 10`, find new log, parse timing, report results diff --git a/.claude/commands/perf-runtime-device.md b/.claude/commands/perf-runtime-device.md new file mode 100644 index 00000000..70d849d8 --- /dev/null +++ b/.claude/commands/perf-runtime-device.md @@ -0,0 +1,11 @@ +Benchmark the hardware performance of all examples under `tests/device_tests/a2a3/$ARGUMENTS/`. + +Reference `tools/benchmark_rounds.sh` for the full implementation pattern (device log resolution, timing parsing, reporting format). + +1. Validate `$ARGUMENTS` is one of: `host_build_graph`, `aicpu_build_graph`, `tensormap_and_ringbuffer`. If not, list valid runtimes and stop +2. Check `command -v npu-smi` — if not found, tell the user this requires hardware and stop +3. Run `npu-smi info`, find the lowest-ID idle device (HBM-Usage = 0). If none, stop +4. Enumerate all subdirectories under `tests/device_tests/a2a3/$ARGUMENTS/` that contain both `kernels/kernel_config.py` and `golden.py` +5. For each example, run the same `run_bench()` pattern from `tools/benchmark_rounds.sh`: + - Snapshot logs, run `run_example.py` with `-n 10`, find new log, parse timing, report results +6. Print a final summary table with example name, average latency, trimmed average, and pass/fail diff --git a/.claude/skills/benchmark-pr/SKILL.md b/.claude/skills/benchmark-pr/SKILL.md new file mode 100644 index 00000000..9fd793fa --- /dev/null +++ b/.claude/skills/benchmark-pr/SKILL.md @@ -0,0 +1,223 @@ +--- +name: benchmark-pr +description: Benchmark a GitHub PR's performance impact by comparing benchmark results between the PR head and its fork point (merge-base). Use when the user asks to benchmark a PR, measure performance changes, or compare latency before/after a PR. +--- + +# Benchmark PR Workflow + +Compare benchmark performance between a PR's changes and its merge-base to quantify performance impact. + +## Task Tracking + +Create tasks to track progress through this workflow: + +1. Match input to PR and fetch metadata +2. Determine merge-base (fork point) +3. Pin PTO-ISA environment +4. Run baseline benchmark (merge-base) +5. Run PR benchmark (HEAD) +6. Compare and report results + +## Input + +Accept PR number (`123`, `#123`) with optional benchmark arguments: + +``` +/benchmark-pr #123 +/benchmark-pr 123 -d 4 -n 20 +``` + +Extra arguments (`-d`, `-n`, etc.) are forwarded to `tools/benchmark_rounds.sh`. + +**Defaults** (when not specified): `-d 0 -n 10 -p a2a3` + +## Step 1: Setup and PR Resolution + +1. [Setup](../../lib/github/setup.md) — authenticate and detect context (role, remotes, state) + +2. Use [lookup-pr](../../lib/github/lookup-pr.md) to find the PR: + +```bash +PR_DATA=$(gh pr view $PR_NUMBER --repo "$PR_REPO_OWNER/$PR_REPO_NAME" \ + --json number,title,headRefName,baseRefName,state,headRefOid) +``` + +Validate PR state: OPEN or MERGED (continue), CLOSED (warn and confirm). + +Extract fields: + +```bash +HEAD_BRANCH=$(echo "$PR_DATA" | jq -r '.headRefName') +BASE_BRANCH=$(echo "$PR_DATA" | jq -r '.baseRefName') +HEAD_SHA=$(echo "$PR_DATA" | jq -r '.headRefOid') +PR_TITLE=$(echo "$PR_DATA" | jq -r '.title') +``` + +## Step 2: Determine Merge-Base (Fork Point) + +The baseline is the merge-base between the PR and its target branch — the point where the PR diverged. + +```bash +# Ensure upstream base branch is fresh +git fetch upstream "$BASE_BRANCH" --quiet + +# Fetch the PR head +git fetch upstream "pull/$PR_NUMBER/head:pr-$PR_NUMBER" --quiet + +# Compute fork point +MERGE_BASE=$(git merge-base "upstream/$BASE_BRANCH" "pr-$PR_NUMBER") +echo "Merge base: $MERGE_BASE" +echo "PR head: $(git rev-parse pr-$PR_NUMBER)" +``` + +**Validation:** Confirm the merge-base is not the PR head itself: + +```bash +if [ "$MERGE_BASE" = "$(git rev-parse pr-$PR_NUMBER)" ]; then + echo "ERROR: merge-base equals PR head — PR has no commits" + exit 1 +fi +``` + +## Step 3: Stash and Prepare + +Save current work before switching branches: + +```bash +ORIGINAL_BRANCH=$(git branch --show-current) +STASHED=false +if [ -n "$(git status --porcelain)" ]; then + git stash push -m "benchmark-pr: auto-stash before benchmarking" + STASHED=true +fi +``` + +## Step 4: Pin PTO-ISA Environment + +`tools/benchmark_rounds.sh` calls `run_example.py`, which auto-clones PTO-ISA to `examples/scripts/_deps/pto-isa/`. To ensure reproducible builds, pin PTO-ISA to the same commit used by CI. + +Extract the pinned commit from `.github/workflows/ci.yml`: + +```bash +PTO_ISA_COMMIT=$(grep -oP '(?<=-c )\w+' .github/workflows/ci.yml | head -1) +if [ -z "$PTO_ISA_COMMIT" ]; then + echo "Could not parse PTO-ISA commit from ci.yml, using latest" + PTO_ISA_COMMIT="" # empty means run_example.py will use the latest PTO-ISA commit +fi +echo "PTO-ISA commit: ${PTO_ISA_COMMIT:-latest}" +``` + +If PTO-ISA is already cloned, pre-checkout the pinned commit so both baseline and PR use the same version. Otherwise, `run_example.py` will clone it automatically when invoked with the `-c` flag. + +```bash +PTO_ISA_DIR="examples/scripts/_deps/pto-isa" +if [ -d "$PTO_ISA_DIR/.git" ]; then + git -C "$PTO_ISA_DIR" fetch origin --quiet + git -C "$PTO_ISA_DIR" checkout "$PTO_ISA_COMMIT" --quiet +fi +``` + +Append the commit flag to benchmark args so `run_example.py` picks it up: + +```bash +if [ -n "$PTO_ISA_COMMIT" ]; then + BENCH_ARGS="$BENCH_ARGS -c $PTO_ISA_COMMIT" +fi +``` + +Create timestamped temp files under the project directory to avoid conflicts with concurrent runs: + +```bash +TIMESTAMP=$(date +%Y%m%d_%H%M%S) +mkdir -p tmp +BENCH_BASELINE="tmp/benchmark_baseline_${TIMESTAMP}.txt" +BENCH_PR="tmp/benchmark_pr_${TIMESTAMP}.txt" +``` + +**Why this matters:** Without pinning, each checkout (merge-base vs PR HEAD) might resolve a different PTO-ISA version, introducing noise unrelated to the PR's actual changes. + +## Step 5: Run Baseline Benchmark (Merge-Base) + +```bash +git checkout "$MERGE_BASE" --quiet + +echo "Running baseline benchmark at merge-base ($MERGE_BASE)..." +./tools/benchmark_rounds.sh $BENCH_ARGS 2>&1 | tee "$BENCH_BASELINE" +``` + +Capture the output. Parse trimmed averages per example from the output for comparison. + +## Step 6: Run PR Benchmark (HEAD) + +```bash +git checkout "pr-$PR_NUMBER" --quiet + +echo "Running PR benchmark at HEAD ($HEAD_SHA)..." +./tools/benchmark_rounds.sh $BENCH_ARGS 2>&1 | tee "$BENCH_PR" +``` + +## Step 7: Restore Working State + +```bash +git checkout "$ORIGINAL_BRANCH" --quiet +if [ "$STASHED" = true ]; then + git stash pop +fi + +# Clean up the temporary PR branch +git branch -D "pr-$PR_NUMBER" 2>/dev/null || true +``` + +## Step 8: Compare and Report + +Parse both output files to extract per-example trimmed averages, then compute the delta. + +Present results as a table: + +``` +PR #123: +Merge-base: → PR HEAD: +Benchmark args: -d 4 -n 20 + +Example Baseline (us) PR (us) Delta (us) Change (%) +-------------------------- ------------- ------- ---------- ---------- +alternating_matmul_add 1240.1 1235.5 -4.6 -0.37% +benchmark_bgemm 890.3 892.1 +1.8 +0.20% +paged_attention_unroll/Case1 2100.0 2050.3 -49.7 -2.37% +... + +Overall: X of Y examples improved, Z regressed +``` + +**Interpretation guidelines:** + +| Change (%) | Assessment | +| ---------- | ---------- | +| < -2% | Notable improvement | +| -2% to +2% | Within noise margin | +| > +2% | Potential regression — flag for review | + +If any example shows > 5% regression, highlight it explicitly and recommend investigation. + +## Error Handling + +| Error | Action | +| ----- | ------ | +| PR not found | `gh pr list`; ask user | +| Benchmark script fails on baseline | Report which examples failed; continue with remaining | +| Benchmark script fails on PR | Same as above; compare only examples that succeeded on both | +| No timing data (profiling not enabled) | Warn user: "No timing markers found — ensure `PTO2_PROFILING` is enabled" | +| Device not available | Suggest simulation platform or different device ID | +| All examples fail and no `-d` was specified | Default device 0 is often occupied or unstable. Prompt user to specify a device ID explicitly (e.g. `/benchmark-pr #123 -d 4`). Suggest running `npu-smi info` to find idle devices (HBM-Usage = 0) | + +## Checklist + +- [ ] PR resolved and metadata fetched +- [ ] Merge-base computed correctly (fork point, not stale local branch) +- [ ] PTO-ISA pinned to CI commit +- [ ] Working state saved (stash if needed) +- [ ] Baseline benchmark completed at merge-base +- [ ] PR benchmark completed at HEAD +- [ ] Working state restored (branch + stash pop) +- [ ] Comparison table presented with per-example deltas +- [ ] Regressions > 2% flagged for attention