diff --git a/tools/README.md b/tools/README.md index bdede4f5..9599aeab 100644 --- a/tools/README.md +++ b/tools/README.md @@ -1,223 +1,235 @@ -# Swimlane 性能分析工具 +# Swimlane Performance Analysis Tools -本目录包含 PTO Runtime 的性能分析工具。 +This directory contains performance analysis tools for PTO Runtime. -## 工具列表 +## Tool List -- **[swimlane_converter.py](#swimlane_converterpy)** - 转换为 Chrome Trace Event 可视化格式 -- **[sched_overhead_analysis.py](#sched_overhead_analysispy)** - Scheduler 开销分析(Tail OH 分解) -- **[perf_to_mermaid.py](#perf_to_mermaidpy)** - 转换为 Mermaid 依赖图 -- **[benchmark_rounds.sh](#benchmark_roundssh)** - 批量运行 examples 并报告每轮耗时 -- **[device_log_resolver.py](#device_log_resolverpy)** - Device log 路径解析库 +- **[swimlane_converter.py](#swimlane_converterpy)** - Convert to Chrome Trace Event visualization format +- **[sched_overhead_analysis.py](#sched_overhead_analysispy)** - Scheduler overhead analysis (Tail OH breakdown) +- **[perf_to_mermaid.py](#perf_to_mermaidpy)** - Convert to Mermaid dependency graph +- **[benchmark_rounds.sh](#benchmark_roundssh)** - Batch-run examples and report per-round timing +- **[benchmark_config.json](#benchmark_configjson)** - Configuration file for benchmark_rounds.sh +- **[device_log_resolver.py](#device_log_resolverpy)** - Device log path resolution library --- ## swimlane_converter.py -将性能分析数据 JSON 文件转换为 Chrome Trace Event 格式,以便在 Perfetto 中可视化。 +Converts performance data JSON files to Chrome Trace Event format for visualization in Perfetto. -### 功能概述 +### Overview -`swimlane_converter.py` 将 PTO Runtime 的性能分析数据(`perf_swimlane_*.json`)转换为可在 Perfetto 跟踪查看器(https://ui.perfetto.dev/)中可视化的格式。同时提供按函数分组的任务执行统计分析,并在解析到 device log 时输出 scheduler overhead deep-dive 报告。 +`swimlane_converter.py` converts PTO Runtime performance data (`perf_swimlane_*.json`, version 1 or 2) into a format viewable in the Perfetto trace viewer (https://ui.perfetto.dev/). It also provides per-function task execution statistics with Exec/Latency analysis, and automatically runs the scheduler overhead deep-dive report when a device log is resolved. -### 基本用法 +For version 2 data, additional tracks are generated: +- **AICPU Scheduler**: scheduler phase bars per thread +- **AICPU Orchestrator**: orchestrator phase bars or summary + +### Basic Usage ```bash -# 自动检测 outputs/ 目录中最新的性能分析文件 +# Auto-detect the latest performance data file in outputs/ python3 tools/swimlane_converter.py -# 指定输入文件 +# Specify an input file python3 tools/swimlane_converter.py outputs/perf_swimlane_20260210_143526.json -# 指定输出文件 +# Specify an output file python3 tools/swimlane_converter.py outputs/perf_swimlane_20260210_143526.json -o custom_output.json -# 从 kernel_config.py 加载函数名映射 +# Load function name mapping from kernel_config.py python3 tools/swimlane_converter.py outputs/perf_swimlane_20260210_143526.json \ -k examples/host_build_graph/paged_attention/kernels/kernel_config.py -# 使用指定 device id 自动选择 device log(device-) +# Use a specific device id for automatic device log selection (device-) python3 tools/swimlane_converter.py outputs/perf_swimlane_20260210_143526.json -d 0 -# 详细模式(用于调试) +# Verbose mode (for debugging) python3 tools/swimlane_converter.py outputs/perf_swimlane_20260210_143526.json -v ``` -### 命令行选项 +### Command-Line Options -| 选项 | 简写 | 说明 | -|------|------|------| -| `input` | | 输入 JSON 文件(perf_swimlane_*.json)。如果省略,使用 outputs/ 中最新的文件 | -| `--output` | `-o` | 输出 JSON 文件(默认:outputs/merged_swimlane_.json) | -| `--kernel-config` | `-k` | kernel_config.py 文件路径,用于函数名映射 | -| `--device-log` | | 设备日志文件/目录/glob 覆盖输入(优先级最高) | -| `--device-id` | `-d` | 指定 device id,从 `device-` 目录自动选择日志 | -| `--verbose` | `-v` | 启用详细输出 | +| Option | Short | Description | +|--------|-------|-------------| +| `input` | | Input JSON file (perf_swimlane_*.json). If omitted, uses the latest file in outputs/ | +| `--output` | `-o` | Output JSON file (default: outputs/merged_swimlane_.json) | +| `--kernel-config` | `-k` | Path to kernel_config.py for function name mapping | +| `--device-log` | | Device log file/path/glob override (highest priority) | +| `--device-id` | `-d` | Device id for automatic log selection from `device-` directory | +| `--verbose` | `-v` | Enable verbose output | -### device log 选择优先级 +### Device Log Selection Priority -`swimlane_converter.py` 和 `sched_overhead_analysis.py` 使用一致的解析规则(由 `device_log_resolver.py` 提供): +`swimlane_converter.py` and `sched_overhead_analysis.py` use consistent resolution rules (provided by `device_log_resolver.py`): -1. `--device-log`(文件/目录/glob)显式覆盖 -2. `-d/--device-id` 对应 `device-` 目录 -3. 自动扫描 `device-*`,选择最接近 perf 时间戳的 `.log` +1. `--device-log` (file/directory/glob) explicit override +2. `-d/--device-id` selects from the `device-` directory +3. Auto-scan all `device-*` directories, choosing the `.log` closest to the perf timestamp -log root 解析顺序: +Log root resolution order: - `$ASCEND_WORK_PATH/log/debug/` -- `~/ascend/log/debug/`(fallback) +- `~/ascend/log/debug/` (fallback) + +### Output -### 输出内容 +The tool generates three types of output: -工具生成三类输出: +#### 1. Perfetto JSON File -#### 1. Perfetto JSON 文件 +A Chrome Trace Event format JSON file viewable in Perfetto: +- File location: `outputs/merged_swimlane_.json` +- Open https://ui.perfetto.dev/ and drag the file in to visualize -可在 Perfetto 中可视化的 Chrome Trace Event 格式 JSON 文件: -- 文件位置:`outputs/merged_swimlane_.json` -- 打开 https://ui.perfetto.dev/ 并拖入文件即可可视化 +Trace processes (tracks): +- **AICore View** (pid=1): kernel execution (start_time_us to end_time_us) +- **AICPU View** (pid=2): end-to-end AICPU perspective (dispatch_time_us to finish_time_us) +- **AICPU Scheduler** (pid=3): scheduler phase bars per thread (version 2 only) +- **AICPU Orchestrator** (pid=4): orchestrator phase bars or summary (version 2 only) -#### 2. 任务统计信息 +#### 2. Task Statistics -按函数分组的统计摘要(打印到控制台),包含 Exec/Latency 对比和调度开销分析: +A per-function summary printed to the console, including Exec/Latency comparison and scheduling overhead analysis: -- **Exec**:AICore 上的 kernel 执行时间(end_time - start_time) -- **Latency**:AICPU 视角的端到端延迟(finish_time - dispatch_time,包含 head OH + Exec + tail OH) -- **Head/Tail OH**:调度头部/尾部开销 -- **Exec_%**:Exec / Latency 百分比(kernel 利用率) +- **Exec**: Kernel execution time on AICore (end_time - start_time) +- **Latency**: End-to-end latency from AICPU perspective (finish_time - dispatch_time, includes Head OH + Exec + Tail OH) +- **Head/Tail OH**: Scheduling head/tail overhead +- **Exec_%**: Exec / Latency percentage (kernel utilization) -解析到 device log 时,还会输出 Sched CPU(AICPU scheduler 线程实际 CPU 时间 per task)和 Exec/Sched_CPU 比率。 +When a device log is resolved, also outputs Sched CPU (AICPU scheduler thread actual CPU time per task) and Exec/Sched_CPU ratio. -#### 3. Scheduler overhead deep-dive(自动) +#### 3. Scheduler Overhead Deep Dive (automatic) -当 device log 成功解析后,`swimlane_converter.py` 会直接调用 `sched_overhead_analysis` 的分析逻辑,并在同一次运行中输出: +When a device log is successfully resolved, `swimlane_converter.py` directly invokes the `sched_overhead_analysis` analysis logic and outputs in the same run: - Part 1: Per-task time breakdown - Part 2: AICPU scheduler loop breakdown - Part 3: Tail OH distribution & cause analysis -### 与 run_example.py 集成 +### Integration with run_example.py -启用性能分析运行测试时,转换器会自动调用: +When running tests with profiling enabled, the converter is automatically invoked: ```bash -# 运行测试并启用性能分析 - 测试通过后自动生成 merged_swimlane.json +# Run a test with profiling enabled — merged_swimlane.json is auto-generated after the test passes python examples/scripts/run_example.py \ -k examples/host_build_graph/vector_example/kernels \ -g examples/host_build_graph/vector_example/golden.py \ --enable-profiling ``` -测试通过后,工具将: -1. 自动检测 outputs/ 中最新的 `perf_swimlane_*.json` -2. 从 `-k` 指定的 kernel_config.py 加载函数名 -3. 把运行时有效 device id(`-d`)透传给 `swimlane_converter.py` -4. 自动解析 device log 并输出选择策略 -5. 生成 `merged_swimlane_*.json` 用于可视化 -6. 将任务统计与 scheduler overhead deep-dive 报告打印到控制台 +After the test passes, the tool will: +1. Auto-detect the latest `perf_swimlane_*.json` in outputs/ +2. Load function names from the kernel_config.py specified by `-k` +3. Pass the runtime device id (`-d`) through to `swimlane_converter.py` +4. Auto-resolve the device log and print the selection strategy +5. Generate `merged_swimlane_*.json` for visualization +6. Print task statistics and the scheduler overhead deep-dive report to the console --- ## sched_overhead_analysis.py -分析 AICPU scheduler 的调度开销,定量分解 Tail OH(任务完成到 scheduler 确认之间的延迟)的来源。 +Analyzes AICPU scheduler overhead, quantitatively breaking down the sources of Tail OH (the delay between task completion and scheduler acknowledgment). -### 功能概述 +### Overview -`sched_overhead_analysis.py` 从两个数据源进行分析: -1. **Perf profiling 数据**(`perf_swimlane_*.json`):提取每个 task 的 Exec / Head OH / Tail OH 时间分解 -2. **设备日志**(device log):解析 AICPU scheduler 线程的循环分解(scan / complete / dispatch / idle)、锁竞争和 fanout 统计 +`sched_overhead_analysis.py` performs analysis from two data sources: +1. **Perf profiling data** (`perf_swimlane_*.json`): Extracts per-task Exec / Head OH / Tail OH time breakdown +2. **Scheduler loop breakdown** with two sources (in priority order): + - **Perf JSON phase data** (version >= 2, preferred): Reads `aicpu_scheduler_phases` records directly from the perf JSON + - **Device log** (fallback for older data or `PTO2_SCHED_PROFILING=1` details) -支持三种 device log 格式: -1. **New two-level tree**(`PTO2_SCHED_PROFILING=1`):`=== Scheduler Phase Breakdown: total=Xus, Y tasks ===`,后跟各 phase 行 -2. **Legacy detailed**(`PTO2_SCHED_PROFILING=1`):`completed=X tasks in Yus (Z loops, W tasks/loop)`,后跟 `--- Phase Breakdown ---` 及带 fanout/fanin/pop 统计的 phase 行 -3. **Summary**(`PTO2_SCHED_PROFILING=0`):`Scheduler summary: total_time=Xus, loops=Y, tasks_scheduled=Z` +Device log supports two formats: +1. **New two-level tree** (`PTO2_SCHED_PROFILING=1`): `=== Scheduler Phase Breakdown: total=Xus, Y tasks ===`, followed by per-phase lines with fanout/fanin/pop statistics +2. **Summary** (`PTO2_SCHED_PROFILING=0`): `Scheduler summary: total_time=Xus, loops=Y, tasks_scheduled=Z` -### 基本用法 +### Basic Usage ```bash -# 自动选取最新的 perf 数据和设备日志 +# Auto-select the latest perf data and device log python3 tools/sched_overhead_analysis.py -# 指定 device id 自动选取 device- 日志 +# Specify device id for automatic log selection from device- python3 tools/sched_overhead_analysis.py --perf-json outputs/perf_swimlane_20260210_143526.json -d 0 -# 指定文件 +# Specify files explicitly python3 tools/sched_overhead_analysis.py \ --perf-json outputs/perf_swimlane_20260210_143526.json \ --device-log ~/ascend/log/debug/device-0/device-*.log ``` -### 命令行选项 +### Command-Line Options -| 选项 | 说明 | -|------|------| -| `--perf-json` | perf_swimlane_*.json 文件路径。省略时自动选取 outputs/ 中最新的文件 | -| `--device-log` | 设备日志文件/目录/glob 覆盖输入(优先级最高) | -| `-d, --device-id` | 指定 device id,从 `device-` 自动选取日志 | +| Option | Description | +|--------|-------------| +| `--perf-json` | Path to perf_swimlane_*.json file. If omitted, auto-selects the latest in outputs/ | +| `--device-log` | Device log file/path/glob override (highest priority) | +| `-d, --device-id` | Device id for automatic log selection from `device-` | -### 输出内容 +### Output -分三部分输出: +Output is divided into three parts: -- **Part 1:Per-task time breakdown** — Exec / Head OH / Tail OH 各占 Latency 的百分比 -- **Part 2:AICPU scheduler loop breakdown** — 各 scheduler 线程的循环统计、各阶段(scan / complete / dispatch / idle)耗时占比、锁竞争和 fanout/fanin/pop 统计 -- **Part 3:Tail OH distribution & cause analysis** — Tail OH 分位数分布(P10~P99)、scheduler 循环迭代耗时与 Tail OH 的关联分析、主导 phase 的数据驱动洞察 +- **Part 1: Per-task time breakdown** — Exec / Head OH / Tail OH as percentages of Latency +- **Part 2: AICPU scheduler loop breakdown** — Per-thread loop statistics, phase breakdown (scan / complete / dispatch / idle) with time percentages, fanout/fanin/pop statistics. Data source is either perf JSON phase data (version >= 2) or device log (fallback) +- **Part 3: Tail OH distribution & cause analysis** — Tail OH percentile distribution (P10–P99), correlation analysis between scheduler loop iteration time and Tail OH, data-driven insights on the dominant phase --- ## perf_to_mermaid.py -将性能分析数据转换为 Mermaid 流程图格式,可视化任务依赖关系。 +Converts performance data into Mermaid flowchart format to visualize task dependencies. -### 功能概述 +### Overview -`perf_to_mermaid.py` 将 PTO Runtime 的性能分析数据(`perf_swimlane_*.json`)转换为 Mermaid 流程图格式。生成的 Markdown 文件可以: -- 在 GitHub/GitLab 中直接渲染 -- 在 https://mermaid.live/ 中查看 -- 在支持 Mermaid 的编辑器中查看(如 VS Code + Mermaid 插件) +`perf_to_mermaid.py` converts PTO Runtime performance data (`perf_swimlane_*.json`) into Mermaid flowchart format. The generated Markdown file can be: +- Rendered directly in GitHub/GitLab +- Viewed at https://mermaid.live/ +- Viewed in editors with Mermaid support (e.g., VS Code + Mermaid extension) -### 基本用法 +### Basic Usage ```bash -# 自动检测 outputs/ 目录中最新的性能分析文件 +# Auto-detect the latest performance data file in outputs/ python3 tools/perf_to_mermaid.py -# 指定输入文件 +# Specify an input file python3 tools/perf_to_mermaid.py outputs/perf_swimlane_20260210_143526.json -# 指定输出文件 +# Specify an output file python3 tools/perf_to_mermaid.py outputs/perf_swimlane_20260210_143526.json -o diagram.md -# 从 kernel_config.py 加载函数名映射 +# Load function name mapping from kernel_config.py python3 tools/perf_to_mermaid.py outputs/perf_swimlane_20260210_143526.json \ -k examples/host_build_graph/paged_attention/kernels/kernel_config.py -# 使用紧凑样式(仅显示任务ID和函数名) +# Use compact style (shows only task ID) python3 tools/perf_to_mermaid.py outputs/perf_swimlane_20260210_143526.json --style compact -# 指定流程图方向(从左到右) +# Specify flowchart direction (left to right) python3 tools/perf_to_mermaid.py outputs/perf_swimlane_20260210_143526.json --direction LR -# 详细模式 +# Verbose mode python3 tools/perf_to_mermaid.py outputs/perf_swimlane_20260210_143526.json -v ``` -### 命令行选项 +### Command-Line Options -| 选项 | 简写 | 说明 | -|------|------|------| -| `input` | | 输入 JSON 文件(perf_swimlane_*.json)。如果省略,使用 outputs/ 中最新的文件 | -| `--output` | `-o` | 输出 Markdown 文件(默认:outputs/mermaid_diagram_.md) | -| `--kernel-config` | `-k` | kernel_config.py 文件路径,用于函数名映射 | -| `--style` | | 节点样式:`detailed`(默认,包含函数名和任务ID)或 `compact`(仅任务ID)| -| `--direction` | | 流程图方向:`TD`(从上到下,默认)或 `LR`(从左到右)| -| `--verbose` | `-v` | 启用详细输出 | +| Option | Short | Description | +|--------|-------|-------------| +| `input` | | Input JSON file (perf_swimlane_*.json). If omitted, uses the latest file in outputs/ | +| `--output` | `-o` | Output Markdown file (default: outputs/mermaid_diagram_.md) | +| `--kernel-config` | `-k` | Path to kernel_config.py for function name mapping | +| `--style` | | Node style: `detailed` (default, shows function name and task ID) or `compact` (task ID only) | +| `--direction` | | Flowchart direction: `TD` (top-down, default) or `LR` (left-right) | +| `--verbose` | `-v` | Enable verbose output | -### 输出内容 +### Output -生成包含 Mermaid 流程图的 Markdown 文件: +Generates a Markdown file containing a Mermaid flowchart: -#### Detailed 样式(默认) +#### Detailed Style (default) ```mermaid flowchart TD @@ -266,81 +278,118 @@ flowchart TD ## benchmark_rounds.sh -批量运行预定义的 examples,解析 device log 中的 timing 行并报告每轮耗时。 +Batch-runs predefined examples on hardware, parses device log timing lines, and reports per-round latency with statistical analysis. -### 功能概述 +### Overview -`benchmark_rounds.sh` 遍历 `EXAMPLES` 数组中配置的测试用例(位于 `tests/device_tests/tensormap_and_ringbuffer/` 下),依次调用 `run_example.py` 运行每个 example,然后从生成的 device log 中提取 `orch_start` / `end` 时间戳计算每轮 elapsed 时间。 +`benchmark_rounds.sh` loads configuration from `benchmark_config.json`, iterates over the configured example list (under `tests/device_tests//tensormap_and_ringbuffer/` by default), invokes `run_example.py` for each example, then extracts `orch_start` / `end` timestamps from the generated device log to compute per-round elapsed time. It supports warm-up rounds (discarded before statistics), and reports mean, median, trimmed mean, range, MAD, standard deviation, and fluctuation rate (CV). Optionally generates scatter plot PNGs and saves statistics logs. -当前预配置的 examples: +Currently configured examples (in `benchmark_config.json`): - `alternating_matmul_add` - `benchmark_bgemm` +- `paged_attention_unroll` - `batch_paged_attention` - `paged_attention` -### 基本用法 +### Basic Usage ```bash -# 使用默认参数(device 0, 10 rounds) +# Use defaults from benchmark_config.json (device 0, 10 rounds, 2 warmup, platform a2a3) ./tools/benchmark_rounds.sh -# 指定 device 和 rounds -./tools/benchmark_rounds.sh -d 4 -n 20 +# Specify a custom config file +./tools/benchmark_rounds.sh -c path/to/config.json + +# Specify platform, device, rounds, and warmup +./tools/benchmark_rounds.sh -p a2a3 -d 4 -n 20 -w 3 -# 额外参数透传给 run_example.py +# Enable verbose output and scatter plots +./tools/benchmark_rounds.sh -v --plot + +# Extra arguments are passed through to run_example.py ./tools/benchmark_rounds.sh -d 0 -n 5 --case 1 ``` -### 命令行选项 +### Command-Line Options -| 选项 | 简写 | 说明 | -|------|------|------| -| `--device` | `-d` | Device ID(默认:0) | -| `--rounds` | `-n` | 每个 example 的运行轮数(默认:10) | -| `--help` | `-h` | 显示帮助信息 | +| Option | Short | Description | +|--------|-------|-------------| +| `--config` | `-c` | Path to JSON config file (default: benchmark_config.json next to the script) | +| `--platform` | `-p` | Platform to run on (config default: `a2a3`) | +| `--device` | `-d` | Device ID (config default: `0`) | +| `--rounds` | `-n` | Number of measured rounds per example (config default: `10`) | +| `--warmup` | `-w` | Number of warm-up rounds to discard (config default: `2`) | +| `--verbose` | `-v` | Print detailed run_example.py output (config default: false) | +| `--plot` | | Generate scatter plot PNG for each example (config default: false) | +| `--log` | | Save statistics to `benchmark_logs/` for each example (config default: true) | +| `--help` | `-h` | Show help message | -所有未识别的参数会透传给 `run_example.py`。 +All unrecognized arguments are passed through to `run_example.py`. CLI arguments override values from the config file. -### 输出内容 +### Output -对每个 example 输出: -- 每轮的 Elapsed 时间(微秒) -- 平均耗时和总轮数 +For each example: +- Per-round elapsed time in microseconds, with warm-up rounds clearly marked +- Mean, median, and trimmed mean (excluding min & max) +- Range (max - min), mean absolute deviation (MAD), standard deviation, and fluctuation rate (CV%) +- Optional scatter plot PNG (with `--plot`) +- Optional statistics log file in `benchmark_logs/` (with `--log`) -最终输出汇总:passed / failed 数量。 +Final summary: passed / failed count. -### Device log 解析 +### Device Log Resolution -脚本通过以下方式定位 device log: -- 优先使用 `$ASCEND_WORK_PATH/log/debug/device-/` -- Fallback 到 `~/ascend/log/debug/device-/` -- 在运行前快照已有 log 文件,运行后等待新 log 文件出现(最多 15 秒) +The script locates device logs as follows: +- Uses `$ASCEND_WORK_PATH/log/debug/device-/` first +- Falls back to `~/ascend/log/debug/device-/` +- Snapshots existing log files before running, then waits for a new log file to appear (up to 15 seconds) + +--- + +## benchmark_config.json + +Configuration file for `benchmark_rounds.sh`. Defines default settings and the list of examples to benchmark. + +### Fields + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `project_root` | string | `".."` | Relative path from the script to the project root | +| `examples_subdir` | string | `"tests/device_tests/${platform}/tensormap_and_ringbuffer"` | Subdirectory under project root containing examples. `${platform}` is substituted at runtime | +| `examples` | list | | List of example directory names to benchmark | +| `device_id` | int | `0` | Default device ID | +| `rounds` | int | `10` | Default number of measured rounds | +| `warmup_rounds` | int | `2` | Default number of warm-up rounds | +| `platform` | string | `"a2a3"` | Default platform | +| `verbose` | bool | `false` | Whether to print detailed output | +| `log` | bool | `true` | Whether to save statistics logs | +| `plot` | bool | `false` | Whether to generate scatter plots | --- ## device_log_resolver.py -Device log 路径解析库,被 `swimlane_converter.py` 和 `sched_overhead_analysis.py` 共同使用。 +Device log path resolution library, used by both `swimlane_converter.py` and `sched_overhead_analysis.py`. -### 功能概述 +### Overview -`device_log_resolver.py` 提供确定性的 device log 路径解析逻辑,支持三种选择优先级: +`device_log_resolver.py` provides deterministic device log path resolution logic, supporting three selection priorities: -1. **显式路径**(`--device-log`):支持文件、目录、glob 模式 -2. **Device ID**(`--device-id`):从 `/device-/` 选择最新 `.log` -3. **自动扫描**:遍历所有 `device-*` 目录,选择与 perf 时间戳最接近的 `.log` +1. **Explicit path** (`--device-log`): Supports file, directory, or glob pattern +2. **Device ID** (`--device-id`): Selects the newest `.log` from `/device-/` +3. **Auto-scan**: Traverses all `device-*` directories, selecting the `.log` closest to the perf timestamp -### 主要函数 +### Main Functions -| 函数 | 说明 | -|------|------| -| `get_log_root()` | 返回 log root 路径(`$ASCEND_WORK_PATH/log/debug/` 或 `~/ascend/log/debug/`) | -| `infer_device_id_from_log_path(log_path)` | 从路径中推断 device id(如 `device-0`) | -| `resolve_device_log_path(device_id, device_log, perf_path)` | 按优先级解析 device log 路径,返回 `(Path, strategy_string)` | +| Function | Description | +|----------|-------------| +| `get_log_root()` | Returns the log root path (`$ASCEND_WORK_PATH/log/debug/` or `~/ascend/log/debug/`) | +| `infer_device_id_from_log_path(log_path)` | Infers device id from the path (e.g., `device-0`) | +| `resolve_device_log_path(device_id, device_log, perf_path)` | Resolves device log path by priority, returns `(Path, strategy_string)` | -### 使用方式 +### Usage -该模块不作为独立命令行工具使用,而是被其他工具导入: +This module is not used as a standalone command-line tool; it is imported by other tools: ```python from device_log_resolver import resolve_device_log_path @@ -354,11 +403,11 @@ log_path, strategy = resolve_device_log_path( --- -## 共同配置 +## Common Configuration -### 输入文件格式 +### Input File Format -分析工具共用相同的输入格式 - PTO Runtime 生成的 `perf_swimlane_*.json` 文件: +The analysis tools share the same input format — `perf_swimlane_*.json` files generated by PTO Runtime: ```json { @@ -380,119 +429,123 @@ log_path, strategy = resolve_device_log_path( } ``` -### Kernel Config 格式 +Version 2 data additionally includes `aicpu_scheduler_phases`, `aicpu_orchestrator`, `aicpu_orchestrator_phases`, and `core_to_thread` fields. + +### Kernel Config Format -要在输出中显示有意义的函数名,需要提供 `kernel_config.py` 文件: +To display meaningful function names in the output, provide a `kernel_config.py` file: ```python KERNELS = [ { "func_id": 0, "name": "QK", - # ... 其他字段 + # ... other fields }, { "func_id": 1, "name": "SF", - # ... 其他字段 + # ... other fields }, ] ``` -工具从 `KERNELS` 列表中提取 `func_id` 到 `name` 的映射。 +The tools extract a `func_id` to `name` mapping from the `KERNELS` list. --- -## 工具选择建议 +## Tool Selection Guide -### 使用 swimlane_converter.py 当你需要: -- 查看详细的时间线执行视图 -- 分析任务在不同核心上的调度情况 -- 查看精确的执行时间和时间间隔 -- 获取任务执行的统计信息 -- 专业的性能分析和优化 +### Use swimlane_converter.py when you need to: +- View a detailed timeline execution view +- Analyze task scheduling across different cores +- See precise execution times and intervals +- Get task execution statistics +- Perform professional performance analysis and optimization -### 使用 perf_to_mermaid.py 当你需要: -- 快速查看任务依赖关系 -- 在文档中嵌入依赖图 -- 在代码审查中分享依赖结构 -- 不需要时间线细节,只关注拓扑结构 -- 在 GitHub/GitLab 中直接查看 +### Use perf_to_mermaid.py when you need to: +- Quickly view task dependencies +- Embed dependency graphs in documentation +- Share dependency structure in code reviews +- Focus on topology rather than timeline details +- View directly in GitHub/GitLab -### 使用 benchmark_rounds.sh 当你需要: -- 批量运行多个 examples 并对比耗时 -- 获取每轮的 elapsed 时间统计 -- 在硬件上做端到端性能回归测试 +### Use benchmark_rounds.sh when you need to: +- Batch-run multiple examples and compare timing +- Get per-round elapsed time with statistical analysis +- Run end-to-end performance regression tests on hardware -### 推荐工作流 +### Recommended Workflow ```bash -# 1. 运行测试获取性能数据 +# 1. Run a test to collect performance data python examples/scripts/run_example.py -k ./kernels -g ./golden.py --enable-profiling -# 2. 生成 Perfetto 可视化(自动) +# 2. Generate Perfetto visualization (automatic) # → outputs/merged_swimlane_*.json -# 3. 生成 Mermaid 依赖图 +# 3. Generate Mermaid dependency graph python3 tools/perf_to_mermaid.py -k ./kernels/kernel_config.py -# 4. 批量 benchmark(硬件上) +# 4. Batch benchmark (on hardware) ./tools/benchmark_rounds.sh -d 0 -n 20 -# 5. 分析结果 -# - 详细性能分析:Perfetto (https://ui.perfetto.dev/) -# - 依赖关系概览:Mermaid 图(GitHub/编辑器) -# - 统计摘要:控制台输出 +# 5. Analyze results +# - Detailed performance analysis: Perfetto (https://ui.perfetto.dev/) +# - Dependency overview: Mermaid diagram (GitHub / editor) +# - Statistical summary: console output ``` --- -## 故障排查 +## Troubleshooting -### 错误:找不到 perf_swimlane_*.json 文件 -- 确保使用 `--enable-profiling` 标志运行了测试 -- 检查 outputs/ 目录是否存在并包含性能分析数据 +### Error: Cannot find perf_swimlane_*.json file +- Make sure you ran the test with the `--enable-profiling` flag +- Check that the outputs/ directory exists and contains performance data -### 警告:Kernel entry missing 'func_id' or 'name' -- 检查 kernel_config.py 文件格式 -- 确保所有 KERNELS 条目都有 'func_id' 和 'name' 字段 +### Warning: Kernel entry missing 'func_id' or 'name' +- Check the kernel_config.py file format +- Ensure all KERNELS entries have 'func_id' and 'name' fields -### 错误:Unsupported version -- 工具仅支持版本 1 的性能分析数据格式 -- 使用最新的 runtime 重新生成性能分析数据 +### Error: Unsupported version +- The tools support version 1 and version 2 performance data formats +- Regenerate performance data using the latest runtime if needed -### 错误:Perf JSON missing required fields for scheduler overhead analysis -- 该错误表示输入的 `perf_swimlane_*.json` 缺少 deep-dive 分析需要的字段(通常是 `dispatch_time_us` / `finish_time_us`) -- `swimlane_converter.py` 的基础转换可继续成功,但 deep-dive 会跳过或失败 -- 处理路径: - 1. 使用 `--enable-profiling` 重新跑一次,生成新的 `outputs/perf_swimlane_*.json` - 2. 重新执行 `swimlane_converter.py` 或 `sched_overhead_analysis.py` - 3. 检查 JSON 中每个 task 是否包含 `dispatch_time_us` 和 `finish_time_us` +### Error: Perf JSON missing required fields for scheduler overhead analysis +- This error indicates that the input `perf_swimlane_*.json` is missing fields needed for the deep-dive analysis (typically `dispatch_time_us` / `finish_time_us`) +- The basic conversion by `swimlane_converter.py` can still succeed, but the deep-dive will be skipped or fail +- Resolution: + 1. Re-run with `--enable-profiling` to generate a new `outputs/perf_swimlane_*.json` + 2. Re-run `swimlane_converter.py` or `sched_overhead_analysis.py` + 3. Verify that each task in the JSON includes `dispatch_time_us` and `finish_time_us` -### benchmark_rounds.sh 无 timing 数据 -- 确保运行时启用了 profiling(`PTO2_PROFILING` 环境变量) -- 检查 device log 目录是否可访问 -- 确认 log 中包含 `orch_start` / `end` 时间戳行 +### benchmark_rounds.sh has no timing data +- Ensure profiling was enabled at runtime (`PTO2_PROFILING` environment variable) +- Check that the device log directory is accessible +- Confirm the log contains `orch_start` / `end` timestamp lines -### Mermaid 图在 GitHub 上不显示 -- 确保文件是 `.md` 扩展名 -- 检查 Mermaid 语法是否正确 -- GitHub 有时需要刷新才能渲染 Mermaid 图 +### Mermaid diagram does not render on GitHub +- Ensure the file has a `.md` extension +- Check that the Mermaid syntax is correct +- GitHub may need a refresh to render Mermaid diagrams --- -## 输出文件说明 +## Output File Reference -| 文件 | 工具 | 用途 | 格式 | -|------|------|------|------| -| `perf_swimlane_*.json` | Runtime | 原始性能分析数据 | JSON | -| `merged_swimlane_*.json` | swimlane_converter.py | Perfetto 可视化 | Chrome Trace Event JSON | -| `mermaid_diagram_*.md` | perf_to_mermaid.py | 依赖关系图 | Markdown + Mermaid | +| File | Tool | Purpose | Format | +|------|------|---------|--------| +| `perf_swimlane_*.json` | Runtime | Raw performance data | JSON | +| `merged_swimlane_*.json` | swimlane_converter.py | Perfetto visualization | Chrome Trace Event JSON | +| `mermaid_diagram_*.md` | perf_to_mermaid.py | Dependency graph | Markdown + Mermaid | +| `benchmark_logs/*.log` | benchmark_rounds.sh | Benchmark statistics | Text | +| `benchmark_logs/*.png` | benchmark_rounds.sh | Scatter plots (optional) | PNG image | --- -## 相关资源 +## Related Resources - [Perfetto Trace Viewer](https://ui.perfetto.dev/) - [Mermaid Live Editor](https://mermaid.live/) -- [Mermaid 文档](https://mermaid.js.org/) +- [Mermaid Documentation](https://mermaid.js.org/) diff --git a/tools/README_benchmark.md b/tools/README_benchmark.md new file mode 100644 index 00000000..79b976cb --- /dev/null +++ b/tools/README_benchmark.md @@ -0,0 +1,169 @@ +# benchmark_rounds.sh + +A benchmark wrapper that runs device test examples on Ascend hardware, parses per-round timing data from device logs, and reports latency statistics. + +## Quick Start + +```bash +# Run with defaults from benchmark_config.json +./tools/benchmark_rounds.sh + +# Run with custom rounds and verbose output +./tools/benchmark_rounds.sh -n 20 -w 3 -v + +# Use a different config file +./tools/benchmark_rounds.sh -c /path/to/my_config.json + +# Generate plots and save logs +./tools/benchmark_rounds.sh --plot --log +``` + +## Usage + +``` +./tools/benchmark_rounds.sh [-c ] [-p ] [-d ] [-n ] [-w ] [-v] [--plot] [--log] +``` + +## Command-Line Options + +All CLI arguments override their corresponding values from the config file. + +| Option | Long Form | Description | Config Key | Default | +|--------|-----------|-------------|------------|---------| +| `-c` | `--config` | Path to JSON config file | — | `tools/benchmark_config.json` | +| `-p` | `--platform` | Platform to run on | `platform` | `a2a3` | +| `-d` | `--device` | Device ID | `device_id` | `0` | +| `-n` | `--rounds` | Number of measured rounds per example | `rounds` | `10` | +| `-w` | `--warmup` | Number of warm-up rounds to discard | `warmup_rounds` | `2` | +| `-v` | `--verbose` | Print detailed `run_example.py` output | `verbose` | `false` | +| | `--plot` | Generate scatter plot PNG for each example | `plot` | `false` | +| | `--log` | Save statistics to `benchmark_logs/` | `log` | `true` | +| `-h` | `--help` | Show help message | — | — | + +Any unrecognized arguments are passed through to `run_example.py` (e.g., `--case`). + +## Configuration File + +The script loads settings from a JSON config file (`benchmark_config.json` by default, located next to the script). Example: + +```json +{ + "project_root": "..", + "examples_subdir": "tests/device_tests/${platform}/tensormap_and_ringbuffer", + "examples": [ + "alternating_matmul_add", + "benchmark_bgemm", + "paged_attention_unroll", + "batch_paged_attention", + "paged_attention" + ], + "device_id": 0, + "rounds": 10, + "warmup_rounds": 2, + "platform": "a2a3", + "verbose": false, + "log": true, + "plot": false +} +``` + +### Config Fields + +| Field | Type | Description | +|-------|------|-------------| +| `project_root` | string | Relative path from the script directory to the project root. | +| `examples_subdir` | string | Subdirectory under project root containing examples. The `${platform}` placeholder is replaced with the active platform value. | +| `examples` | array | List of example directory names to benchmark. | +| `device_id` | int | Ascend device ID to run on. | +| `rounds` | int | Number of measured (non-warmup) rounds. | +| `warmup_rounds` | int | Number of initial rounds to discard before measuring. | +| `platform` | string | Target platform (e.g., `a2a3` for hardware, `a2a3sim` for simulation). | +| `verbose` | bool | Whether to show full `run_example.py` output. | +| `log` | bool | Whether to save per-example statistics to `benchmark_logs/`. | +| `plot` | bool | Whether to generate scatter plot PNGs (requires `matplotlib`). | + +## Execution Logic + +The script follows these steps: + +### 1. Load Configuration + +Configuration is loaded from the JSON file using `python3`. The config file path defaults to `benchmark_config.json` in the same directory as the script. If `-c` is provided on the command line, that config file is loaded instead. CLI arguments then override any config values. + +### 2. Resolve Paths + +- The `examples_subdir` path has its `${platform}` placeholder replaced with the actual platform value. +- The device log directory is resolved from `$ASCEND_WORK_PATH/log/debug/device-` (falling back to `$HOME/ascend/log/debug/device-`). + +### 3. Iterate Over Examples + +For each example listed in the config: + +1. **Validate**: Check that both `golden.py` and `kernels/` exist in the example directory. Skip otherwise. +2. **Snapshot logs**: Record the list of existing `.log` files in the device log directory (used later to detect the newly created log). +3. **Run**: Execute `run_example.py` with the example's kernels directory and golden file. The total number of rounds passed is `rounds + warmup_rounds`. In non-verbose mode, stdout/stderr are suppressed. +4. **Find new log**: Wait up to 15 seconds for a new `.log` file to appear in the device log directory. Falls back to the newest log file if no new file is detected. +5. **Parse timing**: Extract `orch_start` and `end` timestamps from the device log and compute per-round elapsed time. + +### 4. Timing Analysis + +The `parse_timing` function processes device log lines containing `orch_start=` and `end=` markers: + +- For each round, it computes elapsed time as `(max_end - min_start) / 50.0` microseconds (clock ticks to microseconds conversion). +- The first N rounds (controlled by `--warmup`) are labeled as warmup and excluded from statistics. +- For the remaining measured rounds, it computes: + - **Arithmetic mean**: Average of all measured rounds. + - **Median**: Middle value of sorted measurements. + - **Trimmed mean**: Mean after dropping the single lowest and highest values (requires at least 3 samples). + - **Range (Max-Min)**: Difference between highest and lowest measurements. + - **Mean Absolute Deviation (MAD)**: Average absolute distance from the mean. + - **Standard deviation**: Root-mean-square deviation from the mean. + - **Fluctuation rate (CV)**: Coefficient of variation, i.e., `(stddev / mean) * 100%`. + +### 5. Optional: Save Logs + +When `--log` is enabled, per-example statistics are saved to `tools/benchmark_logs/_.log`. + +### 6. Optional: Generate Plots + +When `--plot` is enabled, the script collects per-round data and generates a scatter plot PNG for each example using `matplotlib`. Each plot shows: + +- Individual per-round elapsed times as scatter points. +- Horizontal reference lines for the arithmetic mean and trimmed mean. + +Plots are saved to `tools/benchmark_logs/benchmark__.png`. If `matplotlib` is not installed, a warning is printed and plot generation is skipped. + +### 7. Summary + +After all examples complete, a summary line reports the number of passed and failed examples. The script exits with a non-zero status if any example failed. + +## Output Example + +``` +================================================================ + alternating_matmul_add +================================================================ + Log: /path/to/device-0/xxx.log + Round Elapsed (us) + ----- ------------ + 0 123.4 (warmup) + 1 121.8 (warmup) + 2 118.2 + 3 117.9 + ... + + Mean: 118.1 us | Median: 118.0 us | Trimmed-mean: 118.0 us (10 rounds, 2 warmup) + Range(Max-Min): 2.3 us | Avg variation (MAD): 0.8 us | Std deviation: 0.9 us | Fluctuation rate (CV): 0.76% + result: 118.1 us + +================================================================ + Benchmark complete: 5 passed, 0 failed (5 total) +================================================================ +``` + +## Prerequisites + +- **python3**: Required for config parsing, running examples, and optional plot generation. +- **Ascend device**: Required for hardware benchmarking (platform `a2a3`). +- **matplotlib** (optional): Required only when `--plot` is enabled. +- **PTO2_PROFILING**: Must be enabled in the device environment to produce the `orch_start`/`end` timing markers in device logs. diff --git a/tools/benchmark_config.json b/tools/benchmark_config.json new file mode 100644 index 00000000..1e9647ed --- /dev/null +++ b/tools/benchmark_config.json @@ -0,0 +1,18 @@ +{ + "project_root": "..", + "examples_subdir": "tests/device_tests/${platform}/tensormap_and_ringbuffer", + "examples": [ + "alternating_matmul_add", + "benchmark_bgemm", + "paged_attention_unroll", + "batch_paged_attention", + "paged_attention" + ], + "device_id": 0, + "rounds": 10, + "warmup_rounds": 2, + "platform": "a2a3", + "verbose": false, + "log": true, + "plot": false +} diff --git a/tools/benchmark_rounds.sh b/tools/benchmark_rounds.sh index 0d630ae8..d7478d04 100755 --- a/tools/benchmark_rounds.sh +++ b/tools/benchmark_rounds.sh @@ -3,38 +3,76 @@ # then parse device-log timing lines to report per-round latency. # # Usage: -# ./tools/benchmark_rounds.sh [-p ] [-d ] [-n ] +# ./tools/benchmark_rounds.sh [-c ] [-p ] [-d ] [-n ] [-w ] [-v] # -# Runs all examples listed in EXAMPLES array and prints timing for each. +# Runs all examples listed in the config file and prints timing for each. +# Configuration is loaded from benchmark_config.json (next to this script by default). +# CLI arguments override config file values. set -euo pipefail SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" -PROJECT_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)" -RUN_EXAMPLE="$PROJECT_ROOT/examples/scripts/run_example.py" # --------------------------------------------------------------------------- -# Examples to benchmark (paths relative to tests/device_tests//tensormap_and_ringbuffer/) -# Each entry is just the directory name; kernels/ and golden.py are implied. +# Load configuration from JSON file # --------------------------------------------------------------------------- -EXAMPLES=( - alternating_matmul_add - benchmark_bgemm - paged_attention_unroll - batch_paged_attention - paged_attention -) +CONFIG_FILE="$SCRIPT_DIR/benchmark_config.json" -# --------------------------------------------------------------------------- -# Parse arguments -# --------------------------------------------------------------------------- -DEVICE_ID=0 -ROUNDS=10 -PLATFORM=a2a3 +load_config() { + local cfg="$1" + if [[ ! -f "$cfg" ]]; then + echo "ERROR: config file not found: $cfg" >&2 + exit 1 + fi + if ! command -v python3 &>/dev/null; then + echo "ERROR: python3 is required to parse the config file" >&2 + exit 1 + fi + # Parse JSON config via python3 (always available in this project) + eval "$(python3 -c " +import json, sys, os +with open('$cfg') as f: + c = json.load(f) +print('CFG_PROJECT_ROOT=' + repr(str(c.get('project_root', '..')))) +print('CFG_EXAMPLES_SUBDIR=' + repr(str(c.get('examples_subdir', 'tests/device_tests/\${platform}/tensormap_and_ringbuffer')))) +print('CFG_DEVICE_ID=' + repr(str(c.get('device_id', 0)))) +print('CFG_ROUNDS=' + repr(str(c.get('rounds', 10)))) +print('CFG_WARMUP_ROUNDS=' + repr(str(c.get('warmup_rounds', 2)))) +print('CFG_PLATFORM=' + repr(str(c.get('platform', 'a2a3')))) +print('CFG_VERBOSE=' + ('true' if c.get('verbose', False) else 'false')) +print('CFG_PLOT=' + ('true' if c.get('plot', False) else 'false')) +print('CFG_LOG=' + ('true' if c.get('log', False) else 'false')) +examples = c.get('examples', []) +print('CFG_EXAMPLES=(' + ' '.join(repr(str(e)) for e in examples) + ')') +")" +} + +apply_config() { + PROJECT_ROOT="$(cd "$SCRIPT_DIR/$CFG_PROJECT_ROOT" && pwd)" + RUN_EXAMPLE="$PROJECT_ROOT/examples/scripts/run_example.py" + DEVICE_ID="$CFG_DEVICE_ID" + ROUNDS="$CFG_ROUNDS" + WARMUP_ROUNDS="$CFG_WARMUP_ROUNDS" + PLATFORM="$CFG_PLATFORM" + EXAMPLES=("${CFG_EXAMPLES[@]}") + EXAMPLES_SUBDIR="$CFG_EXAMPLES_SUBDIR" + VERBOSE="$CFG_VERBOSE" + PLOT="$CFG_PLOT" + LOG="$CFG_LOG" +} + +load_config "$CONFIG_FILE" +apply_config EXTRA_ARGS=() while [[ $# -gt 0 ]]; do case "$1" in + -c|--config) + CONFIG_FILE="$2" + load_config "$CONFIG_FILE" + apply_config + shift 2 + ;; -p|--platform) PLATFORM="$2" shift 2 @@ -47,23 +85,45 @@ while [[ $# -gt 0 ]]; do ROUNDS="$2" shift 2 ;; + -w|--warmup) + WARMUP_ROUNDS="$2" + shift 2 + ;; + -v|--verbose) + VERBOSE=true + shift + ;; + --plot) + PLOT=true + shift + ;; + --log) + LOG=true + shift + ;; --help|-h) - cat <<'USAGE' + cat <] [-d ] [-n ] + ./tools/benchmark_rounds.sh [-c ] [-p ] [-d ] [-n ] [-w ] [-v] Options: - -p, --platform Platform to run on (default: a2a3) - -d, --device Device ID (default: 0) - -n, --rounds Override number of rounds for each example (default: 10) + -c, --config Path to JSON config file (default: $SCRIPT_DIR/benchmark_config.json) + -p, --platform Platform to run on (config default: a2a3) + -d, --device Device ID (config default: 0) + -n, --rounds Number of measured rounds per example (config default: 10) + -w, --warmup Number of warm-up rounds to discard (config default: 2) + -v, --verbose Print detailed run_example.py output (config default: false) + --plot Generate scatter plot PNG for each example (config default: false) + --log Save statistics to benchmark_logs/ for each example (config default: true) -h, --help Show this help +CLI arguments override values from the config file. All other options are passed through to run_example.py (e.g. --case). Output: - Average elapsed time in microseconds for each example. + Mean and median elapsed time in microseconds for each example. USAGE exit 0 ;; @@ -77,7 +137,9 @@ done # --------------------------------------------------------------------------- # Derive arch from platform and set examples directory # --------------------------------------------------------------------------- -EXAMPLES_DIR="$PROJECT_ROOT/tests/device_tests/${PLATFORM}/tensormap_and_ringbuffer" +# Substitute ${platform} placeholder in examples_subdir +RESOLVED_SUBDIR="${EXAMPLES_SUBDIR//\$\{platform\}/$PLATFORM}" +EXAMPLES_DIR="$PROJECT_ROOT/$RESOLVED_SUBDIR" # --------------------------------------------------------------------------- # Resolve device log directory (mirrors run_example.py / device_log_resolver.py) @@ -93,11 +155,15 @@ fi DEVICE_LOG_DIR="$LOG_ROOT/device-${DEVICE_ID}" # --------------------------------------------------------------------------- -# parse_timing +# parse_timing # Grep for orch_start / end lines, compute per-round elapsed, print summary. +# Discards the first rounds, then reports median, trimmed +# mean (excluding min & max), and arithmetic mean for the remaining rounds. # --------------------------------------------------------------------------- parse_timing() { local log_file="$1" + local warmup="${2:-0}" + local example_name="${3:-unknown}" local timing timing=$(grep -E 'Thread=[0-9]+ (orch_start|end)=' "$log_file" || true) @@ -107,7 +173,7 @@ parse_timing() { return 1 fi - echo "$timing" | awk ' + echo "$timing" | awk -v warmup="$warmup" -v example="$example_name" ' function flush_round() { if (round >= 0 && max_end > 0 && min_start > 0) { results[round] = (max_end - min_start) / 50.0 @@ -126,27 +192,100 @@ parse_timing() { delete seen } seen[tid] = 1 - match($0, /orch_start=([0-9]+)/, m) - val = m[1] + 0 + match($0, /orch_start=([0-9]+)/, sm) + val = sm[1] + 0 if (min_start == 0 || val < min_start) min_start = val } /end=/ { - match($0, /end=([0-9]+)/, m) - val = m[1] + 0 + match($0, /end=([0-9]+)/, em) + val = em[1] + 0 if (val > max_end) max_end = val } END { flush_round() if (count == 0) { print " (no rounds parsed)"; exit 1 } - printf " %-8s %12s\n", "Round", "Elapsed (us)" - printf " %-8s %12s\n", "-----", "------------" - sum_v = 0 + # Print all rounds, marking warm-up rounds + printf " %-8s %12s %s\n", "Round", "Elapsed (us)", "" + printf " %-8s %12s %s\n", "-----", "------------", "" for (i = 0; i < count; i++) { - printf " %-8d %12.1f\n", i, results[i] - sum_v += results[i] + if (i < warmup) + printf " %-8d %12.1f (warmup)\n", i, results[i] + else + printf " %-8d %12.1f\n", i, results[i] + } + + # Collect measured (non-warmup) rounds + m = 0 + for (i = warmup; i < count; i++) { + measured[m] = results[i] + m++ + } + + if (m == 0) { + printf "\n (all %d rounds were warm-up — no measured data)\n", count + exit 1 + } + + # Sort measured[] (insertion sort — tiny array) + for (i = 1; i < m; i++) { + key = measured[i] + j = i - 1 + while (j >= 0 && measured[j] > key) { + measured[j + 1] = measured[j] + j-- + } + measured[j + 1] = key + } + + # Arithmetic mean + sum_v = 0 + for (i = 0; i < m; i++) sum_v += measured[i] + mean_v = sum_v / m + + # Median + if (m % 2 == 1) + median_v = measured[int(m / 2)] + else + median_v = (measured[m / 2 - 1] + measured[m / 2]) / 2.0 + + # Trimmed mean (drop one min and one max if we have >= 3 samples) + if (m >= 3) { + trim_sum = 0 + for (i = 1; i < m - 1; i++) trim_sum += measured[i] + trimmed_v = trim_sum / (m - 2) + } else { + trimmed_v = mean_v } - printf "\n Avg: %.1f us (%d rounds)\n", sum_v / count, count + + # Mean absolute deviation + mad_sum = 0 + for (i = 0; i < m; i++) { + diff = measured[i] - mean_v + mad_sum += (diff < 0 ? -diff : diff) + } + mad_v = mad_sum / m + + # Standard deviation + sq_sum = 0 + for (i = 0; i < m; i++) { + diff = measured[i] - mean_v + sq_sum += diff * diff + } + stddev_v = sqrt(sq_sum / m) + + printf "\n Mean: %.1f us | Median: %.1f us | Trimmed-mean: %.1f us (%d rounds, %d warmup)\n", \ + mean_v, median_v, trimmed_v, m, warmup + range_v = measured[m - 1] - measured[0] + fluct_v = 0 + if (mean_v > 0) fluct_v = (stddev_v / mean_v) * 100 + printf " Range(Max-Min): %.1f us | Avg variation (MAD): %.1f us | Std deviation: %.1f us | Fluctuation rate (CV): %.2f%%\n", \ + range_v, mad_v, stddev_v, fluct_v + printf " result: %.1f us\n", mean_v + + # Emit machine-readable plot data (one line per measured round, original round numbers) + for (i = warmup; i < count; i++) + printf "PLOT_DATA:%s,%d,%.1f,%.1f,%.1f\n", example, i, results[i], mean_v, trimmed_v }' } @@ -186,6 +325,15 @@ wait_for_new_log() { # --------------------------------------------------------------------------- PASS=0 FAIL=0 +PLOT_DATA_FILE="" +if [[ "$PLOT" == "true" ]]; then + PLOT_DATA_FILE=$(mktemp) +fi +LOG_DIR="$SCRIPT_DIR/benchmark_logs" +LOG_TIMESTAMP=$(date +%Y%m%d_%H%M%S) +if [[ "$LOG" == "true" || "$PLOT" == "true" ]]; then + mkdir -p "$LOG_DIR" +fi for example in "${EXAMPLES[@]}"; do EXAMPLE_DIR="$EXAMPLES_DIR/$example" @@ -207,12 +355,24 @@ for example in "${EXAMPLES[@]}"; do PRE_LOG_FILE=$(mktemp) ls -1 "$DEVICE_LOG_DIR"/*.log 2>/dev/null | sort > "$PRE_LOG_FILE" || true - # Run example - if ! python3 "$RUN_EXAMPLE" \ - -k "$KERNELS_DIR" -g "$GOLDEN" \ - -p "$PLATFORM" -d "$DEVICE_ID" \ - -n "$ROUNDS" \ - "${EXTRA_ARGS[@]}" > /dev/null 2>&1; then + # Run example (measured rounds + warm-up rounds) + TOTAL_ROUNDS=$(( ROUNDS + WARMUP_ROUNDS )) + run_exit=0 + if [[ "$VERBOSE" == "true" ]]; then + python3 "$RUN_EXAMPLE" \ + -k "$KERNELS_DIR" -g "$GOLDEN" \ + -p "$PLATFORM" -d "$DEVICE_ID" \ + -n "$TOTAL_ROUNDS" \ + "${EXTRA_ARGS[@]}" || run_exit=$? + else + python3 "$RUN_EXAMPLE" \ + -k "$KERNELS_DIR" -g "$GOLDEN" \ + -p "$PLATFORM" -d "$DEVICE_ID" \ + -n "$TOTAL_ROUNDS" \ + "${EXTRA_ARGS[@]}" > /dev/null 2>&1 || run_exit=$? + fi + + if [[ $run_exit -ne 0 ]]; then echo " FAILED: run_example.py returned non-zero" rm -f "$PRE_LOG_FILE" ((FAIL++)) || true @@ -230,13 +390,80 @@ for example in "${EXAMPLES[@]}"; do fi echo " Log: $NEW_LOG" - if parse_timing "$NEW_LOG"; then + timing_output=$(parse_timing "$NEW_LOG" "$WARMUP_ROUNDS" "$example") + timing_exit=$? + # Print non-PLOT_DATA lines to stdout + echo "$timing_output" | grep -v '^PLOT_DATA:' + # Collect PLOT_DATA lines if plotting enabled + if [[ "$PLOT" == "true" && -n "$PLOT_DATA_FILE" ]]; then + echo "$timing_output" | grep '^PLOT_DATA:' >> "$PLOT_DATA_FILE" || true + fi + # Save statistics to log file if logging enabled + if [[ "$LOG" == "true" && $timing_exit -eq 0 ]]; then + LOG_FILE="$LOG_DIR/${example}_${LOG_TIMESTAMP}.log" + echo "$timing_output" | grep -v '^PLOT_DATA:' > "$LOG_FILE" + echo " Log saved: $LOG_FILE" + fi + if [[ $timing_exit -eq 0 ]]; then ((PASS++)) || true else ((FAIL++)) || true fi done +# --------------------------------------------------------------------------- +# Generate scatter plots (if enabled) +# --------------------------------------------------------------------------- +if [[ "$PLOT" == "true" && -n "$PLOT_DATA_FILE" && -s "$PLOT_DATA_FILE" ]]; then + PLOT_TIMESTAMP=$(date +%Y%m%d_%H%M%S) + python3 -c " +import sys, os +try: + import matplotlib + matplotlib.use('Agg') + import matplotlib.pyplot as plt +except ImportError: + print(' WARNING: matplotlib not available, skipping plot generation', file=sys.stderr) + sys.exit(0) + +from collections import OrderedDict + +data = OrderedDict() # example -> [(round, elapsed, mean, trimmed)] +with open('$PLOT_DATA_FILE') as f: + for line in f: + line = line.strip() + if not line.startswith('PLOT_DATA:'): + continue + parts = line[len('PLOT_DATA:'):].split(',') + name, rnd, elapsed, mean, trimmed = parts[0], int(parts[1]), float(parts[2]), float(parts[3]), float(parts[4]) + data.setdefault(name, []).append((rnd, elapsed, mean, trimmed)) + +outdir = '$LOG_DIR' +ts = '$PLOT_TIMESTAMP' +for name, points in data.items(): + rounds = [p[0] for p in points] + values = [p[1] for p in points] + mean_val = points[0][2] + trimmed_val = points[0][3] + + fig, ax = plt.subplots(figsize=(8, 5)) + ax.scatter(rounds, values, c='steelblue', s=40, zorder=3, label='Per-round elapsed') + ax.axhline(y=mean_val, color='tomato', linestyle='--', linewidth=1.5, label=f'Mean: {mean_val:.1f} us') + ax.axhline(y=trimmed_val, color='blue', linestyle='--', linewidth=1.5, label=f'Trimmed-mean: {trimmed_val:.1f} us') + ax.set_xlabel('Round') + ax.set_ylabel('Elapsed (us)') + ax.set_title(f'{name}') + ax.legend() + ax.grid(True, alpha=0.3) + + outpath = os.path.join(outdir, f'benchmark_{name}_{ts}.png') + fig.savefig(outpath, dpi=150, bbox_inches='tight') + plt.close(fig) + print(f' Plot saved: {outpath}') +" || echo " WARNING: plot generation failed" + rm -f "$PLOT_DATA_FILE" +fi + # --------------------------------------------------------------------------- # Summary # ---------------------------------------------------------------------------