diff --git a/tools/README.md b/tools/README.md
index bdede4f5..9599aeab 100644
--- a/tools/README.md
+++ b/tools/README.md
@@ -1,223 +1,235 @@
-# Swimlane 性能分析工具
+# Swimlane Performance Analysis Tools
 
-本目录包含 PTO Runtime 的性能分析工具。
+This directory contains performance analysis tools for PTO Runtime.
 
-## 工具列表
+## Tool List
 
-- **[swimlane_converter.py](#swimlane_converterpy)** - 转换为 Chrome Trace Event 可视化格式
-- **[sched_overhead_analysis.py](#sched_overhead_analysispy)** - Scheduler 开销分析（Tail OH 分解）
-- **[perf_to_mermaid.py](#perf_to_mermaidpy)** - 转换为 Mermaid 依赖图
-- **[benchmark_rounds.sh](#benchmark_roundssh)** - 批量运行 examples 并报告每轮耗时
-- **[device_log_resolver.py](#device_log_resolverpy)** - Device log 路径解析库
+- **[swimlane_converter.py](#swimlane_converterpy)** - Convert to Chrome Trace Event visualization format
+- **[sched_overhead_analysis.py](#sched_overhead_analysispy)** - Scheduler overhead analysis (Tail OH breakdown)
+- **[perf_to_mermaid.py](#perf_to_mermaidpy)** - Convert to Mermaid dependency graph
+- **[benchmark_rounds.sh](#benchmark_roundssh)** - Batch-run examples and report per-round timing
+- **[benchmark_config.json](#benchmark_configjson)** - Configuration file for benchmark_rounds.sh
+- **[device_log_resolver.py](#device_log_resolverpy)** - Device log path resolution library
 
 ---
 
 ## swimlane_converter.py
 
-将性能分析数据 JSON 文件转换为 Chrome Trace Event 格式，以便在 Perfetto 中可视化。
+Converts performance data JSON files to Chrome Trace Event format for visualization in Perfetto.
 
-### 功能概述
+### Overview
 
-`swimlane_converter.py` 将 PTO Runtime 的性能分析数据（`perf_swimlane_*.json`）转换为可在 Perfetto 跟踪查看器（https://ui.perfetto.dev/）中可视化的格式。同时提供按函数分组的任务执行统计分析，并在解析到 device log 时输出 scheduler overhead deep-dive 报告。
+`swimlane_converter.py` converts PTO Runtime performance data (`perf_swimlane_*.json`, version 1 or 2) into a format viewable in the Perfetto trace viewer (https://ui.perfetto.dev/). It also provides per-function task execution statistics with Exec/Latency analysis, and automatically runs the scheduler overhead deep-dive report when a device log is resolved.
 
-### 基本用法
+For version 2 data, additional tracks are generated:
+- **AICPU Scheduler**: scheduler phase bars per thread
+- **AICPU Orchestrator**: orchestrator phase bars or summary
+
+### Basic Usage
 
 ```bash
-# 自动检测 outputs/ 目录中最新的性能分析文件
+# Auto-detect the latest performance data file in outputs/
 python3 tools/swimlane_converter.py
 
-# 指定输入文件
+# Specify an input file
 python3 tools/swimlane_converter.py outputs/perf_swimlane_20260210_143526.json
 
-# 指定输出文件
+# Specify an output file
 python3 tools/swimlane_converter.py outputs/perf_swimlane_20260210_143526.json -o custom_output.json
 
-# 从 kernel_config.py 加载函数名映射
+# Load function name mapping from kernel_config.py
 python3 tools/swimlane_converter.py outputs/perf_swimlane_20260210_143526.json \
     -k examples/host_build_graph/paged_attention/kernels/kernel_config.py
 
-# 使用指定 device id 自动选择 device log（device-<id>）
+# Use a specific device id for automatic device log selection (device-<id>)
 python3 tools/swimlane_converter.py outputs/perf_swimlane_20260210_143526.json -d 0
 
-# 详细模式（用于调试）
+# Verbose mode (for debugging)
 python3 tools/swimlane_converter.py outputs/perf_swimlane_20260210_143526.json -v
 ```
 
-### 命令行选项
+### Command-Line Options
 
-| 选项 | 简写 | 说明 |
-|------|------|------|
-| `input` | | 输入 JSON 文件（perf_swimlane_*.json）。如果省略，使用 outputs/ 中最新的文件 |
-| `--output` | `-o` | 输出 JSON 文件（默认：outputs/merged_swimlane_<timestamp>.json） |
-| `--kernel-config` | `-k` | kernel_config.py 文件路径，用于函数名映射 |
-| `--device-log` | | 设备日志文件/目录/glob 覆盖输入（优先级最高） |
-| `--device-id` | `-d` | 指定 device id，从 `device-<id>` 目录自动选择日志 |
-| `--verbose` | `-v` | 启用详细输出 |
+| Option | Short | Description |
+|--------|-------|-------------|
+| `input` | | Input JSON file (perf_swimlane_*.json). If omitted, uses the latest file in outputs/ |
+| `--output` | `-o` | Output JSON file (default: outputs/merged_swimlane_<timestamp>.json) |
+| `--kernel-config` | `-k` | Path to kernel_config.py for function name mapping |
+| `--device-log` | | Device log file/path/glob override (highest priority) |
+| `--device-id` | `-d` | Device id for automatic log selection from `device-<id>` directory |
+| `--verbose` | `-v` | Enable verbose output |
 
-### device log 选择优先级
+### Device Log Selection Priority
 
-`swimlane_converter.py` 和 `sched_overhead_analysis.py` 使用一致的解析规则（由 `device_log_resolver.py` 提供）：
+`swimlane_converter.py` and `sched_overhead_analysis.py` use consistent resolution rules (provided by `device_log_resolver.py`):
 
-1. `--device-log`（文件/目录/glob）显式覆盖
-2. `-d/--device-id` 对应 `device-<id>` 目录
-3. 自动扫描 `device-*`，选择最接近 perf 时间戳的 `.log`
+1. `--device-log` (file/directory/glob) explicit override
+2. `-d/--device-id` selects from the `device-<id>` directory
+3. Auto-scan all `device-*` directories, choosing the `.log` closest to the perf timestamp
 
-log root 解析顺序：
+Log root resolution order:
 - `$ASCEND_WORK_PATH/log/debug/`
-- `~/ascend/log/debug/`（fallback）
+- `~/ascend/log/debug/` (fallback)
+
+### Output
 
-### 输出内容
+The tool generates three types of output:
 
-工具生成三类输出：
+#### 1. Perfetto JSON File
 
-#### 1. Perfetto JSON 文件
+A Chrome Trace Event format JSON file viewable in Perfetto:
+- File location: `outputs/merged_swimlane_<timestamp>.json`
+- Open https://ui.perfetto.dev/ and drag the file in to visualize
 
-可在 Perfetto 中可视化的 Chrome Trace Event 格式 JSON 文件：
-- 文件位置：`outputs/merged_swimlane_<timestamp>.json`
-- 打开 https://ui.perfetto.dev/ 并拖入文件即可可视化
+Trace processes (tracks):
+- **AICore View** (pid=1): kernel execution (start_time_us to end_time_us)
+- **AICPU View** (pid=2): end-to-end AICPU perspective (dispatch_time_us to finish_time_us)
+- **AICPU Scheduler** (pid=3): scheduler phase bars per thread (version 2 only)
+- **AICPU Orchestrator** (pid=4): orchestrator phase bars or summary (version 2 only)
 
-#### 2. 任务统计信息
+#### 2. Task Statistics
 
-按函数分组的统计摘要（打印到控制台），包含 Exec/Latency 对比和调度开销分析：
+A per-function summary printed to the console, including Exec/Latency comparison and scheduling overhead analysis:
 
-- **Exec**：AICore 上的 kernel 执行时间（end_time - start_time）
-- **Latency**：AICPU 视角的端到端延迟（finish_time - dispatch_time，包含 head OH + Exec + tail OH）
-- **Head/Tail OH**：调度头部/尾部开销
-- **Exec_%**：Exec / Latency 百分比（kernel 利用率）
+- **Exec**: Kernel execution time on AICore (end_time - start_time)
+- **Latency**: End-to-end latency from AICPU perspective (finish_time - dispatch_time, includes Head OH + Exec + Tail OH)
+- **Head/Tail OH**: Scheduling head/tail overhead
+- **Exec_%**: Exec / Latency percentage (kernel utilization)
 
-解析到 device log 时，还会输出 Sched CPU（AICPU scheduler 线程实际 CPU 时间 per task）和 Exec/Sched_CPU 比率。
+When a device log is resolved, also outputs Sched CPU (AICPU scheduler thread actual CPU time per task) and Exec/Sched_CPU ratio.
 
-#### 3. Scheduler overhead deep-dive（自动）
+#### 3. Scheduler Overhead Deep Dive (automatic)
 
-当 device log 成功解析后，`swimlane_converter.py` 会直接调用 `sched_overhead_analysis` 的分析逻辑，并在同一次运行中输出：
+When a device log is successfully resolved, `swimlane_converter.py` directly invokes the `sched_overhead_analysis` analysis logic and outputs in the same run:
 
 - Part 1: Per-task time breakdown
 - Part 2: AICPU scheduler loop breakdown
 - Part 3: Tail OH distribution & cause analysis
 
-### 与 run_example.py 集成
+### Integration with run_example.py
 
-启用性能分析运行测试时，转换器会自动调用：
+When running tests with profiling enabled, the converter is automatically invoked:
 
 ```bash
-# 运行测试并启用性能分析 - 测试通过后自动生成 merged_swimlane.json
+# Run a test with profiling enabled — merged_swimlane.json is auto-generated after the test passes
 python examples/scripts/run_example.py \
     -k examples/host_build_graph/vector_example/kernels \
     -g examples/host_build_graph/vector_example/golden.py \
     --enable-profiling
 ```
 
-测试通过后，工具将：
-1. 自动检测 outputs/ 中最新的 `perf_swimlane_*.json`
-2. 从 `-k` 指定的 kernel_config.py 加载函数名
-3. 把运行时有效 device id（`-d`）透传给 `swimlane_converter.py`
-4. 自动解析 device log 并输出选择策略
-5. 生成 `merged_swimlane_*.json` 用于可视化
-6. 将任务统计与 scheduler overhead deep-dive 报告打印到控制台
+After the test passes, the tool will:
+1. Auto-detect the latest `perf_swimlane_*.json` in outputs/
+2. Load function names from the kernel_config.py specified by `-k`
+3. Pass the runtime device id (`-d`) through to `swimlane_converter.py`
+4. Auto-resolve the device log and print the selection strategy
+5. Generate `merged_swimlane_*.json` for visualization
+6. Print task statistics and the scheduler overhead deep-dive report to the console
 
 ---
 
 ## sched_overhead_analysis.py
 
-分析 AICPU scheduler 的调度开销，定量分解 Tail OH（任务完成到 scheduler 确认之间的延迟）的来源。
+Analyzes AICPU scheduler overhead, quantitatively breaking down the sources of Tail OH (the delay between task completion and scheduler acknowledgment).
 
-### 功能概述
+### Overview
 
-`sched_overhead_analysis.py` 从两个数据源进行分析：
-1. **Perf profiling 数据**（`perf_swimlane_*.json`）：提取每个 task 的 Exec / Head OH / Tail OH 时间分解
-2. **设备日志**（device log）：解析 AICPU scheduler 线程的循环分解（scan / complete / dispatch / idle）、锁竞争和 fanout 统计
+`sched_overhead_analysis.py` performs analysis from two data sources:
+1. **Perf profiling data** (`perf_swimlane_*.json`): Extracts per-task Exec / Head OH / Tail OH time breakdown
+2. **Scheduler loop breakdown** with two sources (in priority order):
+   - **Perf JSON phase data** (version >= 2, preferred): Reads `aicpu_scheduler_phases` records directly from the perf JSON
+   - **Device log** (fallback for older data or `PTO2_SCHED_PROFILING=1` details)
 
-支持三种 device log 格式：
-1. **New two-level tree**（`PTO2_SCHED_PROFILING=1`）：`=== Scheduler Phase Breakdown: total=Xus, Y tasks ===`，后跟各 phase 行
-2. **Legacy detailed**（`PTO2_SCHED_PROFILING=1`）：`completed=X tasks in Yus (Z loops, W tasks/loop)`，后跟 `--- Phase Breakdown ---` 及带 fanout/fanin/pop 统计的 phase 行
-3. **Summary**（`PTO2_SCHED_PROFILING=0`）：`Scheduler summary: total_time=Xus, loops=Y, tasks_scheduled=Z`
+Device log supports two formats:
+1. **New two-level tree** (`PTO2_SCHED_PROFILING=1`): `=== Scheduler Phase Breakdown: total=Xus, Y tasks ===`, followed by per-phase lines with fanout/fanin/pop statistics
+2. **Summary** (`PTO2_SCHED_PROFILING=0`): `Scheduler summary: total_time=Xus, loops=Y, tasks_scheduled=Z`
 
-### 基本用法
+### Basic Usage
 
 ```bash
-# 自动选取最新的 perf 数据和设备日志
+# Auto-select the latest perf data and device log
 python3 tools/sched_overhead_analysis.py
 
-# 指定 device id 自动选取 device-<id> 日志
+# Specify device id for automatic log selection from device-<id>
 python3 tools/sched_overhead_analysis.py --perf-json outputs/perf_swimlane_20260210_143526.json -d 0
 
-# 指定文件
+# Specify files explicitly
 python3 tools/sched_overhead_analysis.py \
     --perf-json outputs/perf_swimlane_20260210_143526.json \
     --device-log ~/ascend/log/debug/device-0/device-*.log
 ```
 
-### 命令行选项
+### Command-Line Options
 
-| 选项 | 说明 |
-|------|------|
-| `--perf-json` | perf_swimlane_*.json 文件路径。省略时自动选取 outputs/ 中最新的文件 |
-| `--device-log` | 设备日志文件/目录/glob 覆盖输入（优先级最高） |
-| `-d, --device-id` | 指定 device id，从 `device-<id>` 自动选取日志 |
+| Option | Description |
+|--------|-------------|
+| `--perf-json` | Path to perf_swimlane_*.json file. If omitted, auto-selects the latest in outputs/ |
+| `--device-log` | Device log file/path/glob override (highest priority) |
+| `-d, --device-id` | Device id for automatic log selection from `device-<id>` |
 
-### 输出内容
+### Output
 
-分三部分输出：
+Output is divided into three parts:
 
-- **Part 1：Per-task time breakdown** — Exec / Head OH / Tail OH 各占 Latency 的百分比
-- **Part 2：AICPU scheduler loop breakdown** — 各 scheduler 线程的循环统计、各阶段（scan / complete / dispatch / idle）耗时占比、锁竞争和 fanout/fanin/pop 统计
-- **Part 3：Tail OH distribution & cause analysis** — Tail OH 分位数分布（P10~P99）、scheduler 循环迭代耗时与 Tail OH 的关联分析、主导 phase 的数据驱动洞察
+- **Part 1: Per-task time breakdown** — Exec / Head OH / Tail OH as percentages of Latency
+- **Part 2: AICPU scheduler loop breakdown** — Per-thread loop statistics, phase breakdown (scan / complete / dispatch / idle) with time percentages, fanout/fanin/pop statistics. Data source is either perf JSON phase data (version >= 2) or device log (fallback)
+- **Part 3: Tail OH distribution & cause analysis** — Tail OH percentile distribution (P10–P99), correlation analysis between scheduler loop iteration time and Tail OH, data-driven insights on the dominant phase
 
 ---
 
 ## perf_to_mermaid.py
 
-将性能分析数据转换为 Mermaid 流程图格式，可视化任务依赖关系。
+Converts performance data into Mermaid flowchart format to visualize task dependencies.
 
-### 功能概述
+### Overview
 
-`perf_to_mermaid.py` 将 PTO Runtime 的性能分析数据（`perf_swimlane_*.json`）转换为 Mermaid 流程图格式。生成的 Markdown 文件可以：
-- 在 GitHub/GitLab 中直接渲染
-- 在 https://mermaid.live/ 中查看
-- 在支持 Mermaid 的编辑器中查看（如 VS Code + Mermaid 插件）
+`perf_to_mermaid.py` converts PTO Runtime performance data (`perf_swimlane_*.json`) into Mermaid flowchart format. The generated Markdown file can be:
+- Rendered directly in GitHub/GitLab
+- Viewed at https://mermaid.live/
+- Viewed in editors with Mermaid support (e.g., VS Code + Mermaid extension)
 
-### 基本用法
+### Basic Usage
 
 ```bash
-# 自动检测 outputs/ 目录中最新的性能分析文件
+# Auto-detect the latest performance data file in outputs/
 python3 tools/perf_to_mermaid.py
 
-# 指定输入文件
+# Specify an input file
 python3 tools/perf_to_mermaid.py outputs/perf_swimlane_20260210_143526.json
 
-# 指定输出文件
+# Specify an output file
 python3 tools/perf_to_mermaid.py outputs/perf_swimlane_20260210_143526.json -o diagram.md
 
-# 从 kernel_config.py 加载函数名映射
+# Load function name mapping from kernel_config.py
 python3 tools/perf_to_mermaid.py outputs/perf_swimlane_20260210_143526.json \
     -k examples/host_build_graph/paged_attention/kernels/kernel_config.py
 
-# 使用紧凑样式（仅显示任务ID和函数名）
+# Use compact style (shows only task ID)
 python3 tools/perf_to_mermaid.py outputs/perf_swimlane_20260210_143526.json --style compact
 
-# 指定流程图方向（从左到右）
+# Specify flowchart direction (left to right)
 python3 tools/perf_to_mermaid.py outputs/perf_swimlane_20260210_143526.json --direction LR
 
-# 详细模式
+# Verbose mode
 python3 tools/perf_to_mermaid.py outputs/perf_swimlane_20260210_143526.json -v
 ```
 
-### 命令行选项
+### Command-Line Options
 
-| 选项 | 简写 | 说明 |
-|------|------|------|
-| `input` | | 输入 JSON 文件（perf_swimlane_*.json）。如果省略，使用 outputs/ 中最新的文件 |
-| `--output` | `-o` | 输出 Markdown 文件（默认：outputs/mermaid_diagram_<timestamp>.md） |
-| `--kernel-config` | `-k` | kernel_config.py 文件路径，用于函数名映射 |
-| `--style` | | 节点样式：`detailed`（默认，包含函数名和任务ID）或 `compact`（仅任务ID）|
-| `--direction` | | 流程图方向：`TD`（从上到下，默认）或 `LR`（从左到右）|
-| `--verbose` | `-v` | 启用详细输出 |
+| Option | Short | Description |
+|--------|-------|-------------|
+| `input` | | Input JSON file (perf_swimlane_*.json). If omitted, uses the latest file in outputs/ |
+| `--output` | `-o` | Output Markdown file (default: outputs/mermaid_diagram_<timestamp>.md) |
+| `--kernel-config` | `-k` | Path to kernel_config.py for function name mapping |
+| `--style` | | Node style: `detailed` (default, shows function name and task ID) or `compact` (task ID only) |
+| `--direction` | | Flowchart direction: `TD` (top-down, default) or `LR` (left-right) |
+| `--verbose` | `-v` | Enable verbose output |
 
-### 输出内容
+### Output
 
-生成包含 Mermaid 流程图的 Markdown 文件：
+Generates a Markdown file containing a Mermaid flowchart:
 
-#### Detailed 样式（默认）
+#### Detailed Style (default)
 
 ```mermaid
 flowchart TD
@@ -266,81 +278,118 @@ flowchart TD
 
 ## benchmark_rounds.sh
 
-批量运行预定义的 examples，解析 device log 中的 timing 行并报告每轮耗时。
+Batch-runs predefined examples on hardware, parses device log timing lines, and reports per-round latency with statistical analysis.
 
-### 功能概述
+### Overview
 
-`benchmark_rounds.sh` 遍历 `EXAMPLES` 数组中配置的测试用例（位于 `tests/device_tests/tensormap_and_ringbuffer/` 下），依次调用 `run_example.py` 运行每个 example，然后从生成的 device log 中提取 `orch_start` / `end` 时间戳计算每轮 elapsed 时间。
+`benchmark_rounds.sh` loads configuration from `benchmark_config.json`, iterates over the configured example list (under `tests/device_tests/<platform>/tensormap_and_ringbuffer/` by default), invokes `run_example.py` for each example, then extracts `orch_start` / `end` timestamps from the generated device log to compute per-round elapsed time. It supports warm-up rounds (discarded before statistics), and reports mean, median, trimmed mean, range, MAD, standard deviation, and fluctuation rate (CV). Optionally generates scatter plot PNGs and saves statistics logs.
 
-当前预配置的 examples：
+Currently configured examples (in `benchmark_config.json`):
 - `alternating_matmul_add`
 - `benchmark_bgemm`
+- `paged_attention_unroll`
 - `batch_paged_attention`
 - `paged_attention`
 
-### 基本用法
+### Basic Usage
 
 ```bash
-# 使用默认参数（device 0, 10 rounds）
+# Use defaults from benchmark_config.json (device 0, 10 rounds, 2 warmup, platform a2a3)
 ./tools/benchmark_rounds.sh
 
-# 指定 device 和 rounds
-./tools/benchmark_rounds.sh -d 4 -n 20
+# Specify a custom config file
+./tools/benchmark_rounds.sh -c path/to/config.json
+
+# Specify platform, device, rounds, and warmup
+./tools/benchmark_rounds.sh -p a2a3 -d 4 -n 20 -w 3
 
-# 额外参数透传给 run_example.py
+# Enable verbose output and scatter plots
+./tools/benchmark_rounds.sh -v --plot
+
+# Extra arguments are passed through to run_example.py
 ./tools/benchmark_rounds.sh -d 0 -n 5 --case 1
 ```
 
-### 命令行选项
+### Command-Line Options
 
-| 选项 | 简写 | 说明 |
-|------|------|------|
-| `--device` | `-d` | Device ID（默认：0） |
-| `--rounds` | `-n` | 每个 example 的运行轮数（默认：10） |
-| `--help` | `-h` | 显示帮助信息 |
+| Option | Short | Description |
+|--------|-------|-------------|
+| `--config` | `-c` | Path to JSON config file (default: benchmark_config.json next to the script) |
+| `--platform` | `-p` | Platform to run on (config default: `a2a3`) |
+| `--device` | `-d` | Device ID (config default: `0`) |
+| `--rounds` | `-n` | Number of measured rounds per example (config default: `10`) |
+| `--warmup` | `-w` | Number of warm-up rounds to discard (config default: `2`) |
+| `--verbose` | `-v` | Print detailed run_example.py output (config default: false) |
+| `--plot` | | Generate scatter plot PNG for each example (config default: false) |
+| `--log` | | Save statistics to `benchmark_logs/` for each example (config default: true) |
+| `--help` | `-h` | Show help message |
 
-所有未识别的参数会透传给 `run_example.py`。
+All unrecognized arguments are passed through to `run_example.py`. CLI arguments override values from the config file.
 
-### 输出内容
+### Output
 
-对每个 example 输出：
-- 每轮的 Elapsed 时间（微秒）
-- 平均耗时和总轮数
+For each example:
+- Per-round elapsed time in microseconds, with warm-up rounds clearly marked
+- Mean, median, and trimmed mean (excluding min & max)
+- Range (max - min), mean absolute deviation (MAD), standard deviation, and fluctuation rate (CV%)
+- Optional scatter plot PNG (with `--plot`)
+- Optional statistics log file in `benchmark_logs/` (with `--log`)
 
-最终输出汇总：passed / failed 数量。
+Final summary: passed / failed count.
 
-### Device log 解析
+### Device Log Resolution
 
-脚本通过以下方式定位 device log：
-- 优先使用 `$ASCEND_WORK_PATH/log/debug/device-<id>/`
-- Fallback 到 `~/ascend/log/debug/device-<id>/`
-- 在运行前快照已有 log 文件，运行后等待新 log 文件出现（最多 15 秒）
+The script locates device logs as follows:
+- Uses `$ASCEND_WORK_PATH/log/debug/device-<id>/` first
+- Falls back to `~/ascend/log/debug/device-<id>/`
+- Snapshots existing log files before running, then waits for a new log file to appear (up to 15 seconds)
+
+---
+
+## benchmark_config.json
+
+Configuration file for `benchmark_rounds.sh`. Defines default settings and the list of examples to benchmark.
+
+### Fields
+
+| Field | Type | Default | Description |
+|-------|------|---------|-------------|
+| `project_root` | string | `".."` | Relative path from the script to the project root |
+| `examples_subdir` | string | `"tests/device_tests/${platform}/tensormap_and_ringbuffer"` | Subdirectory under project root containing examples. `${platform}` is substituted at runtime |
+| `examples` | list | | List of example directory names to benchmark |
+| `device_id` | int | `0` | Default device ID |
+| `rounds` | int | `10` | Default number of measured rounds |
+| `warmup_rounds` | int | `2` | Default number of warm-up rounds |
+| `platform` | string | `"a2a3"` | Default platform |
+| `verbose` | bool | `false` | Whether to print detailed output |
+| `log` | bool | `true` | Whether to save statistics logs |
+| `plot` | bool | `false` | Whether to generate scatter plots |
 
 ---
 
 ## device_log_resolver.py
 
-Device log 路径解析库，被 `swimlane_converter.py` 和 `sched_overhead_analysis.py` 共同使用。
+Device log path resolution library, used by both `swimlane_converter.py` and `sched_overhead_analysis.py`.
 
-### 功能概述
+### Overview
 
-`device_log_resolver.py` 提供确定性的 device log 路径解析逻辑，支持三种选择优先级：
+`device_log_resolver.py` provides deterministic device log path resolution logic, supporting three selection priorities:
 
-1. **显式路径**（`--device-log`）：支持文件、目录、glob 模式
-2. **Device ID**（`--device-id`）：从 `<log_root>/device-<id>/` 选择最新 `.log`
-3. **自动扫描**：遍历所有 `device-*` 目录，选择与 perf 时间戳最接近的 `.log`
+1. **Explicit path** (`--device-log`): Supports file, directory, or glob pattern
+2. **Device ID** (`--device-id`): Selects the newest `.log` from `<log_root>/device-<id>/`
+3. **Auto-scan**: Traverses all `device-*` directories, selecting the `.log` closest to the perf timestamp
 
-### 主要函数
+### Main Functions
 
-| 函数 | 说明 |
-|------|------|
-| `get_log_root()` | 返回 log root 路径（`$ASCEND_WORK_PATH/log/debug/` 或 `~/ascend/log/debug/`） |
-| `infer_device_id_from_log_path(log_path)` | 从路径中推断 device id（如 `device-0`） |
-| `resolve_device_log_path(device_id, device_log, perf_path)` | 按优先级解析 device log 路径，返回 `(Path, strategy_string)` |
+| Function | Description |
+|----------|-------------|
+| `get_log_root()` | Returns the log root path (`$ASCEND_WORK_PATH/log/debug/` or `~/ascend/log/debug/`) |
+| `infer_device_id_from_log_path(log_path)` | Infers device id from the path (e.g., `device-0`) |
+| `resolve_device_log_path(device_id, device_log, perf_path)` | Resolves device log path by priority, returns `(Path, strategy_string)` |
 
-### 使用方式
+### Usage
 
-该模块不作为独立命令行工具使用，而是被其他工具导入：
+This module is not used as a standalone command-line tool; it is imported by other tools:
 
 ```python
 from device_log_resolver import resolve_device_log_path
@@ -354,11 +403,11 @@ log_path, strategy = resolve_device_log_path(
 
 ---
 
-## 共同配置
+## Common Configuration
 
-### 输入文件格式
+### Input File Format
 
-分析工具共用相同的输入格式 - PTO Runtime 生成的 `perf_swimlane_*.json` 文件：
+The analysis tools share the same input format — `perf_swimlane_*.json` files generated by PTO Runtime:
 
 ```json
 {
@@ -380,119 +429,123 @@ log_path, strategy = resolve_device_log_path(
 }
 ```
 
-### Kernel Config 格式
+Version 2 data additionally includes `aicpu_scheduler_phases`, `aicpu_orchestrator`, `aicpu_orchestrator_phases`, and `core_to_thread` fields.
+
+### Kernel Config Format
 
-要在输出中显示有意义的函数名，需要提供 `kernel_config.py` 文件：
+To display meaningful function names in the output, provide a `kernel_config.py` file:
 
 ```python
 KERNELS = [
     {
         "func_id": 0,
         "name": "QK",
-        # ... 其他字段
+        # ... other fields
     },
     {
         "func_id": 1,
         "name": "SF",
-        # ... 其他字段
+        # ... other fields
     },
 ]
 ```
 
-工具从 `KERNELS` 列表中提取 `func_id` 到 `name` 的映射。
+The tools extract a `func_id` to `name` mapping from the `KERNELS` list.
 
 ---
 
-## 工具选择建议
+## Tool Selection Guide
 
-### 使用 swimlane_converter.py 当你需要：
-- 查看详细的时间线执行视图
-- 分析任务在不同核心上的调度情况
-- 查看精确的执行时间和时间间隔
-- 获取任务执行的统计信息
-- 专业的性能分析和优化
+### Use swimlane_converter.py when you need to:
+- View a detailed timeline execution view
+- Analyze task scheduling across different cores
+- See precise execution times and intervals
+- Get task execution statistics
+- Perform professional performance analysis and optimization
 
-### 使用 perf_to_mermaid.py 当你需要：
-- 快速查看任务依赖关系
-- 在文档中嵌入依赖图
-- 在代码审查中分享依赖结构
-- 不需要时间线细节，只关注拓扑结构
-- 在 GitHub/GitLab 中直接查看
+### Use perf_to_mermaid.py when you need to:
+- Quickly view task dependencies
+- Embed dependency graphs in documentation
+- Share dependency structure in code reviews
+- Focus on topology rather than timeline details
+- View directly in GitHub/GitLab
 
-### 使用 benchmark_rounds.sh 当你需要：
-- 批量运行多个 examples 并对比耗时
-- 获取每轮的 elapsed 时间统计
-- 在硬件上做端到端性能回归测试
+### Use benchmark_rounds.sh when you need to:
+- Batch-run multiple examples and compare timing
+- Get per-round elapsed time with statistical analysis
+- Run end-to-end performance regression tests on hardware
 
-### 推荐工作流
+### Recommended Workflow
 
 ```bash
-# 1. 运行测试获取性能数据
+# 1. Run a test to collect performance data
 python examples/scripts/run_example.py -k ./kernels -g ./golden.py --enable-profiling
 
-# 2. 生成 Perfetto 可视化（自动）
+# 2. Generate Perfetto visualization (automatic)
 # → outputs/merged_swimlane_*.json
 
-# 3. 生成 Mermaid 依赖图
+# 3. Generate Mermaid dependency graph
 python3 tools/perf_to_mermaid.py -k ./kernels/kernel_config.py
 
-# 4. 批量 benchmark（硬件上）
+# 4. Batch benchmark (on hardware)
 ./tools/benchmark_rounds.sh -d 0 -n 20
 
-# 5. 分析结果
-# - 详细性能分析：Perfetto (https://ui.perfetto.dev/)
-# - 依赖关系概览：Mermaid 图（GitHub/编辑器）
-# - 统计摘要：控制台输出
+# 5. Analyze results
+# - Detailed performance analysis: Perfetto (https://ui.perfetto.dev/)
+# - Dependency overview: Mermaid diagram (GitHub / editor)
+# - Statistical summary: console output
 ```
 
 ---
 
-## 故障排查
+## Troubleshooting
 
-### 错误：找不到 perf_swimlane_*.json 文件
-- 确保使用 `--enable-profiling` 标志运行了测试
-- 检查 outputs/ 目录是否存在并包含性能分析数据
+### Error: Cannot find perf_swimlane_*.json file
+- Make sure you ran the test with the `--enable-profiling` flag
+- Check that the outputs/ directory exists and contains performance data
 
-### 警告：Kernel entry missing 'func_id' or 'name'
-- 检查 kernel_config.py 文件格式
-- 确保所有 KERNELS 条目都有 'func_id' 和 'name' 字段
+### Warning: Kernel entry missing 'func_id' or 'name'
+- Check the kernel_config.py file format
+- Ensure all KERNELS entries have 'func_id' and 'name' fields
 
-### 错误：Unsupported version
-- 工具仅支持版本 1 的性能分析数据格式
-- 使用最新的 runtime 重新生成性能分析数据
+### Error: Unsupported version
+- The tools support version 1 and version 2 performance data formats
+- Regenerate performance data using the latest runtime if needed
 
-### 错误：Perf JSON missing required fields for scheduler overhead analysis
-- 该错误表示输入的 `perf_swimlane_*.json` 缺少 deep-dive 分析需要的字段（通常是 `dispatch_time_us` / `finish_time_us`）
-- `swimlane_converter.py` 的基础转换可继续成功，但 deep-dive 会跳过或失败
-- 处理路径：
-  1. 使用 `--enable-profiling` 重新跑一次，生成新的 `outputs/perf_swimlane_*.json`
-  2. 重新执行 `swimlane_converter.py` 或 `sched_overhead_analysis.py`
-  3. 检查 JSON 中每个 task 是否包含 `dispatch_time_us` 和 `finish_time_us`
+### Error: Perf JSON missing required fields for scheduler overhead analysis
+- This error indicates that the input `perf_swimlane_*.json` is missing fields needed for the deep-dive analysis (typically `dispatch_time_us` / `finish_time_us`)
+- The basic conversion by `swimlane_converter.py` can still succeed, but the deep-dive will be skipped or fail
+- Resolution:
+  1. Re-run with `--enable-profiling` to generate a new `outputs/perf_swimlane_*.json`
+  2. Re-run `swimlane_converter.py` or `sched_overhead_analysis.py`
+  3. Verify that each task in the JSON includes `dispatch_time_us` and `finish_time_us`
 
-### benchmark_rounds.sh 无 timing 数据
-- 确保运行时启用了 profiling（`PTO2_PROFILING` 环境变量）
-- 检查 device log 目录是否可访问
-- 确认 log 中包含 `orch_start` / `end` 时间戳行
+### benchmark_rounds.sh has no timing data
+- Ensure profiling was enabled at runtime (`PTO2_PROFILING` environment variable)
+- Check that the device log directory is accessible
+- Confirm the log contains `orch_start` / `end` timestamp lines
 
-### Mermaid 图在 GitHub 上不显示
-- 确保文件是 `.md` 扩展名
-- 检查 Mermaid 语法是否正确
-- GitHub 有时需要刷新才能渲染 Mermaid 图
+### Mermaid diagram does not render on GitHub
+- Ensure the file has a `.md` extension
+- Check that the Mermaid syntax is correct
+- GitHub may need a refresh to render Mermaid diagrams
 
 ---
 
-## 输出文件说明
+## Output File Reference
 
-| 文件 | 工具 | 用途 | 格式 |
-|------|------|------|------|
-| `perf_swimlane_*.json` | Runtime | 原始性能分析数据 | JSON |
-| `merged_swimlane_*.json` | swimlane_converter.py | Perfetto 可视化 | Chrome Trace Event JSON |
-| `mermaid_diagram_*.md` | perf_to_mermaid.py | 依赖关系图 | Markdown + Mermaid |
+| File | Tool | Purpose | Format |
+|------|------|---------|--------|
+| `perf_swimlane_*.json` | Runtime | Raw performance data | JSON |
+| `merged_swimlane_*.json` | swimlane_converter.py | Perfetto visualization | Chrome Trace Event JSON |
+| `mermaid_diagram_*.md` | perf_to_mermaid.py | Dependency graph | Markdown + Mermaid |
+| `benchmark_logs/*.log` | benchmark_rounds.sh | Benchmark statistics | Text |
+| `benchmark_logs/*.png` | benchmark_rounds.sh | Scatter plots (optional) | PNG image |
 
 ---
 
-## 相关资源
+## Related Resources
 
 - [Perfetto Trace Viewer](https://ui.perfetto.dev/)
 - [Mermaid Live Editor](https://mermaid.live/)
-- [Mermaid 文档](https://mermaid.js.org/)
+- [Mermaid Documentation](https://mermaid.js.org/)
diff --git a/tools/README_benchmark.md b/tools/README_benchmark.md
new file mode 100644
index 00000000..79b976cb
--- /dev/null
+++ b/tools/README_benchmark.md
@@ -0,0 +1,169 @@
+# benchmark_rounds.sh
+
+A benchmark wrapper that runs device test examples on Ascend hardware, parses per-round timing data from device logs, and reports latency statistics.
+
+## Quick Start
+
+```bash
+# Run with defaults from benchmark_config.json
+./tools/benchmark_rounds.sh
+
+# Run with custom rounds and verbose output
+./tools/benchmark_rounds.sh -n 20 -w 3 -v
+
+# Use a different config file
+./tools/benchmark_rounds.sh -c /path/to/my_config.json
+
+# Generate plots and save logs
+./tools/benchmark_rounds.sh --plot --log
+```
+
+## Usage
+
+```
+./tools/benchmark_rounds.sh [-c <config>] [-p <platform>] [-d <device>] [-n <rounds>] [-w <warmup>] [-v] [--plot] [--log]
+```
+
+## Command-Line Options
+
+All CLI arguments override their corresponding values from the config file.
+
+| Option | Long Form | Description | Config Key | Default |
+|--------|-----------|-------------|------------|---------|
+| `-c` | `--config` | Path to JSON config file | — | `tools/benchmark_config.json` |
+| `-p` | `--platform` | Platform to run on | `platform` | `a2a3` |
+| `-d` | `--device` | Device ID | `device_id` | `0` |
+| `-n` | `--rounds` | Number of measured rounds per example | `rounds` | `10` |
+| `-w` | `--warmup` | Number of warm-up rounds to discard | `warmup_rounds` | `2` |
+| `-v` | `--verbose` | Print detailed `run_example.py` output | `verbose` | `false` |
+| | `--plot` | Generate scatter plot PNG for each example | `plot` | `false` |
+| | `--log` | Save statistics to `benchmark_logs/` | `log` | `true` |
+| `-h` | `--help` | Show help message | — | — |
+
+Any unrecognized arguments are passed through to `run_example.py` (e.g., `--case`).
+
+## Configuration File
+
+The script loads settings from a JSON config file (`benchmark_config.json` by default, located next to the script). Example:
+
+```json
+{
+    "project_root": "..",
+    "examples_subdir": "tests/device_tests/${platform}/tensormap_and_ringbuffer",
+    "examples": [
+        "alternating_matmul_add",
+        "benchmark_bgemm",
+        "paged_attention_unroll",
+        "batch_paged_attention",
+        "paged_attention"
+    ],
+    "device_id": 0,
+    "rounds": 10,
+    "warmup_rounds": 2,
+    "platform": "a2a3",
+    "verbose": false,
+    "log": true,
+    "plot": false
+}
+```
+
+### Config Fields
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `project_root` | string | Relative path from the script directory to the project root. |
+| `examples_subdir` | string | Subdirectory under project root containing examples. The `${platform}` placeholder is replaced with the active platform value. |
+| `examples` | array | List of example directory names to benchmark. |
+| `device_id` | int | Ascend device ID to run on. |
+| `rounds` | int | Number of measured (non-warmup) rounds. |
+| `warmup_rounds` | int | Number of initial rounds to discard before measuring. |
+| `platform` | string | Target platform (e.g., `a2a3` for hardware, `a2a3sim` for simulation). |
+| `verbose` | bool | Whether to show full `run_example.py` output. |
+| `log` | bool | Whether to save per-example statistics to `benchmark_logs/`. |
+| `plot` | bool | Whether to generate scatter plot PNGs (requires `matplotlib`). |
+
+## Execution Logic
+
+The script follows these steps:
+
+### 1. Load Configuration
+
+Configuration is loaded from the JSON file using `python3`. The config file path defaults to `benchmark_config.json` in the same directory as the script. If `-c` is provided on the command line, that config file is loaded instead. CLI arguments then override any config values.
+
+### 2. Resolve Paths
+
+- The `examples_subdir` path has its `${platform}` placeholder replaced with the actual platform value.
+- The device log directory is resolved from `$ASCEND_WORK_PATH/log/debug/device-<id>` (falling back to `$HOME/ascend/log/debug/device-<id>`).
+
+### 3. Iterate Over Examples
+
+For each example listed in the config:
+
+1. **Validate**: Check that both `golden.py` and `kernels/` exist in the example directory. Skip otherwise.
+2. **Snapshot logs**: Record the list of existing `.log` files in the device log directory (used later to detect the newly created log).
+3. **Run**: Execute `run_example.py` with the example's kernels directory and golden file. The total number of rounds passed is `rounds + warmup_rounds`. In non-verbose mode, stdout/stderr are suppressed.
+4. **Find new log**: Wait up to 15 seconds for a new `.log` file to appear in the device log directory. Falls back to the newest log file if no new file is detected.
+5. **Parse timing**: Extract `orch_start` and `end` timestamps from the device log and compute per-round elapsed time.
+
+### 4. Timing Analysis
+
+The `parse_timing` function processes device log lines containing `orch_start=` and `end=` markers:
+
+- For each round, it computes elapsed time as `(max_end - min_start) / 50.0` microseconds (clock ticks to microseconds conversion).
+- The first N rounds (controlled by `--warmup`) are labeled as warmup and excluded from statistics.
+- For the remaining measured rounds, it computes:
+  - **Arithmetic mean**: Average of all measured rounds.
+  - **Median**: Middle value of sorted measurements.
+  - **Trimmed mean**: Mean after dropping the single lowest and highest values (requires at least 3 samples).
+  - **Range (Max-Min)**: Difference between highest and lowest measurements.
+  - **Mean Absolute Deviation (MAD)**: Average absolute distance from the mean.
+  - **Standard deviation**: Root-mean-square deviation from the mean.
+  - **Fluctuation rate (CV)**: Coefficient of variation, i.e., `(stddev / mean) * 100%`.
+
+### 5. Optional: Save Logs
+
+When `--log` is enabled, per-example statistics are saved to `tools/benchmark_logs/<example>_<timestamp>.log`.
+
+### 6. Optional: Generate Plots
+
+When `--plot` is enabled, the script collects per-round data and generates a scatter plot PNG for each example using `matplotlib`. Each plot shows:
+
+- Individual per-round elapsed times as scatter points.
+- Horizontal reference lines for the arithmetic mean and trimmed mean.
+
+Plots are saved to `tools/benchmark_logs/benchmark_<example>_<timestamp>.png`. If `matplotlib` is not installed, a warning is printed and plot generation is skipped.
+
+### 7. Summary
+
+After all examples complete, a summary line reports the number of passed and failed examples. The script exits with a non-zero status if any example failed.
+
+## Output Example
+
+```
+================================================================
+  alternating_matmul_add
+================================================================
+  Log: /path/to/device-0/xxx.log
+  Round     Elapsed (us)
+  -----     ------------
+  0              123.4  (warmup)
+  1              121.8  (warmup)
+  2              118.2
+  3              117.9
+  ...
+
+  Mean: 118.1 us  |  Median: 118.0 us  |  Trimmed-mean: 118.0 us  (10 rounds, 2 warmup)
+  Range(Max-Min): 2.3 us  |  Avg variation (MAD): 0.8 us  |  Std deviation: 0.9 us  |  Fluctuation rate (CV): 0.76%
+  result: 118.1 us
+
+================================================================
+  Benchmark complete: 5 passed, 0 failed (5 total)
+================================================================
+```
+
+## Prerequisites
+
+- **python3**: Required for config parsing, running examples, and optional plot generation.
+- **Ascend device**: Required for hardware benchmarking (platform `a2a3`).
+- **matplotlib** (optional): Required only when `--plot` is enabled.
+- **PTO2_PROFILING**: Must be enabled in the device environment to produce the `orch_start`/`end` timing markers in device logs.
diff --git a/tools/benchmark_config.json b/tools/benchmark_config.json
new file mode 100644
index 00000000..1e9647ed
--- /dev/null
+++ b/tools/benchmark_config.json
@@ -0,0 +1,18 @@
+{
+    "project_root": "..",
+    "examples_subdir": "tests/device_tests/${platform}/tensormap_and_ringbuffer",
+    "examples": [
+        "alternating_matmul_add",
+        "benchmark_bgemm",
+        "paged_attention_unroll",
+        "batch_paged_attention",
+        "paged_attention"
+    ],
+    "device_id": 0,
+    "rounds": 10,
+    "warmup_rounds": 2,
+    "platform": "a2a3",
+    "verbose": false,
+    "log": true,
+    "plot": false
+}
diff --git a/tools/benchmark_rounds.sh b/tools/benchmark_rounds.sh
index 0d630ae8..d7478d04 100755
--- a/tools/benchmark_rounds.sh
+++ b/tools/benchmark_rounds.sh
@@ -3,38 +3,76 @@
 # then parse device-log timing lines to report per-round latency.
 #
 # Usage:
-#   ./tools/benchmark_rounds.sh [-p <platform>] [-d <device>] [-n <rounds>]
+#   ./tools/benchmark_rounds.sh [-c <config>] [-p <platform>] [-d <device>] [-n <rounds>] [-w <warmup>] [-v]
 #
-# Runs all examples listed in EXAMPLES array and prints timing for each.
+# Runs all examples listed in the config file and prints timing for each.
+# Configuration is loaded from benchmark_config.json (next to this script by default).
+# CLI arguments override config file values.
 
 set -euo pipefail
 
 SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
-PROJECT_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
-RUN_EXAMPLE="$PROJECT_ROOT/examples/scripts/run_example.py"
 
 # ---------------------------------------------------------------------------
-# Examples to benchmark (paths relative to tests/device_tests/<arch>/tensormap_and_ringbuffer/)
-# Each entry is just the directory name; kernels/ and golden.py are implied.
+# Load configuration from JSON file
 # ---------------------------------------------------------------------------
-EXAMPLES=(
-    alternating_matmul_add
-    benchmark_bgemm
-    paged_attention_unroll
-    batch_paged_attention
-    paged_attention
-)
+CONFIG_FILE="$SCRIPT_DIR/benchmark_config.json"
 
-# ---------------------------------------------------------------------------
-# Parse arguments
-# ---------------------------------------------------------------------------
-DEVICE_ID=0
-ROUNDS=10
-PLATFORM=a2a3
+load_config() {
+    local cfg="$1"
+    if [[ ! -f "$cfg" ]]; then
+        echo "ERROR: config file not found: $cfg" >&2
+        exit 1
+    fi
+    if ! command -v python3 &>/dev/null; then
+        echo "ERROR: python3 is required to parse the config file" >&2
+        exit 1
+    fi
+    # Parse JSON config via python3 (always available in this project)
+    eval "$(python3 -c "
+import json, sys, os
+with open('$cfg') as f:
+    c = json.load(f)
+print('CFG_PROJECT_ROOT=' + repr(str(c.get('project_root', '..'))))
+print('CFG_EXAMPLES_SUBDIR=' + repr(str(c.get('examples_subdir', 'tests/device_tests/\${platform}/tensormap_and_ringbuffer'))))
+print('CFG_DEVICE_ID=' + repr(str(c.get('device_id', 0))))
+print('CFG_ROUNDS=' + repr(str(c.get('rounds', 10))))
+print('CFG_WARMUP_ROUNDS=' + repr(str(c.get('warmup_rounds', 2))))
+print('CFG_PLATFORM=' + repr(str(c.get('platform', 'a2a3'))))
+print('CFG_VERBOSE=' + ('true' if c.get('verbose', False) else 'false'))
+print('CFG_PLOT=' + ('true' if c.get('plot', False) else 'false'))
+print('CFG_LOG=' + ('true' if c.get('log', False) else 'false'))
+examples = c.get('examples', [])
+print('CFG_EXAMPLES=(' + ' '.join(repr(str(e)) for e in examples) + ')')
+")"
+}
+
+apply_config() {
+    PROJECT_ROOT="$(cd "$SCRIPT_DIR/$CFG_PROJECT_ROOT" && pwd)"
+    RUN_EXAMPLE="$PROJECT_ROOT/examples/scripts/run_example.py"
+    DEVICE_ID="$CFG_DEVICE_ID"
+    ROUNDS="$CFG_ROUNDS"
+    WARMUP_ROUNDS="$CFG_WARMUP_ROUNDS"
+    PLATFORM="$CFG_PLATFORM"
+    EXAMPLES=("${CFG_EXAMPLES[@]}")
+    EXAMPLES_SUBDIR="$CFG_EXAMPLES_SUBDIR"
+    VERBOSE="$CFG_VERBOSE"
+    PLOT="$CFG_PLOT"
+    LOG="$CFG_LOG"
+}
+
+load_config "$CONFIG_FILE"
+apply_config
 EXTRA_ARGS=()
 
 while [[ $# -gt 0 ]]; do
     case "$1" in
+        -c|--config)
+            CONFIG_FILE="$2"
+            load_config "$CONFIG_FILE"
+            apply_config
+            shift 2
+            ;;
         -p|--platform)
             PLATFORM="$2"
             shift 2
@@ -47,23 +85,45 @@ while [[ $# -gt 0 ]]; do
             ROUNDS="$2"
             shift 2
             ;;
+        -w|--warmup)
+            WARMUP_ROUNDS="$2"
+            shift 2
+            ;;
+        -v|--verbose)
+            VERBOSE=true
+            shift
+            ;;
+        --plot)
+            PLOT=true
+            shift
+            ;;
+        --log)
+            LOG=true
+            shift
+            ;;
         --help|-h)
-            cat <<'USAGE'
+            cat <<USAGE
 benchmark_rounds.sh — run all examples and report per-round timing from device logs
 
 Usage:
-  ./tools/benchmark_rounds.sh [-p <platform>] [-d <device>] [-n <rounds>]
+  ./tools/benchmark_rounds.sh [-c <config>] [-p <platform>] [-d <device>] [-n <rounds>] [-w <warmup>] [-v]
 
 Options:
-  -p, --platform Platform to run on (default: a2a3)
-  -d, --device   Device ID (default: 0)
-  -n, --rounds   Override number of rounds for each example (default: 10)
+  -c, --config   Path to JSON config file (default: $SCRIPT_DIR/benchmark_config.json)
+  -p, --platform Platform to run on (config default: a2a3)
+  -d, --device   Device ID (config default: 0)
+  -n, --rounds   Number of measured rounds per example (config default: 10)
+  -w, --warmup   Number of warm-up rounds to discard (config default: 2)
+  -v, --verbose  Print detailed run_example.py output (config default: false)
+  --plot         Generate scatter plot PNG for each example (config default: false)
+  --log          Save statistics to benchmark_logs/ for each example (config default: true)
   -h, --help     Show this help
 
+CLI arguments override values from the config file.
 All other options are passed through to run_example.py (e.g. --case).
 
 Output:
-  Average elapsed time in microseconds for each example.
+  Mean and median elapsed time in microseconds for each example.
 USAGE
             exit 0
             ;;
@@ -77,7 +137,9 @@ done
 # ---------------------------------------------------------------------------
 # Derive arch from platform and set examples directory
 # ---------------------------------------------------------------------------
-EXAMPLES_DIR="$PROJECT_ROOT/tests/device_tests/${PLATFORM}/tensormap_and_ringbuffer"
+# Substitute ${platform} placeholder in examples_subdir
+RESOLVED_SUBDIR="${EXAMPLES_SUBDIR//\$\{platform\}/$PLATFORM}"
+EXAMPLES_DIR="$PROJECT_ROOT/$RESOLVED_SUBDIR"
 
 # ---------------------------------------------------------------------------
 # Resolve device log directory (mirrors run_example.py / device_log_resolver.py)
@@ -93,11 +155,15 @@ fi
 DEVICE_LOG_DIR="$LOG_ROOT/device-${DEVICE_ID}"
 
 # ---------------------------------------------------------------------------
-# parse_timing <log_file>
+# parse_timing <log_file> <warmup_rounds>
 #   Grep for orch_start / end lines, compute per-round elapsed, print summary.
+#   Discards the first <warmup_rounds> rounds, then reports median, trimmed
+#   mean (excluding min & max), and arithmetic mean for the remaining rounds.
 # ---------------------------------------------------------------------------
 parse_timing() {
     local log_file="$1"
+    local warmup="${2:-0}"
+    local example_name="${3:-unknown}"
 
     local timing
     timing=$(grep -E 'Thread=[0-9]+ (orch_start|end)=' "$log_file" || true)
@@ -107,7 +173,7 @@ parse_timing() {
         return 1
     fi
 
-    echo "$timing" | awk '
+    echo "$timing" | awk -v warmup="$warmup" -v example="$example_name" '
     function flush_round() {
         if (round >= 0 && max_end > 0 && min_start > 0) {
             results[round] = (max_end - min_start) / 50.0
@@ -126,27 +192,100 @@ parse_timing() {
             delete seen
         }
         seen[tid] = 1
-        match($0, /orch_start=([0-9]+)/, m)
-        val = m[1] + 0
+        match($0, /orch_start=([0-9]+)/, sm)
+        val = sm[1] + 0
         if (min_start == 0 || val < min_start) min_start = val
     }
     /end=/ {
-        match($0, /end=([0-9]+)/, m)
-        val = m[1] + 0
+        match($0, /end=([0-9]+)/, em)
+        val = em[1] + 0
         if (val > max_end) max_end = val
     }
     END {
         flush_round()
         if (count == 0) { print "  (no rounds parsed)"; exit 1 }
 
-        printf "  %-8s  %12s\n", "Round", "Elapsed (us)"
-        printf "  %-8s  %12s\n", "-----", "------------"
-        sum_v = 0
+        # Print all rounds, marking warm-up rounds
+        printf "  %-8s  %12s  %s\n", "Round", "Elapsed (us)", ""
+        printf "  %-8s  %12s  %s\n", "-----", "------------", ""
         for (i = 0; i < count; i++) {
-            printf "  %-8d  %12.1f\n", i, results[i]
-            sum_v += results[i]
+            if (i < warmup)
+                printf "  %-8d  %12.1f  (warmup)\n", i, results[i]
+            else
+                printf "  %-8d  %12.1f\n", i, results[i]
+        }
+
+        # Collect measured (non-warmup) rounds
+        m = 0
+        for (i = warmup; i < count; i++) {
+            measured[m] = results[i]
+            m++
+        }
+
+        if (m == 0) {
+            printf "\n  (all %d rounds were warm-up — no measured data)\n", count
+            exit 1
+        }
+
+        # Sort measured[] (insertion sort — tiny array)
+        for (i = 1; i < m; i++) {
+            key = measured[i]
+            j = i - 1
+            while (j >= 0 && measured[j] > key) {
+                measured[j + 1] = measured[j]
+                j--
+            }
+            measured[j + 1] = key
+        }
+
+        # Arithmetic mean
+        sum_v = 0
+        for (i = 0; i < m; i++) sum_v += measured[i]
+        mean_v = sum_v / m
+
+        # Median
+        if (m % 2 == 1)
+            median_v = measured[int(m / 2)]
+        else
+            median_v = (measured[m / 2 - 1] + measured[m / 2]) / 2.0
+
+        # Trimmed mean (drop one min and one max if we have >= 3 samples)
+        if (m >= 3) {
+            trim_sum = 0
+            for (i = 1; i < m - 1; i++) trim_sum += measured[i]
+            trimmed_v = trim_sum / (m - 2)
+        } else {
+            trimmed_v = mean_v
         }
-        printf "\n  Avg: %.1f us  (%d rounds)\n", sum_v / count, count
+
+        # Mean absolute deviation
+        mad_sum = 0
+        for (i = 0; i < m; i++) {
+            diff = measured[i] - mean_v
+            mad_sum += (diff < 0 ? -diff : diff)
+        }
+        mad_v = mad_sum / m
+
+        # Standard deviation
+        sq_sum = 0
+        for (i = 0; i < m; i++) {
+            diff = measured[i] - mean_v
+            sq_sum += diff * diff
+        }
+        stddev_v = sqrt(sq_sum / m)
+
+        printf "\n  Mean: %.1f us  |  Median: %.1f us  |  Trimmed-mean: %.1f us  (%d rounds, %d warmup)\n", \
+            mean_v, median_v, trimmed_v, m, warmup
+        range_v = measured[m - 1] - measured[0]
+        fluct_v = 0
+        if (mean_v > 0) fluct_v = (stddev_v / mean_v) * 100
+        printf "  Range(Max-Min): %.1f us  |  Avg variation (MAD): %.1f us  |  Std deviation: %.1f us  |  Fluctuation rate (CV): %.2f%%\n", \
+            range_v, mad_v, stddev_v, fluct_v
+        printf "  result: %.1f us\n", mean_v
+
+        # Emit machine-readable plot data (one line per measured round, original round numbers)
+        for (i = warmup; i < count; i++)
+            printf "PLOT_DATA:%s,%d,%.1f,%.1f,%.1f\n", example, i, results[i], mean_v, trimmed_v
     }'
 }
 
@@ -186,6 +325,15 @@ wait_for_new_log() {
 # ---------------------------------------------------------------------------
 PASS=0
 FAIL=0
+PLOT_DATA_FILE=""
+if [[ "$PLOT" == "true" ]]; then
+    PLOT_DATA_FILE=$(mktemp)
+fi
+LOG_DIR="$SCRIPT_DIR/benchmark_logs"
+LOG_TIMESTAMP=$(date +%Y%m%d_%H%M%S)
+if [[ "$LOG" == "true" || "$PLOT" == "true" ]]; then
+    mkdir -p "$LOG_DIR"
+fi
 
 for example in "${EXAMPLES[@]}"; do
     EXAMPLE_DIR="$EXAMPLES_DIR/$example"
@@ -207,12 +355,24 @@ for example in "${EXAMPLES[@]}"; do
     PRE_LOG_FILE=$(mktemp)
     ls -1 "$DEVICE_LOG_DIR"/*.log 2>/dev/null | sort > "$PRE_LOG_FILE" || true
 
-    # Run example
-    if ! python3 "$RUN_EXAMPLE" \
-            -k "$KERNELS_DIR" -g "$GOLDEN" \
-            -p "$PLATFORM" -d "$DEVICE_ID" \
-            -n "$ROUNDS" \
-            "${EXTRA_ARGS[@]}" > /dev/null 2>&1; then
+    # Run example (measured rounds + warm-up rounds)
+    TOTAL_ROUNDS=$(( ROUNDS + WARMUP_ROUNDS ))
+    run_exit=0
+    if [[ "$VERBOSE" == "true" ]]; then
+        python3 "$RUN_EXAMPLE" \
+                -k "$KERNELS_DIR" -g "$GOLDEN" \
+                -p "$PLATFORM" -d "$DEVICE_ID" \
+                -n "$TOTAL_ROUNDS" \
+                "${EXTRA_ARGS[@]}" || run_exit=$?
+    else
+        python3 "$RUN_EXAMPLE" \
+                -k "$KERNELS_DIR" -g "$GOLDEN" \
+                -p "$PLATFORM" -d "$DEVICE_ID" \
+                -n "$TOTAL_ROUNDS" \
+                "${EXTRA_ARGS[@]}" > /dev/null 2>&1 || run_exit=$?
+    fi
+
+    if [[ $run_exit -ne 0 ]]; then
         echo "  FAILED: run_example.py returned non-zero"
         rm -f "$PRE_LOG_FILE"
         ((FAIL++)) || true
@@ -230,13 +390,80 @@ for example in "${EXAMPLES[@]}"; do
     fi
 
     echo "  Log: $NEW_LOG"
-    if parse_timing "$NEW_LOG"; then
+    timing_output=$(parse_timing "$NEW_LOG" "$WARMUP_ROUNDS" "$example")
+    timing_exit=$?
+    # Print non-PLOT_DATA lines to stdout
+    echo "$timing_output" | grep -v '^PLOT_DATA:'
+    # Collect PLOT_DATA lines if plotting enabled
+    if [[ "$PLOT" == "true" && -n "$PLOT_DATA_FILE" ]]; then
+        echo "$timing_output" | grep '^PLOT_DATA:' >> "$PLOT_DATA_FILE" || true
+    fi
+    # Save statistics to log file if logging enabled
+    if [[ "$LOG" == "true" && $timing_exit -eq 0 ]]; then
+        LOG_FILE="$LOG_DIR/${example}_${LOG_TIMESTAMP}.log"
+        echo "$timing_output" | grep -v '^PLOT_DATA:' > "$LOG_FILE"
+        echo "  Log saved: $LOG_FILE"
+    fi
+    if [[ $timing_exit -eq 0 ]]; then
         ((PASS++)) || true
     else
         ((FAIL++)) || true
     fi
 done
 
+# ---------------------------------------------------------------------------
+# Generate scatter plots (if enabled)
+# ---------------------------------------------------------------------------
+if [[ "$PLOT" == "true" && -n "$PLOT_DATA_FILE" && -s "$PLOT_DATA_FILE" ]]; then
+    PLOT_TIMESTAMP=$(date +%Y%m%d_%H%M%S)
+    python3 -c "
+import sys, os
+try:
+    import matplotlib
+    matplotlib.use('Agg')
+    import matplotlib.pyplot as plt
+except ImportError:
+    print('  WARNING: matplotlib not available, skipping plot generation', file=sys.stderr)
+    sys.exit(0)
+
+from collections import OrderedDict
+
+data = OrderedDict()  # example -> [(round, elapsed, mean, trimmed)]
+with open('$PLOT_DATA_FILE') as f:
+    for line in f:
+        line = line.strip()
+        if not line.startswith('PLOT_DATA:'):
+            continue
+        parts = line[len('PLOT_DATA:'):].split(',')
+        name, rnd, elapsed, mean, trimmed = parts[0], int(parts[1]), float(parts[2]), float(parts[3]), float(parts[4])
+        data.setdefault(name, []).append((rnd, elapsed, mean, trimmed))
+
+outdir = '$LOG_DIR'
+ts = '$PLOT_TIMESTAMP'
+for name, points in data.items():
+    rounds = [p[0] for p in points]
+    values = [p[1] for p in points]
+    mean_val = points[0][2]
+    trimmed_val = points[0][3]
+
+    fig, ax = plt.subplots(figsize=(8, 5))
+    ax.scatter(rounds, values, c='steelblue', s=40, zorder=3, label='Per-round elapsed')
+    ax.axhline(y=mean_val, color='tomato', linestyle='--', linewidth=1.5, label=f'Mean: {mean_val:.1f} us')
+    ax.axhline(y=trimmed_val, color='blue', linestyle='--', linewidth=1.5, label=f'Trimmed-mean: {trimmed_val:.1f} us')
+    ax.set_xlabel('Round')
+    ax.set_ylabel('Elapsed (us)')
+    ax.set_title(f'{name}')
+    ax.legend()
+    ax.grid(True, alpha=0.3)
+
+    outpath = os.path.join(outdir, f'benchmark_{name}_{ts}.png')
+    fig.savefig(outpath, dpi=150, bbox_inches='tight')
+    plt.close(fig)
+    print(f'  Plot saved: {outpath}')
+" || echo "  WARNING: plot generation failed"
+    rm -f "$PLOT_DATA_FILE"
+fi
+
 # ---------------------------------------------------------------------------
 # Summary
 # ---------------------------------------------------------------------------