Support: report wall-clock orch and sched time in benchmark_rounds#324
Support: report wall-clock orch and sched time in benchmark_rounds#324hw-native-sys-bot wants to merge 1 commit intohw-native-sys:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly upgrades the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
Parse additional device log markers (BENCHMARK end cycle, Scheduler summary total_time) to compute per-round wall-clock orchestration and scheduling time. For concurrent threads, reports max(end)-min(start) instead of summing per-thread durations, so overlapping work is not double-counted.
d51bcfc to
bc10aa1
Compare
There was a problem hiding this comment.
Code Review
This pull request extends the benchmark_rounds.sh script to report wall-clock orchestration and scheduling times by parsing additional profiling information from device logs. The changes correctly handle cases where this detailed information is not available. My review identified a significant issue in how statistics are calculated for the new metrics when data is sparse across rounds, which could lead to incorrect benchmark results. I have provided a detailed comment with a suggested fix to ensure the accuracy of the reported metrics.
| sum_e = 0; min_e = r_elapsed[0]; max_e = r_elapsed[0] | ||
| sum_o = 0; min_o = r_orch[0]; max_o = r_orch[0] | ||
| sum_s = 0; min_s = r_sched[0]; max_s = r_sched[0] | ||
| for (i = 0; i < count; i++) { | ||
| printf " %-8d %12.1f\n", i, results[i] | ||
| sum_v += results[i] | ||
| if (results[i] < min_v) min_v = results[i] | ||
| if (results[i] > max_v) max_v = results[i] | ||
| if (has_detail) | ||
| printf " %-8d %12.1f %12.1f %12.1f\n", i, r_elapsed[i], r_orch[i], r_sched[i] | ||
| else | ||
| printf " %-8d %12.1f\n", i, r_elapsed[i] | ||
| sum_e += r_elapsed[i] | ||
| if (r_elapsed[i] < min_e) min_e = r_elapsed[i] | ||
| if (r_elapsed[i] > max_e) max_e = r_elapsed[i] | ||
| sum_o += r_orch[i] | ||
| if (r_orch[i] < min_o) min_o = r_orch[i] | ||
| if (r_orch[i] > max_o) max_o = r_orch[i] | ||
| sum_s += r_sched[i] | ||
| if (r_sched[i] < min_s) min_s = r_sched[i] | ||
| if (r_sched[i] > max_s) max_s = r_sched[i] | ||
| } | ||
| printf "\n Avg: %.1f us (%d rounds)\n", sum_v / count, count | ||
| printf "\n Avg: %.1f us (%d rounds)\n", sum_e / count, count | ||
| if (count > 2) { | ||
| trimmed = (sum_v - min_v - max_v) / (count - 2) | ||
| printf " Trimmed Avg: %.1f us (excluding min=%.1f, max=%.1f)\n", trimmed, min_v, max_v | ||
| trimmed = (sum_e - min_e - max_e) / (count - 2) | ||
| printf " Trimmed Avg: %.1f us (excluding min=%.1f, max=%.1f)\n", trimmed, min_e, max_e | ||
| } | ||
| if (has_detail) { | ||
| printf "\n Orch Avg: %.1f us", sum_o / count | ||
| if (count > 2) { | ||
| trimmed_o = (sum_o - min_o - max_o) / (count - 2) | ||
| printf " Trimmed: %.1f us", trimmed_o | ||
| } | ||
| printf " Total: %.1f us\n", sum_o | ||
|
|
||
| printf " Sched Avg: %.1f us", sum_s / count | ||
| if (count > 2) { | ||
| trimmed_s = (sum_s - min_s - max_s) / (count - 2) | ||
| printf " Trimmed: %.1f us", trimmed_s | ||
| } | ||
| printf " Total: %.1f us\n", sum_s | ||
| } |
There was a problem hiding this comment.
The calculation of statistics (average, min, max, trimmed average) for orchestration and scheduling times is incorrect when some rounds are missing this data. The current implementation initializes min_o and min_s with r_orch[0] and r_sched[0], which can be 0 if data for the first round is missing. This leads to an incorrect minimum value of 0. Additionally, the average is calculated by dividing by the total number of rounds (count) instead of the number of rounds that actually have orchestration or scheduling data, skewing the average. The trimmed average is also affected by these issues.
The logic should be updated to only consider rounds with valid data (> 0) for these statistics. This involves:
- Initializing
min_oandmin_sto a value indicating they are not set yet. - Keeping separate counters (
count_o,count_s) for rounds with orchestration and scheduling data. - Updating sums, min, max, and counts only for rounds with data.
- Using these separate counters for calculating averages and trimmed averages.
sum_e = 0; min_e = r_elapsed[0]; max_e = r_elapsed[0]
sum_o = 0; min_o = -1; max_o = 0; count_o = 0
sum_s = 0; min_s = -1; max_s = 0; count_s = 0
for (i = 0; i < count; i++) {
if (has_detail)
printf " %-8d %12.1f %12.1f %12.1f\n", i, r_elapsed[i], r_orch[i], r_sched[i]
else
printf " %-8d %12.1f\n", i, r_elapsed[i]
sum_e += r_elapsed[i]
if (r_elapsed[i] < min_e) min_e = r_elapsed[i]
if (r_elapsed[i] > max_e) max_e = r_elapsed[i]
if (r_orch[i] > 0) {
sum_o += r_orch[i]
if (min_o < 0 || r_orch[i] < min_o) min_o = r_orch[i]
if (r_orch[i] > max_o) max_o = r_orch[i]
count_o++
}
if (r_sched[i] > 0) {
sum_s += r_sched[i]
if (min_s < 0 || r_sched[i] < min_s) min_s = r_sched[i]
if (r_sched[i] > max_s) max_s = r_sched[i]
count_s++
}
}
printf "\n Avg: %.1f us (%d rounds)\n", sum_e / count, count
if (count > 2) {
trimmed = (sum_e - min_e - max_e) / (count - 2)
printf " Trimmed Avg: %.1f us (excluding min=%.1f, max=%.1f)\n", trimmed, min_e, max_e
}
if (has_detail) {
if (count_o > 0) {
printf "\n Orch Avg: %.1f us", sum_o / count_o
if (count_o > 2) {
trimmed_o = (sum_o - min_o - max_o) / (count_o - 2)
printf " Trimmed: %.1f us", trimmed_o
}
printf " Total: %.1f us\n", sum_o
}
if (count_s > 0) {
printf " Sched Avg: %.1f us", sum_s / count_s
if (count_s > 2) {
trimmed_s = (sum_s - min_s - max_s) / (count_s - 2)
printf " Trimmed: %.1f us", trimmed_s
}
printf " Total: %.1f us\n", sum_s
}
}
Summary
BENCHMARK: thread=N end=<cycles>andScheduler summary: total_time=Xusfrom device logs to compute per-round wall-clock orchestration and scheduling timemax(end) - min(start)instead of summing per-thread durations, so overlapping work is not double-countedTest plan
./tools/benchmark_rounds.shon hardware with PTO2_PROFILING enabled, verify Orch/Sched columns appear