Update: promote task_id to mixed_task_id (u64) across profiling pipeline#316
Update: promote task_id to mixed_task_id (u64) across profiling pipeline#316indigo1973 wants to merge 1 commit intohw-native-sys:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses critical Perfetto visualization inaccuracies by unifying task ID spaces and correctly handling multi-ring fanout dependencies. It refactors core data structures and profiling logic to ensure consistent task identification and dependency tracking across different profiling components, leading to more accurate and complete performance traces. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request effectively addresses two correctness issues in performance visualization: missing orchestrator-to-scheduler arrows and incorrect fanout data for multi-ring setups. The core of the fix involves changing task_id to a 64-bit mixed_task_id to uniquely identify tasks across rings and updating fanout arrays to store these full IDs. The use of a union in AicpuPhaseRecord is a clever optimization to avoid increasing its size. The changes are consistent and well-implemented across the C++ runtime and the Python trace converter. I have one suggestion regarding potentially redundant code.
|
fix #311 |
8e37bba to
b03af07
Compare
b03af07 to
8e9c3f9
Compare
| local_packed_end = (char*)local_packed_base + total_output_size; | ||
| } | ||
| CYCLE_COUNT_LAP_RECORD(g_orch_heap_cycle, AicpuPhaseId::ORCH_HEAP, local_id); | ||
| CYCLE_COUNT_LAP_RECORD(g_orch_heap_cycle, AicpuPhaseId::ORCH_HEAP, pto2_task_id_raw(mixed_task_id)); |
There was a problem hiding this comment.
重载PTO2TaskId转uint64_t的operator
There was a problem hiding this comment.
这里是指在 struct PTO2TaskId 中添加 constexpr explicit operator uint64_t() const { return raw; },然后直接用static_cast<uint64_t>么。runtime.cpp和aicpu_executor.cpp中也有从PTO2TaskId中提取uint64 raw的,需要一并修改么
| for (uint32_t i = 0; i < count; i++) { | ||
| PerfRecord* record = &perf_buf->records[i]; | ||
| int32_t task_id = record->task_id; | ||
| int32_t task_id = static_cast<int32_t>(pto2_task_id_local(PTO2TaskId{record->mixed_task_id})); |
There was a problem hiding this comment.
multi-ring这个功能是不使能吗,会有什么后果?
There was a problem hiding this comment.
mult-ring是使能的吧。只是本次提交前,fanout 对 ring_id>0 直接 skip(旧链路里 fanout 用 32-bit local 表达),结果是多 ring 任务依赖边在 perf_swimlane/merged_swimlane 中缺边。当前依赖 aicpu_executor 这条写入链路;如果那边漏写,complete_perf_records 不会兜底,可能出现 fanout 信息缺失。这个我后续再看。
af951fd to
bc72047
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request is a significant and well-executed refactoring to support multi-ring profiling by promoting the 32-bit task_id to a 64-bit mixed_task_id across the entire profiling pipeline. The changes are comprehensive, affecting data structures like PerfRecord and AicpuPhaseRecord, device-side recording logic in AICore and AICPU, and host-side data processing and visualization tools. The implementation correctly handles the new 64-bit IDs, including overwriting partial IDs with full ones where necessary and updating data processing logic. The Python script for visualization is also appropriately updated. The changes appear correct and consistent. I have one minor suggestion for code cleanup in src/a2a3/runtime/tensormap_and_ringbuffer/runtime/runtime.cpp to remove a redundant check.
| int32_t ring_id = static_cast<int32_t>(task_id.ring()); | ||
| if (ring_id < 0 || ring_id >= PTO2_MAX_RING_DEPTH) { | ||
| continue; | ||
| } |
There was a problem hiding this comment.
The task_id.ring() method returns a uint8_t, which is an unsigned type. Casting it to int32_t and then checking if ring_id < 0 is redundant as this condition will always be false. It's cleaner to use the original uint8_t type and remove the unnecessary check for better code clarity and correctness.
| int32_t ring_id = static_cast<int32_t>(task_id.ring()); | |
| if (ring_id < 0 || ring_id >= PTO2_MAX_RING_DEPTH) { | |
| continue; | |
| } | |
| uint8_t ring_id = task_id.ring(); | |
| if (ring_id >= PTO2_MAX_RING_DEPTH) { | |
| continue; | |
| } |
bc72047 to
1b565ed
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request is a significant and well-executed update to promote the 32-bit task_id to a 64-bit mixed_task_id throughout the profiling pipeline. This change is crucial for enabling cross-view correlation in multi-ring profiling scenarios. The modifications are comprehensive, touching data structures, function signatures, and logic from device-side collection to host-side processing and visualization scripts. The changes appear correct and consistent with the goal of the pull request. I have one minor suggestion to refactor a code block for improved clarity and maintainability.
| if (task != nullptr) { | ||
| record->fanout_count = task->fanout_count; | ||
|
|
||
| for (int32_t j = 0; j < task->fanout_count; j++) { | ||
| record->fanout[j] = task->fanout[j]; | ||
| record->fanout[j] = static_cast<uint64_t>(task->fanout[j]); | ||
| } | ||
| record->fanout_filled = 1; | ||
| } else { | ||
| record->fanout_count = 0; | ||
| record->fanout_filled = 1; | ||
| } |
There was a problem hiding this comment.
This block can be simplified by setting fanout_count to a default value before the if statement and setting fanout_filled after. This refactoring removes code duplication and improves readability.
record->fanout_count = 0;
if (task != nullptr) {
record->fanout_count = task->fanout_count;
for (int32_t j = 0; j < task->fanout_count; j++) {
record->fanout[j] = static_cast<uint64_t>(task->fanout[j]);
}
}
record->fanout_filled = 1;Replace the 32-bit task_id field in PerfRecord and AicpuPhaseRecord with
the full 64-bit mixed_task_id (ring_id<<32 | local_id) for cross-view
correlation in multi-ring profiling scenarios.
Platform:
- PerfRecord: task_id (u32) → mixed_task_id (u64); fanout[] int32_t →
uint64_t; add fanout_filled flag to skip duplicate host-side fill
- AicpuPhaseRecord: replace tasks_processed (u32) + padding with a union
of mixed_task_id / tasks_processed (both u64)
- Update perf_aicpu_record_phase/orch_phase signatures to u64
tensormap_and_ringbuffer:
- Replace pto2_slot_states_ptr_ (single) with per-ring array
pto2_ring_slot_states_ptrs_[PTO2_MAX_RING_DEPTH]
- complete_perf_records: skip fanout_filled records, decode ring_id from
mixed_task_id, look up per-ring slot states, store u64 fanout entries
- aicpu_executor: overwrite mixed_task_id with full pto2_task_id_raw,
set fanout_filled=1, register per-ring slot states
host_build_graph:
- Match PerfRecord by mixed_task_id; store fanout as u64; set fanout_filled
swimlane_converter:
- Add format_task_display() to render mixed_task_id as r{ring}t{local}
- Replace orch_finalize→orch_fanin flow arrows (use fanin end time)
1b565ed to
72fb6e8
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request successfully promotes the 32-bit task_id to a 64-bit mixed_task_id throughout the profiling pipeline, a necessary change to support multi-ring scenarios. The modifications are comprehensive, affecting data structures like PerfRecord and AicpuPhaseRecord, data collection on both AICore and AICPU, and host-side processing and visualization tools. The implementation is robust, with notable improvements such as the efficient use of a union in AicpuPhaseRecord and the addition of an explicit conversion operator to PTO2TaskId. The updates to swimlane_converter.py for the new ID format and enhanced flow visualization are also well-executed. I have one suggestion for a minor code refactoring to improve maintainability by removing duplicated code. Overall, this is a solid and well-implemented enhancement.
Replace the 32-bit task_id field in PerfRecord and AicpuPhaseRecord with
the full 64-bit mixed_task_id (ring_id<<32 | local_id) for cross-view
correlation in multi-ring profiling scenarios.
Platform:
uint64_t; add fanout_filled flag to skip duplicate host-side fill
of mixed_task_id / tasks_processed (both u64)
tensormap_and_ringbuffer:
pto2_ring_slot_states_ptrs_[PTO2_MAX_RING_DEPTH]
mixed_task_id, look up per-ring slot states, store u64 fanout entries
set fanout_filled=1, register per-ring slot states
host_build_graph:
swimlane_converter: