Update: promote task_id to mixed_task_id (u64) across profiling pipeline by indigo1973 · Pull Request #316 · hw-native-sys/simpler

indigo1973 · 2026-03-17T17:56:11Z

Replace the 32-bit task_id field in PerfRecord and AicpuPhaseRecord with
the full 64-bit mixed_task_id (ring_id<<32 | local_id) for cross-view
correlation in multi-ring profiling scenarios.

Platform:

PerfRecord: task_id (u32) → mixed_task_id (u64); fanout[] int32_t →
uint64_t; add fanout_filled flag to skip duplicate host-side fill
AicpuPhaseRecord: replace tasks_processed (u32) + padding with a union
of mixed_task_id / tasks_processed (both u64)
Update perf_aicpu_record_phase/orch_phase signatures to u64

tensormap_and_ringbuffer:

Replace pto2_slot_states_ptr_ (single) with per-ring array
pto2_ring_slot_states_ptrs_[PTO2_MAX_RING_DEPTH]
complete_perf_records: skip fanout_filled records, decode ring_id from
mixed_task_id, look up per-ring slot states, store u64 fanout entries
aicpu_executor: overwrite mixed_task_id with full pto2_task_id_raw,
set fanout_filled=1, register per-ring slot states

host_build_graph:

Match PerfRecord by mixed_task_id; store fanout as u64; set fanout_filled

swimlane_converter:

Add format_task_display() to render mixed_task_id as r{ring}t{local}
Replace orch_finalize→orch_fanin flow arrows (use fanin end time)

gemini-code-assist · 2026-03-17T17:56:36Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses critical Perfetto visualization inaccuracies by unifying task ID spaces and correctly handling multi-ring fanout dependencies. It refactors core data structures and profiling logic to ensure consistent task identification and dependency tracking across different profiling components, leading to more accurate and complete performance traces.

Highlights

Unified Task ID Space: The PerfRecord.task_id (uint32_t) field has been replaced with PerfRecord.mixed_task_id (uint64_t) to store the full ring_id<<32 | local_id value. This resolves ID-space mismatches that caused missing orchestrator-to-scheduler arrows in Perfetto visualizations.
Correct Multi-Ring Fanout: The PerfRecord.fanout[] array now stores uint64_t mixed_task_id values instead of int32_t local_id. This enables correct lookup of successor tasks across all rings, fixing issues where multi-ring fanout was incorrect or missing.
Orchestrator Phase Record Updates: The AicpuPhaseRecord struct has been refactored to use a union, allowing orchestrator phases to record mixed_task_id for cross-view correlation, while scheduler phases continue to record tasks_processed. The perf_aicpu_record_phase and perf_aicpu_record_orch_phase functions were updated to accept uint64_t for task IDs.
Perfetto Visualization Logic Improvement: The swimlane_converter.py script has been updated to build orchestrator-to-scheduler arrows from the ORCH_FANIN end time (instead of ORCH_FINALIZE start time), providing more precise correlation as ORCH_FANIN marks when dependencies are resolved.
Memory Footprint Adjustment: The PerfRecord structure size increased from 2176B to 4224B due to the change in fanout[] from int32_t to uint64_t. Host memory allocation adapts automatically.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

src/a2a3/platform/include/aicore/performance_collector_aicore.h
- Updated perf_aicore_record_task to assign task_id to mixed_task_id.
src/a2a3/platform/include/aicpu/performance_collector_aicpu.h
- Modified perf_aicpu_record_phase and perf_aicpu_record_orch_phase function signatures to accept uint64_t for task IDs.
- Updated documentation for tasks_processed parameter to reflect its dual use for scheduler and orchestrator phases.
src/a2a3/platform/include/common/perf_profiling.h
- Replaced PerfRecord.task_id (uint32_t) with PerfRecord.mixed_task_id (uint64_t).
- Changed PerfRecord.fanout array type from int32_t to uint64_t.
- Replaced AicpuPhaseRecord.tasks_processed and padding with an anonymous union containing mixed_task_id (uint64_t) and tasks_processed (uint64_t).
src/a2a3/platform/src/aicpu/performance_collector_aicpu.cpp
- Updated perf_aicpu_record_phase implementation to use uint64_t tasks_processed and assign it to record->mixed_task_id.
- Modified perf_aicpu_record_orch_phase to pass uint64_t mixed_task_id to perf_aicpu_record_phase.
src/a2a3/platform/src/host/performance_collector.cpp
- Updated task sorting in export_swimlane_json to use mixed_task_id.
- Modified LOG_WARN message to correctly print mixed_task_id using %llu format specifier.
- Changed JSON output for task_id to use record.mixed_task_id and cast to int64_t for orchestrator phases.
src/a2a3/runtime/host_build_graph/aicpu/aicpu_executor.cpp
- Updated task ID comparison in resolve_and_dispatch to use record->mixed_task_id and cast completed_task_id to uint64_t.
src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
- Updated task ID comparison to use record->mixed_task_id and cast expected_reg_task_id to uint64_t.
- Added logic to overwrite record->mixed_task_id with the full pto2_task_id_raw value.
- Modified fanout population to store pto2_task_id_raw directly into record->fanout.
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
- Updated the weak symbol declaration for perf_aicpu_record_orch_phase to accept uint64_t.
- Modified CYCLE_COUNT_LAP_RECORD calls to pass pto2_task_id_raw(mixed_task_id) or uint64_t(0) for task identification.
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/runtime.cpp
- Updated task ID extraction to use pto2_task_id_local on record->mixed_task_id.
- Removed conditional logic and directly stored pto2_task_id_raw into record->fanout, eliminating the 32-bit fanout limitation.
tools/swimlane_converter.py
- Changed the logic for generating orchestrator-to-scheduler arrows from orch_finalize to orch_fanin phase.
- Updated arrow generation to use the end_time_us of the ORCH_FANIN record for more accurate timing.
- Ensured task_id (now mixed_task_id) is an integer and valid before processing.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively addresses two correctness issues in performance visualization: missing orchestrator-to-scheduler arrows and incorrect fanout data for multi-ring setups. The core of the fix involves changing task_id to a 64-bit mixed_task_id to uniquely identify tasks across rings and updating fanout arrays to store these full IDs. The use of a union in AicpuPhaseRecord is a clever optimization to avoid increasing its size. The changes are consistent and well-implemented across the C++ runtime and the Python trace converter. I have one suggestion regarding potentially redundant code.

src/a2a3/runtime/tensormap_and_ringbuffer/runtime/runtime.cpp

ChaoZheng109 · 2026-03-18T01:24:43Z

fix #311

poursoul · 2026-03-18T03:40:45Z

src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp

        local_packed_end = (char*)local_packed_base + total_output_size;
    }
-    CYCLE_COUNT_LAP_RECORD(g_orch_heap_cycle, AicpuPhaseId::ORCH_HEAP, local_id);
+    CYCLE_COUNT_LAP_RECORD(g_orch_heap_cycle, AicpuPhaseId::ORCH_HEAP, pto2_task_id_raw(mixed_task_id));


重载PTO2TaskId转uint64_t的operator

这里是指在 struct PTO2TaskId 中添加 constexpr explicit operator uint64_t() const { return raw; }，然后直接用static_cast<uint64_t>么。runtime.cpp和aicpu_executor.cpp中也有从PTO2TaskId中提取uint64 raw的，需要一并修改么

poursoul · 2026-03-18T03:42:02Z

src/a2a3/runtime/tensormap_and_ringbuffer/runtime/runtime.cpp

    for (uint32_t i = 0; i < count; i++) {
        PerfRecord* record = &perf_buf->records[i];
-        int32_t task_id = record->task_id;
+        int32_t task_id = static_cast<int32_t>(pto2_task_id_local(PTO2TaskId{record->mixed_task_id}));


用task_id.local()

poursoul · 2026-03-18T03:45:03Z

src/a2a3/runtime/tensormap_and_ringbuffer/runtime/runtime.cpp

multi-ring这个功能是不使能吗，会有什么后果？

mult-ring是使能的吧。只是本次提交前，fanout 对 ring_id>0 直接 skip（旧链路里 fanout 用 32-bit local 表达），结果是多 ring 任务依赖边在 perf_swimlane/merged_swimlane 中缺边。当前依赖 aicpu_executor 这条写入链路；如果那边漏写，complete_perf_records 不会兜底，可能出现 fanout 信息缺失。这个我后续再看。

indigo1973 · 2026-03-18T09:10:00Z

/gemini review

gemini-code-assist

Code Review

This pull request is a significant and well-executed refactoring to support multi-ring profiling by promoting the 32-bit task_id to a 64-bit mixed_task_id across the entire profiling pipeline. The changes are comprehensive, affecting data structures like PerfRecord and AicpuPhaseRecord, device-side recording logic in AICore and AICPU, and host-side data processing and visualization tools. The implementation correctly handles the new 64-bit IDs, including overwriting partial IDs with full ones where necessary and updating data processing logic. The Python script for visualization is also appropriately updated. The changes appear correct and consistent. I have one minor suggestion for code cleanup in src/a2a3/runtime/tensormap_and_ringbuffer/runtime/runtime.cpp to remove a redundant check.

gemini-code-assist · 2026-03-18T09:11:51Z

src/a2a3/runtime/tensormap_and_ringbuffer/runtime/runtime.cpp

+        int32_t ring_id = static_cast<int32_t>(task_id.ring());
+        if (ring_id < 0 || ring_id >= PTO2_MAX_RING_DEPTH) {
+            continue;
+        }


The task_id.ring() method returns a uint8_t, which is an unsigned type. Casting it to int32_t and then checking if ring_id < 0 is redundant as this condition will always be false. It's cleaner to use the original uint8_t type and remove the unnecessary check for better code clarity and correctness.

Suggested change

int32_t ring_id = static_cast<int32_t>(task_id.ring());

if (ring_id < 0 || ring_id >= PTO2_MAX_RING_DEPTH) {

continue;

}

uint8_t ring_id = task_id.ring();

if (ring_id >= PTO2_MAX_RING_DEPTH) {

continue;

}

indigo1973 · 2026-03-18T09:28:45Z

/gemini review

gemini-code-assist

Code Review

This pull request is a significant and well-executed update to promote the 32-bit task_id to a 64-bit mixed_task_id throughout the profiling pipeline. This change is crucial for enabling cross-view correlation in multi-ring profiling scenarios. The modifications are comprehensive, touching data structures, function signatures, and logic from device-side collection to host-side processing and visualization scripts. The changes appear correct and consistent with the goal of the pull request. I have one minor suggestion to refactor a code block for improved clarity and maintainability.

gemini-code-assist · 2026-03-18T09:32:22Z

src/a2a3/runtime/host_build_graph/runtime/runtime.cpp

        if (task != nullptr) {
            record->fanout_count = task->fanout_count;

            for (int32_t j = 0; j < task->fanout_count; j++) {
-                record->fanout[j] = task->fanout[j];
+                record->fanout[j] = static_cast<uint64_t>(task->fanout[j]);
            }
+            record->fanout_filled = 1;
        } else {
            record->fanout_count = 0;
+            record->fanout_filled = 1;
        }


This block can be simplified by setting fanout_count to a default value before the if statement and setting fanout_filled after. This refactoring removes code duplication and improves readability.

record->fanout_count = 0; if (task != nullptr) { record->fanout_count = task->fanout_count; for (int32_t j = 0; j < task->fanout_count; j++) { record->fanout[j] = static_cast<uint64_t>(task->fanout[j]); } } record->fanout_filled = 1;

Replace the 32-bit task_id field in PerfRecord and AicpuPhaseRecord with the full 64-bit mixed_task_id (ring_id<<32 | local_id) for cross-view correlation in multi-ring profiling scenarios. Platform: - PerfRecord: task_id (u32) → mixed_task_id (u64); fanout[] int32_t → uint64_t; add fanout_filled flag to skip duplicate host-side fill - AicpuPhaseRecord: replace tasks_processed (u32) + padding with a union of mixed_task_id / tasks_processed (both u64) - Update perf_aicpu_record_phase/orch_phase signatures to u64 tensormap_and_ringbuffer: - Replace pto2_slot_states_ptr_ (single) with per-ring array pto2_ring_slot_states_ptrs_[PTO2_MAX_RING_DEPTH] - complete_perf_records: skip fanout_filled records, decode ring_id from mixed_task_id, look up per-ring slot states, store u64 fanout entries - aicpu_executor: overwrite mixed_task_id with full pto2_task_id_raw, set fanout_filled=1, register per-ring slot states host_build_graph: - Match PerfRecord by mixed_task_id; store fanout as u64; set fanout_filled swimlane_converter: - Add format_task_display() to render mixed_task_id as r{ring}t{local} - Replace orch_finalize→orch_fanin flow arrows (use fanin end time)

indigo1973 · 2026-03-18T09:50:05Z

/gemini review

gemini-code-assist

Code Review

This pull request successfully promotes the 32-bit task_id to a 64-bit mixed_task_id throughout the profiling pipeline, a necessary change to support multi-ring scenarios. The modifications are comprehensive, affecting data structures like PerfRecord and AicpuPhaseRecord, data collection on both AICore and AICPU, and host-side processing and visualization tools. The implementation is robust, with notable improvements such as the efficient use of a union in AicpuPhaseRecord and the addition of an explicit conversion operator to PTO2TaskId. The updates to swimlane_converter.py for the new ID format and enhanced flow visualization are also well-executed. I have one suggestion for a minor code refactoring to improve maintainability by removing duplicated code. Overall, this is a solid and well-implemented enhancement.

src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp

indigo1973 changed the title ~~perf: fix orch→sched arrows and multi-ring fanout correctness~~ fix(perf): fix orch→sched arrows and multi-ring fanout correctness Mar 17, 2026

gemini-code-assist bot reviewed Mar 17, 2026

View reviewed changes

src/a2a3/runtime/tensormap_and_ringbuffer/runtime/runtime.cpp Show resolved Hide resolved

indigo1973 changed the title ~~fix(perf): fix orch→sched arrows and multi-ring fanout correctness~~ [WIP] fix(perf): fix orch→sched arrows and multi-ring fanout correctness Mar 18, 2026

indigo1973 force-pushed the prof_0317_v2 branch from 8e37bba to b03af07 Compare March 18, 2026 02:00

indigo1973 changed the title ~~[WIP] fix(perf): fix orch→sched arrows and multi-ring fanout correctness~~ fix(perf): fix orch→sched arrows and multi-ring fanout correctness Mar 18, 2026

indigo1973 force-pushed the prof_0317_v2 branch from b03af07 to 8e9c3f9 Compare March 18, 2026 02:41

poursoul reviewed Mar 18, 2026

View reviewed changes

indigo1973 force-pushed the prof_0317_v2 branch 2 times, most recently from af951fd to bc72047 Compare March 18, 2026 08:52

indigo1973 changed the title ~~fix(perf): fix orch→sched arrows and multi-ring fanout correctness~~ Update: promote task_id to mixed_task_id (u64) across profiling pipeline Mar 18, 2026

gemini-code-assist bot reviewed Mar 18, 2026

View reviewed changes

indigo1973 force-pushed the prof_0317_v2 branch from bc72047 to 1b565ed Compare March 18, 2026 09:28

gemini-code-assist bot reviewed Mar 18, 2026

View reviewed changes

indigo1973 force-pushed the prof_0317_v2 branch from 1b565ed to 72fb6e8 Compare March 18, 2026 09:49

gemini-code-assist bot reviewed Mar 18, 2026

View reviewed changes

src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp Show resolved Hide resolved

Conversation

indigo1973 commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Mar 17, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

ChaoZheng109 commented Mar 18, 2026

Uh oh!

poursoul Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

indigo1973 Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

poursoul Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

indigo1973 Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

poursoul Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

indigo1973 Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

indigo1973 commented Mar 18, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

indigo1973 commented Mar 18, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

indigo1973 commented Mar 18, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

indigo1973 commented Mar 17, 2026 •

edited

Loading