Skip to content

Update: promote task_id to mixed_task_id (u64) across profiling pipeline#316

Open
indigo1973 wants to merge 1 commit intohw-native-sys:mainfrom
indigo1973:prof_0317_v2
Open

Update: promote task_id to mixed_task_id (u64) across profiling pipeline#316
indigo1973 wants to merge 1 commit intohw-native-sys:mainfrom
indigo1973:prof_0317_v2

Conversation

@indigo1973
Copy link
Contributor

@indigo1973 indigo1973 commented Mar 17, 2026

Replace the 32-bit task_id field in PerfRecord and AicpuPhaseRecord with
the full 64-bit mixed_task_id (ring_id<<32 | local_id) for cross-view
correlation in multi-ring profiling scenarios.

Platform:

  • PerfRecord: task_id (u32) → mixed_task_id (u64); fanout[] int32_t →
    uint64_t; add fanout_filled flag to skip duplicate host-side fill
  • AicpuPhaseRecord: replace tasks_processed (u32) + padding with a union
    of mixed_task_id / tasks_processed (both u64)
  • Update perf_aicpu_record_phase/orch_phase signatures to u64

tensormap_and_ringbuffer:

  • Replace pto2_slot_states_ptr_ (single) with per-ring array
    pto2_ring_slot_states_ptrs_[PTO2_MAX_RING_DEPTH]
  • complete_perf_records: skip fanout_filled records, decode ring_id from
    mixed_task_id, look up per-ring slot states, store u64 fanout entries
  • aicpu_executor: overwrite mixed_task_id with full pto2_task_id_raw,
    set fanout_filled=1, register per-ring slot states

host_build_graph:

  • Match PerfRecord by mixed_task_id; store fanout as u64; set fanout_filled

swimlane_converter:

  • Add format_task_display() to render mixed_task_id as r{ring}t{local}
  • Replace orch_finalize→orch_fanin flow arrows (use fanin end time)

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses critical Perfetto visualization inaccuracies by unifying task ID spaces and correctly handling multi-ring fanout dependencies. It refactors core data structures and profiling logic to ensure consistent task identification and dependency tracking across different profiling components, leading to more accurate and complete performance traces.

Highlights

  • Unified Task ID Space: The PerfRecord.task_id (uint32_t) field has been replaced with PerfRecord.mixed_task_id (uint64_t) to store the full ring_id<<32 | local_id value. This resolves ID-space mismatches that caused missing orchestrator-to-scheduler arrows in Perfetto visualizations.
  • Correct Multi-Ring Fanout: The PerfRecord.fanout[] array now stores uint64_t mixed_task_id values instead of int32_t local_id. This enables correct lookup of successor tasks across all rings, fixing issues where multi-ring fanout was incorrect or missing.
  • Orchestrator Phase Record Updates: The AicpuPhaseRecord struct has been refactored to use a union, allowing orchestrator phases to record mixed_task_id for cross-view correlation, while scheduler phases continue to record tasks_processed. The perf_aicpu_record_phase and perf_aicpu_record_orch_phase functions were updated to accept uint64_t for task IDs.
  • Perfetto Visualization Logic Improvement: The swimlane_converter.py script has been updated to build orchestrator-to-scheduler arrows from the ORCH_FANIN end time (instead of ORCH_FINALIZE start time), providing more precise correlation as ORCH_FANIN marks when dependencies are resolved.
  • Memory Footprint Adjustment: The PerfRecord structure size increased from 2176B to 4224B due to the change in fanout[] from int32_t to uint64_t. Host memory allocation adapts automatically.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/a2a3/platform/include/aicore/performance_collector_aicore.h
    • Updated perf_aicore_record_task to assign task_id to mixed_task_id.
  • src/a2a3/platform/include/aicpu/performance_collector_aicpu.h
    • Modified perf_aicpu_record_phase and perf_aicpu_record_orch_phase function signatures to accept uint64_t for task IDs.
    • Updated documentation for tasks_processed parameter to reflect its dual use for scheduler and orchestrator phases.
  • src/a2a3/platform/include/common/perf_profiling.h
    • Replaced PerfRecord.task_id (uint32_t) with PerfRecord.mixed_task_id (uint64_t).
    • Changed PerfRecord.fanout array type from int32_t to uint64_t.
    • Replaced AicpuPhaseRecord.tasks_processed and padding with an anonymous union containing mixed_task_id (uint64_t) and tasks_processed (uint64_t).
  • src/a2a3/platform/src/aicpu/performance_collector_aicpu.cpp
    • Updated perf_aicpu_record_phase implementation to use uint64_t tasks_processed and assign it to record->mixed_task_id.
    • Modified perf_aicpu_record_orch_phase to pass uint64_t mixed_task_id to perf_aicpu_record_phase.
  • src/a2a3/platform/src/host/performance_collector.cpp
    • Updated task sorting in export_swimlane_json to use mixed_task_id.
    • Modified LOG_WARN message to correctly print mixed_task_id using %llu format specifier.
    • Changed JSON output for task_id to use record.mixed_task_id and cast to int64_t for orchestrator phases.
  • src/a2a3/runtime/host_build_graph/aicpu/aicpu_executor.cpp
    • Updated task ID comparison in resolve_and_dispatch to use record->mixed_task_id and cast completed_task_id to uint64_t.
  • src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
    • Updated task ID comparison to use record->mixed_task_id and cast expected_reg_task_id to uint64_t.
    • Added logic to overwrite record->mixed_task_id with the full pto2_task_id_raw value.
    • Modified fanout population to store pto2_task_id_raw directly into record->fanout.
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
    • Updated the weak symbol declaration for perf_aicpu_record_orch_phase to accept uint64_t.
    • Modified CYCLE_COUNT_LAP_RECORD calls to pass pto2_task_id_raw(mixed_task_id) or uint64_t(0) for task identification.
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/runtime.cpp
    • Updated task ID extraction to use pto2_task_id_local on record->mixed_task_id.
    • Removed conditional logic and directly stored pto2_task_id_raw into record->fanout, eliminating the 32-bit fanout limitation.
  • tools/swimlane_converter.py
    • Changed the logic for generating orchestrator-to-scheduler arrows from orch_finalize to orch_fanin phase.
    • Updated arrow generation to use the end_time_us of the ORCH_FANIN record for more accurate timing.
    • Ensured task_id (now mixed_task_id) is an integer and valid before processing.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@indigo1973 indigo1973 changed the title perf: fix orch→sched arrows and multi-ring fanout correctness fix(perf): fix orch→sched arrows and multi-ring fanout correctness Mar 17, 2026
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses two correctness issues in performance visualization: missing orchestrator-to-scheduler arrows and incorrect fanout data for multi-ring setups. The core of the fix involves changing task_id to a 64-bit mixed_task_id to uniquely identify tasks across rings and updating fanout arrays to store these full IDs. The use of a union in AicpuPhaseRecord is a clever optimization to avoid increasing its size. The changes are consistent and well-implemented across the C++ runtime and the Python trace converter. I have one suggestion regarding potentially redundant code.

@ChaoZheng109
Copy link
Collaborator

fix #311

@indigo1973 indigo1973 changed the title fix(perf): fix orch→sched arrows and multi-ring fanout correctness [WIP] fix(perf): fix orch→sched arrows and multi-ring fanout correctness Mar 18, 2026
@indigo1973 indigo1973 changed the title [WIP] fix(perf): fix orch→sched arrows and multi-ring fanout correctness fix(perf): fix orch→sched arrows and multi-ring fanout correctness Mar 18, 2026
local_packed_end = (char*)local_packed_base + total_output_size;
}
CYCLE_COUNT_LAP_RECORD(g_orch_heap_cycle, AicpuPhaseId::ORCH_HEAP, local_id);
CYCLE_COUNT_LAP_RECORD(g_orch_heap_cycle, AicpuPhaseId::ORCH_HEAP, pto2_task_id_raw(mixed_task_id));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

重载PTO2TaskId转uint64_t的operator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是指在 struct PTO2TaskId 中添加 constexpr explicit operator uint64_t() const { return raw; },然后直接用static_cast<uint64_t>么。runtime.cpp和aicpu_executor.cpp中也有从PTO2TaskId中提取uint64 raw的,需要一并修改么

for (uint32_t i = 0; i < count; i++) {
PerfRecord* record = &perf_buf->records[i];
int32_t task_id = record->task_id;
int32_t task_id = static_cast<int32_t>(pto2_task_id_local(PTO2TaskId{record->mixed_task_id}));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

用task_id.local()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

采纳

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multi-ring这个功能是不使能吗,会有什么后果?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mult-ring是使能的吧。只是本次提交前,fanout 对 ring_id>0 直接 skip(旧链路里 fanout 用 32-bit local 表达),结果是多 ring 任务依赖边在 perf_swimlane/merged_swimlane 中缺边。当前依赖 aicpu_executor 这条写入链路;如果那边漏写,complete_perf_records 不会兜底,可能出现 fanout 信息缺失。这个我后续再看。

@indigo1973 indigo1973 force-pushed the prof_0317_v2 branch 2 times, most recently from af951fd to bc72047 Compare March 18, 2026 08:52
@indigo1973 indigo1973 changed the title fix(perf): fix orch→sched arrows and multi-ring fanout correctness Update: promote task_id to mixed_task_id (u64) across profiling pipeline Mar 18, 2026
@indigo1973
Copy link
Contributor Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant and well-executed refactoring to support multi-ring profiling by promoting the 32-bit task_id to a 64-bit mixed_task_id across the entire profiling pipeline. The changes are comprehensive, affecting data structures like PerfRecord and AicpuPhaseRecord, device-side recording logic in AICore and AICPU, and host-side data processing and visualization tools. The implementation correctly handles the new 64-bit IDs, including overwriting partial IDs with full ones where necessary and updating data processing logic. The Python script for visualization is also appropriately updated. The changes appear correct and consistent. I have one minor suggestion for code cleanup in src/a2a3/runtime/tensormap_and_ringbuffer/runtime/runtime.cpp to remove a redundant check.

Comment on lines +188 to +191
int32_t ring_id = static_cast<int32_t>(task_id.ring());
if (ring_id < 0 || ring_id >= PTO2_MAX_RING_DEPTH) {
continue;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The task_id.ring() method returns a uint8_t, which is an unsigned type. Casting it to int32_t and then checking if ring_id < 0 is redundant as this condition will always be false. It's cleaner to use the original uint8_t type and remove the unnecessary check for better code clarity and correctness.

Suggested change
int32_t ring_id = static_cast<int32_t>(task_id.ring());
if (ring_id < 0 || ring_id >= PTO2_MAX_RING_DEPTH) {
continue;
}
uint8_t ring_id = task_id.ring();
if (ring_id >= PTO2_MAX_RING_DEPTH) {
continue;
}

@indigo1973
Copy link
Contributor Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant and well-executed update to promote the 32-bit task_id to a 64-bit mixed_task_id throughout the profiling pipeline. This change is crucial for enabling cross-view correlation in multi-ring profiling scenarios. The modifications are comprehensive, touching data structures, function signatures, and logic from device-side collection to host-side processing and visualization scripts. The changes appear correct and consistent with the goal of the pull request. I have one minor suggestion to refactor a code block for improved clarity and maintainability.

Comment on lines 237 to 247
if (task != nullptr) {
record->fanout_count = task->fanout_count;

for (int32_t j = 0; j < task->fanout_count; j++) {
record->fanout[j] = task->fanout[j];
record->fanout[j] = static_cast<uint64_t>(task->fanout[j]);
}
record->fanout_filled = 1;
} else {
record->fanout_count = 0;
record->fanout_filled = 1;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block can be simplified by setting fanout_count to a default value before the if statement and setting fanout_filled after. This refactoring removes code duplication and improves readability.

        record->fanout_count = 0;
        if (task != nullptr) {
            record->fanout_count = task->fanout_count;

            for (int32_t j = 0; j < task->fanout_count; j++) {
                record->fanout[j] = static_cast<uint64_t>(task->fanout[j]);
            }
        }
        record->fanout_filled = 1;

Replace the 32-bit task_id field in PerfRecord and AicpuPhaseRecord with
the full 64-bit mixed_task_id (ring_id<<32 | local_id) for cross-view
correlation in multi-ring profiling scenarios.

Platform:
- PerfRecord: task_id (u32) → mixed_task_id (u64); fanout[] int32_t →
  uint64_t; add fanout_filled flag to skip duplicate host-side fill
- AicpuPhaseRecord: replace tasks_processed (u32) + padding with a union
  of mixed_task_id / tasks_processed (both u64)
- Update perf_aicpu_record_phase/orch_phase signatures to u64

tensormap_and_ringbuffer:
- Replace pto2_slot_states_ptr_ (single) with per-ring array
  pto2_ring_slot_states_ptrs_[PTO2_MAX_RING_DEPTH]
- complete_perf_records: skip fanout_filled records, decode ring_id from
  mixed_task_id, look up per-ring slot states, store u64 fanout entries
- aicpu_executor: overwrite mixed_task_id with full pto2_task_id_raw,
  set fanout_filled=1, register per-ring slot states

host_build_graph:
- Match PerfRecord by mixed_task_id; store fanout as u64; set fanout_filled

swimlane_converter:
- Add format_task_display() to render mixed_task_id as r{ring}t{local}
- Replace orch_finalize→orch_fanin flow arrows (use fanin end time)
@indigo1973
Copy link
Contributor Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully promotes the 32-bit task_id to a 64-bit mixed_task_id throughout the profiling pipeline, a necessary change to support multi-ring scenarios. The modifications are comprehensive, affecting data structures like PerfRecord and AicpuPhaseRecord, data collection on both AICore and AICPU, and host-side processing and visualization tools. The implementation is robust, with notable improvements such as the efficient use of a union in AicpuPhaseRecord and the addition of an explicit conversion operator to PTO2TaskId. The updates to swimlane_converter.py for the new ID format and enhanced flow visualization are also well-executed. I have one suggestion for a minor code refactoring to improve maintainability by removing duplicated code. Overall, this is a solid and well-implemented enhancement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants