Refactor: encapsulate dep pool operations and reorder orchestrator pipeline by poursoul · Pull Request #320 · hw-native-sys/simpler

poursoul · 2026-03-18T03:52:23Z

Move free functions (init, reclaim, ensure_space, used, available) into PTO2DepListPool as member methods; rename pto2_dep_list_prepend to prepend(), eliminating dep_pool_cur_entries from orchestrator state
Defer tensormap sync after heap allocation to overlap with GM prefetch
Guard paged attention orchestration profiling with ENABLE_PROFILING
Fix signed/unsigned comparison in host_regs fallback loop

…peline - Move free functions (init, reclaim, ensure_space, used, available) into PTO2DepListPool as member methods; rename pto2_dep_list_prepend to prepend(), eliminating dep_pool_cur_entries from orchestrator state - Defer tensormap sync after heap allocation to overlap with GM prefetch - Guard paged attention orchestration profiling with ENABLE_PROFILING - Fix signed/unsigned comparison in host_regs fallback loop

gemini-code-assist · 2026-03-18T03:52:47Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the orchestrator's dependency management by encapsulating related functionalities into a dedicated class, improving code structure and maintainability. It also optimizes the orchestrator's execution flow by reordering key operations to enhance performance, specifically by deferring tensormap synchronization. Additionally, profiling capabilities are now conditionally compiled, and a minor type-mismatch bug in host register handling has been resolved, contributing to a more robust and efficient system.

Highlights

Dependency Pool Encapsulation: Moved free functions related to the dependency pool (init, reclaim, ensure_space, used, available) into the PTO2DepListPool class as member methods, enhancing object-oriented design and removing dep_pool_cur_entries from orchestrator state.
Orchestrator Pipeline Reordering: Deferred tensormap synchronization until after heap allocation within the orchestrator pipeline, aiming to overlap with GM prefetch operations for improved performance.
Conditional Profiling: Guarded paged attention orchestration profiling code with ENABLE_PROFILING macros, allowing profiling to be enabled or disabled at compile time.
Bug Fix: Signed/Unsigned Comparison: Corrected a signed/unsigned comparison warning in the host_regs fallback loop by changing an int loop variable to uint32_t.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant refactoring by encapsulating dependency pool operations into the PTO2DepListPool class, which improves code structure and maintainability. It also reorders the orchestrator pipeline to defer tensormap synchronization, aiming to improve performance by overlapping it with GM prefetch. Additionally, it fixes a signed/unsigned comparison bug and makes profiling code in a test kernel conditional, which is a good practice. The changes are well-executed. I've found one minor issue in the documentation that needs to be addressed.

src/a2a3/runtime/tensormap_and_ringbuffer/docs/MULTI_RING.md

hw-native-sys-bot

整体上没有发现关键的逻辑/正确性 bug，pipeline reorder 保持了 sync-before-lookup 不变量，dep_pool_cur_entries 的消除逻辑也是正确的。以下是一些代码质量和风格上的问题：

hw-native-sys-bot · 2026-03-18T07:13:49Z

src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.cpp

+        }
+        SPIN_WAIT_HINT();
+    }
+}


文件末尾缺少换行符（\ No newline at end of file），建议补上。

hw-native-sys-bot · 2026-03-18T07:13:49Z

src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp


    PTO2TaskDescriptor& task = task_ring.get_task_by_slot(slot);
-    PTO2TaskPayload* payload = &orch->sm_handle->task_payloads[ring_id][slot];
+    PTO2TaskPayload* payload = &orch->sm_handle->task_payloads[ring_id][slot];    


行尾有多余的空格（trailing whitespace）。

hw-native-sys-bot · 2026-03-18T07:13:49Z

src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp

+    uint8_t ring_id = orch->current_ring_id();
+    auto& task_ring = orch->rings[ring_id].task_ring;
+    PTO2SchedulerState* sched = orch->scheduler;
+    PTO2RingFlowControl &fc = orch->sm_handle->header->rings[ring_id].fc;


& 放置风格不一致：这里是 PTO2RingFlowControl &fc（& 跟变量名），但同文件 line 277 auto& task_ring 以及 pto_ring_buffer.h 中 PTO2SchedulerState& sched 都是 & 跟类型。同样的问题也出现在 line 120 的 auto &fc。建议统一。

hw-native-sys-bot · 2026-03-18T07:13:49Z

src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp

 // =============================================================================
 void pto2_submit_mixed_task(
    PTO2OrchestratorState* orch, const MixedKernels& mixed_kernels, const PTOParam& params) {
+    CYCLE_COUNT_START();


CYCLE_COUNT_START() 移到了 fatal 检查之前，fatal 路径会多一次无意义的 cycle counter 读取。建议移到 fatal 检查之后（原来的位置更合理）。

src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp

hw-native-sys-bot · 2026-03-18T07:13:49Z

...nsormap_and_ringbuffer/paged_attention_unroll/kernels/orchestration/paged_attention_orch.cpp

@@ -115,8 +128,6 @@ __attribute__((visibility("default"))) void aicpu_orchestration_entry(PTO2Runtim
    DataType data_type = DataType::BFLOAT16;
    CYCLE_COUNT_LAP(prof_param_extract);


Profiling guard 模式有隐患：prof_param_extract 等变量在 #ifdef ENABLE_PROFILING 内声明（line 83-94），但这里的 CYCLE_COUNT_LAP(prof_param_extract) 以及后面的 CYCLE_COUNT_LAP(prof_param_setup) 等调用并不在 #ifdef 内。

当前能编过是因为宏展开为 (void)0 时参数不会被求值，但这个模式比较脆弱——如果将来有人把宏改成 do { (void)(acc); } while(0) 来消除 unused warning，就会编译失败。建议要么把这些 CYCLE_COUNT_LAP 调用也用 #ifdef ENABLE_PROFILING 包裹，要么把变量声明移到 guard 外面。

Synchronize A5 runtime and tests with a2a3 improvements. Follows the established sync pattern. Runtime host_build_graph (src/a5/runtime/host_build_graph/): - 83473ba (hw-native-sys#323): replace block core assignment with round-robin in AICPU executor Runtime tensormap_and_ringbuffer (src/a5/runtime/tensormap_and_ringbuffer/): - e4348eb (hw-native-sys#315): move reclamation state into owning data structures - c17770e (hw-native-sys#320): encapsulate dep pool operations and reorder orchestrator pipeline - 83473ba (hw-native-sys#323): replace block core assignment with round-robin in AICPU executor Tests (tests/device_tests/a5/): - 439ccd4 (hw-native-sys#322): unify paged attention golden cases across test variants

Synchronize A5 tensormap_and_ringbuffer runtime and platform with a2a3 improvements introduced after 56a2c61. Follows the sync pattern established in hw-native-sys#250 and hw-native-sys#300. Platform (src/a5/platform/): - 2f58a2f (hw-native-sys#267): add AICPU thread affinity (platform_aicpu_affinity.h/cpp), PLATFORM_MAX_AICPU_THREADS_JUST_FOR_LAUNCH, device_runner, kernel.cpp, CMakeLists.txt - b903e7b: sync perf_profiling.h for multi-ring support - 334d355 (hw-native-sys#254): sync performance_collector_aicore.h for slim dispatch Runtime host_build_graph (src/a5/runtime/host_build_graph/): - 334d355 (hw-native-sys#254): slim dispatch payload in aicore_executor.cpp - dd7ada4: standardize register init and exit handshake in aicore_executor.cpp - 2f58a2f (hw-native-sys#267): AICPU affinity gate in aicpu_executor.cpp - 83473ba (hw-native-sys#323): replace block core assignment with round-robin in AICPU executor Runtime tensormap_and_ringbuffer (src/a5/runtime/tensormap_and_ringbuffer/): - e2e38b9 (hw-native-sys#249): cluster-based mixed-task dispatch; add pto_submit_types.h and SUBMIT_BY_CLUSTER.md - a842263 (hw-native-sys#255): separate local ready queue by CoreType in pto_scheduler.h - cf6462c (hw-native-sys#268): consolidate per-task state into PTO2TaskSlotState (pto_runtime2_types.h, pto_scheduler.cpp, pto_orchestrator.cpp) - b903e7b: multi-ring buffer architecture (pto_shared_memory, MULTI_RING.md, aicpu_executor.cpp, perf_profiling.h) - 5d92137 (hw-native-sys#264): DepListPool ring buffer reclamation (pto_ring_buffer.h/cpp) - 54d082c (hw-native-sys#281): replace task_id with slot-state pointer across scheduler, orchestrator, ring buffer, executor, RUNTIME_LOGIC.md - d305376 (hw-native-sys#277): add scope deadlock detection in pto_orchestrator - 1e41a3a (hw-native-sys#274): per-thread orchestrator phase profiling - f5da078 (hw-native-sys#275): progress-aware ring buffer spin detection (pto_ring_buffer.h, pto_orchestrator.cpp, runtime_maker.cpp) - 10f6415 (hw-native-sys#284): tighten PTO2_PROFILING macro guards; sync profiling_levels.md - 9c158e0 (hw-native-sys#291): emergency shutdown on fatal error (aicpu_executor, pto_orchestration_api.h, pto_orchestrator, pto_shared_memory) - 94f39ff (hw-native-sys#301): refactor PTOParam to aggregated container with parallel arrays (pto_types.h, pto_runtime2_types.h, pto_scheduler, pto_shared_memory, pto_tensormap, pto_orchestrator, runtime2) - 15e6034 (hw-native-sys#308): refactor Tensor fields and pto_tensormap for cache locality - 77a81aa (hw-native-sys#306): replace PTOParam assert with orchestration error handling - e4348eb (hw-native-sys#315): move reclamation state into owning data structures - c17770e (hw-native-sys#320): encapsulate dep pool operations and reorder orchestrator pipeline - 83473ba (hw-native-sys#323): replace block core assignment with round-robin in AICPU executor Tests (tests/device_tests/a5/): - 439ccd4 (hw-native-sys#322): unify paged attention golden cases across test variants Examples & tests (examples/a5/, tests/device_tests/a5/): - 8cf8981 (hw-native-sys#293): replace PipeSyncFunc with FULL_MEMORY_BARRIER in kernels - b88eed3 (hw-native-sys#302): optimize paged attention pipeline, eliminate GM round-trips - 94f39ff (hw-native-sys#301) + 15e6034 (hw-native-sys#308): update orchestration to new PTOParam API

Synchronize A5 tensormap_and_ringbuffer runtime and platform with a2a3 improvements introduced after 56a2c61. Follows the sync pattern established in #250 and #300. Platform (src/a5/platform/): - 2f58a2f (#267): add AICPU thread affinity (platform_aicpu_affinity.h/cpp), PLATFORM_MAX_AICPU_THREADS_JUST_FOR_LAUNCH, device_runner, kernel.cpp, CMakeLists.txt - b903e7b: sync perf_profiling.h for multi-ring support - 334d355 (#254): sync performance_collector_aicore.h for slim dispatch Runtime host_build_graph (src/a5/runtime/host_build_graph/): - 334d355 (#254): slim dispatch payload in aicore_executor.cpp - dd7ada4: standardize register init and exit handshake in aicore_executor.cpp - 2f58a2f (#267): AICPU affinity gate in aicpu_executor.cpp - 83473ba (#323): replace block core assignment with round-robin in AICPU executor Runtime tensormap_and_ringbuffer (src/a5/runtime/tensormap_and_ringbuffer/): - e2e38b9 (#249): cluster-based mixed-task dispatch; add pto_submit_types.h and SUBMIT_BY_CLUSTER.md - a842263 (#255): separate local ready queue by CoreType in pto_scheduler.h - cf6462c (#268): consolidate per-task state into PTO2TaskSlotState (pto_runtime2_types.h, pto_scheduler.cpp, pto_orchestrator.cpp) - b903e7b: multi-ring buffer architecture (pto_shared_memory, MULTI_RING.md, aicpu_executor.cpp, perf_profiling.h) - 5d92137 (#264): DepListPool ring buffer reclamation (pto_ring_buffer.h/cpp) - 54d082c (#281): replace task_id with slot-state pointer across scheduler, orchestrator, ring buffer, executor, RUNTIME_LOGIC.md - d305376 (#277): add scope deadlock detection in pto_orchestrator - 1e41a3a (#274): per-thread orchestrator phase profiling - f5da078 (#275): progress-aware ring buffer spin detection (pto_ring_buffer.h, pto_orchestrator.cpp, runtime_maker.cpp) - 10f6415 (#284): tighten PTO2_PROFILING macro guards; sync profiling_levels.md - 9c158e0 (#291): emergency shutdown on fatal error (aicpu_executor, pto_orchestration_api.h, pto_orchestrator, pto_shared_memory) - 94f39ff (#301): refactor PTOParam to aggregated container with parallel arrays (pto_types.h, pto_runtime2_types.h, pto_scheduler, pto_shared_memory, pto_tensormap, pto_orchestrator, runtime2) - 15e6034 (#308): refactor Tensor fields and pto_tensormap for cache locality - 77a81aa (#306): replace PTOParam assert with orchestration error handling - e4348eb (#315): move reclamation state into owning data structures - c17770e (#320): encapsulate dep pool operations and reorder orchestrator pipeline - 83473ba (#323): replace block core assignment with round-robin in AICPU executor Tests (tests/device_tests/a5/): - 439ccd4 (#322): unify paged attention golden cases across test variants Examples & tests (examples/a5/, tests/device_tests/a5/): - 8cf8981 (#293): replace PipeSyncFunc with FULL_MEMORY_BARRIER in kernels - b88eed3 (#302): optimize paged attention pipeline, eliminate GM round-trips - 94f39ff (#301) + 15e6034 (#308): update orchestration to new PTOParam API

gemini-code-assist bot reviewed Mar 18, 2026

View reviewed changes

src/a2a3/runtime/tensormap_and_ringbuffer/docs/MULTI_RING.md Show resolved Hide resolved

hw-native-sys-bot reviewed Mar 18, 2026

View reviewed changes

ChaoZheng109 approved these changes Mar 18, 2026

View reviewed changes

ChaoZheng109 merged commit c17770e into hw-native-sys:main Mar 18, 2026
5 checks passed

ChaoZheng109 mentioned this pull request Mar 19, 2026

sync(a5): align runtime and tests with a2a3 (56a2c61..HEAD) #314

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: encapsulate dep pool operations and reorder orchestrator pipeline#320

Refactor: encapsulate dep pool operations and reorder orchestrator pipeline#320
ChaoZheng109 merged 1 commit intohw-native-sys:mainfrom
poursoul:some-changes-for-orch

poursoul commented Mar 18, 2026

Uh oh!

gemini-code-assist bot commented Mar 18, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

hw-native-sys-bot left a comment

Uh oh!

hw-native-sys-bot Mar 18, 2026

Uh oh!

hw-native-sys-bot Mar 18, 2026

Uh oh!

hw-native-sys-bot Mar 18, 2026

Uh oh!

hw-native-sys-bot Mar 18, 2026

Uh oh!

Uh oh!

hw-native-sys-bot Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -115,8 +128,6 @@ __attribute__((visibility("default"))) void aicpu_orchestration_entry(PTO2Runtim
		DataType data_type = DataType::BFLOAT16;
		CYCLE_COUNT_LAP(prof_param_extract);

Conversation

poursoul commented Mar 18, 2026

Uh oh!

gemini-code-assist bot commented Mar 18, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

hw-native-sys-bot left a comment

Choose a reason for hiding this comment

Uh oh!

hw-native-sys-bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

hw-native-sys-bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

hw-native-sys-bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

hw-native-sys-bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hw-native-sys-bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants