Refactor: encapsulate dep pool operations and reorder orchestrator pipeline#320
Conversation
poursoul
commented
Mar 18, 2026
- Move free functions (init, reclaim, ensure_space, used, available) into PTO2DepListPool as member methods; rename pto2_dep_list_prepend to prepend(), eliminating dep_pool_cur_entries from orchestrator state
- Defer tensormap sync after heap allocation to overlap with GM prefetch
- Guard paged attention orchestration profiling with ENABLE_PROFILING
- Fix signed/unsigned comparison in host_regs fallback loop
…peline - Move free functions (init, reclaim, ensure_space, used, available) into PTO2DepListPool as member methods; rename pto2_dep_list_prepend to prepend(), eliminating dep_pool_cur_entries from orchestrator state - Defer tensormap sync after heap allocation to overlap with GM prefetch - Guard paged attention orchestration profiling with ENABLE_PROFILING - Fix signed/unsigned comparison in host_regs fallback loop
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors the orchestrator's dependency management by encapsulating related functionalities into a dedicated class, improving code structure and maintainability. It also optimizes the orchestrator's execution flow by reordering key operations to enhance performance, specifically by deferring tensormap synchronization. Additionally, profiling capabilities are now conditionally compiled, and a minor type-mismatch bug in host register handling has been resolved, contributing to a more robust and efficient system. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant refactoring by encapsulating dependency pool operations into the PTO2DepListPool class, which improves code structure and maintainability. It also reorders the orchestrator pipeline to defer tensormap synchronization, aiming to improve performance by overlapping it with GM prefetch. Additionally, it fixes a signed/unsigned comparison bug and makes profiling code in a test kernel conditional, which is a good practice. The changes are well-executed. I've found one minor issue in the documentation that needs to be addressed.
hw-native-sys-bot
left a comment
There was a problem hiding this comment.
整体上没有发现关键的逻辑/正确性 bug,pipeline reorder 保持了 sync-before-lookup 不变量,dep_pool_cur_entries 的消除逻辑也是正确的。以下是一些代码质量和风格上的问题:
| } | ||
| SPIN_WAIT_HINT(); | ||
| } | ||
| } No newline at end of file |
There was a problem hiding this comment.
文件末尾缺少换行符(\ No newline at end of file),建议补上。
|
|
||
| PTO2TaskDescriptor& task = task_ring.get_task_by_slot(slot); | ||
| PTO2TaskPayload* payload = &orch->sm_handle->task_payloads[ring_id][slot]; | ||
| PTO2TaskPayload* payload = &orch->sm_handle->task_payloads[ring_id][slot]; |
There was a problem hiding this comment.
行尾有多余的空格(trailing whitespace)。
| uint8_t ring_id = orch->current_ring_id(); | ||
| auto& task_ring = orch->rings[ring_id].task_ring; | ||
| PTO2SchedulerState* sched = orch->scheduler; | ||
| PTO2RingFlowControl &fc = orch->sm_handle->header->rings[ring_id].fc; |
There was a problem hiding this comment.
& 放置风格不一致:这里是 PTO2RingFlowControl &fc(& 跟变量名),但同文件 line 277 auto& task_ring 以及 pto_ring_buffer.h 中 PTO2SchedulerState& sched 都是 & 跟类型。同样的问题也出现在 line 120 的 auto &fc。建议统一。
| // ============================================================================= | ||
| void pto2_submit_mixed_task( | ||
| PTO2OrchestratorState* orch, const MixedKernels& mixed_kernels, const PTOParam& params) { | ||
| CYCLE_COUNT_START(); |
There was a problem hiding this comment.
CYCLE_COUNT_START() 移到了 fatal 检查之前,fatal 路径会多一次无意义的 cycle counter 读取。建议移到 fatal 检查之后(原来的位置更合理)。
| @@ -115,8 +128,6 @@ __attribute__((visibility("default"))) void aicpu_orchestration_entry(PTO2Runtim | |||
| DataType data_type = DataType::BFLOAT16; | |||
| CYCLE_COUNT_LAP(prof_param_extract); | |||
There was a problem hiding this comment.
Profiling guard 模式有隐患:prof_param_extract 等变量在 #ifdef ENABLE_PROFILING 内声明(line 83-94),但这里的 CYCLE_COUNT_LAP(prof_param_extract) 以及后面的 CYCLE_COUNT_LAP(prof_param_setup) 等调用并不在 #ifdef 内。
当前能编过是因为宏展开为 (void)0 时参数不会被求值,但这个模式比较脆弱——如果将来有人把宏改成 do { (void)(acc); } while(0) 来消除 unused warning,就会编译失败。建议要么把这些 CYCLE_COUNT_LAP 调用也用 #ifdef ENABLE_PROFILING 包裹,要么把变量声明移到 guard 外面。
Synchronize A5 runtime and tests with a2a3 improvements. Follows the established sync pattern. Runtime host_build_graph (src/a5/runtime/host_build_graph/): - 83473ba (hw-native-sys#323): replace block core assignment with round-robin in AICPU executor Runtime tensormap_and_ringbuffer (src/a5/runtime/tensormap_and_ringbuffer/): - e4348eb (hw-native-sys#315): move reclamation state into owning data structures - c17770e (hw-native-sys#320): encapsulate dep pool operations and reorder orchestrator pipeline - 83473ba (hw-native-sys#323): replace block core assignment with round-robin in AICPU executor Tests (tests/device_tests/a5/): - 439ccd4 (hw-native-sys#322): unify paged attention golden cases across test variants
Synchronize A5 tensormap_and_ringbuffer runtime and platform with a2a3 improvements introduced after 56a2c61. Follows the sync pattern established in hw-native-sys#250 and hw-native-sys#300. Platform (src/a5/platform/): - 2f58a2f (hw-native-sys#267): add AICPU thread affinity (platform_aicpu_affinity.h/cpp), PLATFORM_MAX_AICPU_THREADS_JUST_FOR_LAUNCH, device_runner, kernel.cpp, CMakeLists.txt - b903e7b: sync perf_profiling.h for multi-ring support - 334d355 (hw-native-sys#254): sync performance_collector_aicore.h for slim dispatch Runtime host_build_graph (src/a5/runtime/host_build_graph/): - 334d355 (hw-native-sys#254): slim dispatch payload in aicore_executor.cpp - dd7ada4: standardize register init and exit handshake in aicore_executor.cpp - 2f58a2f (hw-native-sys#267): AICPU affinity gate in aicpu_executor.cpp - 83473ba (hw-native-sys#323): replace block core assignment with round-robin in AICPU executor Runtime tensormap_and_ringbuffer (src/a5/runtime/tensormap_and_ringbuffer/): - e2e38b9 (hw-native-sys#249): cluster-based mixed-task dispatch; add pto_submit_types.h and SUBMIT_BY_CLUSTER.md - a842263 (hw-native-sys#255): separate local ready queue by CoreType in pto_scheduler.h - cf6462c (hw-native-sys#268): consolidate per-task state into PTO2TaskSlotState (pto_runtime2_types.h, pto_scheduler.cpp, pto_orchestrator.cpp) - b903e7b: multi-ring buffer architecture (pto_shared_memory, MULTI_RING.md, aicpu_executor.cpp, perf_profiling.h) - 5d92137 (hw-native-sys#264): DepListPool ring buffer reclamation (pto_ring_buffer.h/cpp) - 54d082c (hw-native-sys#281): replace task_id with slot-state pointer across scheduler, orchestrator, ring buffer, executor, RUNTIME_LOGIC.md - d305376 (hw-native-sys#277): add scope deadlock detection in pto_orchestrator - 1e41a3a (hw-native-sys#274): per-thread orchestrator phase profiling - f5da078 (hw-native-sys#275): progress-aware ring buffer spin detection (pto_ring_buffer.h, pto_orchestrator.cpp, runtime_maker.cpp) - 10f6415 (hw-native-sys#284): tighten PTO2_PROFILING macro guards; sync profiling_levels.md - 9c158e0 (hw-native-sys#291): emergency shutdown on fatal error (aicpu_executor, pto_orchestration_api.h, pto_orchestrator, pto_shared_memory) - 94f39ff (hw-native-sys#301): refactor PTOParam to aggregated container with parallel arrays (pto_types.h, pto_runtime2_types.h, pto_scheduler, pto_shared_memory, pto_tensormap, pto_orchestrator, runtime2) - 15e6034 (hw-native-sys#308): refactor Tensor fields and pto_tensormap for cache locality - 77a81aa (hw-native-sys#306): replace PTOParam assert with orchestration error handling - e4348eb (hw-native-sys#315): move reclamation state into owning data structures - c17770e (hw-native-sys#320): encapsulate dep pool operations and reorder orchestrator pipeline - 83473ba (hw-native-sys#323): replace block core assignment with round-robin in AICPU executor Tests (tests/device_tests/a5/): - 439ccd4 (hw-native-sys#322): unify paged attention golden cases across test variants Examples & tests (examples/a5/, tests/device_tests/a5/): - 8cf8981 (hw-native-sys#293): replace PipeSyncFunc with FULL_MEMORY_BARRIER in kernels - b88eed3 (hw-native-sys#302): optimize paged attention pipeline, eliminate GM round-trips - 94f39ff (hw-native-sys#301) + 15e6034 (hw-native-sys#308): update orchestration to new PTOParam API
Synchronize A5 tensormap_and_ringbuffer runtime and platform with a2a3 improvements introduced after 56a2c61. Follows the sync pattern established in #250 and #300. Platform (src/a5/platform/): - 2f58a2f (#267): add AICPU thread affinity (platform_aicpu_affinity.h/cpp), PLATFORM_MAX_AICPU_THREADS_JUST_FOR_LAUNCH, device_runner, kernel.cpp, CMakeLists.txt - b903e7b: sync perf_profiling.h for multi-ring support - 334d355 (#254): sync performance_collector_aicore.h for slim dispatch Runtime host_build_graph (src/a5/runtime/host_build_graph/): - 334d355 (#254): slim dispatch payload in aicore_executor.cpp - dd7ada4: standardize register init and exit handshake in aicore_executor.cpp - 2f58a2f (#267): AICPU affinity gate in aicpu_executor.cpp - 83473ba (#323): replace block core assignment with round-robin in AICPU executor Runtime tensormap_and_ringbuffer (src/a5/runtime/tensormap_and_ringbuffer/): - e2e38b9 (#249): cluster-based mixed-task dispatch; add pto_submit_types.h and SUBMIT_BY_CLUSTER.md - a842263 (#255): separate local ready queue by CoreType in pto_scheduler.h - cf6462c (#268): consolidate per-task state into PTO2TaskSlotState (pto_runtime2_types.h, pto_scheduler.cpp, pto_orchestrator.cpp) - b903e7b: multi-ring buffer architecture (pto_shared_memory, MULTI_RING.md, aicpu_executor.cpp, perf_profiling.h) - 5d92137 (#264): DepListPool ring buffer reclamation (pto_ring_buffer.h/cpp) - 54d082c (#281): replace task_id with slot-state pointer across scheduler, orchestrator, ring buffer, executor, RUNTIME_LOGIC.md - d305376 (#277): add scope deadlock detection in pto_orchestrator - 1e41a3a (#274): per-thread orchestrator phase profiling - f5da078 (#275): progress-aware ring buffer spin detection (pto_ring_buffer.h, pto_orchestrator.cpp, runtime_maker.cpp) - 10f6415 (#284): tighten PTO2_PROFILING macro guards; sync profiling_levels.md - 9c158e0 (#291): emergency shutdown on fatal error (aicpu_executor, pto_orchestration_api.h, pto_orchestrator, pto_shared_memory) - 94f39ff (#301): refactor PTOParam to aggregated container with parallel arrays (pto_types.h, pto_runtime2_types.h, pto_scheduler, pto_shared_memory, pto_tensormap, pto_orchestrator, runtime2) - 15e6034 (#308): refactor Tensor fields and pto_tensormap for cache locality - 77a81aa (#306): replace PTOParam assert with orchestration error handling - e4348eb (#315): move reclamation state into owning data structures - c17770e (#320): encapsulate dep pool operations and reorder orchestrator pipeline - 83473ba (#323): replace block core assignment with round-robin in AICPU executor Tests (tests/device_tests/a5/): - 439ccd4 (#322): unify paged attention golden cases across test variants Examples & tests (examples/a5/, tests/device_tests/a5/): - 8cf8981 (#293): replace PipeSyncFunc with FULL_MEMORY_BARRIER in kernels - b88eed3 (#302): optimize paged attention pipeline, eliminate GM round-trips - 94f39ff (#301) + 15e6034 (#308): update orchestration to new PTOParam API