Skip to content

Refactor: encapsulate dep pool operations and reorder orchestrator pipeline#320

Merged
ChaoZheng109 merged 1 commit intohw-native-sys:mainfrom
poursoul:some-changes-for-orch
Mar 18, 2026
Merged

Refactor: encapsulate dep pool operations and reorder orchestrator pipeline#320
ChaoZheng109 merged 1 commit intohw-native-sys:mainfrom
poursoul:some-changes-for-orch

Conversation

@poursoul
Copy link
Collaborator

  • Move free functions (init, reclaim, ensure_space, used, available) into PTO2DepListPool as member methods; rename pto2_dep_list_prepend to prepend(), eliminating dep_pool_cur_entries from orchestrator state
  • Defer tensormap sync after heap allocation to overlap with GM prefetch
  • Guard paged attention orchestration profiling with ENABLE_PROFILING
  • Fix signed/unsigned comparison in host_regs fallback loop

…peline

- Move free functions (init, reclaim, ensure_space, used, available)
  into PTO2DepListPool as member methods; rename pto2_dep_list_prepend
  to prepend(), eliminating dep_pool_cur_entries from orchestrator state
- Defer tensormap sync after heap allocation to overlap with GM prefetch
- Guard paged attention orchestration profiling with ENABLE_PROFILING
- Fix signed/unsigned comparison in host_regs fallback loop
@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the orchestrator's dependency management by encapsulating related functionalities into a dedicated class, improving code structure and maintainability. It also optimizes the orchestrator's execution flow by reordering key operations to enhance performance, specifically by deferring tensormap synchronization. Additionally, profiling capabilities are now conditionally compiled, and a minor type-mismatch bug in host register handling has been resolved, contributing to a more robust and efficient system.

Highlights

  • Dependency Pool Encapsulation: Moved free functions related to the dependency pool (init, reclaim, ensure_space, used, available) into the PTO2DepListPool class as member methods, enhancing object-oriented design and removing dep_pool_cur_entries from orchestrator state.
  • Orchestrator Pipeline Reordering: Deferred tensormap synchronization until after heap allocation within the orchestrator pipeline, aiming to overlap with GM prefetch operations for improved performance.
  • Conditional Profiling: Guarded paged attention orchestration profiling code with ENABLE_PROFILING macros, allowing profiling to be enabled or disabled at compile time.
  • Bug Fix: Signed/Unsigned Comparison: Corrected a signed/unsigned comparison warning in the host_regs fallback loop by changing an int loop variable to uint32_t.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant refactoring by encapsulating dependency pool operations into the PTO2DepListPool class, which improves code structure and maintainability. It also reorders the orchestrator pipeline to defer tensormap synchronization, aiming to improve performance by overlapping it with GM prefetch. Additionally, it fixes a signed/unsigned comparison bug and makes profiling code in a test kernel conditional, which is a good practice. The changes are well-executed. I've found one minor issue in the documentation that needs to be addressed.

Copy link
Collaborator

@hw-native-sys-bot hw-native-sys-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

整体上没有发现关键的逻辑/正确性 bug,pipeline reorder 保持了 sync-before-lookup 不变量,dep_pool_cur_entries 的消除逻辑也是正确的。以下是一些代码质量和风格上的问题:

}
SPIN_WAIT_HINT();
}
} No newline at end of file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

文件末尾缺少换行符(\ No newline at end of file),建议补上。


PTO2TaskDescriptor& task = task_ring.get_task_by_slot(slot);
PTO2TaskPayload* payload = &orch->sm_handle->task_payloads[ring_id][slot];
PTO2TaskPayload* payload = &orch->sm_handle->task_payloads[ring_id][slot];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

行尾有多余的空格(trailing whitespace)。

uint8_t ring_id = orch->current_ring_id();
auto& task_ring = orch->rings[ring_id].task_ring;
PTO2SchedulerState* sched = orch->scheduler;
PTO2RingFlowControl &fc = orch->sm_handle->header->rings[ring_id].fc;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

& 放置风格不一致:这里是 PTO2RingFlowControl &fc(& 跟变量名),但同文件 line 277 auto& task_ring 以及 pto_ring_buffer.hPTO2SchedulerState& sched 都是 & 跟类型。同样的问题也出现在 line 120 的 auto &fc。建议统一。

// =============================================================================
void pto2_submit_mixed_task(
PTO2OrchestratorState* orch, const MixedKernels& mixed_kernels, const PTOParam& params) {
CYCLE_COUNT_START();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CYCLE_COUNT_START() 移到了 fatal 检查之前,fatal 路径会多一次无意义的 cycle counter 读取。建议移到 fatal 检查之后(原来的位置更合理)。

@@ -115,8 +128,6 @@ __attribute__((visibility("default"))) void aicpu_orchestration_entry(PTO2Runtim
DataType data_type = DataType::BFLOAT16;
CYCLE_COUNT_LAP(prof_param_extract);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Profiling guard 模式有隐患:prof_param_extract 等变量在 #ifdef ENABLE_PROFILING 内声明(line 83-94),但这里的 CYCLE_COUNT_LAP(prof_param_extract) 以及后面的 CYCLE_COUNT_LAP(prof_param_setup) 等调用并不在 #ifdef 内。

当前能编过是因为宏展开为 (void)0 时参数不会被求值,但这个模式比较脆弱——如果将来有人把宏改成 do { (void)(acc); } while(0) 来消除 unused warning,就会编译失败。建议要么把这些 CYCLE_COUNT_LAP 调用也用 #ifdef ENABLE_PROFILING 包裹,要么把变量声明移到 guard 外面。

@ChaoZheng109 ChaoZheng109 merged commit c17770e into hw-native-sys:main Mar 18, 2026
5 checks passed
ChaoZheng109 added a commit to ChaoZheng109/simpler that referenced this pull request Mar 19, 2026
Synchronize A5 runtime and tests with a2a3 improvements. Follows the established sync pattern.

Runtime host_build_graph (src/a5/runtime/host_build_graph/):
- 83473ba (hw-native-sys#323): replace block core assignment with round-robin in AICPU executor

Runtime tensormap_and_ringbuffer (src/a5/runtime/tensormap_and_ringbuffer/):
- e4348eb (hw-native-sys#315): move reclamation state into owning data structures
- c17770e (hw-native-sys#320): encapsulate dep pool operations and reorder orchestrator pipeline
- 83473ba (hw-native-sys#323): replace block core assignment with round-robin in AICPU executor

Tests (tests/device_tests/a5/):
- 439ccd4 (hw-native-sys#322): unify paged attention golden cases across test variants
ChaoZheng109 added a commit to ChaoZheng109/simpler that referenced this pull request Mar 19, 2026
Synchronize A5 tensormap_and_ringbuffer runtime and platform with
a2a3 improvements introduced after 56a2c61. Follows the sync pattern
established in hw-native-sys#250 and hw-native-sys#300.

Platform (src/a5/platform/):
- 2f58a2f (hw-native-sys#267): add AICPU thread affinity (platform_aicpu_affinity.h/cpp),
  PLATFORM_MAX_AICPU_THREADS_JUST_FOR_LAUNCH, device_runner, kernel.cpp,
  CMakeLists.txt
- b903e7b: sync perf_profiling.h for multi-ring support
- 334d355 (hw-native-sys#254): sync performance_collector_aicore.h for slim dispatch

Runtime host_build_graph (src/a5/runtime/host_build_graph/):
- 334d355 (hw-native-sys#254): slim dispatch payload in aicore_executor.cpp
- dd7ada4: standardize register init and exit handshake in aicore_executor.cpp
- 2f58a2f (hw-native-sys#267): AICPU affinity gate in aicpu_executor.cpp
- 83473ba (hw-native-sys#323): replace block core assignment with round-robin in AICPU executor

Runtime tensormap_and_ringbuffer (src/a5/runtime/tensormap_and_ringbuffer/):
- e2e38b9 (hw-native-sys#249): cluster-based mixed-task dispatch; add pto_submit_types.h
  and SUBMIT_BY_CLUSTER.md
- a842263 (hw-native-sys#255): separate local ready queue by CoreType in pto_scheduler.h
- cf6462c (hw-native-sys#268): consolidate per-task state into PTO2TaskSlotState
  (pto_runtime2_types.h, pto_scheduler.cpp, pto_orchestrator.cpp)
- b903e7b: multi-ring buffer architecture (pto_shared_memory, MULTI_RING.md,
  aicpu_executor.cpp, perf_profiling.h)
- 5d92137 (hw-native-sys#264): DepListPool ring buffer reclamation (pto_ring_buffer.h/cpp)
- 54d082c (hw-native-sys#281): replace task_id with slot-state pointer across scheduler,
  orchestrator, ring buffer, executor, RUNTIME_LOGIC.md
- d305376 (hw-native-sys#277): add scope deadlock detection in pto_orchestrator
- 1e41a3a (hw-native-sys#274): per-thread orchestrator phase profiling
- f5da078 (hw-native-sys#275): progress-aware ring buffer spin detection
  (pto_ring_buffer.h, pto_orchestrator.cpp, runtime_maker.cpp)
- 10f6415 (hw-native-sys#284): tighten PTO2_PROFILING macro guards; sync profiling_levels.md
- 9c158e0 (hw-native-sys#291): emergency shutdown on fatal error
  (aicpu_executor, pto_orchestration_api.h, pto_orchestrator, pto_shared_memory)
- 94f39ff (hw-native-sys#301): refactor PTOParam to aggregated container with parallel arrays
  (pto_types.h, pto_runtime2_types.h, pto_scheduler, pto_shared_memory,
  pto_tensormap, pto_orchestrator, runtime2)
- 15e6034 (hw-native-sys#308): refactor Tensor fields and pto_tensormap for cache locality
- 77a81aa (hw-native-sys#306): replace PTOParam assert with orchestration error handling
- e4348eb (hw-native-sys#315): move reclamation state into owning data structures
- c17770e (hw-native-sys#320): encapsulate dep pool operations and reorder orchestrator pipeline
- 83473ba (hw-native-sys#323): replace block core assignment with round-robin in AICPU executor

Tests (tests/device_tests/a5/):
- 439ccd4 (hw-native-sys#322): unify paged attention golden cases across test variants

Examples & tests (examples/a5/, tests/device_tests/a5/):
- 8cf8981 (hw-native-sys#293): replace PipeSyncFunc with FULL_MEMORY_BARRIER in kernels
- b88eed3 (hw-native-sys#302): optimize paged attention pipeline, eliminate GM round-trips
- 94f39ff (hw-native-sys#301) + 15e6034 (hw-native-sys#308): update orchestration to new PTOParam API
ChaoZheng109 added a commit that referenced this pull request Mar 19, 2026
Synchronize A5 tensormap_and_ringbuffer runtime and platform with
a2a3 improvements introduced after 56a2c61. Follows the sync pattern
established in #250 and #300.

Platform (src/a5/platform/):
- 2f58a2f (#267): add AICPU thread affinity (platform_aicpu_affinity.h/cpp),
  PLATFORM_MAX_AICPU_THREADS_JUST_FOR_LAUNCH, device_runner, kernel.cpp,
  CMakeLists.txt
- b903e7b: sync perf_profiling.h for multi-ring support
- 334d355 (#254): sync performance_collector_aicore.h for slim dispatch

Runtime host_build_graph (src/a5/runtime/host_build_graph/):
- 334d355 (#254): slim dispatch payload in aicore_executor.cpp
- dd7ada4: standardize register init and exit handshake in aicore_executor.cpp
- 2f58a2f (#267): AICPU affinity gate in aicpu_executor.cpp
- 83473ba (#323): replace block core assignment with round-robin in AICPU executor

Runtime tensormap_and_ringbuffer (src/a5/runtime/tensormap_and_ringbuffer/):
- e2e38b9 (#249): cluster-based mixed-task dispatch; add pto_submit_types.h
  and SUBMIT_BY_CLUSTER.md
- a842263 (#255): separate local ready queue by CoreType in pto_scheduler.h
- cf6462c (#268): consolidate per-task state into PTO2TaskSlotState
  (pto_runtime2_types.h, pto_scheduler.cpp, pto_orchestrator.cpp)
- b903e7b: multi-ring buffer architecture (pto_shared_memory, MULTI_RING.md,
  aicpu_executor.cpp, perf_profiling.h)
- 5d92137 (#264): DepListPool ring buffer reclamation (pto_ring_buffer.h/cpp)
- 54d082c (#281): replace task_id with slot-state pointer across scheduler,
  orchestrator, ring buffer, executor, RUNTIME_LOGIC.md
- d305376 (#277): add scope deadlock detection in pto_orchestrator
- 1e41a3a (#274): per-thread orchestrator phase profiling
- f5da078 (#275): progress-aware ring buffer spin detection
  (pto_ring_buffer.h, pto_orchestrator.cpp, runtime_maker.cpp)
- 10f6415 (#284): tighten PTO2_PROFILING macro guards; sync profiling_levels.md
- 9c158e0 (#291): emergency shutdown on fatal error
  (aicpu_executor, pto_orchestration_api.h, pto_orchestrator, pto_shared_memory)
- 94f39ff (#301): refactor PTOParam to aggregated container with parallel arrays
  (pto_types.h, pto_runtime2_types.h, pto_scheduler, pto_shared_memory,
  pto_tensormap, pto_orchestrator, runtime2)
- 15e6034 (#308): refactor Tensor fields and pto_tensormap for cache locality
- 77a81aa (#306): replace PTOParam assert with orchestration error handling
- e4348eb (#315): move reclamation state into owning data structures
- c17770e (#320): encapsulate dep pool operations and reorder orchestrator pipeline
- 83473ba (#323): replace block core assignment with round-robin in AICPU executor

Tests (tests/device_tests/a5/):
- 439ccd4 (#322): unify paged attention golden cases across test variants

Examples & tests (examples/a5/, tests/device_tests/a5/):
- 8cf8981 (#293): replace PipeSyncFunc with FULL_MEMORY_BARRIER in kernels
- b88eed3 (#302): optimize paged attention pipeline, eliminate GM round-trips
- 94f39ff (#301) + 15e6034 (#308): update orchestration to new PTOParam API
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants