sync(a5): align tensormap_and_ringbuffer runtime with a2a3 (56a2c61..HEAD)#314
sync(a5): align tensormap_and_ringbuffer runtime with a2a3 (56a2c61..HEAD)#314ChaoZheng109 wants to merge 1 commit intohw-native-sys:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces several key improvements to the PTO2 runtime, including a multi-ring buffer architecture for better memory management, an affinity gate for optimized thread usage, and enhanced profiling capabilities. It also includes validation and optimization of data structures to improve performance and prevent errors. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a major architectural refactoring to support multi-ring buffers and cluster-based scheduling. The changes are extensive, touching the core runtime, orchestration API, and data structures to improve performance, resource management, and support for future hardware. While the refactoring is substantial and includes significant performance optimizations, I've identified a few critical issues that need to be addressed. Specifically, there's a potential deadlock in the paged attention kernels due to a self-wait synchronization pattern, and an incorrect cache flush operation that could lead to data corruption. There is also a minor configuration inconsistency that should be clarified.
examples/a5/tensormap_and_ringbuffer/paged_attention/kernels/aic/aic_pv_matmul.cpp
Show resolved
Hide resolved
…HEAD) Synchronize A5 tensormap_and_ringbuffer runtime and platform with a2a3 improvements introduced after 56a2c61. Follows the sync pattern established in hw-native-sys#250 and hw-native-sys#300. Platform (src/a5/platform/): - 2f58a2f (hw-native-sys#267): add AICPU thread affinity (platform_aicpu_affinity.h/cpp), PLATFORM_MAX_AICPU_THREADS_JUST_FOR_LAUNCH, device_runner, kernel.cpp, CMakeLists.txt - b903e7b: sync perf_profiling.h for multi-ring support - 334d355 (hw-native-sys#254): sync performance_collector_aicore.h for slim dispatch Runtime host_build_graph (src/a5/runtime/host_build_graph/): - 334d355 (hw-native-sys#254): slim dispatch payload in aicore_executor.cpp - dd7ada4: standardize register init and exit handshake in aicore_executor.cpp - 2f58a2f (hw-native-sys#267): AICPU affinity gate in aicpu_executor.cpp Runtime tensormap_and_ringbuffer (src/a5/runtime/tensormap_and_ringbuffer/): - e2e38b9 (hw-native-sys#249): cluster-based mixed-task dispatch; add pto_submit_types.h and SUBMIT_BY_CLUSTER.md - a842263 (hw-native-sys#255): separate local ready queue by CoreType in pto_scheduler.h - cf6462c (hw-native-sys#268): consolidate per-task state into PTO2TaskSlotState (pto_runtime2_types.h, pto_scheduler.cpp, pto_orchestrator.cpp) - b903e7b: multi-ring buffer architecture (pto_shared_memory, MULTI_RING.md, aicpu_executor.cpp, perf_profiling.h) - 5d92137 (hw-native-sys#264): DepListPool ring buffer reclamation (pto_ring_buffer.h/cpp) - 54d082c (hw-native-sys#281): replace task_id with slot-state pointer across scheduler, orchestrator, ring buffer, executor, RUNTIME_LOGIC.md - d305376 (hw-native-sys#277): add scope deadlock detection in pto_orchestrator - 1e41a3a (hw-native-sys#274): per-thread orchestrator phase profiling - f5da078 (hw-native-sys#275): progress-aware ring buffer spin detection (pto_ring_buffer.h, pto_orchestrator.cpp, runtime_maker.cpp) - 10f6415 (hw-native-sys#284): tighten PTO2_PROFILING macro guards; sync profiling_levels.md - 9c158e0 (hw-native-sys#291): emergency shutdown on fatal error (aicpu_executor, pto_orchestration_api.h, pto_orchestrator, pto_shared_memory) - 94f39ff (hw-native-sys#301): refactor PTOParam to aggregated container with parallel arrays (pto_types.h, pto_runtime2_types.h, pto_scheduler, pto_shared_memory, pto_tensormap, pto_orchestrator, runtime2) - 15e6034 (hw-native-sys#308): refactor Tensor fields and pto_tensormap for cache locality - 77a81aa (hw-native-sys#306): replace PTOParam assert with orchestration error handling Examples & tests (examples/a5/, tests/device_tests/a5/): - 8cf8981 (hw-native-sys#293): replace PipeSyncFunc with FULL_MEMORY_BARRIER in kernels - b88eed3 (hw-native-sys#302): optimize paged attention pipeline, eliminate GM round-trips - 94f39ff (hw-native-sys#301) + 15e6034 (hw-native-sys#308): update orchestration to new PTOParam API
Synchronize A5 tensormap_and_ringbuffer runtime and platform with
a2a3 improvements introduced after 56a2c61. Follows the sync pattern
established in #250 and #300.
Platform (src/a5/platform/):
PLATFORM_MAX_AICPU_THREADS_JUST_FOR_LAUNCH, device_runner, kernel.cpp,
CMakeLists.txt
Runtime host_build_graph (src/a5/runtime/host_build_graph/):
Runtime tensormap_and_ringbuffer (src/a5/runtime/tensormap_and_ringbuffer/):
and SUBMIT_BY_CLUSTER.md
(pto_runtime2_types.h, pto_scheduler.cpp, pto_orchestrator.cpp)
aicpu_executor.cpp, perf_profiling.h)
orchestrator, ring buffer, executor, RUNTIME_LOGIC.md
(pto_ring_buffer.h, pto_orchestrator.cpp, runtime_maker.cpp)
(aicpu_executor, pto_orchestration_api.h, pto_orchestrator, pto_shared_memory)
(pto_types.h, pto_runtime2_types.h, pto_scheduler, pto_shared_memory,
pto_tensormap, pto_orchestrator, runtime2)
Examples & tests (examples/a5/, tests/device_tests/a5/):