Skip to content

sync(a5): align tensormap_and_ringbuffer runtime with a2a3 (56a2c61..HEAD)#314

Draft
ChaoZheng109 wants to merge 1 commit intohw-native-sys:mainfrom
ChaoZheng109:a5/sync1
Draft

sync(a5): align tensormap_and_ringbuffer runtime with a2a3 (56a2c61..HEAD)#314
ChaoZheng109 wants to merge 1 commit intohw-native-sys:mainfrom
ChaoZheng109:a5/sync1

Conversation

@ChaoZheng109
Copy link
Collaborator

@ChaoZheng109 ChaoZheng109 commented Mar 17, 2026

Synchronize A5 tensormap_and_ringbuffer runtime and platform with
a2a3 improvements introduced after 56a2c61. Follows the sync pattern
established in #250 and #300.

Platform (src/a5/platform/):

  • 2f58a2f (#267): add AICPU thread affinity (platform_aicpu_affinity.h/cpp),
    PLATFORM_MAX_AICPU_THREADS_JUST_FOR_LAUNCH, device_runner, kernel.cpp,
    CMakeLists.txt
  • b903e7b: sync perf_profiling.h for multi-ring support
  • 334d355 (#254): sync performance_collector_aicore.h for slim dispatch

Runtime host_build_graph (src/a5/runtime/host_build_graph/):

  • 334d355 (#254): slim dispatch payload in aicore_executor.cpp
  • dd7ada4: standardize register init and exit handshake in aicore_executor.cpp
  • 2f58a2f (#267): AICPU affinity gate in aicpu_executor.cpp

Runtime tensormap_and_ringbuffer (src/a5/runtime/tensormap_and_ringbuffer/):

  • e2e38b9 (#249): cluster-based mixed-task dispatch; add pto_submit_types.h
    and SUBMIT_BY_CLUSTER.md
  • a842263 (#255): separate local ready queue by CoreType in pto_scheduler.h
  • cf6462c (#268): consolidate per-task state into PTO2TaskSlotState
    (pto_runtime2_types.h, pto_scheduler.cpp, pto_orchestrator.cpp)
  • b903e7b: multi-ring buffer architecture (pto_shared_memory, MULTI_RING.md,
    aicpu_executor.cpp, perf_profiling.h)
  • 5d92137 (#264): DepListPool ring buffer reclamation (pto_ring_buffer.h/cpp)
  • 54d082c (#281): replace task_id with slot-state pointer across scheduler,
    orchestrator, ring buffer, executor, RUNTIME_LOGIC.md
  • d305376 (#277): add scope deadlock detection in pto_orchestrator
  • 1e41a3a (#274): per-thread orchestrator phase profiling
  • f5da078 (#275): progress-aware ring buffer spin detection
    (pto_ring_buffer.h, pto_orchestrator.cpp, runtime_maker.cpp)
  • 10f6415 (#284): tighten PTO2_PROFILING macro guards; sync profiling_levels.md
  • 9c158e0 (#291): emergency shutdown on fatal error
    (aicpu_executor, pto_orchestration_api.h, pto_orchestrator, pto_shared_memory)
  • 94f39ff (#301): refactor PTOParam to aggregated container with parallel arrays
    (pto_types.h, pto_runtime2_types.h, pto_scheduler, pto_shared_memory,
    pto_tensormap, pto_orchestrator, runtime2)
  • 15e6034 (#308): refactor Tensor fields and pto_tensormap for cache locality
  • 77a81aa (#306): replace PTOParam assert with orchestration error handling

Examples & tests (examples/a5/, tests/device_tests/a5/):

  • 8cf8981 (#293): replace PipeSyncFunc with FULL_MEMORY_BARRIER in kernels
  • b88eed3 (#302): optimize paged attention pipeline, eliminate GM round-trips
  • 94f39ff (#301) + 15e6034 (#308): update orchestration to new PTOParam API

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces several key improvements to the PTO2 runtime, including a multi-ring buffer architecture for better memory management, an affinity gate for optimized thread usage, and enhanced profiling capabilities. It also includes validation and optimization of data structures to improve performance and prevent errors.

Highlights

  • Multi-Ring Buffer Architecture: Implements a multi-ring buffer architecture to improve memory reclamation in nested scopes, enhancing performance and preventing deadlocks.
  • AICPU Affinity Gate: Introduces an affinity gate to drop excess AICPU threads, optimizing thread usage based on cluster topology.
  • Enhanced Profiling: Adds detailed profiling for orchestrator and scheduler activities, providing better insights into performance bottlenecks.
  • PTOParam Validation: Implements validation for PTOParam construction, ensuring data integrity and preventing runtime errors.
  • Slimmed Dispatch Payload: Reduces the size of the dispatch payload by moving metadata to the TaskDescriptor, optimizing memory access during kernel execution.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • examples/a5/tensormap_and_ringbuffer/paged_attention/kernels/aic/aic_pv_matmul.cpp
    • Added set_flag and wait_flag calls for synchronization.
  • examples/a5/tensormap_and_ringbuffer/paged_attention/kernels/aic/aic_qk_matmul.cpp
    • Added set_flag and wait_flag calls for synchronization.
  • examples/a5/tensormap_and_ringbuffer/paged_attention/kernels/aiv/aiv_online_update.cpp
    • Added set_flag and wait_flag calls for synchronization.
  • examples/a5/tensormap_and_ringbuffer/paged_attention/kernels/aiv/aiv_softmax_prepare.cpp
    • Added a comment explaining the manual filling of invalid columns and added set_flag and wait_flag calls for synchronization, also reordered tensor arguments in kernel_entry.
  • examples/a5/tensormap_and_ringbuffer/paged_attention/kernels/orchestration/paged_attention_orch.cpp
    • Changed tensor shapes from uint64_t to uint32_t and modified PTOParam usage to use add_input/add_output/add_scalar methods.
  • src/a5/platform/include/aicore/performance_collector_aicore.h
    • Removed func_id and core_type from perf_aicore_record_task parameters, AICore now records task_id and timestamps only.
  • src/a5/platform/include/aicpu/platform_aicpu_affinity.h
    • Added a header file defining the platform_aicpu_affinity_gate function for controlling AICPU thread affinity.
  • src/a5/platform/include/common/perf_profiling.h
    • Added ring_id to PerfRecord struct and updated description of task_id.
  • src/a5/platform/include/common/platform_config.h
    • Added PLATFORM_MAX_AICPU_THREADS_JUST_FOR_LAUNCH constant to define the maximum number of AICPU launch threads.
  • src/a5/platform/onboard/aicpu/kernel.cpp
    • Included platform_aicpu_affinity.h and runtime.h, and added an affinity gate to drop excess threads before entering runtime.
  • src/a5/platform/onboard/aicpu/platform_aicpu_affinity.cpp
    • Added a source file implementing the platform_aicpu_affinity_gate function for controlling AICPU thread affinity based on cluster topology.
  • src/a5/platform/onboard/host/device_runner.cpp
    • Modified launch_aicpu_kernel to use PLATFORM_MAX_AICPU_THREADS_JUST_FOR_LAUNCH.
  • src/a5/platform/sim/aicpu/platform_aicpu_affinity.cpp
    • Added a source file implementing the platform_aicpu_affinity_gate function for controlling AICPU thread affinity in simulation.
  • src/a5/platform/sim/host/CMakeLists.txt
    • Added platform_aicpu_affinity.cpp to the HOST_RUNTIME_SOURCES.
  • src/a5/platform/sim/host/device_runner.cpp
    • Included aicpu/platform_aicpu_affinity.h and modified AICPU thread launching to over-launch for affinity gate.
  • src/a5/runtime/host_build_graph/aicore/aicore_executor.cpp
    • Modified aicore_execute to use dcci on aicore_regs_ready and removed func_id and core_type from perf_aicore_record_task parameters, also flushed only SINGLE_CACHE_LINE before kernel exit.
  • src/a5/runtime/host_build_graph/aicpu/aicpu_executor.cpp
    • Modified AicpuExecutor::resolve_and_dispatch to include func_id and core_type in PerfRecord.
  • src/a5/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp
    • Modified execute_task to directly access PTO2DispatchPayload fields and updated aicore_execute to use the payload address from Handshake.task and derive task ID from the register value.
  • src/a5/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
    • Removed unnecessary includes, added a per-core dispatch counter, and implemented affinity gating.
  • src/a5/runtime/tensormap_and_ringbuffer/docs/MULTI_RING.md
    • Added a new document describing the multi-ring buffer architecture.
  • src/a5/runtime/tensormap_and_ringbuffer/docs/RUNTIME_LOGIC.md
    • Added a note about the multi-ring architecture and updated the shared memory layout description.
  • src/a5/runtime/tensormap_and_ringbuffer/docs/SUBMIT_BY_CLUSTER.md
    • Added a new document outlining the requirements and design for cluster submission.
  • src/a5/runtime/tensormap_and_ringbuffer/docs/device_log_profiling.md
    • Clarified the scheduler loop phases and their statistics.
  • src/a5/runtime/tensormap_and_ringbuffer/docs/profiling_levels.md
    • Updated the profiling macro hierarchy and added a description for --enable-profiling.
  • src/a5/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp
    • Modified init_runtime_impl to parse PTO2_RING_DEP_POOL, allocate GM heap for all rings, and update the record_tensor_pair call.
  • src/a5/runtime/tensormap_and_ringbuffer/orchestration/pto_orchestration_api.h
    • Modified the submit API to use MixedKernels and added convenience wrappers for AIC and AIV tasks.
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto2_dispatch_payload.h
    • Slimmed down the dispatch payload to only include function_bin_addr and args.
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
    • Modified the orchestrator to handle per-ring resources and update profiling records.
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h
    • Updated PTO2OrchestratorState to include per-ring resources and a fatal error flag.
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.cpp
    • Modified pto2_dep_pool_init to use nullptr for the base entry and updated error handling in pto2_heap_ring_alloc.
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.h
    • Added error_code_ptr to PTO2HeapRing and PTO2TaskRing, and added PTO2RingSet struct.
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.cpp
    • Implemented multi-ring support, including per-ring state initialization and error handling.
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.h
    • Included pto_submit_types.h and updated PTO2RuntimeOps to use MixedKernels.
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.cpp
    • Implemented shape-based ready queues and updated task completion and dispatch logic.
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h
    • Updated PTO2SchedulerState to include per-ring scheduling state and shape-based ready queues.
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_shared_memory.cpp
    • Implemented per-ring shared memory layout and initialization.
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_shared_memory.h
    • Updated shared memory structures to support multi-ring architecture.
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/pto_submit_types.h
    • Created a new header file defining types for mixed task submission.
  • src/a5/runtime/tensormap_and_ringbuffer/runtime/tensor.h
    • Optimized Tensor struct layout for cache efficiency and added manual dependency flag.
Activity
  • Implemented multi-ring buffer architecture
  • Added AICPU affinity gate
  • Enhanced profiling capabilities
  • Validated PTOParam construction
  • Optimized dispatch payload size
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a major architectural refactoring to support multi-ring buffers and cluster-based scheduling. The changes are extensive, touching the core runtime, orchestration API, and data structures to improve performance, resource management, and support for future hardware. While the refactoring is substantial and includes significant performance optimizations, I've identified a few critical issues that need to be addressed. Specifically, there's a potential deadlock in the paged attention kernels due to a self-wait synchronization pattern, and an incorrect cache flush operation that could lead to data corruption. There is also a minor configuration inconsistency that should be clarified.

…HEAD)

Synchronize A5 tensormap_and_ringbuffer runtime and platform with
a2a3 improvements introduced after 56a2c61. Follows the sync pattern
established in hw-native-sys#250 and hw-native-sys#300.

Platform (src/a5/platform/):
- 2f58a2f (hw-native-sys#267): add AICPU thread affinity (platform_aicpu_affinity.h/cpp),
  PLATFORM_MAX_AICPU_THREADS_JUST_FOR_LAUNCH, device_runner, kernel.cpp,
  CMakeLists.txt
- b903e7b: sync perf_profiling.h for multi-ring support
- 334d355 (hw-native-sys#254): sync performance_collector_aicore.h for slim dispatch

Runtime host_build_graph (src/a5/runtime/host_build_graph/):
- 334d355 (hw-native-sys#254): slim dispatch payload in aicore_executor.cpp
- dd7ada4: standardize register init and exit handshake in aicore_executor.cpp
- 2f58a2f (hw-native-sys#267): AICPU affinity gate in aicpu_executor.cpp

Runtime tensormap_and_ringbuffer (src/a5/runtime/tensormap_and_ringbuffer/):
- e2e38b9 (hw-native-sys#249): cluster-based mixed-task dispatch; add pto_submit_types.h
  and SUBMIT_BY_CLUSTER.md
- a842263 (hw-native-sys#255): separate local ready queue by CoreType in pto_scheduler.h
- cf6462c (hw-native-sys#268): consolidate per-task state into PTO2TaskSlotState
  (pto_runtime2_types.h, pto_scheduler.cpp, pto_orchestrator.cpp)
- b903e7b: multi-ring buffer architecture (pto_shared_memory, MULTI_RING.md,
  aicpu_executor.cpp, perf_profiling.h)
- 5d92137 (hw-native-sys#264): DepListPool ring buffer reclamation (pto_ring_buffer.h/cpp)
- 54d082c (hw-native-sys#281): replace task_id with slot-state pointer across scheduler,
  orchestrator, ring buffer, executor, RUNTIME_LOGIC.md
- d305376 (hw-native-sys#277): add scope deadlock detection in pto_orchestrator
- 1e41a3a (hw-native-sys#274): per-thread orchestrator phase profiling
- f5da078 (hw-native-sys#275): progress-aware ring buffer spin detection
  (pto_ring_buffer.h, pto_orchestrator.cpp, runtime_maker.cpp)
- 10f6415 (hw-native-sys#284): tighten PTO2_PROFILING macro guards; sync profiling_levels.md
- 9c158e0 (hw-native-sys#291): emergency shutdown on fatal error
  (aicpu_executor, pto_orchestration_api.h, pto_orchestrator, pto_shared_memory)
- 94f39ff (hw-native-sys#301): refactor PTOParam to aggregated container with parallel arrays
  (pto_types.h, pto_runtime2_types.h, pto_scheduler, pto_shared_memory,
  pto_tensormap, pto_orchestrator, runtime2)
- 15e6034 (hw-native-sys#308): refactor Tensor fields and pto_tensormap for cache locality
- 77a81aa (hw-native-sys#306): replace PTOParam assert with orchestration error handling

Examples & tests (examples/a5/, tests/device_tests/a5/):
- 8cf8981 (hw-native-sys#293): replace PipeSyncFunc with FULL_MEMORY_BARRIER in kernels
- b88eed3 (hw-native-sys#302): optimize paged attention pipeline, eliminate GM round-trips
- 94f39ff (hw-native-sys#301) + 15e6034 (hw-native-sys#308): update orchestration to new PTOParam API
@ChaoZheng109 ChaoZheng109 changed the title A5/sync1 sync(a5): align tensormap_and_ringbuffer runtime with a2a3 (56a2c61..HEAD) Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant