feat(runtime): double-buffered payload dispatch for AICore-AICPU pipeline by zhangqi-chen · Pull Request #259 · hw-native-sys/simpler

zhangqi-chen · 2026-03-11T09:15:16Z

Summary

Two payload slots per core enable AICPU to pre-stage the next task while AICore is still executing, eliminating idle gaps between dispatches.

AICPU side

Double-buffered payloads: s_pto2_payload_per_core[core][2] with XOR-flip slot selection
4-case completion state machine (A/B/C/D) handling all combinations of pending + running task FIN/ACK signals
Two-level cluster search: first pass finds fully-idle clusters, second pass finds pend-ready clusters (pending slot empty, core may be running)
ACK-wait guard: spin-waits in dispatch_subtask_to_core until AICore ACKs the current running task before overwriting hank->task and DATA_MAIN_BASE, preventing the race where AICore skips a task
Pending subslot tracking (pending_subslot_by_core_) for correct subtask_done_mask bit when promoting pending→running in Case B/D
complete_subtask() helper deduplicates completion logic across the four state machine cases
Profiling: saved running dispatch timestamp before pipeline overwrite; both running and pending perf records filled in Case A/B

AICore side

FIN-skip protocol: after task execution, read DATA_MAIN_BASE to check if AICPU already dispatched a pending task; if pending exists, skip FIN — the next ACK implicitly signals completion
Per-dispatch hank->task read: invalidate full data cache and re-read payload address each dispatch (AICPU updates it per double-buffer slot)

Testing

mixed_example passes 5/5 consecutive runs (exercises all 4 state machine cases)
All 12 simulation tests pass (./ci.sh -p a2a3sim)
Hardware tests (./ci.sh -p a2a3 -d 4-7 --parallel)

gemini-code-assist · 2026-03-11T09:31:55Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the efficiency of the AICore-AICPU pipeline by implementing a double-buffered payload dispatch system. This allows the AICPU to pre-stage the next task for an AICore while it is still processing the current one, thereby reducing idle time and improving overall throughput. The changes involve modifications to both AICore's task reception logic and AICPU's task management and dispatch mechanisms to support this pipelined execution.

Highlights

AICore Double-Buffering: AICore now reads from alternating payload slots and conditionally skips sending a FIN signal if a pending task is detected, allowing for continuous execution.
AICPU Task Management: AICPU tracks both currently executing and pending tasks per core, managing a 2-slot payload buffer and handling three distinct task completion scenarios.
Pipeline Dispatch: A new dispatch mechanism (dispatch_ready_tasks_to_running_cores) is introduced to proactively fill pending slots on cores that are still executing, improving pipeline utilization.
State Preservation: The new pending task and payload slot states are correctly preserved across thread reassignment operations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp
- Updated payload pointer to an array and introduced a slot index.
- Modified payload reading to use the current slot and invalidate cache.
- Added logic to toggle the payload slot after kernel execution.
- Implemented conditional FIN signal based on pending task detection via DATA_MAIN_BASE register.
src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
- Expanded s_pto2_payload_per_core to a 2D array for double buffering.
- Added pending_task_ids_ and payload_slot_ arrays to AicpuExecutor for per-core state tracking.
- Introduced a complete_task helper function to centralize task completion logic.
- Refactored check_running_cores_for_completion to handle three completion cases: pending FIN (both done), pending ACK (running implicitly done), and running FIN (no pending).
- Modified dispatch_ready_tasks_to_idle_cores to use pending_task_ids and toggle the payload slot.
- Added a new function dispatch_ready_tasks_to_running_cores to dispatch tasks to cores with available pending slots.
- Updated handshake_all_cores to initialize the payload pointer to the first slot of the new array.
- Initialized new state variables (pending_task_ids_, payload_slot_) in assign_cores_to_threads and deinit.
- Ensured pending_task_ids_ and payload_slot_ are preserved during reassign_cores_for_all_threads.
- Updated resolve_and_dispatch_pto2 to pass pending_task_ids and call the new pipeline dispatch function.
- Adjusted diagnose_stuck_state to display pending task information and use the correct payload slot for debugging.

Activity

No specific activity (comments, reviews, progress updates) has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request implements a double-buffering mechanism for PTO2 task dispatch between AICPU and AICore, enabling a pending task to be dispatched while another is actively running. This involved updating payload data structures to support multiple slots, introducing pending_task_ids and payload_slot tracking, and enhancing task completion logic to manage both running and pending tasks. A new function, dispatch_ready_tasks_to_running_cores, was added to handle dispatching to already active cores. A review comment points out that this new function has significant code duplication with dispatch_ready_tasks_to_idle_cores, recommending refactoring the common dispatch logic into a shared helper function to improve maintainability.

gemini-code-assist · 2026-03-11T09:38:43Z

src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp

+    template <CoreType CT>
+    void dispatch_ready_tasks_to_running_cores(Runtime* runtime,
+        int32_t thread_idx,
+        CoreTypeTracker& ct,
+        int32_t* pending_task_ids,
+        bool& made_progress,
+        PTO2TaskDescriptor* task_descriptors,
+        PTO2TaskPayload* task_payloads,
+        int32_t window_mask,
+        PTO2LocalReadyBuffer* local_bufs
+#if PTO2_PROFILING
+        ,
+        bool profiling_enabled,
+        uint64_t& pop_hit,
+        uint64_t& pop_miss,
+        uint32_t& phase_dispatch_count
+#endif
+#if PTO2_SCHED_PROFILING
+        ,
+        uint64_t& sched_dispatch_pop_cycle,
+        uint64_t& sched_dispatch_setup_cycle
+#endif
+    ) {
+        if (ct.running_count > 0 && rt->scheduler.ready_queues[static_cast<int32_t>(CT)].size() > 0) {
+            for (int32_t i = ct.running_count - 1; i >= 0; i--) {
+                int32_t core_id = ct.running[i];
+                if (pending_task_ids[core_id] != AICPU_TASK_INVALID) continue;
+
+#if PTO2_SCHED_PROFILING
+                extern uint64_t g_sched_pop_atomic_count[], g_sched_pop_wait_cycle[];
+                uint64_t t_pop_start = get_sys_cnt_aicpu();
+                int32_t task_id = rt->scheduler.get_ready_task<CT>(
+                    local_bufs,
+                    g_sched_pop_atomic_count[thread_idx], g_sched_pop_wait_cycle[thread_idx]);
+                sched_dispatch_pop_cycle += (get_sys_cnt_aicpu() - t_pop_start);
+#else
+                int32_t task_id = rt->scheduler.get_ready_task<CT>(local_bufs);
+#endif
+                if (task_id >= 0) {
+#if PTO2_PROFILING
+                    pop_hit++;
+                    phase_dispatch_count++;
+#endif
+#if PTO2_SCHED_PROFILING
+                    uint64_t t_setup_start = get_sys_cnt_aicpu();
+#endif
+                    PTO2TaskDescriptor* task = &task_descriptors[task_id & window_mask];
+                    PTO2TaskPayload* task_pl = &task_payloads[task_id & window_mask];
+                    int32_t slot = payload_slot_[thread_idx][core_id];
+                    PTO2DispatchPayload* payload = &s_pto2_payload_per_core[core_id][slot];
+                    build_pto2_payload<CT>(payload, runtime, task, task_pl);
+                    payload_slot_[thread_idx][core_id] ^= 1;
+#if PTO2_PROFILING
+                    if (profiling_enabled) {
+                        dispatch_timestamps_[core_id] = get_sys_cnt_aicpu();
+                        if (core_dispatch_counts_[core_id] >= PLATFORM_PROF_BUFFER_SIZE) {
+                            perf_aicpu_switch_buffer(runtime, core_id, thread_idx);
+                            core_dispatch_counts_[core_id] = 0;
+                        }
+                        core_dispatch_counts_[core_id]++;
+                    }
+#endif
+                    write_reg(core_id_to_reg_addr_[core_id], RegId::DATA_MAIN_BASE, static_cast<uint64_t>(task_id + 1));
+                    pending_task_ids[core_id] = task_id;
+                    made_progress = true;
+#if PTO2_SCHED_PROFILING
+                    sched_dispatch_setup_cycle += (get_sys_cnt_aicpu() - t_setup_start);
+#endif
+                    DEV_DEBUG("Thread %d: Pipeline dispatch PTO2 task %d to %s core %d (pending)",
+                        thread_idx,
+                        task_id,
+                        CT == CoreType::AIC ? "AIC" : "AIV",
+                        core_id);
+                } else {
+#if PTO2_PROFILING
+                    pop_miss++;
+#endif
+                    break;
+                }
+            }
+        }
+    }


This new function dispatch_ready_tasks_to_running_cores is very similar to dispatch_ready_tasks_to_idle_cores. There's a large block of duplicated code for getting a ready task, building the payload, and dispatching it. This duplication could make future maintenance harder.

Consider refactoring the common logic into a helper function. This helper could handle fetching a task and dispatching it to a given core ID. The two dispatch_ready_tasks_to_*_cores functions would then just contain their specific looping and state transition logic (e.g., move_idle_to_running).

hw-native-sys-bot · 2026-03-16T09:11:38Z

Rebase & Benchmark Report

Rebase onto `main`

Rebased the PR onto the latest main (after #281 slot_state refactor and #284 profiling macro isolation). Resolved all conflicts — the double-buffer logic has been adapted to the new PTO2TaskSlotState* API.

The rebased branch is available at: hw-native-sys-bot/simpler:pr-259-work

To pull the rebased version:

git remote add bot-fork git@github.com:hw-native-sys-bot/simpler.git  # if not already added
git fetch bot-fork pr-259-work
git reset --hard bot-fork/pr-259-work  # on your opt branch

Benchmark Results (Device 7, 10 rounds)

Example	Base (us)	PR (us)	Delta
alternating_matmul_add	1016.8	1019.4	+0.3%
benchmark_bgemm	912.2	944.3	+3.5%
paged_attention_unroll	7751.4	7779.6	+0.4%
batch_paged_attention	4580.9	4668.0	+1.9%
paged_attention	56470.8	56469.7	-0.0%

No significant performance regression. All deltas are within noise range (< 4%).

The double-buffering infrastructure is in place. The benefit will appear with short-task workloads where the dispatch-to-start gap is a significant fraction of kernel execution time. Current benchmarks have kernels long enough that the gap is hidden.

Simulation Tests

All 12/12 simulation tests pass on the rebased branch.

…line Two payload slots per core enable AICPU to pre-stage the next task while AICore is still executing, eliminating idle gaps between dispatches. Key changes: AICPU side (aicpu_executor.cpp): - Double-buffered payload array: s_pto2_payload_per_core[core][2] with XOR-flip slot selection - 4-case completion state machine (Case A/B/C/D) handling all combinations of pending + running task FIN/ACK signals - Two-level cluster search: first pass finds fully-idle clusters, second pass finds pend-ready clusters (pending slot empty, core running) - ACK-wait guard in dispatch_subtask_to_core: spin-waits until AICore ACKs the current running task before overwriting hank->task and DATA_MAIN_BASE, preventing the race where AICore skips a task - Pending subslot tracking (pending_subslot_by_core_) for correct subtask_done_mask bit when promoting pending to running - Extracted complete_subtask() helper to deduplicate completion logic across the four state machine cases - Correct profiling: saved running dispatch timestamp before pipeline overwrite; both running and pending records filled in Case A/B AICore side (aicore_executor.cpp): - FIN-skip protocol: after task execution, read DATA_MAIN_BASE to check if AICPU already dispatched a pending task. If pending exists, skip FIN write — the next ACK implicitly signals completion - Per-dispatch hank->task read: invalidate full data cache and re-read payload address each dispatch (AICPU updates it per slot)

gemini-code-assist bot reviewed Mar 11, 2026

View reviewed changes

zhangqi-chen force-pushed the opt branch 4 times, most recently from aaf5574 to 5818e39 Compare March 12, 2026 08:25

hw-native-sys-bot force-pushed the opt branch from 5818e39 to 4bea72e Compare March 16, 2026 09:21

ChaoWao marked this pull request as draft March 17, 2026 00:54

hw-native-sys-bot force-pushed the opt branch from 4bea72e to c462782 Compare March 18, 2026 04:42

hw-native-sys-bot changed the title ~~Add: double-buffered payload dispatch for AICore-AICPU pipeline~~ feat(runtime): double-buffered payload dispatch for AICore-AICPU pipeline Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(runtime): double-buffered payload dispatch for AICore-AICPU pipeline#259

feat(runtime): double-buffered payload dispatch for AICore-AICPU pipeline#259
zhangqi-chen wants to merge 1 commit intohw-native-sys:mainfrom
zhangqi-chen:opt

zhangqi-chen commented Mar 11, 2026 •

edited by hw-native-sys-bot

Loading

Uh oh!

gemini-code-assist bot commented Mar 11, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 11, 2026

Uh oh!

hw-native-sys-bot commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zhangqi-chen commented Mar 11, 2026 • edited by hw-native-sys-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

AICPU side

AICore side

Testing

Uh oh!

gemini-code-assist bot commented Mar 11, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

hw-native-sys-bot commented Mar 16, 2026

Rebase & Benchmark Report

Rebase onto main

Benchmark Results (Device 7, 10 rounds)

Simulation Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhangqi-chen commented Mar 11, 2026 •

edited by hw-native-sys-bot

Loading

Rebase onto `main`