Skip to content

feat(runtime): double-buffered payload dispatch for AICore-AICPU pipeline#259

Draft
zhangqi-chen wants to merge 1 commit intohw-native-sys:mainfrom
zhangqi-chen:opt
Draft

feat(runtime): double-buffered payload dispatch for AICore-AICPU pipeline#259
zhangqi-chen wants to merge 1 commit intohw-native-sys:mainfrom
zhangqi-chen:opt

Conversation

@zhangqi-chen
Copy link
Contributor

@zhangqi-chen zhangqi-chen commented Mar 11, 2026

Summary

Two payload slots per core enable AICPU to pre-stage the next task while AICore is still executing, eliminating idle gaps between dispatches.

AICPU side

  • Double-buffered payloads: s_pto2_payload_per_core[core][2] with XOR-flip slot selection
  • 4-case completion state machine (A/B/C/D) handling all combinations of pending + running task FIN/ACK signals
  • Two-level cluster search: first pass finds fully-idle clusters, second pass finds pend-ready clusters (pending slot empty, core may be running)
  • ACK-wait guard: spin-waits in dispatch_subtask_to_core until AICore ACKs the current running task before overwriting hank->task and DATA_MAIN_BASE, preventing the race where AICore skips a task
  • Pending subslot tracking (pending_subslot_by_core_) for correct subtask_done_mask bit when promoting pending→running in Case B/D
  • complete_subtask() helper deduplicates completion logic across the four state machine cases
  • Profiling: saved running dispatch timestamp before pipeline overwrite; both running and pending perf records filled in Case A/B

AICore side

  • FIN-skip protocol: after task execution, read DATA_MAIN_BASE to check if AICPU already dispatched a pending task; if pending exists, skip FIN — the next ACK implicitly signals completion
  • Per-dispatch hank->task read: invalidate full data cache and re-read payload address each dispatch (AICPU updates it per double-buffer slot)

Testing

  • mixed_example passes 5/5 consecutive runs (exercises all 4 state machine cases)
  • All 12 simulation tests pass (./ci.sh -p a2a3sim)
  • Hardware tests (./ci.sh -p a2a3 -d 4-7 --parallel)

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the efficiency of the AICore-AICPU pipeline by implementing a double-buffered payload dispatch system. This allows the AICPU to pre-stage the next task for an AICore while it is still processing the current one, thereby reducing idle time and improving overall throughput. The changes involve modifications to both AICore's task reception logic and AICPU's task management and dispatch mechanisms to support this pipelined execution.

Highlights

  • AICore Double-Buffering: AICore now reads from alternating payload slots and conditionally skips sending a FIN signal if a pending task is detected, allowing for continuous execution.
  • AICPU Task Management: AICPU tracks both currently executing and pending tasks per core, managing a 2-slot payload buffer and handling three distinct task completion scenarios.
  • Pipeline Dispatch: A new dispatch mechanism (dispatch_ready_tasks_to_running_cores) is introduced to proactively fill pending slots on cores that are still executing, improving pipeline utilization.
  • State Preservation: The new pending task and payload slot states are correctly preserved across thread reassignment operations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/a2a3/runtime/tensormap_and_ringbuffer/aicore/aicore_executor.cpp
    • Updated payload pointer to an array and introduced a slot index.
    • Modified payload reading to use the current slot and invalidate cache.
    • Added logic to toggle the payload slot after kernel execution.
    • Implemented conditional FIN signal based on pending task detection via DATA_MAIN_BASE register.
  • src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
    • Expanded s_pto2_payload_per_core to a 2D array for double buffering.
    • Added pending_task_ids_ and payload_slot_ arrays to AicpuExecutor for per-core state tracking.
    • Introduced a complete_task helper function to centralize task completion logic.
    • Refactored check_running_cores_for_completion to handle three completion cases: pending FIN (both done), pending ACK (running implicitly done), and running FIN (no pending).
    • Modified dispatch_ready_tasks_to_idle_cores to use pending_task_ids and toggle the payload slot.
    • Added a new function dispatch_ready_tasks_to_running_cores to dispatch tasks to cores with available pending slots.
    • Updated handshake_all_cores to initialize the payload pointer to the first slot of the new array.
    • Initialized new state variables (pending_task_ids_, payload_slot_) in assign_cores_to_threads and deinit.
    • Ensured pending_task_ids_ and payload_slot_ are preserved during reassign_cores_for_all_threads.
    • Updated resolve_and_dispatch_pto2 to pass pending_task_ids and call the new pipeline dispatch function.
    • Adjusted diagnose_stuck_state to display pending task information and use the correct payload slot for debugging.
Activity
  • No specific activity (comments, reviews, progress updates) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request implements a double-buffering mechanism for PTO2 task dispatch between AICPU and AICore, enabling a pending task to be dispatched while another is actively running. This involved updating payload data structures to support multiple slots, introducing pending_task_ids and payload_slot tracking, and enhancing task completion logic to manage both running and pending tasks. A new function, dispatch_ready_tasks_to_running_cores, was added to handle dispatching to already active cores. A review comment points out that this new function has significant code duplication with dispatch_ready_tasks_to_idle_cores, recommending refactoring the common dispatch logic into a shared helper function to improve maintainability.

Comment on lines +483 to +564
template <CoreType CT>
void dispatch_ready_tasks_to_running_cores(Runtime* runtime,
int32_t thread_idx,
CoreTypeTracker& ct,
int32_t* pending_task_ids,
bool& made_progress,
PTO2TaskDescriptor* task_descriptors,
PTO2TaskPayload* task_payloads,
int32_t window_mask,
PTO2LocalReadyBuffer* local_bufs
#if PTO2_PROFILING
,
bool profiling_enabled,
uint64_t& pop_hit,
uint64_t& pop_miss,
uint32_t& phase_dispatch_count
#endif
#if PTO2_SCHED_PROFILING
,
uint64_t& sched_dispatch_pop_cycle,
uint64_t& sched_dispatch_setup_cycle
#endif
) {
if (ct.running_count > 0 && rt->scheduler.ready_queues[static_cast<int32_t>(CT)].size() > 0) {
for (int32_t i = ct.running_count - 1; i >= 0; i--) {
int32_t core_id = ct.running[i];
if (pending_task_ids[core_id] != AICPU_TASK_INVALID) continue;

#if PTO2_SCHED_PROFILING
extern uint64_t g_sched_pop_atomic_count[], g_sched_pop_wait_cycle[];
uint64_t t_pop_start = get_sys_cnt_aicpu();
int32_t task_id = rt->scheduler.get_ready_task<CT>(
local_bufs,
g_sched_pop_atomic_count[thread_idx], g_sched_pop_wait_cycle[thread_idx]);
sched_dispatch_pop_cycle += (get_sys_cnt_aicpu() - t_pop_start);
#else
int32_t task_id = rt->scheduler.get_ready_task<CT>(local_bufs);
#endif
if (task_id >= 0) {
#if PTO2_PROFILING
pop_hit++;
phase_dispatch_count++;
#endif
#if PTO2_SCHED_PROFILING
uint64_t t_setup_start = get_sys_cnt_aicpu();
#endif
PTO2TaskDescriptor* task = &task_descriptors[task_id & window_mask];
PTO2TaskPayload* task_pl = &task_payloads[task_id & window_mask];
int32_t slot = payload_slot_[thread_idx][core_id];
PTO2DispatchPayload* payload = &s_pto2_payload_per_core[core_id][slot];
build_pto2_payload<CT>(payload, runtime, task, task_pl);
payload_slot_[thread_idx][core_id] ^= 1;
#if PTO2_PROFILING
if (profiling_enabled) {
dispatch_timestamps_[core_id] = get_sys_cnt_aicpu();
if (core_dispatch_counts_[core_id] >= PLATFORM_PROF_BUFFER_SIZE) {
perf_aicpu_switch_buffer(runtime, core_id, thread_idx);
core_dispatch_counts_[core_id] = 0;
}
core_dispatch_counts_[core_id]++;
}
#endif
write_reg(core_id_to_reg_addr_[core_id], RegId::DATA_MAIN_BASE, static_cast<uint64_t>(task_id + 1));
pending_task_ids[core_id] = task_id;
made_progress = true;
#if PTO2_SCHED_PROFILING
sched_dispatch_setup_cycle += (get_sys_cnt_aicpu() - t_setup_start);
#endif
DEV_DEBUG("Thread %d: Pipeline dispatch PTO2 task %d to %s core %d (pending)",
thread_idx,
task_id,
CT == CoreType::AIC ? "AIC" : "AIV",
core_id);
} else {
#if PTO2_PROFILING
pop_miss++;
#endif
break;
}
}
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This new function dispatch_ready_tasks_to_running_cores is very similar to dispatch_ready_tasks_to_idle_cores. There's a large block of duplicated code for getting a ready task, building the payload, and dispatching it. This duplication could make future maintenance harder.

Consider refactoring the common logic into a helper function. This helper could handle fetching a task and dispatching it to a given core ID. The two dispatch_ready_tasks_to_*_cores functions would then just contain their specific looping and state transition logic (e.g., move_idle_to_running).

@zhangqi-chen zhangqi-chen force-pushed the opt branch 4 times, most recently from aaf5574 to 5818e39 Compare March 12, 2026 08:25
@hw-native-sys-bot
Copy link
Collaborator

Rebase & Benchmark Report

Rebase onto main

Rebased the PR onto the latest main (after #281 slot_state refactor and #284 profiling macro isolation). Resolved all conflicts — the double-buffer logic has been adapted to the new PTO2TaskSlotState* API.

The rebased branch is available at: hw-native-sys-bot/simpler:pr-259-work

To pull the rebased version:

git remote add bot-fork git@github.com:hw-native-sys-bot/simpler.git  # if not already added
git fetch bot-fork pr-259-work
git reset --hard bot-fork/pr-259-work  # on your opt branch

Benchmark Results (Device 7, 10 rounds)

Example Base (us) PR (us) Delta
alternating_matmul_add 1016.8 1019.4 +0.3%
benchmark_bgemm 912.2 944.3 +3.5%
paged_attention_unroll 7751.4 7779.6 +0.4%
batch_paged_attention 4580.9 4668.0 +1.9%
paged_attention 56470.8 56469.7 -0.0%

No significant performance regression. All deltas are within noise range (< 4%).

The double-buffering infrastructure is in place. The benefit will appear with short-task workloads where the dispatch-to-start gap is a significant fraction of kernel execution time. Current benchmarks have kernels long enough that the gap is hidden.

Simulation Tests

All 12/12 simulation tests pass on the rebased branch.

…line

Two payload slots per core enable AICPU to pre-stage the next task while
AICore is still executing, eliminating idle gaps between dispatches.

Key changes:

AICPU side (aicpu_executor.cpp):
- Double-buffered payload array: s_pto2_payload_per_core[core][2] with
  XOR-flip slot selection
- 4-case completion state machine (Case A/B/C/D) handling all
  combinations of pending + running task FIN/ACK signals
- Two-level cluster search: first pass finds fully-idle clusters, second
  pass finds pend-ready clusters (pending slot empty, core running)
- ACK-wait guard in dispatch_subtask_to_core: spin-waits until AICore
  ACKs the current running task before overwriting hank->task and
  DATA_MAIN_BASE, preventing the race where AICore skips a task
- Pending subslot tracking (pending_subslot_by_core_) for correct
  subtask_done_mask bit when promoting pending to running
- Extracted complete_subtask() helper to deduplicate completion logic
  across the four state machine cases
- Correct profiling: saved running dispatch timestamp before pipeline
  overwrite; both running and pending records filled in Case A/B

AICore side (aicore_executor.cpp):
- FIN-skip protocol: after task execution, read DATA_MAIN_BASE to check
  if AICPU already dispatched a pending task. If pending exists, skip
  FIN write — the next ACK implicitly signals completion
- Per-dispatch hank->task read: invalidate full data cache and re-read
  payload address each dispatch (AICPU updates it per slot)
@hw-native-sys-bot hw-native-sys-bot changed the title Add: double-buffered payload dispatch for AICore-AICPU pipeline feat(runtime): double-buffered payload dispatch for AICore-AICPU pipeline Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants