feat(runtime): double-buffered payload dispatch for AICore-AICPU pipeline#259
feat(runtime): double-buffered payload dispatch for AICore-AICPU pipeline#259zhangqi-chen wants to merge 1 commit intohw-native-sys:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the efficiency of the AICore-AICPU pipeline by implementing a double-buffered payload dispatch system. This allows the AICPU to pre-stage the next task for an AICore while it is still processing the current one, thereby reducing idle time and improving overall throughput. The changes involve modifications to both AICore's task reception logic and AICPU's task management and dispatch mechanisms to support this pipelined execution. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request implements a double-buffering mechanism for PTO2 task dispatch between AICPU and AICore, enabling a pending task to be dispatched while another is actively running. This involved updating payload data structures to support multiple slots, introducing pending_task_ids and payload_slot tracking, and enhancing task completion logic to manage both running and pending tasks. A new function, dispatch_ready_tasks_to_running_cores, was added to handle dispatching to already active cores. A review comment points out that this new function has significant code duplication with dispatch_ready_tasks_to_idle_cores, recommending refactoring the common dispatch logic into a shared helper function to improve maintainability.
| template <CoreType CT> | ||
| void dispatch_ready_tasks_to_running_cores(Runtime* runtime, | ||
| int32_t thread_idx, | ||
| CoreTypeTracker& ct, | ||
| int32_t* pending_task_ids, | ||
| bool& made_progress, | ||
| PTO2TaskDescriptor* task_descriptors, | ||
| PTO2TaskPayload* task_payloads, | ||
| int32_t window_mask, | ||
| PTO2LocalReadyBuffer* local_bufs | ||
| #if PTO2_PROFILING | ||
| , | ||
| bool profiling_enabled, | ||
| uint64_t& pop_hit, | ||
| uint64_t& pop_miss, | ||
| uint32_t& phase_dispatch_count | ||
| #endif | ||
| #if PTO2_SCHED_PROFILING | ||
| , | ||
| uint64_t& sched_dispatch_pop_cycle, | ||
| uint64_t& sched_dispatch_setup_cycle | ||
| #endif | ||
| ) { | ||
| if (ct.running_count > 0 && rt->scheduler.ready_queues[static_cast<int32_t>(CT)].size() > 0) { | ||
| for (int32_t i = ct.running_count - 1; i >= 0; i--) { | ||
| int32_t core_id = ct.running[i]; | ||
| if (pending_task_ids[core_id] != AICPU_TASK_INVALID) continue; | ||
|
|
||
| #if PTO2_SCHED_PROFILING | ||
| extern uint64_t g_sched_pop_atomic_count[], g_sched_pop_wait_cycle[]; | ||
| uint64_t t_pop_start = get_sys_cnt_aicpu(); | ||
| int32_t task_id = rt->scheduler.get_ready_task<CT>( | ||
| local_bufs, | ||
| g_sched_pop_atomic_count[thread_idx], g_sched_pop_wait_cycle[thread_idx]); | ||
| sched_dispatch_pop_cycle += (get_sys_cnt_aicpu() - t_pop_start); | ||
| #else | ||
| int32_t task_id = rt->scheduler.get_ready_task<CT>(local_bufs); | ||
| #endif | ||
| if (task_id >= 0) { | ||
| #if PTO2_PROFILING | ||
| pop_hit++; | ||
| phase_dispatch_count++; | ||
| #endif | ||
| #if PTO2_SCHED_PROFILING | ||
| uint64_t t_setup_start = get_sys_cnt_aicpu(); | ||
| #endif | ||
| PTO2TaskDescriptor* task = &task_descriptors[task_id & window_mask]; | ||
| PTO2TaskPayload* task_pl = &task_payloads[task_id & window_mask]; | ||
| int32_t slot = payload_slot_[thread_idx][core_id]; | ||
| PTO2DispatchPayload* payload = &s_pto2_payload_per_core[core_id][slot]; | ||
| build_pto2_payload<CT>(payload, runtime, task, task_pl); | ||
| payload_slot_[thread_idx][core_id] ^= 1; | ||
| #if PTO2_PROFILING | ||
| if (profiling_enabled) { | ||
| dispatch_timestamps_[core_id] = get_sys_cnt_aicpu(); | ||
| if (core_dispatch_counts_[core_id] >= PLATFORM_PROF_BUFFER_SIZE) { | ||
| perf_aicpu_switch_buffer(runtime, core_id, thread_idx); | ||
| core_dispatch_counts_[core_id] = 0; | ||
| } | ||
| core_dispatch_counts_[core_id]++; | ||
| } | ||
| #endif | ||
| write_reg(core_id_to_reg_addr_[core_id], RegId::DATA_MAIN_BASE, static_cast<uint64_t>(task_id + 1)); | ||
| pending_task_ids[core_id] = task_id; | ||
| made_progress = true; | ||
| #if PTO2_SCHED_PROFILING | ||
| sched_dispatch_setup_cycle += (get_sys_cnt_aicpu() - t_setup_start); | ||
| #endif | ||
| DEV_DEBUG("Thread %d: Pipeline dispatch PTO2 task %d to %s core %d (pending)", | ||
| thread_idx, | ||
| task_id, | ||
| CT == CoreType::AIC ? "AIC" : "AIV", | ||
| core_id); | ||
| } else { | ||
| #if PTO2_PROFILING | ||
| pop_miss++; | ||
| #endif | ||
| break; | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
This new function dispatch_ready_tasks_to_running_cores is very similar to dispatch_ready_tasks_to_idle_cores. There's a large block of duplicated code for getting a ready task, building the payload, and dispatching it. This duplication could make future maintenance harder.
Consider refactoring the common logic into a helper function. This helper could handle fetching a task and dispatching it to a given core ID. The two dispatch_ready_tasks_to_*_cores functions would then just contain their specific looping and state transition logic (e.g., move_idle_to_running).
aaf5574 to
5818e39
Compare
Rebase & Benchmark ReportRebase onto
|
| Example | Base (us) | PR (us) | Delta |
|---|---|---|---|
| alternating_matmul_add | 1016.8 | 1019.4 | +0.3% |
| benchmark_bgemm | 912.2 | 944.3 | +3.5% |
| paged_attention_unroll | 7751.4 | 7779.6 | +0.4% |
| batch_paged_attention | 4580.9 | 4668.0 | +1.9% |
| paged_attention | 56470.8 | 56469.7 | -0.0% |
No significant performance regression. All deltas are within noise range (< 4%).
The double-buffering infrastructure is in place. The benefit will appear with short-task workloads where the dispatch-to-start gap is a significant fraction of kernel execution time. Current benchmarks have kernels long enough that the gap is hidden.
Simulation Tests
All 12/12 simulation tests pass on the rebased branch.
…line Two payload slots per core enable AICPU to pre-stage the next task while AICore is still executing, eliminating idle gaps between dispatches. Key changes: AICPU side (aicpu_executor.cpp): - Double-buffered payload array: s_pto2_payload_per_core[core][2] with XOR-flip slot selection - 4-case completion state machine (Case A/B/C/D) handling all combinations of pending + running task FIN/ACK signals - Two-level cluster search: first pass finds fully-idle clusters, second pass finds pend-ready clusters (pending slot empty, core running) - ACK-wait guard in dispatch_subtask_to_core: spin-waits until AICore ACKs the current running task before overwriting hank->task and DATA_MAIN_BASE, preventing the race where AICore skips a task - Pending subslot tracking (pending_subslot_by_core_) for correct subtask_done_mask bit when promoting pending to running - Extracted complete_subtask() helper to deduplicate completion logic across the four state machine cases - Correct profiling: saved running dispatch timestamp before pipeline overwrite; both running and pending records filled in Case A/B AICore side (aicore_executor.cpp): - FIN-skip protocol: after task execution, read DATA_MAIN_BASE to check if AICPU already dispatched a pending task. If pending exists, skip FIN write — the next ACK implicitly signals completion - Per-dispatch hank->task read: invalidate full data cache and re-read payload address each dispatch (AICPU updates it per slot)
Summary
Two payload slots per core enable AICPU to pre-stage the next task while AICore is still executing, eliminating idle gaps between dispatches.
AICPU side
s_pto2_payload_per_core[core][2]with XOR-flip slot selectiondispatch_subtask_to_coreuntil AICore ACKs the current running task before overwritinghank->taskandDATA_MAIN_BASE, preventing the race where AICore skips a taskpending_subslot_by_core_) for correctsubtask_done_maskbit when promoting pending→running in Case B/Dcomplete_subtask()helper deduplicates completion logic across the four state machine casesAICore side
DATA_MAIN_BASEto check if AICPU already dispatched a pending task; if pending exists, skip FIN — the next ACK implicitly signals completionhank->taskread: invalidate full data cache and re-read payload address each dispatch (AICPU updates it per double-buffer slot)Testing
mixed_examplepasses 5/5 consecutive runs (exercises all 4 state machine cases)./ci.sh -p a2a3sim)./ci.sh -p a2a3 -d 4-7 --parallel)