[WIP] Perf+Refactor: optimize dispatch payload and split task ring into dual rings#280
[WIP] Perf+Refactor: optimize dispatch payload and split task ring into dual rings#280zhusy54 wants to merge 1 commit intohw-native-sys:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors the task dispatch and ring buffer management within the runtime system. The primary goal is to improve performance by optimizing the dispatch payload construction process, moving it to an earlier stage, and to enhance memory efficiency and reduce back-pressure by introducing a dual-ring buffer architecture. These changes streamline the task lifecycle, allowing for more efficient resource allocation and reclamation, which is crucial for high-performance computing environments. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant performance-focused refactoring. It optimizes dispatch payload construction by moving it from dispatch time to submit time, and it splits the unified task ring into a dual-ring system (Main and Consumed) to reduce back-pressure. The changes are extensive and touch many files across the runtime, scheduler, and orchestrator. The implementation appears solid and consistent with the stated goals. The new RELEASED state and dual-watermark advancement logic are well-designed to improve resource reclamation. I have one suggestion to improve the clarity of a log message for better debuggability.
| if (runtime->pto2_task_window_size || runtime->pto2_heap_size || runtime->pto2_consumed_window_size) { | ||
| LOG_INFO("Ring buffer overrides: task_window=%lu heap=%lu consumed_window=%lu", | ||
| (unsigned long)(runtime->pto2_task_window_size ? runtime->pto2_task_window_size : PTO2_TASK_WINDOW_SIZE), | ||
| (unsigned long)(runtime->pto2_heap_size ? runtime->pto2_heap_size : PTO2_HEAP_SIZE)); | ||
| (unsigned long)(runtime->pto2_heap_size ? runtime->pto2_heap_size : PTO2_HEAP_SIZE), | ||
| (unsigned long)(runtime->pto2_consumed_window_size ? runtime->pto2_consumed_window_size : 0)); | ||
| } |
There was a problem hiding this comment.
The log message for consumed_window can be misleading. When the PTO2_RING_CONSUMED_WINDOW environment variable is not set, runtime->pto2_consumed_window_size is 0, and the log will show consumed_window=0. However, the actual default value used by the runtime is 4 times the effective task window size. This discrepancy can be confusing when debugging performance or memory usage. The log message should reflect the effective default value that will be used.
if (runtime->pto2_task_window_size || runtime->pto2_heap_size || runtime->pto2_consumed_window_size) {
uint64_t eff_task_window = runtime->pto2_task_window_size ? runtime->pto2_task_window_size : PTO2_TASK_WINDOW_SIZE;
LOG_INFO("Ring buffer overrides: task_window=%lu heap=%lu consumed_window=%lu",
(unsigned long)eff_task_window,
(unsigned long)(runtime->pto2_heap_size ? runtime->pto2_heap_size : PTO2_HEAP_SIZE),
(unsigned long)(runtime->pto2_consumed_window_size ? runtime->pto2_consumed_window_size : eff_task_window * 4));
}1ad55e4 to
66f6d6f
Compare
Separate per-task data by lifecycle to reduce ring buffer back-pressure: - Main Ring (freed at RELEASED): descriptor + payload (~3800B/task) - Consumed Ring (freed at CONSUMED): task_state + fanout tracking (~64B/task) Add RELEASED(4) and CONSUMED(5) states to extend the task state machine. Introduce dual waterline advancement (last_task_released / last_task_consumed) with independent try-locks for each reclamation path. TensorMap validity now uses last_task_consumed; dep pool reclamation uses last_task_released. Consumed ring defaults to 4x main ring capacity (configurable via PTO2_RING_CONSUMED_WINDOW env var).
Summary
Key Changes
Dispatch Payload Optimization
dispatch[PTO2_SUBTASK_SLOT_COUNT]array inPTO2TaskPayloadfunc_id_to_addrlookup table throughPTO2Runtimeto orchestratorsdispatch_subtask_to_core()to a single pointer assignments_pto2_payload_per_corestatic array andbuild_pto2_payload()Split Task Ring
RELEASED(4)andCONSUMED(5)states to the task state machinePTO2ConsumedRingEntry(64B, cache-aligned) andPTO2MainSlotState(8B)last_task_releaseddrives main ring reclamation,last_task_consumeddrives consumed ring + heap reclamationlast_task_consumed; dep pool reclamation useslast_task_releasedPTO2_RING_CONSUMED_WINDOWenv var)Files Modified (19 files, +421/-272)
pto_runtime2_types.h— New states,PTO2ConsumedRingEntry,PTO2MainSlotStatepto_scheduler.h/cpp— Dual ring management, split advance functionspto_orchestrator.cpp/h— Submit-time payload build, dual ring accesspto_shared_memory.h/cpp—last_task_released,consumed_window_sizefieldspto_ring_buffer.h/cpp— Renamelast_alive_ptrtolast_released_ptrpto_tensormap.h/cpp— Uselast_task_consumedfor entry validitypto_runtime2.h/cpp—consumed_window_sizecomputation and propagationruntime.h/cpp— Profiling uses consumed ring for fanout traversalaicpu_executor.cpp— Simplified dispatch, consumed ring diagnosticsruntime_maker.cpp—PTO2_RING_CONSUMED_WINDOWenv var support