Skip to content

[WIP] Perf+Refactor: optimize dispatch payload and split task ring into dual rings#280

Open
zhusy54 wants to merge 1 commit intohw-native-sys:mainfrom
zhusy54:taskring
Open

[WIP] Perf+Refactor: optimize dispatch payload and split task ring into dual rings#280
zhusy54 wants to merge 1 commit intohw-native-sys:mainfrom
zhusy54:taskring

Conversation

@zhusy54
Copy link
Contributor

@zhusy54 zhusy54 commented Mar 13, 2026

Summary

  • Move dispatch payload construction from AICPU dispatch time to orchestrator submit time, eliminating per-dispatch function address lookup and argument copying from the critical path
  • Split the unified task ring into Main Ring (descriptor + payload, ~3800B/task) and Consumed Ring (task_state + fanout tracking, ~64B/task), separating per-task data by lifecycle to reduce ring buffer back-pressure by ~234 MB

Key Changes

Dispatch Payload Optimization

  • Embed dispatch[PTO2_SUBTASK_SLOT_COUNT] array in PTO2TaskPayload
  • Pass func_id_to_addr lookup table through PTO2Runtime to orchestrators
  • Build per-slot dispatch payloads at submit time after output address allocation
  • Simplify dispatch_subtask_to_core() to a single pointer assignment
  • Remove s_pto2_payload_per_core static array and build_pto2_payload()

Split Task Ring

  • Add RELEASED(4) and CONSUMED(5) states to the task state machine
  • Introduce PTO2ConsumedRingEntry (64B, cache-aligned) and PTO2MainSlotState (8B)
  • Dual waterline advancement: last_task_released drives main ring reclamation, last_task_consumed drives consumed ring + heap reclamation
  • Independent try-locks for each reclamation path to reduce contention
  • TensorMap validity uses last_task_consumed; dep pool reclamation uses last_task_released
  • Consumed ring defaults to 4x main ring capacity (configurable via PTO2_RING_CONSUMED_WINDOW env var)

Files Modified (19 files, +421/-272)

  • pto_runtime2_types.h — New states, PTO2ConsumedRingEntry, PTO2MainSlotState
  • pto_scheduler.h/cpp — Dual ring management, split advance functions
  • pto_orchestrator.cpp/h — Submit-time payload build, dual ring access
  • pto_shared_memory.h/cpplast_task_released, consumed_window_size fields
  • pto_ring_buffer.h/cpp — Rename last_alive_ptr to last_released_ptr
  • pto_tensormap.h/cpp — Use last_task_consumed for entry validity
  • pto_runtime2.h/cppconsumed_window_size computation and propagation
  • runtime.h/cpp — Profiling uses consumed ring for fanout traversal
  • aicpu_executor.cpp — Simplified dispatch, consumed ring diagnostics
  • runtime_maker.cppPTO2_RING_CONSUMED_WINDOW env var support

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the task dispatch and ring buffer management within the runtime system. The primary goal is to improve performance by optimizing the dispatch payload construction process, moving it to an earlier stage, and to enhance memory efficiency and reduce back-pressure by introducing a dual-ring buffer architecture. These changes streamline the task lifecycle, allowing for more efficient resource allocation and reclamation, which is crucial for high-performance computing environments.

Highlights

  • Dispatch Payload Optimization: Moved dispatch payload construction from AICPU dispatch time to orchestrator submit time, eliminating per-dispatch function address lookup and argument copying.
  • Dual Task Ring Implementation: Split the unified task ring into a Main Ring (for descriptors and payloads) and a Consumed Ring (for task state and fanout tracking), reducing memory footprint and back-pressure.
  • Enhanced Task State Machine: Introduced RELEASED and CONSUMED states to the task lifecycle, enabling more granular control over resource reclamation.
  • Optimized Resource Reclamation: Implemented dual waterline advancement with last_task_released for main ring reclamation and last_task_consumed for consumed ring and heap reclamation, using independent try-locks for efficiency.
  • Configurable Consumed Ring Capacity: Made the consumed ring capacity configurable, defaulting to 4x the main ring capacity.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
    • Removed the static s_pto2_payload_per_core array and the build_pto2_payload function.
    • Simplified the dispatch_subtask_to_core function to directly use pre-built dispatch payloads from PTO2TaskPayload.
    • Updated initial handshake and deinitialization routines to reflect the removal of the per-core payload storage.
    • Adjusted profiling and task resolution logic to pass task_pl directly instead of task and task_pl.
  • src/a2a3/runtime/tensormap_and_ringbuffer/docs/RUNTIME_LOGIC.md
    • Updated the documentation for PTO2TaskPayload to include the dispatch array for pre-built payloads.
    • Revised the description of Phase 2 (Dispatch) to state that Handshake.task now points to the pre-built PTO2DispatchPayload.
    • Updated the PTO2DispatchPayload section to clarify that it's built by the orchestrator at submit time and embedded in PTO2TaskPayload, removing fields like mixed_task_id, subslot, kernel_id, and core_type.
    • Updated the kernel address lookup process to reflect that func_id_to_addr is passed to orchestrators at runtime creation and used at submit time.
  • src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp
    • Added support for configuring pto2_consumed_window_size via the PTO2_RING_CONSUMED_WINDOW environment variable.
    • Updated the logging message for ring buffer overrides to include the new consumed_window_size.
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto2_dispatch_payload.h
    • Updated comments to clarify that PTO2DispatchPayload is built at submit time by the orchestrator and embedded in PTO2TaskPayload, and that Handshake.task points to this embedded payload.
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
    • Modified pto2_orchestrator_init to use last_task_released for the task ring and consumed_window_size for TensorMap initialization.
    • Added an assertion to pto2_submit_mixed_task to ensure func_id_to_addr is set.
    • Updated dep pool reclamation to use last_task_released and PTO2ConsumedRingEntry.
    • Updated task initialization to use PTO2ConsumedRingEntry and PTO2MainSlotState for managing task state and fanout/fanin.
    • Moved the packed_buffer_end field from PTO2TaskDescriptor to PTO2ConsumedRingEntry.
    • Added new logic to pto2_submit_mixed_task to build PTO2DispatchPayload for each active subtask slot at submit time, using func_id_to_addr.
    • Moved the dep_pool_mark to PTO2ConsumedRingEntry.
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h
    • Updated the comment for dep_pool_last_reclaimed to refer to last_task_released.
    • Added a func_id_to_addr member to PTO2OrchestratorState for kernel address lookup.
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.cpp
    • Renamed the last_alive_ptr parameter and member to last_released_ptr in pto2_task_ring_init and related logic.
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.h
    • Updated comments and variable names from last_task_alive to last_task_released throughout the PTO2TaskRing structure and its methods, reflecting the new dual ring terminology.
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.cpp
    • Added logic to initialize consumed_window_size in the shared memory header during pto2_runtime_create_custom, defaulting to 4x task_window_size.
    • Modified pto2_scheduler_init calls to pass consumed_window_size.
    • Updated pto2_runtime_create_from_sm to accept and propagate func_id_to_addr to the orchestrators.
    • Added a fallback for consumed_window_size if not set by the host.
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.h
    • Added a func_id_to_addr member to the PTO2Runtime struct for kernel address lookup.
    • Updated the pto2_runtime_create_from_sm function signature to include func_id_to_addr as a parameter.
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h
    • Introduced PTO2_TASK_RELEASED state (4) and updated PTO2_TASK_CONSUMED to 5, modifying the task state transition comments.
    • Defined new structs PTO2ConsumedRingEntry (for long-lived state) and PTO2MainSlotState (for short-lived fanin tracking).
    • Moved packed_buffer_end and dep_pool_mark from PTO2TaskDescriptor and PTO2TaskPayload respectively to PTO2ConsumedRingEntry.
    • Modified PTO2TaskPayload to contain an array of PTO2DispatchPayload for per-slot pre-built payloads.
    • Updated pto2_fanout_lock and pto2_fanout_unlock to operate on PTO2ConsumedRingEntry.
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.cpp
    • Added PTO2_TASK_RELEASED to the pto2_task_state_name function.
    • Modified pto2_scheduler_init to accept consumed_window_size, allocate consumed_ring and main_slot_states arrays, and zero-initialize their fields.
    • Updated pto2_scheduler_destroy to deallocate both new arrays.
    • Updated pto2_scheduler_print_stats to log last_task_consumed and last_task_released.
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h
    • Updated the scheduler's responsibility comments to reflect dual ring management and the new RELEASED task state.
    • Modified PTO2SchedulerState to include last_task_consumed, last_task_released, consumed_window_size, consumed_window_mask, consumed_ring, main_slot_states, and main_ring_advance_lock.
    • Replaced get_slot_state_by_slot/task_id with get_consumed_entry and get_main_slot.
    • Split advance_ring_pointers into advance_consumed_ring_pointers and advance_main_ring_pointers.
    • Updated check_and_handle_consumed, release_producer, release_fanin_and_check_ready, init_task, on_mixed_task_complete, and on_task_release to use the new dual ring structures and RELEASED state.
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_shared_memory.cpp
    • Initialized last_task_consumed, last_task_released, and consumed_window_size in pto2_sm_init_header.
    • Updated pto2_sm_print_layout and pto2_sm_validate to log and check the new last_task_consumed and last_task_released fields.
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_shared_memory.h
    • Renamed last_task_alive to last_task_consumed and added last_task_released to PTO2SharedMemoryHeader.
    • Added consumed_window_size to the header.
    • Adjusted padding to accommodate new fields.
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_tensormap.cpp
    • Renamed last_task_alive to last_task_consumed in PTO2TensorMap initialization, statistics printing, sync_tensormap, and cleanup_retired functions, aligning with the new consumed ring logic.
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_tensormap.h
    • Updated comments and variable names from last_task_alive to last_task_consumed in PTO2TensorMap struct and its methods, reflecting that TensorMap validity is now driven by the consumed ring.
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/runtime.cpp
    • Added pto2_consumed_window_size initialization to the Runtime constructor.
    • Renamed pto2_slot_states_ptr_ to pto2_consumed_ring_ptr_ and updated its setter.
    • Modified complete_perf_records to use pto2_consumed_ring_ptr_ and consumed_window_mask for fanout traversal.
  • src/a2a3/runtime/tensormap_and_ringbuffer/runtime/runtime.h
    • Added pto2_consumed_window_size member to the Runtime class.
    • Renamed pto2_slot_states_ptr_ to pto2_consumed_ring_ptr_ and updated its corresponding setter method.
Activity
  • No human activity has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant performance-focused refactoring. It optimizes dispatch payload construction by moving it from dispatch time to submit time, and it splits the unified task ring into a dual-ring system (Main and Consumed) to reduce back-pressure. The changes are extensive and touch many files across the runtime, scheduler, and orchestrator. The implementation appears solid and consistent with the stated goals. The new RELEASED state and dual-watermark advancement logic are well-designed to improve resource reclamation. I have one suggestion to improve the clarity of a log message for better debuggability.

Comment on lines +269 to 274
if (runtime->pto2_task_window_size || runtime->pto2_heap_size || runtime->pto2_consumed_window_size) {
LOG_INFO("Ring buffer overrides: task_window=%lu heap=%lu consumed_window=%lu",
(unsigned long)(runtime->pto2_task_window_size ? runtime->pto2_task_window_size : PTO2_TASK_WINDOW_SIZE),
(unsigned long)(runtime->pto2_heap_size ? runtime->pto2_heap_size : PTO2_HEAP_SIZE));
(unsigned long)(runtime->pto2_heap_size ? runtime->pto2_heap_size : PTO2_HEAP_SIZE),
(unsigned long)(runtime->pto2_consumed_window_size ? runtime->pto2_consumed_window_size : 0));
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The log message for consumed_window can be misleading. When the PTO2_RING_CONSUMED_WINDOW environment variable is not set, runtime->pto2_consumed_window_size is 0, and the log will show consumed_window=0. However, the actual default value used by the runtime is 4 times the effective task window size. This discrepancy can be confusing when debugging performance or memory usage. The log message should reflect the effective default value that will be used.

        if (runtime->pto2_task_window_size || runtime->pto2_heap_size || runtime->pto2_consumed_window_size) {
            uint64_t eff_task_window = runtime->pto2_task_window_size ? runtime->pto2_task_window_size : PTO2_TASK_WINDOW_SIZE;
            LOG_INFO("Ring buffer overrides: task_window=%lu heap=%lu consumed_window=%lu",
                     (unsigned long)eff_task_window,
                     (unsigned long)(runtime->pto2_heap_size ? runtime->pto2_heap_size : PTO2_HEAP_SIZE),
                     (unsigned long)(runtime->pto2_consumed_window_size ? runtime->pto2_consumed_window_size : eff_task_window * 4));
        }

@zhusy54 zhusy54 force-pushed the taskring branch 2 times, most recently from 1ad55e4 to 66f6d6f Compare March 13, 2026 06:56
Separate per-task data by lifecycle to reduce ring buffer back-pressure:
- Main Ring (freed at RELEASED): descriptor + payload (~3800B/task)
- Consumed Ring (freed at CONSUMED): task_state + fanout tracking (~64B/task)

Add RELEASED(4) and CONSUMED(5) states to extend the task state machine.
Introduce dual waterline advancement (last_task_released / last_task_consumed)
with independent try-locks for each reclamation path. TensorMap validity now
uses last_task_consumed; dep pool reclamation uses last_task_released.

Consumed ring defaults to 4x main ring capacity (configurable via
PTO2_RING_CONSUMED_WINDOW env var).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant