[WIP] Perf+Refactor: optimize dispatch payload and split task ring into dual rings by zhusy54 · Pull Request #280 · hw-native-sys/simpler

zhusy54 · 2026-03-13T01:00:55Z

Summary

Move dispatch payload construction from AICPU dispatch time to orchestrator submit time, eliminating per-dispatch function address lookup and argument copying from the critical path
Split the unified task ring into Main Ring (descriptor + payload, ~3800B/task) and Consumed Ring (task_state + fanout tracking, ~64B/task), separating per-task data by lifecycle to reduce ring buffer back-pressure by ~234 MB

Key Changes

Dispatch Payload Optimization

Embed dispatch[PTO2_SUBTASK_SLOT_COUNT] array in PTO2TaskPayload
Pass func_id_to_addr lookup table through PTO2Runtime to orchestrators
Build per-slot dispatch payloads at submit time after output address allocation
Simplify dispatch_subtask_to_core() to a single pointer assignment
Remove s_pto2_payload_per_core static array and build_pto2_payload()

Split Task Ring

Add RELEASED(4) and CONSUMED(5) states to the task state machine
Introduce PTO2ConsumedRingEntry (64B, cache-aligned) and PTO2MainSlotState (8B)
Dual waterline advancement: last_task_released drives main ring reclamation, last_task_consumed drives consumed ring + heap reclamation
Independent try-locks for each reclamation path to reduce contention
TensorMap validity uses last_task_consumed; dep pool reclamation uses last_task_released
Consumed ring defaults to 4x main ring capacity (configurable via PTO2_RING_CONSUMED_WINDOW env var)

Files Modified (19 files, +421/-272)

pto_runtime2_types.h — New states, PTO2ConsumedRingEntry, PTO2MainSlotState
pto_scheduler.h/cpp — Dual ring management, split advance functions
pto_orchestrator.cpp/h — Submit-time payload build, dual ring access
pto_shared_memory.h/cpp — last_task_released, consumed_window_size fields
pto_ring_buffer.h/cpp — Rename last_alive_ptr to last_released_ptr
pto_tensormap.h/cpp — Use last_task_consumed for entry validity
pto_runtime2.h/cpp — consumed_window_size computation and propagation
runtime.h/cpp — Profiling uses consumed ring for fanout traversal
aicpu_executor.cpp — Simplified dispatch, consumed ring diagnostics
runtime_maker.cpp — PTO2_RING_CONSUMED_WINDOW env var support

gemini-code-assist · 2026-03-13T01:01:42Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the task dispatch and ring buffer management within the runtime system. The primary goal is to improve performance by optimizing the dispatch payload construction process, moving it to an earlier stage, and to enhance memory efficiency and reduce back-pressure by introducing a dual-ring buffer architecture. These changes streamline the task lifecycle, allowing for more efficient resource allocation and reclamation, which is crucial for high-performance computing environments.

Highlights

Dispatch Payload Optimization: Moved dispatch payload construction from AICPU dispatch time to orchestrator submit time, eliminating per-dispatch function address lookup and argument copying.
Dual Task Ring Implementation: Split the unified task ring into a Main Ring (for descriptors and payloads) and a Consumed Ring (for task state and fanout tracking), reducing memory footprint and back-pressure.
Enhanced Task State Machine: Introduced RELEASED and CONSUMED states to the task lifecycle, enabling more granular control over resource reclamation.
Optimized Resource Reclamation: Implemented dual waterline advancement with last_task_released for main ring reclamation and last_task_consumed for consumed ring and heap reclamation, using independent try-locks for efficiency.
Configurable Consumed Ring Capacity: Made the consumed ring capacity configurable, defaulting to 4x the main ring capacity.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
- Removed the static s_pto2_payload_per_core array and the build_pto2_payload function.
- Simplified the dispatch_subtask_to_core function to directly use pre-built dispatch payloads from PTO2TaskPayload.
- Updated initial handshake and deinitialization routines to reflect the removal of the per-core payload storage.
- Adjusted profiling and task resolution logic to pass task_pl directly instead of task and task_pl.
src/a2a3/runtime/tensormap_and_ringbuffer/docs/RUNTIME_LOGIC.md
- Updated the documentation for PTO2TaskPayload to include the dispatch array for pre-built payloads.
- Revised the description of Phase 2 (Dispatch) to state that Handshake.task now points to the pre-built PTO2DispatchPayload.
- Updated the PTO2DispatchPayload section to clarify that it's built by the orchestrator at submit time and embedded in PTO2TaskPayload, removing fields like mixed_task_id, subslot, kernel_id, and core_type.
- Updated the kernel address lookup process to reflect that func_id_to_addr is passed to orchestrators at runtime creation and used at submit time.
src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp
- Added support for configuring pto2_consumed_window_size via the PTO2_RING_CONSUMED_WINDOW environment variable.
- Updated the logging message for ring buffer overrides to include the new consumed_window_size.
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto2_dispatch_payload.h
- Updated comments to clarify that PTO2DispatchPayload is built at submit time by the orchestrator and embedded in PTO2TaskPayload, and that Handshake.task points to this embedded payload.
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp
- Modified pto2_orchestrator_init to use last_task_released for the task ring and consumed_window_size for TensorMap initialization.
- Added an assertion to pto2_submit_mixed_task to ensure func_id_to_addr is set.
- Updated dep pool reclamation to use last_task_released and PTO2ConsumedRingEntry.
- Updated task initialization to use PTO2ConsumedRingEntry and PTO2MainSlotState for managing task state and fanout/fanin.
- Moved the packed_buffer_end field from PTO2TaskDescriptor to PTO2ConsumedRingEntry.
- Added new logic to pto2_submit_mixed_task to build PTO2DispatchPayload for each active subtask slot at submit time, using func_id_to_addr.
- Moved the dep_pool_mark to PTO2ConsumedRingEntry.
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h
- Updated the comment for dep_pool_last_reclaimed to refer to last_task_released.
- Added a func_id_to_addr member to PTO2OrchestratorState for kernel address lookup.
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.cpp
- Renamed the last_alive_ptr parameter and member to last_released_ptr in pto2_task_ring_init and related logic.
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_ring_buffer.h
- Updated comments and variable names from last_task_alive to last_task_released throughout the PTO2TaskRing structure and its methods, reflecting the new dual ring terminology.
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.cpp
- Added logic to initialize consumed_window_size in the shared memory header during pto2_runtime_create_custom, defaulting to 4x task_window_size.
- Modified pto2_scheduler_init calls to pass consumed_window_size.
- Updated pto2_runtime_create_from_sm to accept and propagate func_id_to_addr to the orchestrators.
- Added a fallback for consumed_window_size if not set by the host.
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.h
- Added a func_id_to_addr member to the PTO2Runtime struct for kernel address lookup.
- Updated the pto2_runtime_create_from_sm function signature to include func_id_to_addr as a parameter.
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h
- Introduced PTO2_TASK_RELEASED state (4) and updated PTO2_TASK_CONSUMED to 5, modifying the task state transition comments.
- Defined new structs PTO2ConsumedRingEntry (for long-lived state) and PTO2MainSlotState (for short-lived fanin tracking).
- Moved packed_buffer_end and dep_pool_mark from PTO2TaskDescriptor and PTO2TaskPayload respectively to PTO2ConsumedRingEntry.
- Modified PTO2TaskPayload to contain an array of PTO2DispatchPayload for per-slot pre-built payloads.
- Updated pto2_fanout_lock and pto2_fanout_unlock to operate on PTO2ConsumedRingEntry.
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.cpp
- Added PTO2_TASK_RELEASED to the pto2_task_state_name function.
- Modified pto2_scheduler_init to accept consumed_window_size, allocate consumed_ring and main_slot_states arrays, and zero-initialize their fields.
- Updated pto2_scheduler_destroy to deallocate both new arrays.
- Updated pto2_scheduler_print_stats to log last_task_consumed and last_task_released.
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h
- Updated the scheduler's responsibility comments to reflect dual ring management and the new RELEASED task state.
- Modified PTO2SchedulerState to include last_task_consumed, last_task_released, consumed_window_size, consumed_window_mask, consumed_ring, main_slot_states, and main_ring_advance_lock.
- Replaced get_slot_state_by_slot/task_id with get_consumed_entry and get_main_slot.
- Split advance_ring_pointers into advance_consumed_ring_pointers and advance_main_ring_pointers.
- Updated check_and_handle_consumed, release_producer, release_fanin_and_check_ready, init_task, on_mixed_task_complete, and on_task_release to use the new dual ring structures and RELEASED state.
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_shared_memory.cpp
- Initialized last_task_consumed, last_task_released, and consumed_window_size in pto2_sm_init_header.
- Updated pto2_sm_print_layout and pto2_sm_validate to log and check the new last_task_consumed and last_task_released fields.
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_shared_memory.h
- Renamed last_task_alive to last_task_consumed and added last_task_released to PTO2SharedMemoryHeader.
- Added consumed_window_size to the header.
- Adjusted padding to accommodate new fields.
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_tensormap.cpp
- Renamed last_task_alive to last_task_consumed in PTO2TensorMap initialization, statistics printing, sync_tensormap, and cleanup_retired functions, aligning with the new consumed ring logic.
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_tensormap.h
- Updated comments and variable names from last_task_alive to last_task_consumed in PTO2TensorMap struct and its methods, reflecting that TensorMap validity is now driven by the consumed ring.
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/runtime.cpp
- Added pto2_consumed_window_size initialization to the Runtime constructor.
- Renamed pto2_slot_states_ptr_ to pto2_consumed_ring_ptr_ and updated its setter.
- Modified complete_perf_records to use pto2_consumed_ring_ptr_ and consumed_window_mask for fanout traversal.
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/runtime.h
- Added pto2_consumed_window_size member to the Runtime class.
- Renamed pto2_slot_states_ptr_ to pto2_consumed_ring_ptr_ and updated its corresponding setter method.

Activity

No human activity has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant performance-focused refactoring. It optimizes dispatch payload construction by moving it from dispatch time to submit time, and it splits the unified task ring into a dual-ring system (Main and Consumed) to reduce back-pressure. The changes are extensive and touch many files across the runtime, scheduler, and orchestrator. The implementation appears solid and consistent with the stated goals. The new RELEASED state and dual-watermark advancement logic are well-designed to improve resource reclamation. I have one suggestion to improve the clarity of a log message for better debuggability.

gemini-code-assist · 2026-03-13T01:03:39Z

src/a2a3/runtime/tensormap_and_ringbuffer/host/runtime_maker.cpp

+        if (runtime->pto2_task_window_size || runtime->pto2_heap_size || runtime->pto2_consumed_window_size) {
+            LOG_INFO("Ring buffer overrides: task_window=%lu heap=%lu consumed_window=%lu",
                     (unsigned long)(runtime->pto2_task_window_size ? runtime->pto2_task_window_size : PTO2_TASK_WINDOW_SIZE),
-                     (unsigned long)(runtime->pto2_heap_size ? runtime->pto2_heap_size : PTO2_HEAP_SIZE));
+                     (unsigned long)(runtime->pto2_heap_size ? runtime->pto2_heap_size : PTO2_HEAP_SIZE),
+                     (unsigned long)(runtime->pto2_consumed_window_size ? runtime->pto2_consumed_window_size : 0));
        }


The log message for consumed_window can be misleading. When the PTO2_RING_CONSUMED_WINDOW environment variable is not set, runtime->pto2_consumed_window_size is 0, and the log will show consumed_window=0. However, the actual default value used by the runtime is 4 times the effective task window size. This discrepancy can be confusing when debugging performance or memory usage. The log message should reflect the effective default value that will be used.

if (runtime->pto2_task_window_size || runtime->pto2_heap_size || runtime->pto2_consumed_window_size) { uint64_t eff_task_window = runtime->pto2_task_window_size ? runtime->pto2_task_window_size : PTO2_TASK_WINDOW_SIZE; LOG_INFO("Ring buffer overrides: task_window=%lu heap=%lu consumed_window=%lu", (unsigned long)eff_task_window, (unsigned long)(runtime->pto2_heap_size ? runtime->pto2_heap_size : PTO2_HEAP_SIZE), (unsigned long)(runtime->pto2_consumed_window_size ? runtime->pto2_consumed_window_size : eff_task_window * 4)); }

Separate per-task data by lifecycle to reduce ring buffer back-pressure: - Main Ring (freed at RELEASED): descriptor + payload (~3800B/task) - Consumed Ring (freed at CONSUMED): task_state + fanout tracking (~64B/task) Add RELEASED(4) and CONSUMED(5) states to extend the task state machine. Introduce dual waterline advancement (last_task_released / last_task_consumed) with independent try-locks for each reclamation path. TensorMap validity now uses last_task_consumed; dep pool reclamation uses last_task_released. Consumed ring defaults to 4x main ring capacity (configurable via PTO2_RING_CONSUMED_WINDOW env var).

gemini-code-assist bot reviewed Mar 13, 2026

View reviewed changes

zhusy54 force-pushed the taskring branch 2 times, most recently from 1ad55e4 to 66f6d6f Compare March 13, 2026 06:56

zhusy54 force-pushed the taskring branch from 66f6d6f to 13da7a3 Compare March 13, 2026 07:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Perf+Refactor: optimize dispatch payload and split task ring into dual rings#280

[WIP] Perf+Refactor: optimize dispatch payload and split task ring into dual rings#280
zhusy54 wants to merge 1 commit intohw-native-sys:mainfrom
zhusy54:taskring

zhusy54 commented Mar 13, 2026

Uh oh!

gemini-code-assist bot commented Mar 13, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhusy54 commented Mar 13, 2026

Summary

Key Changes

Dispatch Payload Optimization

Split Task Ring

Files Modified (19 files, +421/-272)

Uh oh!

gemini-code-assist bot commented Mar 13, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant