Perf: precompute dispatch desc and use flat byte offset register encoding#326
Perf: precompute dispatch desc and use flat byte offset register encoding#326zhusy54 wants to merge 2 commits intohw-native-sys:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors the task dispatch mechanism to improve performance and robustness. By precomputing dispatch descriptors and adopting a more flexible flat byte offset register encoding, the system reduces overhead in the scheduler's critical path and removes previous limitations on task scaling. Additionally, it introduces a crucial safety check to prevent encoding overflows, enhancing system stability. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant performance optimization by pre-computing dispatch descriptors and adopting a more flexible flat byte offset register encoding. The changes are extensive, touching the core dispatch logic in the orchestrator, scheduler, and AICore. The addition of an overflow safety check is a great improvement for robustness. I've found a critical bug related to potential integer truncation in the new offset calculation logic, which could lead to incorrect task dispatch. I've provided comments with suggestions to fix this issue in both places it occurs.
src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp
Outdated
Show resolved
Hide resolved
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto2_dispatch_payload.h
Outdated
Show resolved
Hide resolved
… encoding Replace structured (slot_in_ring, ring_id, slot_idx) register encoding with a flat byte offset scheme. The new layout uses 28 bits for offset_field, enabling ~2GB of addressable dispatch descriptors and eliminating independent bit-width constraints on ring depth and task window size. Add overflow safety check in dispatch_subtask_to_core that triggers emergency_shutdown with PTO2_ERROR_ENCODING_OVERFLOW on sentinel collision.
2a9b2e2 to
ca41517
Compare
Perform the overflow safety check on the full-width uint64_t value before narrowing to uint32_t, preventing silent truncation from bypassing the sentinel-collision guard. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
(slot_in_ring, ring_id, slot_idx)register encoding with a flat byte offset scheme, removing independent bit-width constraints on ring depth and task window sizeemergency_shutdownon sentinel collisionKey Changes
pto2_dispatch_payload.h: NewPTO2DispatchDescstruct with per-slotfunction_bin_addrs[]+ unifiedargs[]; flat byte offset encoding constants (PTO2_REG_OFFSET_SHIFT/MASK/ALIGN_SHIFT); simplifiedPTO2DispatchInitInfofrom{dispatch_bases[4], payload_stride}to{dispatch_base}aicore_executor.cpp: Simplified init to cache singledispatch_base; decode usesoffset_field+desc_byte_offsetinstead ofslot_in_ring * payload_strideaicpu_executor.cpp:dispatch_subtask_to_corereturnsbool(false = fatal overflow); newdispatch_base_andsm_header_members; overflow triggersPTO2_ERROR_ENCODING_OVERFLOW→emergency_shutdown→ clean exitpto_runtime2_types.h: AddPTO2_ERROR_ENCODING_OVERFLOWerror code (101); updatePTO2TaskPayloaddocstringpto_orchestrator.cpp/h: BuildPTO2DispatchDescat submit time with per-slot function addresses and unified args layoutpto_shared_memory.h,runtime.h,build_config.py: Supporting changes for new dispatch descriptor and profiling infrastructureRegister Encoding
Max encodable byte offset ≈ 2GB (vs previous scheme limited by independent bit widths)