Skip to content

sync(a2a3): unify scheduling queues and optimize core allocation performance#327

Draft
Songyangyang18 wants to merge 1 commit intohw-native-sys:mainfrom
Songyangyang18:dismix
Draft

sync(a2a3): unify scheduling queues and optimize core allocation performance#327
Songyangyang18 wants to merge 1 commit intohw-native-sys:mainfrom
Songyangyang18:dismix

Conversation

@Songyangyang18
Copy link

Key Changes

  1. Scheduler Queue Unification
  • Remove multi-resource shape classification (AIC_ONLY/AIV_X1/AIV_X2/AIC_AIV_X1/AIC_AIV_X2), simplify to single UNIFIED type
  • Merge 5 independent ready queues into 1 global queue, reducing queue traversal and lock contention overhead
  • Unify local buffer structure, simplify task enqueue logic
  1. Break Cluster Resource Constraints
  • Change AIC and AIV cores from cluster-bound to global resource pool management
  • Independently allocate required AIC and AIV during task dispatch, no longer requiring same cluster
  • Significantly improve resource utilization for mixed tasks (AIC+AIV)
  1. Core Allocation Performance Optimization
  • Add take_last_idle() method, optimize core allocation from O(n) to O(1)
  • Add dispatch_from_idle_pool() fast path function, avoid double state updates
  • Fix bug where double decrement caused incorrect idle core count

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the scheduling and resource allocation mechanisms to enhance performance and flexibility. By unifying scheduling queues and breaking traditional cluster-based resource constraints, it streamlines task dispatch and improves the utilization of AIC and AIV cores. The changes also introduce specific optimizations for core allocation, leading to a more efficient and robust system.

Highlights

  • Scheduler Queue Unification: Consolidated multiple resource shape classifications into a single UNIFIED type and merged five independent ready queues into one global queue, reducing overhead.
  • Resource Constraint Relaxation: Transformed AIC and AIV core management from cluster-bound to a global resource pool, enabling independent allocation and improving utilization for mixed tasks.
  • Core Allocation Optimization: Introduced take_last_idle() for O(1) core allocation and dispatch_from_idle_pool() for a fast path, while also correcting an idle core count bug.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist
Copy link

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

@writalnaie
Copy link
Contributor

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully unifies the scheduling queues and optimizes core allocation performance, aligning with the stated objectives. The changes effectively remove multi-resource shape classification, merge independent ready queues into a single global queue, and introduce fast-path core allocation methods. The code refactoring for PTO2ResourceShape and related functions is clean and directly addresses the goal of simplifying the scheduling model. The introduction of take_last_idle() and dispatch_from_idle_pool() provides the intended performance benefits. Overall, the changes are well-implemented and improve the system's efficiency and resource utilization.

Comment on lines +476 to +477
PTO2TaskSlotState* get_ready_task(PTO2ResourceShape) {
return ready_queues[0].pop();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The get_ready_task(PTO2ResourceShape) overload now takes a PTO2ResourceShape parameter but explicitly ignores it. This can be misleading for callers who might expect shape-based filtering. If this overload is no longer needed, consider removing it. If it's kept for backward compatibility, a comment explaining its behavior (e.g., "Legacy adapter, always returns from unified queue") would improve clarity.

    PTO2TaskSlotState* get_ready_task() {
        return ready_queues[0].pop();
    }

    // Legacy adapter, always returns from unified queue
    PTO2TaskSlotState* get_ready_task(PTO2ResourceShape) {
        return ready_queues[0].pop();
    }

Comment on lines +494 to +533
void dispatch_from_idle_pool(
Runtime* runtime, CoreStateTracker& tracker, int32_t* executing_reg_task_ids,
CoreType core_type, PTO2TaskSlotState& slot_state,
PTO2SubtaskSlot subslot
#if PTO2_PROFILING
, bool profiling_enabled, int32_t thread_idx
#endif
) {
CoreTypeTracker& ct = tracker.by_type[static_cast<int32_t>(core_type)];
int32_t core_id = ct.take_last_idle();
if (core_id < 0) return;

PTO2DispatchPayload& payload = s_pto2_payload_per_core[core_id];
PTO2TaskDescriptor& task = *slot_state.task;
int32_t slot_idx = static_cast<int32_t>(subslot);
build_pto2_payload(payload, task.kernel_id[slot_idx], *slot_state.payload);
executing_subslot_by_core_[core_id] = subslot;
executing_slot_state_by_core_[core_id] = &slot_state;
#if PTO2_PROFILING
if (profiling_enabled) {
dispatch_timestamps_[core_id] = get_sys_cnt_aicpu();
if (core_dispatch_counts_[core_id] >= PLATFORM_PROF_BUFFER_SIZE) {
perf_aicpu_switch_buffer(runtime, core_id, thread_idx);
core_dispatch_counts_[core_id] = 0;
}
core_dispatch_counts_[core_id]++;
}
#endif
dispatch_seq_by_core_[core_id]++;
uint32_t reg_task_id = dispatch_seq_by_core_[core_id] & TASK_ID_MASK;
while (reg_task_id == AICORE_IDLE_TASK_ID ||
(reg_task_id + 1) == AICORE_EXIT_SIGNAL) {
dispatch_seq_by_core_[core_id]++;
reg_task_id = dispatch_seq_by_core_[core_id] & TASK_ID_MASK;
}
write_reg(core_id_to_reg_addr_[core_id], RegId::DATA_MAIN_BASE, static_cast<uint64_t>(reg_task_id));

tracker.core_idle[core_id] = false;
executing_reg_task_ids[core_id] = reg_task_id;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The dispatch_from_idle_pool method uses ct.take_last_idle() to select a core. While this provides an O(1) allocation, it always picks the last available core from the idle array. Depending on how cores are added to this array, this might not always be the most optimal choice in terms of cache locality or other performance characteristics if multiple idle cores are available. This is a design trade-off for simplicity and O(1) performance, but it's worth noting that a more sophisticated core selection strategy (e.g., based on recent usage or physical proximity) could potentially yield further performance gains in specific scenarios, though it would likely increase complexity.

@ChaoWao ChaoWao marked this pull request as draft March 19, 2026 07:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants