Skip to content

Fix tensor dimension issues and refactor attention scope in Qwen3 prefill#22

Merged
xzhxzhxzh123 merged 2 commits intohw-native-sys:mainfrom
xzhxzhxzh123:main
Mar 19, 2026
Merged

Fix tensor dimension issues and refactor attention scope in Qwen3 prefill#22
xzhxzhxzh123 merged 2 commits intohw-native-sys:mainfrom
xzhxzhxzh123:main

Conversation

@xzhxzhxzh123
Copy link
Collaborator

@xzhxzhxzh123 xzhxzhxzh123 commented Mar 19, 2026

  • Fix hidden_states slice dimension from 2D to 3D to match tensor shape
  • Remove unsupported valid_shape parameter from create_tensor calls
  • Add reshape after slice to adapt dimensions for downstream ops
  • Separate KV cache update loop from attention loop in Scope 2
  • Fix down_acc_3d usage in output assembly
  • Adjust MLP_OUT_CHUNK from 256 to 64

Summary by CodeRabbit

  • New Features

    • Added support for the Ascend950 backend in example workflows.
  • Improvements

    • Optimized tensor processing and parallel execution for better performance.
    • Improved caching and attention update flow to reduce redundant work.
    • Refined attention masking and output assembly for more robust results.
    • Simplified API by removing the optional work_dir parameter.

…fill

- Fix hidden_states slice dimension from 2D to 3D to match tensor shape
- Remove unsupported valid_shape parameter from create_tensor calls
- Add reshape after slice to adapt dimensions for downstream ops
- Separate KV cache update loop from attention loop in Scope 2
- Fix down_acc_3d usage in output assembly
- Adjust MLP_OUT_CHUNK from 256 to 64
@coderabbitai
Copy link

coderabbitai bot commented Mar 19, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a3407eac-3493-41d9-98a6-8874b89a818c

📥 Commits

Reviewing files that changed from the base of the PR and between 4feb548 and e8d3dcb.

📒 Files selected for processing (1)
  • examples/qwen3/qwen3_32b_prefill.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • examples/qwen3/qwen3_32b_prefill.py

📝 Walkthrough

Walkthrough

This PR updates the Qwen3 32B prefill example: reduces MLP output tiling granularity, changes batch-parallel and KV‑cache update scheduling, refactors tensor slicing/reshape patterns and valid_shape usages, adjusts attention masking and output assembly, and removes work_dir while switching the public compile_and_run backend to Ascend950.

Changes

Cohort / File(s) Summary
Qwen3 Prefill Example
examples/qwen3/qwen3_32b_prefill.py
Reduced MLP_OUT_CHUNK from 256→64 (affects derived blocks); removed explicit chunk=4 from batch-parallel call; replaced 2D pl.slice(..., valid_shape=...) patterns with rank‑3 slices + pl.reshape for RMS/Q/K/V/residual paths; removed several valid_shape args when creating tile tensors; split KV-cache update into a dedicated first parallel loop over NUM_KV_HEADS_CFG and then a separate query-attention loop; adjusted attention scores_valid slice to valid_shape=[1, valid_len]; added down_acc_3d = pl.reshape(down_acc, [1, TOK_TILE, Q_OUT_CHUNK]) before writing outputs; public compile_and_run signature no longer accepts work_dir and now uses BackendType.Ascend950 instead of BackendType.CCE.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 With tiles cut small and loops rearranged,

I hop through slices neatly reshaped and changed,
KV caches first, then queries sing on cue,
Ascend950 hums — the prefill hops anew! 🥕✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: fixing tensor dimension issues (3D slice patterns, reshape operations, valid_shape removals) and refactoring attention scope (separating KV cache update from attention loop).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can disable sequence diagrams in the walkthrough.

Disable the reviews.sequence_diagrams setting to disable sequence diagrams in the walkthrough.

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses several critical issues and refactorings within the Qwen3 prefill model, primarily focusing on tensor dimension correctness and optimizing the attention mechanism. The changes ensure proper tensor shape handling throughout the computation, refine the attention scope for better clarity and stability, and update build configurations to target a specific backend, ultimately enhancing the model's robustness and execution on the intended platform.

Highlights

  • Tensor Dimension Correction: Fixed hidden_states slice dimensions from 2D to 3D and added reshape operations to correctly adapt tensor shapes for downstream computations, resolving potential dimension mismatches.
  • Removed Unsupported Parameter: The valid_shape parameter was removed from pl.create_tensor calls, streamlining tensor creation and potentially relying on implicit shape inference or other validation mechanisms.
  • Attention Mechanism Refactoring: The KV cache update loop was separated from the main attention computation loop within Scope 2, aiming to improve code clarity and address 'tensor view issues'.
  • MLP Configuration Adjustment: The MLP_OUT_CHUNK constant was adjusted from 256 to 64, which may impact performance and resource utilization for the MLP layer.
  • Output Assembly Fix: Corrected the usage of down_acc_3d in the output assembly by reshaping down_acc before casting and assembling it into the final out tensor.
  • Build and Backend Configuration Update: Removed the work_dir parameter from the compile_and_run function and explicitly set the backend type to BackendType.Ascend950, simplifying the build process and targeting a specific hardware platform.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@xzhxzhxzh123 xzhxzhxzh123 changed the title Fix tensor dimension issues and refactor attention scope in Qwen3 prefill [Fix] tensor dimension issues and refactor attention scope in Qwen3 prefill Mar 19, 2026
@xzhxzhxzh123 xzhxzhxzh123 changed the title [Fix] tensor dimension issues and refactor attention scope in Qwen3 prefill [Fix] Fix tensor dimension issues and refactor attention scope in Qwen3 prefill Mar 19, 2026
@xzhxzhxzh123 xzhxzhxzh123 changed the title [Fix] Fix tensor dimension issues and refactor attention scope in Qwen3 prefill Fix tensor dimension issues and refactor attention scope in Qwen3 prefill Mar 19, 2026
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several fixes and refactorings to the Qwen3 prefill implementation. Key changes include correcting tensor dimensions for slicing operations, removing unsupported API parameters, and refactoring the attention scope for clarity and correctness. The changes appear to be solid improvements. I've identified one area of code duplication that could be refactored to improve maintainability.

Comment on lines +138 to +145
x_chunk = pl.reshape(
pl.cast(
pl.slice(hidden_states, [1, TOK_TILE, K_CHUNK], [b, p0, k0],
valid_shape=[1, valid_tok, K_CHUNK]),
target_type=pl.FP32,
)
),
[TOK_TILE, K_CHUNK]
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of code to calculate x_chunk is duplicated three times in this function (here, at lines 117-124, and at lines 160-167). To improve maintainability and reduce redundancy, consider extracting this logic into a helper method.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/qwen3/qwen3_32b_prefill.py (1)

433-443: ⚠️ Potential issue | 🟡 Minor

API inconsistency: work_dir parameter removed while other qwen3 modules retain it.

The work_dir: str | None = None parameter was removed from compile_and_run, but the related modules (qwen3_32b_decode.py at line 411 and qwen3_32b_training_forward_and_backward.py at line 931) still include this parameter. This creates an inconsistent API surface across the qwen3 examples.

Consider either:

  1. Retaining work_dir for consistency with sibling modules, or
  2. Removing work_dir from all qwen3 modules if it's no longer needed
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/qwen3/qwen3_32b_prefill.py` around lines 433 - 443, The
compile_and_run function signature in qwen3_32b_prefill.py removed the work_dir
parameter, causing an inconsistent API with sibling modules (see
qwen3_32b_decode.py and qwen3_32b_training_forward_and_backward.py); either
restore work_dir: str | None = None to the compile_and_run signature and
propagate it to any internal calls/variables that need a working directory, or
remove work_dir from the other modules so all three functions (compile_and_run
in qwen3_32b_prefill.py, qwen3_32b_decode.py, and
qwen3_32b_training_forward_and_backward.py) share the same signature; update
callers to match the chosen approach and ensure any file/dump behavior (e.g.,
dump_passes or file paths) uses the unified work_dir handling.
🧹 Nitpick comments (2)
examples/qwen3/qwen3_32b_prefill.py (2)

160-167: Same indentation issue as above.

This block has the same inconsistent indentation pattern as lines 138-145.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/qwen3/qwen3_32b_prefill.py` around lines 160 - 167, The x_chunk
construction has inconsistent indentation around the nested
pl.reshape/pl.cast/pl.slice calls (symbols: x_chunk, pl.reshape, pl.cast,
pl.slice, hidden_states, TOK_TILE, K_CHUNK, valid_tok, b, p0, k0); fix by
aligning the chained function calls and their arguments consistently with the
surrounding blocks (same style used for earlier similar blocks) so each nested
call and its parameters are indented uniformly and closing parentheses line up
with their opening calls.

138-145: Inconsistent indentation in multi-line expression.

The pl.cast( and its contents are not properly indented relative to pl.reshape(. While Python parses this correctly due to parentheses, it harms readability.

🔧 Suggested indentation fix
-                                x_chunk = pl.reshape(
-                                pl.cast(
-                                    pl.slice(hidden_states, [1, TOK_TILE, K_CHUNK], [b, p0, k0],
-                                            valid_shape=[1, valid_tok, K_CHUNK]),
+                                x_chunk = pl.reshape(
+                                    pl.cast(
+                                        pl.slice(hidden_states, [1, TOK_TILE, K_CHUNK], [b, p0, k0],
+                                                valid_shape=[1, valid_tok, K_CHUNK]),
                                     target_type=pl.FP32,
-                                ),
-                                [TOK_TILE, K_CHUNK]
-                            )
+                                    ),
+                                    [TOK_TILE, K_CHUNK]
+                                )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/qwen3/qwen3_32b_prefill.py` around lines 138 - 145, The multi-line
expression building x_chunk uses pl.reshape(pl.cast(pl.slice(...))) but has
inconsistent indentation that reduces readability; reformat the nested calls so
the arguments to pl.reshape are aligned and the pl.cast( and its pl.slice(...)
block are indented one level under pl.reshape, making pl.slice(hidden_states,
[1, TOK_TILE, K_CHUNK], [b, p0, k0], valid_shape=[1, valid_tok, K_CHUNK])
clearly nested inside pl.cast which is the first argument to pl.reshape; locate
x_chunk, pl.reshape, pl.cast, and pl.slice to apply this consistent indentation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@examples/qwen3/qwen3_32b_prefill.py`:
- Around line 433-443: The compile_and_run function signature in
qwen3_32b_prefill.py removed the work_dir parameter, causing an inconsistent API
with sibling modules (see qwen3_32b_decode.py and
qwen3_32b_training_forward_and_backward.py); either restore work_dir: str | None
= None to the compile_and_run signature and propagate it to any internal
calls/variables that need a working directory, or remove work_dir from the other
modules so all three functions (compile_and_run in qwen3_32b_prefill.py,
qwen3_32b_decode.py, and qwen3_32b_training_forward_and_backward.py) share the
same signature; update callers to match the chosen approach and ensure any
file/dump behavior (e.g., dump_passes or file paths) uses the unified work_dir
handling.

---

Nitpick comments:
In `@examples/qwen3/qwen3_32b_prefill.py`:
- Around line 160-167: The x_chunk construction has inconsistent indentation
around the nested pl.reshape/pl.cast/pl.slice calls (symbols: x_chunk,
pl.reshape, pl.cast, pl.slice, hidden_states, TOK_TILE, K_CHUNK, valid_tok, b,
p0, k0); fix by aligning the chained function calls and their arguments
consistently with the surrounding blocks (same style used for earlier similar
blocks) so each nested call and its parameters are indented uniformly and
closing parentheses line up with their opening calls.
- Around line 138-145: The multi-line expression building x_chunk uses
pl.reshape(pl.cast(pl.slice(...))) but has inconsistent indentation that reduces
readability; reformat the nested calls so the arguments to pl.reshape are
aligned and the pl.cast( and its pl.slice(...) block are indented one level
under pl.reshape, making pl.slice(hidden_states, [1, TOK_TILE, K_CHUNK], [b, p0,
k0], valid_shape=[1, valid_tok, K_CHUNK]) clearly nested inside pl.cast which is
the first argument to pl.reshape; locate x_chunk, pl.reshape, pl.cast, and
pl.slice to apply this consistent indentation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ee21d379-0d7d-483c-b9c5-887467c864e8

📥 Commits

Reviewing files that changed from the base of the PR and between 4eacdc9 and 4feb548.

📒 Files selected for processing (1)
  • examples/qwen3/qwen3_32b_prefill.py

@xzhxzhxzh123 xzhxzhxzh123 merged commit 6b865d2 into hw-native-sys:main Mar 19, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant