Skip to content

use reshape in deepseek#19

Merged
zhangqi-chen merged 1 commit intohw-native-sys:mainfrom
Inspiron-st:dev
Mar 17, 2026
Merged

use reshape in deepseek#19
zhangqi-chen merged 1 commit intohw-native-sys:mainfrom
Inspiron-st:dev

Conversation

@Inspiron-st
Copy link
Collaborator

@Inspiron-st Inspiron-st commented Mar 17, 2026

use reshape
restore large buffer

Summary by CodeRabbit

  • Refactor
    • Optimized tensor memory allocation and reshaping operations in the DeepSeek example for improved resource efficiency.

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refines the deepseek_v3_2_decode_front_layer function by adjusting memory allocation strategies and ensuring correct tensor dimensions. It moves several tensor creations to a broader scope, which aligns with restoring larger buffer sizes, and applies explicit reshaping to sliced tensors to maintain compatibility and correctness for subsequent computations.

Highlights

  • Memory Management: Tensor allocations for qr, q_proj, kv_a, and attn_front were moved outside pl.auto_incore() blocks, potentially restoring larger buffer allocations and optimizing memory usage.
  • Tensor Reshaping: Explicit pl.reshape operations were introduced after pl.slice calls for w_q_nope_to_latent and w_latent_to_v tensors to correct their shapes, likely removing singleton dimensions introduced by slicing.
Changelog
  • examples/deepseek_v3_2_decode_front.py
    • Adjusted tensor allocation scope for qr, q_proj, kv_a, and attn_front by moving them out of pl.auto_incore() blocks.
    • Applied pl.reshape to the output of pl.slice for w_q_nope_to_latent and w_latent_to_v to ensure correct tensor dimensions.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link

coderabbitai bot commented Mar 17, 2026

📝 Walkthrough

Walkthrough

Optimizes tensor memory allocation in the DeepSeek V3.2 decode example by preallocating RMSNorm projection tensors (qr, q_proj, kv_a) and attn_front outside their respective inner scopes. Refactors multi-dimensional slice operations to use 1-element-leading dimensions followed by reshape operations for consistent slicing semantics.

Changes

Cohort / File(s) Summary
Tensor Allocation & Slicing Optimization
examples/deepseek_v3_2_decode_front.py
Preallocates RMSNorm projection tensors and attn_front outside auto_incore scopes; replaces direct multi-dimensional slicing with 1-element-leading-dimension slicing followed by reshape operations for w_q_nope_to_latent and w_latent_to_v tensors; adjusts wv_tile reshape patterns accordingly.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 With whiskers twitch and eager paws,
I hop through tensors without flaws,
Preallocate and reshape with care,
Each slice now flows through padded air,
Optimization hops declare! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'use reshape in deepseek' is partially related to the changeset, referring to a real aspect of the changes (introducing reshape operations), but doesn't capture the main objective of preallocation optimization mentioned in the PR description. Consider a more specific title that captures the primary intent, such as 'Optimize deepseek tensor allocation with preallocation and reshape' or 'Refactor deepseek_v3_2 with tensor preallocation strategy'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the reshape operation in the deepseek_v3_2_decode_front.py file to adjust tensor dimensions before matrix multiplication, and also restores a large buffer. The reshape operation is used to ensure that the dimensions of the sliced tensor are compatible with the matmul operation. The buffer restoration involves moving tensor creation outside the auto_incore scope to potentially improve performance by reducing memory allocation overhead within the scope.

Comment on lines +156 to 159
qr = pl.create_tensor([BATCH_CFG, Q_LORA_RANK_CFG], dtype=pl.BF16)
q_proj = pl.create_tensor([BATCH_CFG, NUM_HEADS_CFG * QK_HEAD_DIM_CFG], dtype=pl.BF16)
kv_a = pl.create_tensor([BATCH_CFG, KV_A_OUT], dtype=pl.BF16)
with pl.auto_incore():

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Moving the tensor creation outside the with pl.auto_incore(): block could lead to performance improvements by reducing memory allocation overhead within the incore scope. However, it's crucial to ensure that these tensors are still properly managed and accessible within the incore scope.

Double-check that this change doesn't introduce any data consistency issues or unexpected behavior due to the tensors being created outside the scope.

Suggested change
qr = pl.create_tensor([BATCH_CFG, Q_LORA_RANK_CFG], dtype=pl.BF16)
q_proj = pl.create_tensor([BATCH_CFG, NUM_HEADS_CFG * QK_HEAD_DIM_CFG], dtype=pl.BF16)
kv_a = pl.create_tensor([BATCH_CFG, KV_A_OUT], dtype=pl.BF16)
with pl.auto_incore():
# Scope 1: input RMSNorm + Q/K/V projection.
qr = pl.create_tensor([BATCH_CFG, Q_LORA_RANK_CFG], dtype=pl.BF16)
q_proj = pl.create_tensor([BATCH_CFG, NUM_HEADS_CFG * QK_HEAD_DIM_CFG], dtype=pl.BF16)
kv_a = pl.create_tensor([BATCH_CFG, KV_A_OUT], dtype=pl.BF16)
with pl.auto_incore():

Comment on lines +229 to 231
attn_front = pl.create_tensor([BATCH_CFG, ATTN_OUT_CFG], dtype=pl.FP32)
with pl.auto_incore():
layer_id = pl.tensor.read(layer_id_t, [0])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Moving the tensor creation outside the with pl.auto_incore(): block could lead to performance improvements by reducing memory allocation overhead within the incore scope. However, it's crucial to ensure that this tensor is still properly managed and accessible within the incore scope.

Double-check that this change doesn't introduce any data consistency issues or unexpected behavior due to the tensor being created outside the scope.

            # - C: sparse attention consumes merged topk immediately
            # This avoids materializing topk intermediates across kernel boundaries.
            attn_front = pl.create_tensor([BATCH_CFG, ATTN_OUT_CFG], dtype=pl.FP32)
            with pl.auto_incore():

Comment on lines 297 to +299
q0_nope_latent = pl.matmul(
pl.cast(q_nope0, target_type=pl.BF16),
pl.slice(w_q_nope_to_latent, [QK_NOPE_HEAD_DIM_CFG, KV_LORA_RANK_CFG], [0, 0, 0]),
pl.reshape(pl.slice(w_q_nope_to_latent, [1, QK_NOPE_HEAD_DIM_CFG, KV_LORA_RANK_CFG], [0, 0, 0]), [QK_NOPE_HEAD_DIM_CFG, KV_LORA_RANK_CFG]),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The reshape operation is used here to ensure that the dimensions of the sliced tensor are compatible with the matmul operation. It is crucial to verify that the dimensions specified in reshape are correct and consistent with the expected input shape of matmul to avoid runtime errors or incorrect results.

Consider adding an inline comment to explain the purpose of the reshape and the expected dimensions.

Suggested change
q0_nope_latent = pl.matmul(
pl.cast(q_nope0, target_type=pl.BF16),
pl.slice(w_q_nope_to_latent, [QK_NOPE_HEAD_DIM_CFG, KV_LORA_RANK_CFG], [0, 0, 0]),
pl.reshape(pl.slice(w_q_nope_to_latent, [1, QK_NOPE_HEAD_DIM_CFG, KV_LORA_RANK_CFG], [0, 0, 0]), [QK_NOPE_HEAD_DIM_CFG, KV_LORA_RANK_CFG]),
q0_nope_latent = pl.matmul(
pl.cast(q_nope0, target_type=pl.BF16),
pl.reshape(pl.slice(w_q_nope_to_latent, [1, QK_NOPE_HEAD_DIM_CFG, KV_LORA_RANK_CFG], [0, 0, 0]), [QK_NOPE_HEAD_DIM_CFG, KV_LORA_RANK_CFG]), # Reshape to [QK_NOPE_HEAD_DIM_CFG, KV_LORA_RANK_CFG] for matmul compatibility
)

Comment on lines 405 to +407
q_nope_latent = pl.matmul(
pl.cast(q_nope, target_type=pl.BF16),
pl.slice(w_q_nope_to_latent, [QK_NOPE_HEAD_DIM_CFG, KV_LORA_RANK_CFG], [h, 0, 0]),
pl.reshape(pl.slice(w_q_nope_to_latent, [1, QK_NOPE_HEAD_DIM_CFG, KV_LORA_RANK_CFG], [h, 0, 0]), [QK_NOPE_HEAD_DIM_CFG, KV_LORA_RANK_CFG]),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the previous reshape operation, it's important to ensure that the dimensions specified in this reshape are correct and consistent with the expected input shape of matmul. A mismatch in dimensions can lead to runtime errors or incorrect results.

Consider adding an inline comment to explain the purpose of the reshape and the expected dimensions.

Suggested change
q_nope_latent = pl.matmul(
pl.cast(q_nope, target_type=pl.BF16),
pl.slice(w_q_nope_to_latent, [QK_NOPE_HEAD_DIM_CFG, KV_LORA_RANK_CFG], [h, 0, 0]),
pl.reshape(pl.slice(w_q_nope_to_latent, [1, QK_NOPE_HEAD_DIM_CFG, KV_LORA_RANK_CFG], [h, 0, 0]), [QK_NOPE_HEAD_DIM_CFG, KV_LORA_RANK_CFG]),
q_nope_latent = pl.matmul(
pl.cast(q_nope, target_type=pl.BF16),
pl.reshape(pl.slice(w_q_nope_to_latent, [1, QK_NOPE_HEAD_DIM_CFG, KV_LORA_RANK_CFG], [h, 0, 0]), [QK_NOPE_HEAD_DIM_CFG, KV_LORA_RANK_CFG]), # Reshape to [QK_NOPE_HEAD_DIM_CFG, KV_LORA_RANK_CFG] for matmul compatibility
)

Comment on lines +447 to +448
wv_tile = pl.slice(w_latent_to_v, [1, KV_LORA_RANK_CFG, V_OUT_CHUNK], [h, 0, v0])
wv_tile = pl.reshape(wv_tile, [KV_LORA_RANK_CFG, V_OUT_CHUNK])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The reshape operation is used here to ensure that the dimensions of the sliced tensor are compatible with the matmul operation. It is crucial to verify that the dimensions specified in reshape are correct and consistent with the expected input shape of matmul to avoid runtime errors or incorrect results.

Consider adding an inline comment to explain the purpose of the reshape and the expected dimensions.

Suggested change
wv_tile = pl.slice(w_latent_to_v, [1, KV_LORA_RANK_CFG, V_OUT_CHUNK], [h, 0, v0])
wv_tile = pl.reshape(wv_tile, [KV_LORA_RANK_CFG, V_OUT_CHUNK])
wv_tile = pl.slice(w_latent_to_v, [1, KV_LORA_RANK_CFG, V_OUT_CHUNK], [h, 0, v0])
wv_tile = pl.reshape(wv_tile, [KV_LORA_RANK_CFG, V_OUT_CHUNK]) # Reshape to [KV_LORA_RANK_CFG, V_OUT_CHUNK] for matmul compatibility

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
examples/deepseek_v3_2_decode_front.py (1)

297-300: Inconsistency with prefill implementation and codebase style guidance.

The decode implementation uses pl.reshape(pl.slice(...)) while the prefill implementation (line 283-285 of deepseek_v3_2_prefill_front.py) uses direct 2D slicing. The codebase documentation in type_layout.py and recommendations in qwen3-32b.py ("avoid unnecessary reshape; prefer direct view slicing") suggest consistency with the prefill approach. Since pl.reshape is metadata-only (zero-copy), this is a style preference rather than a performance concern, but aligning with the existing pattern would improve clarity.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/deepseek_v3_2_decode_front.py` around lines 297 - 300, The decode
path creates q0_nope_latent by reshaping a 3D slice of w_q_nope_to_latent but
the prefill code uses a direct 2D view; update the q0_nope_latent expression
(the use of pl.reshape(pl.slice(...)) around pl.slice(w_q_nope_to_latent, ...))
to match the prefill style by taking the corresponding 2D slice/view of
w_q_nope_to_latent directly (keep operands q_nope0 and w_q_nope_to_latent, and
preserve the target shape [QK_NOPE_HEAD_DIM_CFG, KV_LORA_RANK_CFG] semantics) to
avoid the explicit pl.reshape and align with the codebase slicing convention.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@examples/deepseek_v3_2_decode_front.py`:
- Around line 297-300: The decode path creates q0_nope_latent by reshaping a 3D
slice of w_q_nope_to_latent but the prefill code uses a direct 2D view; update
the q0_nope_latent expression (the use of pl.reshape(pl.slice(...)) around
pl.slice(w_q_nope_to_latent, ...)) to match the prefill style by taking the
corresponding 2D slice/view of w_q_nope_to_latent directly (keep operands
q_nope0 and w_q_nope_to_latent, and preserve the target shape
[QK_NOPE_HEAD_DIM_CFG, KV_LORA_RANK_CFG] semantics) to avoid the explicit
pl.reshape and align with the codebase slicing convention.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fd1d3ccd-1aae-48c0-bf0b-9538f8d20d28

📥 Commits

Reviewing files that changed from the base of the PR and between 7e00d41 and e281d8b.

📒 Files selected for processing (1)
  • examples/deepseek_v3_2_decode_front.py

@zhangqi-chen zhangqi-chen merged commit 8a51018 into hw-native-sys:main Mar 17, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants