Skip to content

test(st): Add dynamic-shape paged attention and emit layout attrs#604

Open
Crystal-wzy wants to merge 2 commits intohw-native-sys:mainfrom
Crystal-wzy:main
Open

test(st): Add dynamic-shape paged attention and emit layout attrs#604
Crystal-wzy wants to merge 2 commits intohw-native-sys:mainfrom
Crystal-wzy:main

Conversation

@Crystal-wzy
Copy link
Contributor

Summary

  • Add examples/ir_parser/dynamic_paged_attention_example.py with build_dynamic_paged_attention_program() builder that defines five InCore kernel closures (dyn_kernel_init_inplace, dyn_kernel_qk_matmul, dyn_kernel_softmax_prepare, dyn_kernel_pv_matmul, dyn_kernel_online_update); type annotations use module-level pl.dynamic() variables (Q_HEADS, HEAD_DIM_DYN, BLOCK_SIZE_DYN), load ops use concrete closure variables (_Q_TILE, _HEAD_DIM, _BLOCK_SIZE)
  • Add tests/st/codegen/test_dynamic_paged_attention.py with DynamicPagedAttentionTestCase inheriting golden reference and tensor definitions from PagedAttentionTestCase, targeting Ascend910B PTO backend; 4 parametrized configurations (including one with per-request variable context lengths)
  • Extend tests/st/codegen/test_paged_attention.py: PagedAttentionTestCase accepts context_len: int | list[int] and constructs context_lens tensor from a list when heterogeneous per-request lengths are needed (required by DynamicPagedAttentionTestCase)
  • Fix PTOCodegen::EmitMakeTensorViews to emit {layout = #pto.layout<nd|dn|nz>} attribute on tensor view assignments so DN-layout tensors (e.g. transposed key_cache in paged attention) are correctly represented in generated PTO IR
  • Fix examples/ir_parser/paged_attention_example.py: annotate [16, 1] output tensors in kernel_init_inplace and kernel_softmax_prepare with pl.DN; remove pl.DN from kj parameter in kernel_qk_matmul (key_cache is ND, not DN)

Testing

  • DynamicPagedAttentionTestCase passes on 910B PTO hardware
  • Pre-commit hooks pass (ruff, pyright, clang-format, cpplint)

@coderabbitai
Copy link

coderabbitai bot commented Mar 19, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a new dynamic-shape paged-attention example and integration test, updates an existing paged-attention example's kernel type annotations and orchestration signature, and modifies PTO codegen tensor-view/stride emission and type/layout printing.

Changes

Cohort / File(s) Summary
Dynamic Paged-Attention Example
examples/ir_parser/dynamic_paged_attention_example.py
New example that builds a dynamic-shape, paged-attention @pl.program with three pl.dynamic(...) variables and five InCore kernels wired into a DynamicPagedAttentionProgram.paged_attention(...) orchestration.
Paged-Attention Example Edits
examples/ir_parser/paged_attention_example.py
Updated InCore kernel tensor annotations to add pl.DN on li/mi accumulator outputs and removed pl.DN from kj in qk matmul; removed three size parameters from PagedAttentionProgram.paged_attention and stopped returning their TensorSpec entries.
PTO Codegen
src/codegen/pto/pto_codegen.cpp
Removed N‑D row‑stride precomputation and simplified stride emission (explicit rank‑2 and rank‑1 handling); emit layout metadata in !pto.tensor_view<...> annotations with DN/NZ/ND mapping; narrowed index-constant emission conditions.
Tests — Dynamic & Base
tests/st/codegen/test_dynamic_paged_attention.py, tests/st/codegen/test_paged_attention.py
Added test_dynamic_paged_attention.py with DynamicPagedAttentionTestCase and parameterized runs; PagedAttentionTestCase now accepts `context_len: int

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • Hzfengsy

Poem

🐰 In pages of queries and blocks I leap,
Five tiny kernels wake from sleep,
Shapes bend and stride rules softly shift,
Tests hop in to give them a lift —
A rabbit cheers: "New attention, neat!" 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.87% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'test(st): Add dynamic-shape paged attention and emit layout attrs' accurately summarizes the main changes: adding dynamic-shape paged attention tests and emitting layout attributes in code generation.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, detailing the new files, modifications, and fixes across all changed components with testing information.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the PyPTO framework by introducing robust support for dynamic-shape paged attention. It provides a new example demonstrating how InCore kernels can utilize dynamic variables for tensor shapes, enabling greater flexibility in handling varying input dimensions at runtime. Concurrently, it addresses a critical code generation issue by ensuring that the PTO backend correctly emits layout attributes for tensor views, which is essential for accurate representation of data layouts like DN-layout in the generated IR. These changes collectively improve the framework's capability to handle more complex and dynamic deep learning models, particularly in attention mechanisms.

Highlights

  • Dynamic-Shape Paged Attention Example: A new example, examples/ir_parser/dynamic_paged_attention_example.py, was added to demonstrate dynamic-shape paged attention. This includes a builder function build_dynamic_paged_attention_program() that defines five InCore kernel closures, utilizing module-level pl.dynamic() variables for type annotations and concrete closure variables for load operations.
  • Dynamic Paged Attention Tests: New system tests were introduced in tests/st/codegen/test_dynamic_paged_attention.py to validate the dynamic-shape paged attention. These tests include four parametrized configurations, one of which supports per-request variable context lengths, targeting the Ascend910B PTO backend.
  • Enhanced Paged Attention Test Case: The existing tests/st/codegen/test_paged_attention.py was extended to allow PagedAttentionTestCase to accept context_len as either an integer or a list of integers. This enables the construction of context_lens tensors from a list, supporting heterogeneous per-request lengths required by the new dynamic paged attention tests.
  • PTO Codegen Layout Attribute Fix: A fix was implemented in PTOCodegen::EmitMakeTensorViews to ensure that {layout = #pto.layout<nd|dn|nz>} attributes are correctly emitted on tensor view assignments. This is crucial for accurately representing DN-layout tensors, such as transposed key_cache in paged attention, within the generated PTO IR.
  • Paged Attention Example Annotations Refinement: The examples/ir_parser/paged_attention_example.py file was updated to correctly annotate [16, 1] output tensors in kernel_init_inplace and kernel_softmax_prepare with pl.DN layout. Additionally, the pl.DN annotation was removed from the kj parameter in kernel_qk_matmul, as key_cache is ND, not DN.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully introduces dynamic-shape paged attention, enhancing the flexibility of the PyPTO framework. The new dynamic_paged_attention_example.py demonstrates the usage of dynamic shapes in InCore kernels, and the accompanying test_dynamic_paged_attention.py provides comprehensive test coverage, including configurations with variable context lengths. The fix in src/codegen/pto/pto_codegen.cpp to correctly emit layout attributes for pto.make_tensor_view is crucial for accurate representation of DN-layout tensors in the generated IR. Overall, the changes are well-implemented and align with the stated objectives of the pull request. The recurring type-checking issues highlighted in the example file should be addressed by refining type annotations or the DSL's API, aligning with best practices for DSL argument validation.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/codegen/pto/pto_codegen.cpp (1)

351-357: ⚠️ Potential issue | 🟠 Major

Restore the generic row-major stride path for rank > 2 tensors.

After this split, any tensor parameter with rank greater than 2 now falls through to strides = []. pto.make_tensor_view used to receive a full N-D row-major stride vector here, so this is a silent regression for every 3D+ entrypoint.

Also applies to: 440-457

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/codegen/pto/pto_codegen.cpp` around lines 351 - 357, The change removed
the generic N-D row-major stride emission for tensors with rank > 2, causing
strides to be left as an empty vector for 3D+ tensors; restore the original
generic path by computing and emitting row-major strides for all ranks > 1 (not
just 2), using tensor_type->shape_ to compute cumulative product of trailing
dimensions (use GetConstIntValue and GetOrEmitIndexConstant to emit each stride
value) so that pto.make_tensor_view receives a full N-D stride vector; apply the
same fix in the corresponding mirrored block around the later copy (the block
that currently handles shape sizes 2 and 1 at the other location).
🧹 Nitpick comments (1)
tests/st/codegen/test_dynamic_paged_attention.py (1)

89-112: Consider pinning the emitted #pto.layout<...> text in a codegen-only check.

These STs validate numerics on hardware, but they do not lock down the exact EmitMakeTensorViews regression this PR fixes. A small codegen-only assertion on the generated PTO IR would catch attribute-loss regressions much earlier.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/st/codegen/test_dynamic_paged_attention.py` around lines 89 - 112, Add
a codegen-only assertion in test_dynamic_paged_attention to pin the emitted PTO
IR layout text (the "#pto.layout<...>" emitted by the EmitMakeTensorViews
regression) so attribute-loss regressions are caught; after calling
test_runner.run(test_case) (and before numeric asserts), retrieve the generated
PTO/IR string from the test result or test_runner API (e.g., a field on result
or a helper that returns the generated IR), and assert that it contains the
expected "#pto.layout<...>" fragment (or a small canonical layout pattern) for
the tensors involved in DynamicPagedAttentionTestCase, failing the test if the
fragment is missing. Ensure you reference the EmitMakeTensorViews-related layout
pattern and keep this check confined to codegen-only (skip on hardware
numeric-only runs if necessary).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/ir_parser/dynamic_paged_attention_example.py`:
- Around line 83-90: The dynamic path currently types the mi/li accumulators as
plain ND tensors in dyn_kernel_init_inplace (and the similar signatures later),
which mismatches the static paged-attention builder that marks these as pl.DN to
indicate DN-shaped layout; update the function signatures (e.g.,
dyn_kernel_init_inplace and the other dynamic kernel init/online-update
signatures referenced) to annotate mi and li as pl.DN[pl.Tensor[[Q_HEADS, 1],
pl.FP32]] (or the equivalent pl.DN wrapper used in the repo) instead of plain
pl.Tensor so the downstream dyn_kernel_online_update consumers see the same
DN-shaped layout semantics. Ensure all other dynamic-kernel declarations noted
in the comment are changed consistently.

In `@tests/st/codegen/test_paged_attention.py`:
- Around line 419-422: The code builds context_lens from self.context_len but
doesn't validate when self.context_len is a list: ensure that in the branch
handling list-valued self.context_len you check that len(self.context_len) == B
(the batch size) and raise a clear ValueError if it differs (or alternatively
broadcast/trim only if that behavior is intended); update the logic around the
context_lens creation (the branch that sets context_lens =
torch.tensor(self.context_len, dtype=torch.int32)) to perform this length check
so the created Tensor matches the declared TensorSpec("context_lens", [B], ...).

---

Outside diff comments:
In `@src/codegen/pto/pto_codegen.cpp`:
- Around line 351-357: The change removed the generic N-D row-major stride
emission for tensors with rank > 2, causing strides to be left as an empty
vector for 3D+ tensors; restore the original generic path by computing and
emitting row-major strides for all ranks > 1 (not just 2), using
tensor_type->shape_ to compute cumulative product of trailing dimensions (use
GetConstIntValue and GetOrEmitIndexConstant to emit each stride value) so that
pto.make_tensor_view receives a full N-D stride vector; apply the same fix in
the corresponding mirrored block around the later copy (the block that currently
handles shape sizes 2 and 1 at the other location).

---

Nitpick comments:
In `@tests/st/codegen/test_dynamic_paged_attention.py`:
- Around line 89-112: Add a codegen-only assertion in
test_dynamic_paged_attention to pin the emitted PTO IR layout text (the
"#pto.layout<...>" emitted by the EmitMakeTensorViews regression) so
attribute-loss regressions are caught; after calling test_runner.run(test_case)
(and before numeric asserts), retrieve the generated PTO/IR string from the test
result or test_runner API (e.g., a field on result or a helper that returns the
generated IR), and assert that it contains the expected "#pto.layout<...>"
fragment (or a small canonical layout pattern) for the tensors involved in
DynamicPagedAttentionTestCase, failing the test if the fragment is missing.
Ensure you reference the EmitMakeTensorViews-related layout pattern and keep
this check confined to codegen-only (skip on hardware numeric-only runs if
necessary).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: d36470db-25be-4743-8025-90fa2590573f

📥 Commits

Reviewing files that changed from the base of the PR and between c0a17f9 and e84dd9b.

📒 Files selected for processing (5)
  • examples/ir_parser/dynamic_paged_attention_example.py
  • examples/ir_parser/paged_attention_example.py
  • src/codegen/pto/pto_codegen.cpp
  • tests/st/codegen/test_dynamic_paged_attention.py
  • tests/st/codegen/test_paged_attention.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/st/codegen/test_dynamic_paged_attention.py (1)

89-97: Add at least one non-block-aligned heterogeneous context length.

The current variable-length case uses [512, 4096, 8192, 768], all exact multiples of block_size=128, so it never exercises the partial-tail block path (valid_len < block_size). A non-multiple here would give this new coverage real teeth.

Suggested tweak
-            # Variable context lengths: each of 4 requests has a different length
-            (4, 16, 128, 128, [512, 4096, 8192, 768], 32768),
+            # Variable context lengths, including partial last blocks
+            (4, 16, 128, 128, [513, 4097, 8192, 770], 32768),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/st/codegen/test_dynamic_paged_attention.py` around lines 89 - 97, The
variable-length param case in the pytest.mark.parametrize tuple uses context_len
= [512, 4096, 8192, 768], which are all multiples of block_size (128) and thus
never exercises the partial-tail block path; update the heterogeneous context
length list in the parameter tuple (the case passed to pytest.mark.parametrize)
so at least one entry is not a multiple of block_size (e.g., change 768 to 770
or another non-multiple) so the test for dynamic_paged_attention covers the
partial-tail scenario.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/st/codegen/test_dynamic_paged_attention.py`:
- Around line 89-97: The variable-length param case in the
pytest.mark.parametrize tuple uses context_len = [512, 4096, 8192, 768], which
are all multiples of block_size (128) and thus never exercises the partial-tail
block path; update the heterogeneous context length list in the parameter tuple
(the case passed to pytest.mark.parametrize) so at least one entry is not a
multiple of block_size (e.g., change 768 to 770 or another non-multiple) so the
test for dynamic_paged_attention covers the partial-tail scenario.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 5234711b-59d6-4b69-967b-d38085b28bfe

📥 Commits

Reviewing files that changed from the base of the PR and between e84dd9b and 7701d5d.

📒 Files selected for processing (5)
  • examples/ir_parser/dynamic_paged_attention_example.py
  • examples/ir_parser/paged_attention_example.py
  • src/codegen/pto/pto_codegen.cpp
  • tests/st/codegen/test_dynamic_paged_attention.py
  • tests/st/codegen/test_paged_attention.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • tests/st/codegen/test_paged_attention.py
  • examples/ir_parser/dynamic_paged_attention_example.py
  • src/codegen/pto/pto_codegen.cpp

## Summary
- Add `examples/ir_parser/dynamic_paged_attention_example.py` with
  `build_dynamic_paged_attention_program()` builder that defines five InCore
  kernel closures (`dyn_kernel_init_inplace`, `dyn_kernel_qk_matmul`,
  `dyn_kernel_softmax_prepare`, `dyn_kernel_pv_matmul`, `dyn_kernel_online_update`);
  type annotations use module-level `pl.dynamic()` variables
  (`Q_HEADS`, `HEAD_DIM_DYN`, `BLOCK_SIZE_DYN`), load ops use concrete
  closure variables (`_Q_TILE`, `_HEAD_DIM`, `_BLOCK_SIZE`)
- Add `tests/st/codegen/test_dynamic_paged_attention.py` with
  `DynamicPagedAttentionTestCase` inheriting golden reference and tensor
  definitions from `PagedAttentionTestCase`, targeting Ascend910B PTO backend;
  4 parametrized configurations (including one with per-request variable context lengths)
- Extend `tests/st/codegen/test_paged_attention.py`: `PagedAttentionTestCase`
  accepts `context_len: int | list[int]` and constructs `context_lens` tensor
  from a list when heterogeneous per-request lengths are needed (required by
  `DynamicPagedAttentionTestCase`)
- Fix `PTOCodegen::EmitMakeTensorViews` to emit `{layout = #pto.layout<nd|dn|nz>}`
  attribute on tensor view assignments so DN-layout tensors (e.g. transposed
  key_cache in paged attention) are correctly represented in generated PTO IR
- Fix `examples/ir_parser/paged_attention_example.py`: annotate `[16, 1]`
  output tensors in `kernel_init_inplace` and `kernel_softmax_prepare` with `pl.DN`;
  remove `pl.DN` from `kj` parameter in `kernel_qk_matmul` (key_cache is ND, not DN)
- Remove unused `size_query`, `size_key_cache`, `size_value_cache` parameters from
  orchestration function signatures in `paged_attention_example.py`; also remove their construction and `TensorSpec` entries from `build_tensor_specs` in `paged_attention_example.py`

## Testing
- [x] DynamicPagedAttentionTestCase passes on 910B PTO hardware
- [x] Pre-commit hooks pass (ruff, pyright, clang-format, cpplint)
@Crystal-wzy Crystal-wzy force-pushed the main branch 2 times, most recently from 6b3b68a to 427912a Compare March 19, 2026 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant