test(st): Add dynamic-shape paged attention and emit layout attrs#604
test(st): Add dynamic-shape paged attention and emit layout attrs#604Crystal-wzy wants to merge 2 commits intohw-native-sys:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a new dynamic-shape paged-attention example and integration test, updates an existing paged-attention example's kernel type annotations and orchestration signature, and modifies PTO codegen tensor-view/stride emission and type/layout printing. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. 📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the PyPTO framework by introducing robust support for dynamic-shape paged attention. It provides a new example demonstrating how InCore kernels can utilize dynamic variables for tensor shapes, enabling greater flexibility in handling varying input dimensions at runtime. Concurrently, it addresses a critical code generation issue by ensuring that the PTO backend correctly emits layout attributes for tensor views, which is essential for accurate representation of data layouts like DN-layout in the generated IR. These changes collectively improve the framework's capability to handle more complex and dynamic deep learning models, particularly in attention mechanisms. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request successfully introduces dynamic-shape paged attention, enhancing the flexibility of the PyPTO framework. The new dynamic_paged_attention_example.py demonstrates the usage of dynamic shapes in InCore kernels, and the accompanying test_dynamic_paged_attention.py provides comprehensive test coverage, including configurations with variable context lengths. The fix in src/codegen/pto/pto_codegen.cpp to correctly emit layout attributes for pto.make_tensor_view is crucial for accurate representation of DN-layout tensors in the generated IR. Overall, the changes are well-implemented and align with the stated objectives of the pull request. The recurring type-checking issues highlighted in the example file should be addressed by refining type annotations or the DSL's API, aligning with best practices for DSL argument validation.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/codegen/pto/pto_codegen.cpp (1)
351-357:⚠️ Potential issue | 🟠 MajorRestore the generic row-major stride path for rank > 2 tensors.
After this split, any tensor parameter with rank greater than 2 now falls through to
strides = [].pto.make_tensor_viewused to receive a full N-D row-major stride vector here, so this is a silent regression for every 3D+ entrypoint.Also applies to: 440-457
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/codegen/pto/pto_codegen.cpp` around lines 351 - 357, The change removed the generic N-D row-major stride emission for tensors with rank > 2, causing strides to be left as an empty vector for 3D+ tensors; restore the original generic path by computing and emitting row-major strides for all ranks > 1 (not just 2), using tensor_type->shape_ to compute cumulative product of trailing dimensions (use GetConstIntValue and GetOrEmitIndexConstant to emit each stride value) so that pto.make_tensor_view receives a full N-D stride vector; apply the same fix in the corresponding mirrored block around the later copy (the block that currently handles shape sizes 2 and 1 at the other location).
🧹 Nitpick comments (1)
tests/st/codegen/test_dynamic_paged_attention.py (1)
89-112: Consider pinning the emitted#pto.layout<...>text in a codegen-only check.These STs validate numerics on hardware, but they do not lock down the exact
EmitMakeTensorViewsregression this PR fixes. A small codegen-only assertion on the generated PTO IR would catch attribute-loss regressions much earlier.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/st/codegen/test_dynamic_paged_attention.py` around lines 89 - 112, Add a codegen-only assertion in test_dynamic_paged_attention to pin the emitted PTO IR layout text (the "#pto.layout<...>" emitted by the EmitMakeTensorViews regression) so attribute-loss regressions are caught; after calling test_runner.run(test_case) (and before numeric asserts), retrieve the generated PTO/IR string from the test result or test_runner API (e.g., a field on result or a helper that returns the generated IR), and assert that it contains the expected "#pto.layout<...>" fragment (or a small canonical layout pattern) for the tensors involved in DynamicPagedAttentionTestCase, failing the test if the fragment is missing. Ensure you reference the EmitMakeTensorViews-related layout pattern and keep this check confined to codegen-only (skip on hardware numeric-only runs if necessary).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/ir_parser/dynamic_paged_attention_example.py`:
- Around line 83-90: The dynamic path currently types the mi/li accumulators as
plain ND tensors in dyn_kernel_init_inplace (and the similar signatures later),
which mismatches the static paged-attention builder that marks these as pl.DN to
indicate DN-shaped layout; update the function signatures (e.g.,
dyn_kernel_init_inplace and the other dynamic kernel init/online-update
signatures referenced) to annotate mi and li as pl.DN[pl.Tensor[[Q_HEADS, 1],
pl.FP32]] (or the equivalent pl.DN wrapper used in the repo) instead of plain
pl.Tensor so the downstream dyn_kernel_online_update consumers see the same
DN-shaped layout semantics. Ensure all other dynamic-kernel declarations noted
in the comment are changed consistently.
In `@tests/st/codegen/test_paged_attention.py`:
- Around line 419-422: The code builds context_lens from self.context_len but
doesn't validate when self.context_len is a list: ensure that in the branch
handling list-valued self.context_len you check that len(self.context_len) == B
(the batch size) and raise a clear ValueError if it differs (or alternatively
broadcast/trim only if that behavior is intended); update the logic around the
context_lens creation (the branch that sets context_lens =
torch.tensor(self.context_len, dtype=torch.int32)) to perform this length check
so the created Tensor matches the declared TensorSpec("context_lens", [B], ...).
---
Outside diff comments:
In `@src/codegen/pto/pto_codegen.cpp`:
- Around line 351-357: The change removed the generic N-D row-major stride
emission for tensors with rank > 2, causing strides to be left as an empty
vector for 3D+ tensors; restore the original generic path by computing and
emitting row-major strides for all ranks > 1 (not just 2), using
tensor_type->shape_ to compute cumulative product of trailing dimensions (use
GetConstIntValue and GetOrEmitIndexConstant to emit each stride value) so that
pto.make_tensor_view receives a full N-D stride vector; apply the same fix in
the corresponding mirrored block around the later copy (the block that currently
handles shape sizes 2 and 1 at the other location).
---
Nitpick comments:
In `@tests/st/codegen/test_dynamic_paged_attention.py`:
- Around line 89-112: Add a codegen-only assertion in
test_dynamic_paged_attention to pin the emitted PTO IR layout text (the
"#pto.layout<...>" emitted by the EmitMakeTensorViews regression) so
attribute-loss regressions are caught; after calling test_runner.run(test_case)
(and before numeric asserts), retrieve the generated PTO/IR string from the test
result or test_runner API (e.g., a field on result or a helper that returns the
generated IR), and assert that it contains the expected "#pto.layout<...>"
fragment (or a small canonical layout pattern) for the tensors involved in
DynamicPagedAttentionTestCase, failing the test if the fragment is missing.
Ensure you reference the EmitMakeTensorViews-related layout pattern and keep
this check confined to codegen-only (skip on hardware numeric-only runs if
necessary).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: d36470db-25be-4743-8025-90fa2590573f
📒 Files selected for processing (5)
examples/ir_parser/dynamic_paged_attention_example.pyexamples/ir_parser/paged_attention_example.pysrc/codegen/pto/pto_codegen.cpptests/st/codegen/test_dynamic_paged_attention.pytests/st/codegen/test_paged_attention.py
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/st/codegen/test_dynamic_paged_attention.py (1)
89-97: Add at least one non-block-aligned heterogeneous context length.The current variable-length case uses
[512, 4096, 8192, 768], all exact multiples ofblock_size=128, so it never exercises the partial-tail block path (valid_len < block_size). A non-multiple here would give this new coverage real teeth.Suggested tweak
- # Variable context lengths: each of 4 requests has a different length - (4, 16, 128, 128, [512, 4096, 8192, 768], 32768), + # Variable context lengths, including partial last blocks + (4, 16, 128, 128, [513, 4097, 8192, 770], 32768),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/st/codegen/test_dynamic_paged_attention.py` around lines 89 - 97, The variable-length param case in the pytest.mark.parametrize tuple uses context_len = [512, 4096, 8192, 768], which are all multiples of block_size (128) and thus never exercises the partial-tail block path; update the heterogeneous context length list in the parameter tuple (the case passed to pytest.mark.parametrize) so at least one entry is not a multiple of block_size (e.g., change 768 to 770 or another non-multiple) so the test for dynamic_paged_attention covers the partial-tail scenario.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/st/codegen/test_dynamic_paged_attention.py`:
- Around line 89-97: The variable-length param case in the
pytest.mark.parametrize tuple uses context_len = [512, 4096, 8192, 768], which
are all multiples of block_size (128) and thus never exercises the partial-tail
block path; update the heterogeneous context length list in the parameter tuple
(the case passed to pytest.mark.parametrize) so at least one entry is not a
multiple of block_size (e.g., change 768 to 770 or another non-multiple) so the
test for dynamic_paged_attention covers the partial-tail scenario.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 5234711b-59d6-4b69-967b-d38085b28bfe
📒 Files selected for processing (5)
examples/ir_parser/dynamic_paged_attention_example.pyexamples/ir_parser/paged_attention_example.pysrc/codegen/pto/pto_codegen.cpptests/st/codegen/test_dynamic_paged_attention.pytests/st/codegen/test_paged_attention.py
🚧 Files skipped from review as they are similar to previous changes (3)
- tests/st/codegen/test_paged_attention.py
- examples/ir_parser/dynamic_paged_attention_example.py
- src/codegen/pto/pto_codegen.cpp
## Summary
- Add `examples/ir_parser/dynamic_paged_attention_example.py` with
`build_dynamic_paged_attention_program()` builder that defines five InCore
kernel closures (`dyn_kernel_init_inplace`, `dyn_kernel_qk_matmul`,
`dyn_kernel_softmax_prepare`, `dyn_kernel_pv_matmul`, `dyn_kernel_online_update`);
type annotations use module-level `pl.dynamic()` variables
(`Q_HEADS`, `HEAD_DIM_DYN`, `BLOCK_SIZE_DYN`), load ops use concrete
closure variables (`_Q_TILE`, `_HEAD_DIM`, `_BLOCK_SIZE`)
- Add `tests/st/codegen/test_dynamic_paged_attention.py` with
`DynamicPagedAttentionTestCase` inheriting golden reference and tensor
definitions from `PagedAttentionTestCase`, targeting Ascend910B PTO backend;
4 parametrized configurations (including one with per-request variable context lengths)
- Extend `tests/st/codegen/test_paged_attention.py`: `PagedAttentionTestCase`
accepts `context_len: int | list[int]` and constructs `context_lens` tensor
from a list when heterogeneous per-request lengths are needed (required by
`DynamicPagedAttentionTestCase`)
- Fix `PTOCodegen::EmitMakeTensorViews` to emit `{layout = #pto.layout<nd|dn|nz>}`
attribute on tensor view assignments so DN-layout tensors (e.g. transposed
key_cache in paged attention) are correctly represented in generated PTO IR
- Fix `examples/ir_parser/paged_attention_example.py`: annotate `[16, 1]`
output tensors in `kernel_init_inplace` and `kernel_softmax_prepare` with `pl.DN`;
remove `pl.DN` from `kj` parameter in `kernel_qk_matmul` (key_cache is ND, not DN)
- Remove unused `size_query`, `size_key_cache`, `size_value_cache` parameters from
orchestration function signatures in `paged_attention_example.py`; also remove their construction and `TensorSpec` entries from `build_tensor_specs` in `paged_attention_example.py`
## Testing
- [x] DynamicPagedAttentionTestCase passes on 910B PTO hardware
- [x] Pre-commit hooks pass (ruff, pyright, clang-format, cpplint)
6b3b68a to
427912a
Compare
Summary
examples/ir_parser/dynamic_paged_attention_example.pywithbuild_dynamic_paged_attention_program()builder that defines five InCore kernel closures (dyn_kernel_init_inplace,dyn_kernel_qk_matmul,dyn_kernel_softmax_prepare,dyn_kernel_pv_matmul,dyn_kernel_online_update); type annotations use module-levelpl.dynamic()variables (Q_HEADS,HEAD_DIM_DYN,BLOCK_SIZE_DYN), load ops use concrete closure variables (_Q_TILE,_HEAD_DIM,_BLOCK_SIZE)tests/st/codegen/test_dynamic_paged_attention.pywithDynamicPagedAttentionTestCaseinheriting golden reference and tensor definitions fromPagedAttentionTestCase, targeting Ascend910B PTO backend; 4 parametrized configurations (including one with per-request variable context lengths)tests/st/codegen/test_paged_attention.py:PagedAttentionTestCaseacceptscontext_len: int | list[int]and constructscontext_lenstensor from a list when heterogeneous per-request lengths are needed (required byDynamicPagedAttentionTestCase)PTOCodegen::EmitMakeTensorViewsto emit{layout = #pto.layout<nd|dn|nz>}attribute on tensor view assignments so DN-layout tensors (e.g. transposed key_cache in paged attention) are correctly represented in generated PTO IRexamples/ir_parser/paged_attention_example.py: annotate[16, 1]output tensors inkernel_init_inplaceandkernel_softmax_preparewithpl.DN; removepl.DNfromkjparameter inkernel_qk_matmul(key_cache is ND, not DN)Testing