feat(ascend): op-cache-attn group — ReshapeAndCache, FlashAttention, PagedAttention, TopkToppSampling#67
Open
zhangyue207 wants to merge 1 commit intomasterfrom
Open
feat(ascend): op-cache-attn group — ReshapeAndCache, FlashAttention, PagedAttention, TopkToppSampling#67zhangyue207 wants to merge 1 commit intomasterfrom
zhangyue207 wants to merge 1 commit intomasterfrom
Conversation
Collaborator
Author
|
merge test |
083573a to
e3b6f16
Compare
Ziminli
requested changes
Apr 21, 2026
| // | ||
| // When cu_seqlens is a CPU tensor (device type kCpu), the data pointer is | ||
| // already on the host and can be read directly — no D2H sync needed. | ||
| inline aclIntArray* extractSeqLengths(const Tensor& cu_seqlens, |
| // convention for npu_fused_infer_attention_score actual_seq_lengths. | ||
| // | ||
| // When cu_seqlens is a CPU tensor, reads directly from host memory. | ||
| inline aclIntArray* cumSeqLengths(const Tensor& cu_seqlens, |
Comment on lines
+331
to
+338
| assert(gws == ACL_SUCCESS && | ||
| "aclnnFusedInferAttentionScoreV4GetWorkspaceSize failed (decode)"); | ||
|
|
||
| auto& arena = ascend::GetWorkspacePool().Ensure(stream, ws_needed); | ||
| aclError ret = | ||
| aclnnFusedInferAttentionScoreV4(arena.buf, ws_needed, executor, stream); | ||
| assert(ret == ACL_SUCCESS && | ||
| "aclnnFusedInferAttentionScoreV4 failed (decode)"); |
|
|
||
| if (!has_block_table_host_) { | ||
| bt_host_ = std::malloc(bt_host_bytes_); | ||
| assert(bt_host_ && "Host buffer allocation for `block_table` failed"); |
Collaborator
There was a problem hiding this comment.
Error message 开头小写,其他相关文件建议统一检查和修改一下。
Comment on lines
+59
to
+62
| int64_t B = static_cast<int64_t>(batch_size_); | ||
| int64_t N = num_heads_; | ||
| int64_t Nkv = num_kv_heads_; | ||
| int64_t D = head_size_; |
…PagedAttention, TopkToppSampling Four KV-cache and attention operators: | op | impl | |---|---| | ReshapeAndCache | 3 impls: aclnnInplaceIndexCopy (kernel.h); custom AscendC (kernel_v2.h); ATB `ReshapeAndCacheParam` (kernel_atb.h, int64 `slot_mapping` handled via cached async `aclnnCast`) | | FlashAttention | `aclnnFusedInferAttentionScoreV4` (prefill + paged decode). Supports both the native `(window_left, window_right)` pair and a new `std::optional<int64_t> sliding_window` entry (additive, vLLM-style) | | PagedAttention | ATB `PagedAttentionParam` with optional CPU-pinned host tensors (`seq_lens_host` / `block_table_host`) that make the op NPUGraph-capturable | | TopkToppSampling | ATB `TopkToppSamplingParam` | Includes vLLM API alignment commits: - `perf(reshape_and_cache)`: int64 slot_mapping routed through cached async `aclnnCast` (no D2H sync, NPUGraph-compatible) - `feat(flash_attention)`: add `sliding_window` entry, additive - `docs(paged_attention)`: base class comment explains the CPU-host tensor contract New `src/base/<op>.h`: paged_attention, topk_topp_sampling. Modified: reshape_and_cache, flash_attention.
e3b6f16 to
6b8b32f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Four KV-cache and attention Ascend operators — ReshapeAndCache,
FlashAttention, PagedAttention, TopkToppSampling — completing the Ascend
operator set needed for transformer decode.
Part 4 of 4 in the Ascend operator split. Parallel-reviewable with
op-simple and op-norm-rope (operator sets are disjoint).
Depends on:
feat/ascend-framework-prmust merge first.Operators
aclnnInplaceIndexCopy(kernel.h); custom AscendC (kernel_v2.h); ATBReshapeAndCacheParam(kernel_atb.h)slot_mappinghandled via cached asyncaclnnCast— no D2H sync, NPUGraph-compatibleaclnnFusedInferAttentionScoreV4(prefill + paged decode)(window_left, window_right)pair ANDstd::optional<int64_t> sliding_windowentry (vLLM-style, additive)PagedAttentionParam(impl=0)seq_lens_host/block_table_host) enable NPUGraph capture by avoiding per-layer sync D2HTopkToppSamplingParamvLLM API alignment
perf(reshape_and_cache): async int64→int32 slot_mappingATB
ReshapeAndCacheParamrequires int32slot_mapping. The previousimplementation handled int64 (PyTorch / vLLM's native dtype) via D2H +
CPU cast + H2D +
aclrtSynchronizeStream, which stalled the stream andmade the int64 path NPUGraph-incapturable. Replaced with a cached
aclnnCastasync conversion on-stream; performance matches the int32pass-through and the whole op is now graph-captureable.
feat(flash_attention): addsliding_windowentry (additive)Native
window_left/window_rightpair kept as-is; added an optionalstd::optional<int64_t> sliding_window:sliding_windowonly → normalized to(sliding_window - 1, 0)causalsliding (vLLM convention)
test_flash_attention_sliding_window_equivalenceasserts bit-exactequivalence between the two entry points.
docs(paged_attention): host tensor contractsrc/base/paged_attention.hclass comment explains whyseq_lens_host/block_table_hostexist (CANNqSeqLensCPU-resident contract + ATBhostData+ NPUGraph capture prerequisite) so future backend implementorsunderstand the API contract.
Base headers
src/base/paged_attention.h,topk_topp_sampling.hsrc/base/reshape_and_cache.h,flash_attention.hVerification
python3 .ci/run.py --local --gpu-id <N>(Ascend 910B + CANN 8.5.1):3129 passed / 1798 skipped / 0 failed
Test plan
python3 .ci/run.py --localtest_flash_attention_sliding_window_equivalence(pair vssliding_window bit-exact): 2 passed
test_reshape_and_cache(int32 + int64 paths): 32 passedtest_paged_attention(910B skip removed after CANN 8.5.1 fix): 10passed
clang-formatpasses locally