Skip to content

[example] Linalg to XeGPU fused attention implementation. #148

Draft
charithaintc wants to merge 73 commits into
llvm:mainfrom
charithaintc:flash_attention_tiling_imex_version
Draft

[example] Linalg to XeGPU fused attention implementation. #148
charithaintc wants to merge 73 commits into
llvm:mainfrom
charithaintc:flash_attention_tiling_imex_version

Conversation

@charithaintc
Copy link
Copy Markdown
Contributor

This example demonstrate how to optimize standard attention kerel written in linalg level into the fused attention kernel that can be run gpu.

Main steps involved:

  1. Generate standard attention payload on 4d tensors (batch x head x ctx_len x d_head)
  2. Tile and fuse the outer parallel dims (batch and head)
  3. Vectorize/Bufferize
  4. Use transform extensions to generate the inner tiled reduction loop (Until we have a better solution).
  5. Distribute to GPU workgroups.
  6. Set xegpu layouts and lower to binary.

Currently this depends on a fix for : #147

@charithaintc charithaintc changed the title [example] XeGPU fused attention implementation. [example] Linalg to XeGPU fused attention implementation. May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant