Skip to content

CUDA FA: run KV_max mask scan for all Q batch sizes#22137

Open
ssam18 wants to merge 1 commit intoggml-org:masterfrom
ssam18:fix/fattn-kv-max-small-batch
Open

CUDA FA: run KV_max mask scan for all Q batch sizes#22137
ssam18 wants to merge 1 commit intoggml-org:masterfrom
ssam18:fix/fattn-kv-max-small-batch

Conversation

@ssam18
Copy link
Copy Markdown
Contributor

@ssam18 ssam18 commented Apr 20, 2026

The KV_max mask scan was gated behind Q->ne[1] >= 1024, so decode steps never ran it. This caused the flash attention vec kernel on Pascal GPUs to read past valid KV entries into uninitialized memory beyond the current context, triggering device-side exceptions above 24K tokens. Removing the batch-size guard fixes the crash with negligible overhead since the scan is a single small kernel launch. Fixes #22032

@ssam18 ssam18 requested a review from a team as a code owner April 20, 2026 00:11
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 20, 2026
@JohannesGaessler
Copy link
Copy Markdown
Contributor

This is the wrong fix. Those indices are solely for optimization, any out-of-bounds checks should be working regardless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: Flash attention crash (MUL_MAT failed / cudaStreamSynchronize) on Pascal GPUs with MiniMax-M2.7 at >24K context

2 participants