CUDA FA: run KV_max mask scan for all Q batch sizes by ssam18 · Pull Request #22137 · ggml-org/llama.cpp

ssam18 · 2026-04-20T00:11:42Z

The KV_max mask scan was gated behind Q->ne[1] >= 1024, so decode steps never ran it. This caused the flash attention vec kernel on Pascal GPUs to read past valid KV entries into uninitialized memory beyond the current context, triggering device-side exceptions above 24K tokens. Removing the batch-size guard fixes the crash with negligible overhead since the scan is a single small kernel launch. Fixes #22032

JohannesGaessler · 2026-04-20T07:12:35Z

This is the wrong fix. Those indices are solely for optimization, any out-of-bounds checks should be working regardless.

CUDA FA: run KV_max mask scan for all Q batch sizes

f54ea33

ssam18 requested a review from a team as a code owner April 20, 2026 00:11

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA FA: run KV_max mask scan for all Q batch sizes#22137

CUDA FA: run KV_max mask scan for all Q batch sizes#22137
ssam18 wants to merge 1 commit intoggml-org:masterfrom
ssam18:fix/fattn-kv-max-small-batch

ssam18 commented Apr 20, 2026

Uh oh!

JohannesGaessler commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ssam18 commented Apr 20, 2026

Uh oh!

JohannesGaessler commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants