Skip to content

hip: direct alloc for FA f16 temp buffers#22185

Open
TheTom wants to merge 1 commit intoggml-org:masterfrom
TheTom:fix/hip-fa-pool-retention
Open

hip: direct alloc for FA f16 temp buffers#22185
TheTom wants to merge 1 commit intoggml-org:masterfrom
TheTom:fix/hip-fa-pool-retention

Conversation

@TheTom
Copy link
Copy Markdown

@TheTom TheTom commented Apr 20, 2026

Fixes #22107. On HIP without VMM, the legacy pool holds FA f16 temp
buffers at peak size after use, so quantized KV OOMs before f16 at
the same context length.

Adds ggml_cuda_direct_alloc in common.cuh (mirrors pool_alloc
interface) and uses it for K_f16/V_f16 in launch_fattn. HIP-only,
two files changed.

Complements #22155 which is approved as a general pool OOM safety net.
This avoids the OOM in the first place so the flush-retry path
doesn't trigger. Tested both side by side, no perf regression at
depth.

Tested on gfx1100 (RX 7900 XTX), gfx1200 (RX 9060 XT), gfx1201
(RX 9070 XT) by multiple community testers.

Requirements

On HIP without VMM, the legacy pool retains these at peak size
causing quantized KV to OOM before f16. ggml_cuda_direct_alloc<T>
uses raw hipMalloc/hipFree instead. HIP-only, complements ggml-org#22155.

Fixes ggml-org#22107 without performance degradation.
Tested: gfx1100, gfx1200, gfx1201.
@TheTom TheTom marked this pull request as ready for review April 20, 2026 20:52
@TheTom TheTom requested a review from a team as a code owner April 20, 2026 20:52
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot bot commented Apr 20, 2026

Hi @TheTom, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 3 open PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@TheTom
Copy link
Copy Markdown
Author

TheTom commented Apr 20, 2026

Hi @TheTom, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 3 open PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

One is a draft (#21119), one has been waiting on review for 2 weeks (#21452). This is a bug fix for an OOM affecting all HIP users with quantized KV at long context. Happy to prioritize however maintainers prefesr.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misc. bug: [CUDA/ROCm] VRAM leak/fragmentation in ggml_cuda_pool_leg when using Flash Attention

1 participant