hip: direct alloc for FA f16 temp buffers by TheTom · Pull Request #22185 · ggml-org/llama.cpp

TheTom · 2026-04-20T20:48:35Z

Fixes #22107. On HIP without VMM, the legacy pool holds FA f16 temp
buffers at peak size after use, so quantized KV OOMs before f16 at
the same context length.

Adds ggml_cuda_direct_alloc in common.cuh (mirrors pool_alloc
interface) and uses it for K_f16/V_f16 in launch_fattn. HIP-only,
two files changed.

Complements #22155 which is approved as a general pool OOM safety net.
This avoids the OOM in the first place so the flush-retry path
doesn't trigger. Tested both side by side, no perf regression at
depth.

Tested on gfx1100 (RX 7900 XTX), gfx1200 (RX 9060 XT), gfx1201
(RX 9070 XT) by multiple community testers.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

On HIP without VMM, the legacy pool retains these at peak size causing quantized KV to OOM before f16. ggml_cuda_direct_alloc<T> uses raw hipMalloc/hipFree instead. HIP-only, complements ggml-org#22155. Fixes ggml-org#22107 without performance degradation. Tested: gfx1100, gfx1200, gfx1201.

ggml-gh-bot · 2026-04-20T20:53:10Z

Hi @TheTom, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 3 open PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

TheTom · 2026-04-20T21:02:30Z

Hi @TheTom, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 3 open PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

One is a draft (#21119), one has been waiting on review for 2 weeks (#21452). This is a bug fix for an OOM affecting all HIP users with quantized KV at long context. Happy to prioritize however maintainers prefesr.

TheTom marked this pull request as ready for review April 20, 2026 20:52

TheTom requested a review from a team as a code owner April 20, 2026 20:52

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hip: direct alloc for FA f16 temp buffers#22185

hip: direct alloc for FA f16 temp buffers#22185
TheTom wants to merge 1 commit intoggml-org:masterfrom
TheTom:fix/hip-fa-pool-retention

TheTom commented Apr 20, 2026 •

edited

Loading

Uh oh!

ggml-gh-bot bot commented Apr 20, 2026

Uh oh!

TheTom commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TheTom commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Requirements

Uh oh!

ggml-gh-bot bot commented Apr 20, 2026

Uh oh!

TheTom commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

TheTom commented Apr 20, 2026 •

edited

Loading