hip: direct alloc for FA f16 temp buffers#22185
hip: direct alloc for FA f16 temp buffers#22185TheTom wants to merge 1 commit intoggml-org:masterfrom
Conversation
On HIP without VMM, the legacy pool retains these at peak size causing quantized KV to OOM before f16. ggml_cuda_direct_alloc<T> uses raw hipMalloc/hipFree instead. HIP-only, complements ggml-org#22155. Fixes ggml-org#22107 without performance degradation. Tested: gfx1100, gfx1200, gfx1201.
|
Hi @TheTom, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
One is a draft (#21119), one has been waiting on review for 2 weeks (#21452). This is a bug fix for an OOM affecting all HIP users with quantized KV at long context. Happy to prioritize however maintainers prefesr. |
Fixes #22107. On HIP without VMM, the legacy pool holds FA f16 temp
buffers at peak size after use, so quantized KV OOMs before f16 at
the same context length.
Adds ggml_cuda_direct_alloc in common.cuh (mirrors pool_alloc
interface) and uses it for K_f16/V_f16 in launch_fattn. HIP-only,
two files changed.
Complements #22155 which is approved as a general pool OOM safety net.
This avoids the OOM in the first place so the flush-retry path
doesn't trigger. Tested both side by side, no perf regression at
depth.
Tested on gfx1100 (RX 7900 XTX), gfx1200 (RX 9060 XT), gfx1201
(RX 9070 XT) by multiple community testers.
Requirements