Skip to content

fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support ch…#86

Open
tomwolfe wants to merge 1 commit intoTheTom:feature/turboquant-kv-cachefrom
tomwolfe:bugfix/metal
Open

fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support ch…#86
tomwolfe wants to merge 1 commit intoTheTom:feature/turboquant-kv-cachefrom
tomwolfe:bugfix/metal

Conversation

@tomwolfe
Copy link
Copy Markdown

Fix: Enable GPU offloading for TurboQuant KV cache types in Metal backend

  • Issue: GGML_OP_FLASH_ATTN_EXT and GGML_OP_CPY were falling back to the CPU for TurboQuant KV cache types due to missing support in the Metal backend's supports_op check.
  • Solution: Updated ggml/src/ggml-metal/ggml-metal-device.m to explicitly support the following types for these operations:
    • GGML_TYPE_TURBO2_0
    • GGML_TYPE_TURBO3_0
    • GGML_TYPE_TURBO4_0
  • Result: Operations are now correctly offloaded to the GPU.
  • Verification: The project builds successfully with these changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant