Skip to content

[cuda] GGUF Q6_K real packed INT6 (W6A8 dp4a) + GGUF CI export#20229

Draft
Gasoonjia wants to merge 4 commits into
g4-opt-prefill-window-sdpafrom
g4-int6-gguf
Draft

[cuda] GGUF Q6_K real packed INT6 (W6A8 dp4a) + GGUF CI export#20229
Gasoonjia wants to merge 4 commits into
g4-opt-prefill-window-sdpafrom
g4-int6-gguf

Conversation

@Gasoonjia

Copy link
Copy Markdown
Contributor

Add a genuine 6-bit packed weight path for GGUF Q6_K on the CUDA backend, parallel to the int4/int8 plain_mm paths:

  • int6_plain_mm CUDA shim (W6A8 dp4a; ql/qh planes; spread2; -32 symmetric offset)
  • CudaPackedInt6Tensor (ql/qh + per-group bf16 scale; symmetric, no zero tensor)
  • int6_dispatch: F.linear routing (M<=4 -> executorch_cuda::int6_plain_mm op, M>4 -> dequant)
  • backend fallback-kernel + custom_ops_to_c_shims registration; CMake build
  • route GGUF Q6_K -> CudaPackedInt6Tensor (gguf_loader, pack_cuda, dequantize_weight)
  • tests: int6 gtest, test_int6_dispatch.py, pack round-trip; fix stale int4/int6 type asserts

CI (export_model_artifact.sh, gemma4_31b): download the Q4_K_M GGUF from unsloth/gemma-4-31B-it-GGUF (tokenizer from unsloth/gemma-4-31B-it) and run the inference sanity check + export via the GGUF loader (--gguf) instead of the prequantized HF checkpoint.

@pytorch-bot

pytorch-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20229

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 1 New Failure, 5 Pending

As of commit eaea4a7 with merge base c5bf380 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 12, 2026
prefill-dev2 and others added 2 commits June 12, 2026 10:56
Block-sparse early-exit in _sdpa_fwd_kernel_body: skip KV blocks that are
entirely masked (sliding-window via HAS_MASK sum==0, causal via start_n>max_seq_pos).
Exact (skipped blocks are x1,+0 no-ops). Prefill +46-88% all lengths; decode safe;
SDPA nsys 58.1%->18.5%. Numerically bf16-exact vs dense+mask (unit test).
Add a genuine 6-bit packed weight path for GGUF Q6_K on the CUDA backend,
parallel to the int4/int8 plain_mm paths:
- int6_plain_mm CUDA shim (W6A8 dp4a; ql/qh planes; spread2; -32 symmetric offset)
- CudaPackedInt6Tensor (ql/qh + per-group bf16 scale; symmetric, no zero tensor)
- int6_dispatch: F.linear routing (M<=4 -> executorch_cuda::int6_plain_mm op, M>4 -> dequant)
- backend fallback-kernel + custom_ops_to_c_shims registration; CMake build
- GGUF Q6_K: gguf_loader returns the native torchao IntxUnpackedToInt8Tensor and
  the backend packer (pack_cuda.pack_linear_for_cuda) repacks a symmetric Q6_K
  weight into CudaPackedInt6Tensor -- mirroring Int4Tensor -> CudaCoalescedInt4Tensor,
  so the loader stays backend-agnostic; dequantize_weight handles the tied embedding
- tests: int6 gtest, test_int6_dispatch.py, pack round-trip; fix stale int4/int6 type asserts

CI (export_model_artifact.sh, gemma4_31b): download the Q4_K_M GGUF from
unsloth/gemma-4-31B-it-GGUF (tokenizer from unsloth/gemma-4-31B-it) and run the
inference sanity check + export via the GGUF loader (--gguf) instead of the
prequantized HF checkpoint.

Signed-off-by: gasoonjia <gasoonjia@icloud.com>
@linux-foundation-easycla

linux-foundation-easycla Bot commented Jun 12, 2026

Copy link
Copy Markdown

CLA Missing ID

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant