[cuda] GGUF Q6_K real packed INT6 (W6A8 dp4a) + GGUF CI export by Gasoonjia · Pull Request #20229 · pytorch/executorch

Gasoonjia · 2026-06-12T04:49:56Z

Add a genuine 6-bit packed weight path for GGUF Q6_K on the CUDA backend, parallel to the int4/int8 plain_mm paths:

int6_plain_mm CUDA shim (W6A8 dp4a; ql/qh planes; spread2; -32 symmetric offset)
CudaPackedInt6Tensor (ql/qh + per-group bf16 scale; symmetric, no zero tensor)
int6_dispatch: F.linear routing (M<=4 -> executorch_cuda::int6_plain_mm op, M>4 -> dequant)
backend fallback-kernel + custom_ops_to_c_shims registration; CMake build
route GGUF Q6_K -> CudaPackedInt6Tensor (gguf_loader, pack_cuda, dequantize_weight)
tests: int6 gtest, test_int6_dispatch.py, pack round-trip; fix stale int4/int6 type asserts

CI (export_model_artifact.sh, gemma4_31b): download the Q4_K_M GGUF from unsloth/gemma-4-31B-it-GGUF (tokenizer from unsloth/gemma-4-31B-it) and run the inference sanity check + export via the GGUF loader (--gguf) instead of the prequantized HF checkpoint.

pytorch-bot · 2026-06-12T04:50:00Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20229

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[ROCm] MI350 CI jobs will have longer queue times due to CI migration

❌ 1 New Failure, 5 Pending

As of commit eaea4a7 with merge base c5bf380 ():

NEW FAILURE - The following job has failed:

Test CUDA Windows Export and E2E / test-model-cuda-windows-e2e (facebook, dinov2-small-imagenet1k-1-layer, non-quantized) / windows-job (gh)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Block-sparse early-exit in _sdpa_fwd_kernel_body: skip KV blocks that are entirely masked (sliding-window via HAS_MASK sum==0, causal via start_n>max_seq_pos). Exact (skipped blocks are x1,+0 no-ops). Prefill +46-88% all lengths; decode safe; SDPA nsys 58.1%->18.5%. Numerically bf16-exact vs dense+mask (unit test).

Add a genuine 6-bit packed weight path for GGUF Q6_K on the CUDA backend, parallel to the int4/int8 plain_mm paths: - int6_plain_mm CUDA shim (W6A8 dp4a; ql/qh planes; spread2; -32 symmetric offset) - CudaPackedInt6Tensor (ql/qh + per-group bf16 scale; symmetric, no zero tensor) - int6_dispatch: F.linear routing (M<=4 -> executorch_cuda::int6_plain_mm op, M>4 -> dequant) - backend fallback-kernel + custom_ops_to_c_shims registration; CMake build - GGUF Q6_K: gguf_loader returns the native torchao IntxUnpackedToInt8Tensor and the backend packer (pack_cuda.pack_linear_for_cuda) repacks a symmetric Q6_K weight into CudaPackedInt6Tensor -- mirroring Int4Tensor -> CudaCoalescedInt4Tensor, so the loader stays backend-agnostic; dequantize_weight handles the tied embedding - tests: int6 gtest, test_int6_dispatch.py, pack round-trip; fix stale int4/int6 type asserts CI (export_model_artifact.sh, gemma4_31b): download the Q4_K_M GGUF from unsloth/gemma-4-31B-it-GGUF (tokenizer from unsloth/gemma-4-31B-it) and run the inference sanity check + export via the GGUF loader (--gguf) instead of the prequantized HF checkpoint. Signed-off-by: gasoonjia <gasoonjia@icloud.com>

linux-foundation-easycla · 2026-06-12T20:08:08Z

✅ login: Gasoonjia / name: gasoonjia (390238e, 7db4bba, eaea4a7)
❌ The email address for the commit (2f1203e) is not linked to the GitHub account, preventing the EasyCLA check. Consult this Help Article and GitHub Help to resolve. (To view the commit's email address, add .patch at the end of this PR page's URL.) For further assistance with EasyCLA, please visit our EasyCLA portal and chat with our support bot.

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 12, 2026

mergennachin requested review from digantdesai, mergennachin and metascroy June 12, 2026 17:27

prefill-dev2 and others added 2 commits June 12, 2026 10:56

Gasoonjia force-pushed the g4-int6-gguf branch from 7ad0a7e to 390238e Compare June 12, 2026 20:08

Gasoonjia temporarily deployed to cadence June 12, 2026 20:08 — with GitHub Actions Inactive

Gasoonjia added 2 commits June 12, 2026 13:41

remove comment

7db4bba

lin

eaea4a7

Gasoonjia temporarily deployed to cadence June 12, 2026 21:45 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cuda] GGUF Q6_K real packed INT6 (W6A8 dp4a) + GGUF CI export#20229

[cuda] GGUF Q6_K real packed INT6 (W6A8 dp4a) + GGUF CI export#20229
Gasoonjia wants to merge 4 commits into
g4-opt-prefill-window-sdpafrom
g4-int6-gguf

Gasoonjia commented Jun 12, 2026

Uh oh!

pytorch-bot Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

linux-foundation-easycla Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Gasoonjia commented Jun 12, 2026

Uh oh!

pytorch-bot Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20229

❗ 1 Active SEVs

❌ 1 New Failure, 5 Pending

Uh oh!

linux-foundation-easycla Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pytorch-bot Bot commented Jun 12, 2026 •

edited

Loading

linux-foundation-easycla Bot commented Jun 12, 2026 •

edited

Loading