Open
Conversation
MXFP8 quantization fused into SM100 GEMM epilogue. TMEM → register → quantize → global memory direct write. TMA pipeline completely bypassed with if constexpr. Co-Authored-By: Claude <noreply@anthropic.com>
Switch MXFP8 epilogue from thread-level global memory writes to TMA store via SMEM. Uses independent s-loop with same pipeline structure as BF16 path (wait → write → fence → sync → TMA → arrive). Kernel time: 362us → 273us (matches baseline GEMM kernel). Co-Authored-By: Claude <noreply@anthropic.com>
Wrap DeepGEMM GEMM calls with torch.library.custom_op so baseline (GEMM + quantize) runs as a single compiled fx graph, eliminating Python dispatch overhead between the two ops. Fused still wins 1.34-2.12x vs compiled graph baseline. Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
요약
torch.compile(fullgraph=True)compiled baseline (GEMM + quantize를 하나의 fx graph로) 대비 1.34~2.12x speedup추가된 API
fp8_gemm_nt_mxfp8out(a, b, d_fp8, d_sf)m_grouped_fp8_gemm_nt_contiguous_mxfp8out(a, b, d_fp8, d_sf, layout)B200 성능 (compiled graph baseline 대비, fullgraph=True)
Normal GEMM:
Grouped GEMM:
/mair/team-sys/jangwoong/DeepGEMM/traces/tma_compiled_1775813473/jangwoong-dg-mxfp8v7-node-0-0_628.1775813473554960843.pt.trace.jsonDiff ~0.02는 fused가 FP32 TMEM에서 직접 quantize하고 baseline은 BF16 truncation 후 quantize하기 때문. fused가 더 정확함.
핵심 설계
if constexpr로 완전 우회테스트