Open
Conversation
Made-with: Cursor
The compile_lib() in utils.py bypasses core.py's build_module(), so it was missing the ENABLE_CK=1 default, causing PA and other template-compiled modules to use the shim instead of real CK headers. Made-with: Cursor
valarLip
approved these changes
Mar 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The FMHA V3 forward and backward modules (module_fmha_v3_fwd, module_fmha_v3_varlen_fwd, module_fmha_v3_bwd, module_fmha_v3_varlen_bwd) use only precompiled ASM kernel binaries (.co files) at runtime -- they have no source-level dependency on the Composable Kernel (CK) library. However, the build system still required the full CK submodule at 3rdparty/composable_kernel/ for header includes (fmha_fwd.hpp, fmha_bwd.hpp, mask.hpp, ck_tile/core.hpp), CK example include paths, and CK-specific compile flags.
This unnecessary coupling caused:
Slower builds: CK headers pull in a deep include chain, increasing compile time for modules that don't use any CK code paths.
Larger dependency footprint: The full CK submodule (~600MB) must be cloned and available even when only V3 ASM kernels are needed.
Blocked CK-free packaging: setup.py supports ENABLE_CK=0 to build without CK, but V3 modules still failed to compile without it.
Technical Details
Replaced the module-specific ONLY_FAV3 guards in public headers with a unified ENABLE_CK macro:
ENABLE_CK=1 (default): use real CK headers -- for full CK modules (fwd, bwd, splitkv, batch_prefill)
ENABLE_CK=0: use the lightweight ck_tile_shim.h -- for V3 ASM-only modules
The default is injected by core.py's build_module(), reading the same ENABLE_CK env var that setup.py uses:
enable_ck = int(os.environ.get("ENABLE_CK", "1"))
if not any("ENABLE_CK" in f for f in flags_extra_cc):
flags_cc.append(f"-DENABLE_CK={enable_ck}")
V3 modules explicitly set -DENABLE_CK=0 in their optCompilerConfig.json.
aiter_hip_common.h: #if !ENABLE_CK selects ck_tile_shim.h vs ck_tile/core.hpp
mha_fwd.h: #if ENABLE_CK guards fmha_fwd.hpp/mask.hpp includes, CK trait structs (mha_fwd_traits, mha_batch_prefill_traits, mha_fwd_splitkv_traits), and CK function declarations (mha_fwd_splitkv(), mha_batch_prefill())
mha_bwd.h: #if ENABLE_CK guards fmha_bwd.hpp include
py_itfs_common.h: #if ENABLE_CK guards CK type mapping templates (t2ck)
Structs and declarations used by the V3 path (mha_fwd_args, fmha_fwd_v3_args, mha_fwd(), fmha_fwd_v3()) remain always available -- they use ck_tile::index_t and ck_tile::stream_config which the shim provides.
For module_fmha_v3_fwd and module_fmha_v3_varlen_fwd:
Cleared extra_include (removed CK_DIR/example/ck_tile/01_fmha)
Cleared flags_extra_hip (removed -DCK_TILE_FMHA_FWD_FAST_EXP2=1 and related CK flags)
Added '-DENABLE_CK=0' to flags_extra_cc
4. Updated C++ test infrastructure
compile.py: Added -DENABLE_CK=0 when ck_exclude=True for both fwd and bwd paths
build_mha.sh: Added -DENABLE_CK=1 to benchmark binary hipcc commands (benchmarks need real CK for reference computation)
Test Plan
CK-free compilation: Hide 3rdparty/composable_kernel/, clean build artifacts, compile all 4 V3 modules -- verify they build successfully without CK headers.
ENABLE_CK default verification: Inspect generated ninja build files to confirm ENABLE_CK=1 for non-V3 modules and ENABLE_CK=0 for V3 modules.
Symbol leak check: Run nm -D on V3 .so files to confirm zero CK symbols leaked.
C++ standalone smoke test: Build and run smoke_test_fwd_v3.sh -a gfx942 and smoke_test_bwd_v3.sh -a gfx942.
Non-regression: Full CK modules (module_mha_fwd, module_mha_bwd) still compile and work with ENABLE_CK=1 (default).
Test Result
Submission Checklist