Skip to content

Enable autotuner rocblas/hipblaslt fission backends#663

Open
Eetusjo wants to merge 2 commits intorocm-jaxlib-v0.9.0from
ci_enable_autotuner_fission_090
Open

Enable autotuner rocblas/hipblaslt fission backends#663
Eetusjo wants to merge 2 commits intorocm-jaxlib-v0.9.0from
ci_enable_autotuner_fission_090

Conversation

@Eetusjo
Copy link

@Eetusjo Eetusjo commented Mar 10, 2026

Enables rocBLAS/hipBLASLt fission backends in autotuner.


Example using run_hlo_module

Running input HLO:

HloModule test

triton_gemm {
  p0 = bf16[4096,4096] parameter(0)
  p1 = bf16[4096,4096] parameter(1)
  ROOT dot = bf16[4096,4096] dot(p0, p1),
    lhs_contracting_dims={1}, rhs_contracting_dims={0}
}

ENTRY main {
  p0 = bf16[4096,4096] parameter(0)
  p1 = bf16[4096,4096] parameter(1)
  ROOT r = bf16[4096,4096] fusion(p0, p1),
    kind=kCustom, calls=triton_gemm,
    backend_config={"fusion_backend_config":{"kind":"__triton_gemm"}}
}

Running bazel run --config=rocm_ci //xla/tools:run_hlo_module -- --input_format=hlo --platform=gpu $(realpath triton_gemm.hlo) --xla_gpu_enable_cublaslt --xla_gpu_dump_autotune_logs_to=/tmp/at_logs with --xla_gpu_cublas_fallback={true,false} to enable or disable fission respectively.

Autotune results when fission disabled we see Triton:

results {
  result {
    triton {
      ...
    }
  }
  ...
}

Enabled we get hipBLASLt.

results {
  ...
  result {
    other {
      name: "HipblasLt_fission"
      ...
    }
  }
}

Comment on lines 381 to 392
executable_candidates.erase(
std::remove_if(executable_candidates.begin(),
executable_candidates.end(),
[](const ExecutableCandidate& candidate) {
return candidate.config.codegen_backend->name() ==
"Cublas_fission";
}),
std::remove_if(
executable_candidates.begin(), executable_candidates.end(),
[](const ExecutableCandidate& candidate) {
const auto& name = candidate.config.codegen_backend->name();
return name == "Cublas_fission" ||
name == "Rocblas_fission" ||
name == "HipblasLt_fission";
}),
executable_candidates.end());
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exclude_cublas_config name no longer matches its scope

This field now also excludes Rocblas_fission and HipblasLt_fission, but the name and comment ("autotuner will not select cublas configs for fusions") still suggest cuBLAS only. Consider renaming to something vendor-neutral like exclude_fission_configs, or at minimum updating the comment in autotuner.h:75 to clarify it covers all BLAS library fission backends.

Comment on lines +84 to +98
std::vector<std::unique_ptr<CodegenBackend>> backends;
backends.push_back(std::make_unique<FissionBackend>(
debug_options, compiler, target_config,
std::make_unique<RocblasBackend>(stream_executor, debug_options, compiler,
target_config),
GetCublasRewriterPipeline(target_config->device_description),
mlir_context));
backends.push_back(std::make_unique<FissionBackend>(
debug_options, compiler, target_config,
std::make_unique<HipblasLtBackend>(stream_executor, debug_options,
compiler, target_config),
GetCublasRewriterPipeline(target_config->device_description,
/*enable_cublaslt=*/true),
mlir_context));
return backends;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No tests for the new ROCm fission backends

This PR enables two new backend registrations (Rocblas_fission, HipblasLt_fission) and modifies the exclusion filter, but adds no tests. The existing test at autotuner_test.cc:713 only mocks Cublas_fission. At minimum, a unit test verifying that GetFissionBackendsForROCm returns the correct number/types of backends and a test covering the exclusion of the ROCm backend names when exclude_cublas_config is set would help prevent regressions.

@claude
Copy link

claude bot commented Mar 10, 2026

Re-review: 2/4 prior issues resolved (pass ordering fixed). 2 remain open (exclude_cublas_config naming, missing tests). No new findings.

@Eetusjo Eetusjo force-pushed the ci_enable_autotuner_fission_090 branch from 7a3d591 to 942000b Compare March 10, 2026 11:36
@Eetusjo Eetusjo marked this pull request as ready for review March 10, 2026 11:39
@Eetusjo
Copy link
Author

Eetusjo commented Mar 10, 2026

This downstream branch was branched out from upstream when the codegen backend factories were added, but not used in the compilers. Currently I'm duplicating some functionality to enable both the compiler autotuner path and the autotuner_main target. @draganmladjenovic do you have an opinion on whether I should refactor this to the current upstream way which uses the factory everywhere?

Comment on lines +112 to +123
auto fission_backends_or =
registry.FindObject<GetFissionBackends>(platform->id());
if (fission_backends_or.ok()) {
const GetFissionBackends::Type& get_fission_backends =
fission_backends_or.value();
std::vector<std::unique_ptr<CodegenBackend>> fission_backends =
get_fission_backends(stream_executor, &debug_options, compiler.get(),
&target_config, mlir_context);
backends.insert(backends.end(),
std::make_move_iterator(fission_backends.begin()),
std::make_move_iterator(fission_backends.end()));
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fission backends loaded unconditionally, bypassing xla_gpu_cublas_fallback gate

In amdgpu_compiler.cc (and nvptx_compiler.cc), fission backends are only added when debug_options.xla_gpu_cublas_fallback() is true. Here in autotuner_main, fission backends are loaded unconditionally whenever a platform has registered them. This means running via autotuner_main will always include fission backends regardless of the xla_gpu_cublas_fallback flag, while the compiler path respects that flag.

Consider gating this block behind the same debug_options.xla_gpu_cublas_fallback() check, e.g.:

if (debug_options.xla_gpu_cublas_fallback()) {
  auto fission_backends_or = ...;
  // ...
}

@i-chaochen
Copy link
Collaborator

i-chaochen commented Mar 11, 2026

This downstream branch was branched out from upstream when the codegen backend factories were added, but not used in the compilers. Currently I'm duplicating some functionality to enable both the compiler autotuner path and the autotuner_main target. @draganmladjenovic do you have an opinion on whether I should refactor this to the current upstream way which uses the factory everywhere?

IMHO we no need to refactor this on 0.9.0 as we're planning to have 0.9.1 release soon https://github.com/ROCm/xla/commits/rocm-jaxlib-v0.9.1/

Let's stay what it has in 0.9.0 and make the enablement as little changes as possible. As we don't want to confuse future debug and causes more divegences. Hope it makes sense.

If it's much better shape in 0.9.1 I guess we shall land this feature into our 0.9.1 release?

We shall restrict our refactoring on release branches https://github.com/ROCm/xla-internal/pull/6/changes

cc @nurmukhametov @hsharsha

@hsharsha
Copy link
Collaborator

I tend to agree with @i-chaochen to keep the release branch as close to the point of release, with only bug fixes to them rather than bringing in new code or features.

@nurmukhametov
Copy link
Member

I tend to agree with @i-chaochen to keep the release branch as close to the point of release, with only bug fixes to them rather than bringing in new code or features.

I agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants