Skip to content

[FEATURE REQUEST] gemm CuTe kernel implementation #33

@LoserCheems

Description

@LoserCheems

Problem statement

The BLAS level-3 gemm kernel has no CuTe backend implementation in this project. The README BLAS table lists an empty CuTe column for gemm, even though GEMM is the primary workload that CuTe/CUTLASS is designed to optimize.

Without a CuTe gemm kernel:

  • users cannot see how GEMM is expressed using CuTe’s layout and tiling abstractions,
  • there is no CuTe GEMM baseline for performance comparison with PyTorch and Triton,
  • CuTe-based examples are missing the most important building block for many workloads.

Proposed solution

Implement a CuTe-based gemm kernel that matches the Python reference semantics and aligns with the project’s backend structure.

Concretely:

  • Add a CuTe gemm kernel in the appropriate CuTe backend directory, implementing $C = \alpha A B + \beta C$.
  • Use CuTe primitives to describe matrix layouts, threadblock tiling, and memory movement.
  • Align the public API with other backends so callers can dispatch to CuTe GEMM uniformly.

Alternatives considered

Alternatives such as omitting CuTe gemm or relying on other backends would:

  • significantly reduce the educational and practical value of including CuTe as a backend,
  • leave the CuTe column incomplete in the README BLAS table for the most important BLAS-3 kernel,
  • limit opportunities to demonstrate high-performance GEMM implementation details in CuTe.

Implementation details

  • Establish file layout and build integration for CuTe kernels.
  • Implement GEMM using CuTe abstractions tuned for GPU execution, potentially leveraging CUTLASS patterns.
  • Ensure numerical equivalence with the Python reference and harmonize with PyTorch/Triton semantics.
  • Integrate with planned tests and benchmarks for GEMM.

Use case

The CuTe gemm kernel will:

  • demonstrate a high-performance GEMM implementation in CuTe,
  • enable rich performance comparisons across backends,
  • act as a cornerstone for Transformer modules and other large-scale linear algebra workloads.

Related work

  • CuTe/CUTLASS GEMM examples and reference kernels.
  • Standard BLAS gemm implementations.

Additional context

This issue complements the gemm Python/PyTorch/Triton feature requests and aims to make CuTe a first-class backend for the project’s most important BLAS-3 operation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions