[FEATURE REQUEST] `gemm` CuTe kernel implementation

### Problem statement

The BLAS level-3 `gemm` kernel has no CuTe backend implementation in this project. The README BLAS table lists an empty CuTe column for `gemm`, even though GEMM is the primary workload that CuTe/CUTLASS is designed to optimize.

Without a CuTe `gemm` kernel:
- users cannot see how GEMM is expressed using CuTe’s layout and tiling abstractions,
- there is no CuTe GEMM baseline for performance comparison with PyTorch and Triton,
- CuTe-based examples are missing the most important building block for many workloads.

### Proposed solution

Implement a CuTe-based `gemm` kernel that matches the Python reference semantics and aligns with the project’s backend structure.

Concretely:
- Add a CuTe `gemm` kernel in the appropriate CuTe backend directory, implementing $C = \alpha A B + \beta C$.
- Use CuTe primitives to describe matrix layouts, threadblock tiling, and memory movement.
- Align the public API with other backends so callers can dispatch to CuTe GEMM uniformly.

### Alternatives considered

Alternatives such as omitting CuTe `gemm` or relying on other backends would:
- significantly reduce the educational and practical value of including CuTe as a backend,
- leave the CuTe column incomplete in the README BLAS table for the most important BLAS-3 kernel,
- limit opportunities to demonstrate high-performance GEMM implementation details in CuTe.

### Implementation details

- Establish file layout and build integration for CuTe kernels.
- Implement GEMM using CuTe abstractions tuned for GPU execution, potentially leveraging CUTLASS patterns.
- Ensure numerical equivalence with the Python reference and harmonize with PyTorch/Triton semantics.
- Integrate with planned tests and benchmarks for GEMM.

### Use case

The CuTe `gemm` kernel will:
- demonstrate a high-performance GEMM implementation in CuTe,
- enable rich performance comparisons across backends,
- act as a cornerstone for Transformer modules and other large-scale linear algebra workloads.

### Related work

- CuTe/CUTLASS GEMM examples and reference kernels.
- Standard BLAS `gemm` implementations.

### Additional context

This issue complements the `gemm` Python/PyTorch/Triton feature requests and aims to make CuTe a first-class backend for the project’s most important BLAS-3 operation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE REQUEST] `gemm` CuTe kernel implementation #33

Problem statement

Proposed solution

Alternatives considered

Implementation details

Use case

Related work

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[FEATURE REQUEST] gemm CuTe kernel implementation #33

Description

Problem statement

Proposed solution

Alternatives considered

Implementation details

Use case

Related work

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[FEATURE REQUEST] `gemm` CuTe kernel implementation #33