Skip to content

[FEATURE REQUEST] geru CuTe kernel implementation #28

@LoserCheems

Description

@LoserCheems

Problem statement

The BLAS level-2 geru kernel (general rank-1 update) does not yet have a CuTe backend implementation in this project. The README BLAS table lists an empty CuTe column for geru, preventing full cross-backend coverage for this operation.

Without a CuTe geru kernel:

  • users cannot learn how rank-1 updates are expressed using CuTe’s layout and tiling abstractions,
  • there is no CuTe performance baseline to compare against PyTorch and Triton geru implementations,
  • CuTe-based higher-level examples lack a standard rank-1 update primitive.

Proposed solution

Implement a CuTe-based geru kernel that matches the Python reference semantics and fits within the project’s backend structure.

Concretely:

  • Add a CuTe geru kernel in the appropriate CuTe backend directory, implementing $A = A + \alpha x y^\top$.
  • Use CuTe primitives to describe matrix layout, vector access, and thread scheduling for rank-1 updates.
  • Align the public API with other backends to allow uniform dispatch.

Alternatives considered

Alternatives such as omitting CuTe geru or reusing other backends would:

  • reduce the educational impact of comparing CuTe to PyTorch/Triton on BLAS-2 operations,
  • leave the CuTe column incomplete in the README BLAS table,
  • limit CuTe’s role as a first-class backend.

Implementation details

  • Establish file layout and build rules for CuTe kernels.
  • Implement geru using CuTe abstractions for rank-1 updates over 2D layouts.
  • Ensure numerical equivalence with the Python reference.
  • Integrate with planned tests and benchmarks for geru.

Use case

The CuTe geru kernel will:

  • demonstrate rank-1 updates in CuTe,
  • enable detailed performance comparisons across backends,
  • serve as a building block for more complex CuTe-based kernels.

Related work

  • CuTe/CUTLASS examples of rank-1 updates.
  • Standard BLAS geru implementations.

Additional context

This issue complements the geru Python/PyTorch/Triton feature requests and contributes to full CuTe coverage of BLAS-2 operations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions