Skip to content

[FEATURE REQUEST] dot CuTe kernel implementation #18

@LoserCheems

Description

@LoserCheems

Problem statement

The BLAS level-1 dot kernel has no CuTe backend implementation in this project. In the README BLAS table, the CuTe column for dot is empty, which prevents a full cross-backend comparison for one of the most fundamental BLAS-1 operations.

Without a CuTe dot kernel:

  • users cannot see how dot products are expressed with CuTe primitives and thread scheduling,
  • there is no CuTe baseline for performance comparison against PyTorch and Triton dot products,
  • CuTe-based examples remain incomplete for basic BLAS-1 coverage.

Proposed solution

Implement a CuTe-based dot kernel that matches the mathematical semantics of the Python reference and fits within the project’s backend structure.

Concretely:

  • Add a CuTe dot product kernel in the appropriate CuTe backend directory (once established), implementing $z = x^\top y$ for 1D vectors.
  • Use CuTe constructs suitable for reductions and vector operations.
  • Align the public entry-point API with other backends to allow uniform dispatch.

Alternatives considered

Alternatives such as omitting CuTe dot or reusing other backends for performance comparisons would:

  • limit the educational value of demonstrating reductions in CuTe,
  • leave the CuTe column incomplete in the README BLAS table,
  • reduce CuTe’s role as a first-class backend in the project.

Implementation details

  • Decide on file layout and build integration for CuTe kernels.
  • Implement the dot product using CuTe abstractions for memory and thread scheduling.
  • Ensure numerical equivalence with the Python reference and compatibility with the project’s testing and benchmarking utilities.

Use case

The CuTe dot kernel will:

  • showcase a simple yet important reduction in CuTe,
  • enable performance and implementation comparisons across backends,
  • act as a basis for more advanced CuTe kernels (e.g. matrix multiplications).

Related work

  • CuTe/CUTLASS examples of dot products and reductions.
  • Standard BLAS ddot/sdot implementations.

Additional context

This issue complements the dot Python/PyTorch/Triton feature requests and contributes to full CuTe coverage of BLAS-1 operations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions