[FEA] Detect transposed tensor on reductions and switch to fast path

Currently a reduction across columns vs rows is slow since it naively takes in a `transpose` operator and indexes it using a random access iterator. This causes adjacent threads to have a strided access equal to the second dimension. 

We experimented with an `einsum` transpose, followed by a reduction, and the results were as follows on an A30:

`permute/sum`
```
|  T  | Tensor Size | NumElements |  DataSize   | Samples | CPU Time  | Noise | GPU Time  | Noise | Elem/s  | GlobalMem BW | BWUtil | Samples | Batch GPU |
|-----|-------------|-------------|-------------|---------|-----------|-------|-----------|-------|---------|--------------|--------|---------|-----------|
| F32 |    2^6 = 64 |          64 |  64.000 MiB |    608x |  1.534 ms | 6.09% |  1.446 ms | 0.63% | 44.248K |  46.397 GB/s |  4.97% |    609x |  1.457 ms |
| F32 |   2^7 = 128 |         128 |   1.000 GiB |     17x | 31.070 ms | 0.31% | 30.981 ms | 0.09% |  4.132K |  34.658 GB/s |  3.71% |     18x | 30.965 ms |
| F64 |    2^6 = 64 |          64 | 128.000 MiB |    254x |  2.058 ms | 4.40% |  1.971 ms | 0.18% | 32.467K |  68.088 GB/s |  7.30% |    267x |  1.944 ms |
| F64 |   2^7 = 128 |         128 |   2.000 GiB |     16x | 32.480 ms | 0.29% | 32.391 ms | 0.07% |  3.952K |  66.298 GB/s |  7.11% |     17x | 32.367 ms |
```

`einsum/sum`
```
|  T  | Tensor Size | NumElements |  DataSize   | Samples |  CPU Time  | Noise  |  GPU Time  | Noise  |  Elem/s  | GlobalMem BW | BWUtil | Samples | Batch GPU  |
|-----|-------------|-------------|-------------|---------|------------|--------|------------|--------|----------|--------------|--------|---------|------------|
| F32 |    2^6 = 64 |          64 |  64.000 MiB |    944x | 585.716 us | 14.30% | 577.762 us | 14.06% | 110.772K | 116.153 GB/s | 12.45% |    991x | 504.905 us |
| F32 |   2^7 = 128 |         128 |   1.000 GiB |   1846x |   8.025 ms |  0.82% |   8.017 ms |  0.81% |  15.966K | 133.933 GB/s | 14.35% |   1847x |   8.007 ms |
| F64 |    2^6 = 64 |          64 | 128.000 MiB |   3136x | 668.751 us |  1.46% | 661.259 us |  0.91% |  96.785K | 202.973 GB/s | 21.75% |   3137x | 649.249 us |
| F64 |   2^7 = 128 |         128 |   2.000 GiB |   1380x |  10.776 ms |  1.44% |  10.768 ms |  1.44% |  11.887K | 199.424 GB/s | 21.37% |   1381x |  10.760 ms |
```

The einsum version is quite a bit faster since it hits SoL on the transpose. However, there can be room for improvement where a kernel aware of both transpose and reductions can be faster by tiling. This issue is to implement that kernel and compare performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Detect transposed tensor on reductions and switch to fast path #482

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[FEA] Detect transposed tensor on reductions and switch to fast path #482

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions