Skip to content

[FEA] Detect transposed tensor on reductions and switch to fast path #482

@cliffburdick

Description

@cliffburdick

Currently a reduction across columns vs rows is slow since it naively takes in a transpose operator and indexes it using a random access iterator. This causes adjacent threads to have a strided access equal to the second dimension.

We experimented with an einsum transpose, followed by a reduction, and the results were as follows on an A30:

permute/sum

|  T  | Tensor Size | NumElements |  DataSize   | Samples | CPU Time  | Noise | GPU Time  | Noise | Elem/s  | GlobalMem BW | BWUtil | Samples | Batch GPU |
|-----|-------------|-------------|-------------|---------|-----------|-------|-----------|-------|---------|--------------|--------|---------|-----------|
| F32 |    2^6 = 64 |          64 |  64.000 MiB |    608x |  1.534 ms | 6.09% |  1.446 ms | 0.63% | 44.248K |  46.397 GB/s |  4.97% |    609x |  1.457 ms |
| F32 |   2^7 = 128 |         128 |   1.000 GiB |     17x | 31.070 ms | 0.31% | 30.981 ms | 0.09% |  4.132K |  34.658 GB/s |  3.71% |     18x | 30.965 ms |
| F64 |    2^6 = 64 |          64 | 128.000 MiB |    254x |  2.058 ms | 4.40% |  1.971 ms | 0.18% | 32.467K |  68.088 GB/s |  7.30% |    267x |  1.944 ms |
| F64 |   2^7 = 128 |         128 |   2.000 GiB |     16x | 32.480 ms | 0.29% | 32.391 ms | 0.07% |  3.952K |  66.298 GB/s |  7.11% |     17x | 32.367 ms |

einsum/sum

|  T  | Tensor Size | NumElements |  DataSize   | Samples |  CPU Time  | Noise  |  GPU Time  | Noise  |  Elem/s  | GlobalMem BW | BWUtil | Samples | Batch GPU  |
|-----|-------------|-------------|-------------|---------|------------|--------|------------|--------|----------|--------------|--------|---------|------------|
| F32 |    2^6 = 64 |          64 |  64.000 MiB |    944x | 585.716 us | 14.30% | 577.762 us | 14.06% | 110.772K | 116.153 GB/s | 12.45% |    991x | 504.905 us |
| F32 |   2^7 = 128 |         128 |   1.000 GiB |   1846x |   8.025 ms |  0.82% |   8.017 ms |  0.81% |  15.966K | 133.933 GB/s | 14.35% |   1847x |   8.007 ms |
| F64 |    2^6 = 64 |          64 | 128.000 MiB |   3136x | 668.751 us |  1.46% | 661.259 us |  0.91% |  96.785K | 202.973 GB/s | 21.75% |   3137x | 649.249 us |
| F64 |   2^7 = 128 |         128 |   2.000 GiB |   1380x |  10.776 ms |  1.44% |  10.768 ms |  1.44% |  11.887K | 199.424 GB/s | 21.37% |   1381x |  10.760 ms |

The einsum version is quite a bit faster since it hits SoL on the transpose. However, there can be room for improvement where a kernel aware of both transpose and reductions can be faster by tiling. This issue is to implement that kernel and compare performance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions