Currently a reduction across columns vs rows is slow since it naively takes in a transpose operator and indexes it using a random access iterator. This causes adjacent threads to have a strided access equal to the second dimension.
We experimented with an einsum transpose, followed by a reduction, and the results were as follows on an A30:
permute/sum
| T | Tensor Size | NumElements | DataSize | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s | GlobalMem BW | BWUtil | Samples | Batch GPU |
|-----|-------------|-------------|-------------|---------|-----------|-------|-----------|-------|---------|--------------|--------|---------|-----------|
| F32 | 2^6 = 64 | 64 | 64.000 MiB | 608x | 1.534 ms | 6.09% | 1.446 ms | 0.63% | 44.248K | 46.397 GB/s | 4.97% | 609x | 1.457 ms |
| F32 | 2^7 = 128 | 128 | 1.000 GiB | 17x | 31.070 ms | 0.31% | 30.981 ms | 0.09% | 4.132K | 34.658 GB/s | 3.71% | 18x | 30.965 ms |
| F64 | 2^6 = 64 | 64 | 128.000 MiB | 254x | 2.058 ms | 4.40% | 1.971 ms | 0.18% | 32.467K | 68.088 GB/s | 7.30% | 267x | 1.944 ms |
| F64 | 2^7 = 128 | 128 | 2.000 GiB | 16x | 32.480 ms | 0.29% | 32.391 ms | 0.07% | 3.952K | 66.298 GB/s | 7.11% | 17x | 32.367 ms |
einsum/sum
| T | Tensor Size | NumElements | DataSize | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s | GlobalMem BW | BWUtil | Samples | Batch GPU |
|-----|-------------|-------------|-------------|---------|------------|--------|------------|--------|----------|--------------|--------|---------|------------|
| F32 | 2^6 = 64 | 64 | 64.000 MiB | 944x | 585.716 us | 14.30% | 577.762 us | 14.06% | 110.772K | 116.153 GB/s | 12.45% | 991x | 504.905 us |
| F32 | 2^7 = 128 | 128 | 1.000 GiB | 1846x | 8.025 ms | 0.82% | 8.017 ms | 0.81% | 15.966K | 133.933 GB/s | 14.35% | 1847x | 8.007 ms |
| F64 | 2^6 = 64 | 64 | 128.000 MiB | 3136x | 668.751 us | 1.46% | 661.259 us | 0.91% | 96.785K | 202.973 GB/s | 21.75% | 3137x | 649.249 us |
| F64 | 2^7 = 128 | 128 | 2.000 GiB | 1380x | 10.776 ms | 1.44% | 10.768 ms | 1.44% | 11.887K | 199.424 GB/s | 21.37% | 1381x | 10.760 ms |
The einsum version is quite a bit faster since it hits SoL on the transpose. However, there can be room for improvement where a kernel aware of both transpose and reductions can be faster by tiling. This issue is to implement that kernel and compare performance.
Currently a reduction across columns vs rows is slow since it naively takes in a
transposeoperator and indexes it using a random access iterator. This causes adjacent threads to have a strided access equal to the second dimension.We experimented with an
einsumtranspose, followed by a reduction, and the results were as follows on an A30:permute/sumeinsum/sumThe einsum version is quite a bit faster since it hits SoL on the transpose. However, there can be room for improvement where a kernel aware of both transpose and reductions can be faster by tiling. This issue is to implement that kernel and compare performance.