tensor4all · shinaoka · Feb 19, 2026 · Feb 19, 2026 · Feb 19, 2026
diff --git a/THIRD-PARTY-LICENSES b/THIRD-PARTY-LICENSES
@@ -0,0 +1,47 @@
+This file lists third-party works whose algorithms or code influenced this
+project. Each entry includes the original license text.
+
+================================================================================
+HPTT — High-Performance Tensor Transpose
+https://github.com/springer13/hptt
+================================================================================
+
+The strided-perm/src/hptt/ module implements an algorithm based on the HPTT
+library by Paul Springer, Tong Su, and Paolo Bientinesi. This is an
+independent Rust reimplementation; no C++ source code was copied.
+
+Reference:
+  Paul Springer, Tong Su, and Paolo Bientinesi.
+  "HPTT: A High-Performance Tensor Transpose C++ Library."
+  In Proceedings of the 4th ACM SIGPLAN International Workshop on
+  Libraries, Languages, and Compilers for Array Programming (ARRAY), 2017.
+
+License (BSD-3-Clause):
+
+  Copyright 2018 Paul Springer
+
+  Redistribution and use in source and binary forms, with or without
+  modification, are permitted provided that the following conditions are met:
+
+  1. Redistributions of source code must retain the above copyright notice,
+     this list of conditions and the following disclaimer.
+
+  2. Redistributions in binary form must reproduce the above copyright notice,
+     this list of conditions and the following disclaimer in the documentation
+     and/or other materials provided with the distribution.
+
+  3. Neither the name of the copyright holder nor the names of its
+     contributors may be used to endorse or promote products derived from this
+     software without specific prior written permission.
+
+  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+  AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+  ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE
+  LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+  CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+  SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+  INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+  CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+  ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+  POSSIBILITY OF SUCH DAMAGE.
diff --git a/coverage-thresholds.json b/coverage-thresholds.json
@@ -1,4 +1,8 @@
 {
   "_comment": "Per-file line coverage thresholds (%). Files not listed default to 'default'.",
-  "default": 80
+  "default": 80,
+  "files": {
+    "strided-perm/src/hptt/execute.rs": 65,
+    "strided-perm/src/hptt/macro_kernel.rs": 60
+  }
 }
diff --git a/strided-perm/README.md b/strided-perm/README.md
@@ -6,50 +6,62 @@ Cache-efficient tensor permutation / transpose, inspired by
 ## Techniques
 
 1. **Bilateral dimension fusion** -- fuse consecutive dimensions that are
-   contiguous in *both* source and destination stride patterns.
-2. **Cache-aware blocking** -- tile iterations to fit in L1 cache (32 KB).
-3. **Optimal loop ordering** -- place the stride-1 dimension innermost for
-   sequential memory access; sort outer dimensions by descending stride.
-4. **Rank-specialized kernels** -- tight 1D/2D/3D blocked loops with no
-   allocation overhead; generic N-D fallback with pre-allocated odometer.
-5. **Optional Rayon parallelism** (`parallel` feature) -- parallelize the
-   outermost block loop via `rayon::par_iter`.
+   contiguous in *both* source and destination stride patterns
+   (equivalent to HPTT's `fuseIndices`).
+2. **2D micro-kernel transpose** -- 4×4 scalar kernel for f64, 8×8 for f32.
+3. **Macro-kernel blocking** -- BLOCK × BLOCK tile (16 for f64, 32 for f32)
+   processed as a grid of micro-kernel calls, with scalar edge handling.
+4. **Recursive ComputeNode loop nest** -- mirrors HPTT's linked-list loop
+   structure; only stride-1 dims get blocked.
+5. **ConstStride1 fast path** -- when src and dst stride-1 dims coincide,
+   uses memcpy/strided-copy instead of the 2D transpose kernel.
+6. **Optional Rayon parallelism** (`parallel` feature) -- parallelize the
+   outermost ComputeNode dimension via `rayon::par_iter`.
+
+### TODO
+
+- **SIMD micro-kernels** -- the current scalar 4×4/8×8 kernels rely on LLVM
+  auto-vectorization. Dedicated AVX2/NEON intrinsic kernels could further
+  close the gap with HPTT C++.
 
 ## Benchmark Results
 
-Environment: Linux, AMD 64-core server, `RUSTFLAGS="-C target-cpu=native"`.
+Environment: Apple M2, 8 cores, macOS.
 
 All tensors use `f64` (8 bytes). "16M elements" = 128 MB read + 128 MB write.
 
 ### Single-threaded (1T)
 
 | Scenario | strided-perm | naive | Speedup |
 |---|---:|---:|---:|
-| Scattered 24d (16M elems) | 30 ms (9.0 GB/s) | 84 ms (3.2 GB/s) | 2.8x |
-| Contig->contig perm (24d) | 30 ms (8.9 GB/s) | 84 ms (3.2 GB/s) | 2.8x |
-| Small tensor (13d, 8K elems) | 0.023 ms (5.7 GB/s) | 0.039 ms (3.4 GB/s) | 1.7x |
-| 256^3 transpose [2,0,1] | 76 ms (3.6 GB/s) | 73 ms (3.7 GB/s) | ~1x |
-| 256^3 transpose [1,0,2] | 37 ms (7.3 GB/s) | -- | -- |
-| memcpy baseline | 5.8 ms (46 GB/s) | -- | -- |
+| Scattered 24d (16M elems) | 11.0 ms (24 GB/s) | 38 ms (7.0 GB/s) | 3.5x |
+| Contig→contig perm (24d) | 6.0 ms (45 GB/s) | 30 ms (9.1 GB/s) | 5.0x |
+| Small tensor reverse (13d, 8K) | 0.035 ms (3.7 GB/s) | 0.015 ms (8.9 GB/s) | 0.4x |
+| Small tensor cyclic (13d, 8K) | 0.004 ms (29 GB/s) | -- | -- |
+| 256^3 transpose [2,0,1] | 17.1 ms (16 GB/s) | 45 ms (6.0 GB/s) | 2.6x |
+| 256^3 transpose [1,0,2] | 15.0 ms (18 GB/s) | -- | -- |
+| memcpy baseline | 4.5 ms (59 GB/s) | -- | -- |
 
-### Multi-threaded (64T, `parallel` feature)
+### Multi-threaded (8T, `parallel` feature)
 
-| Scenario | 1T | 64T | Speedup |
+| Scenario | 1T | 8T | Speedup |
 |---|---:|---:|---:|
-| Scattered 24d (16M elems) | 30 ms (9.0 GB/s) | 23 ms (11.7 GB/s) | 1.3x |
-| Contig->contig perm (24d) | 30 ms (8.9 GB/s) | 24 ms (11.4 GB/s) | 1.3x |
-| Small tensor (13d, 8K elems) | 0.023 ms | 0.023 ms | 1.0x (below threshold) |
-| 256^3 transpose [2,0,1] | 76 ms (3.6 GB/s) | 4.7 ms (56.8 GB/s) | 16x |
-| 256^3 transpose [1,0,2] | 37 ms (7.3 GB/s) | 4.2 ms (64.1 GB/s) | 8.8x |
+| Scattered 24d (16M elems) | 15.7 ms (17 GB/s) | 7.8 ms (35 GB/s) | 2.0x |
+| Contig→contig perm (24d) | 6.3 ms (43 GB/s) | 6.5 ms (42 GB/s) | ~1x |
+| Small tensor reverse (13d, 8K) | 0.033 ms | 0.033 ms | 1.0x (below threshold) |
+| 256^3 transpose [2,0,1] | 17.0 ms (16 GB/s) | 17.5 ms (15 GB/s) | ~1x |
+| 256^3 transpose [1,0,2] | 15.8 ms (17 GB/s) | 6.3 ms (42 GB/s) | 2.5x |
 
 ### Notes
 
 - **Scattered 24d**: 24 binary dimensions with non-contiguous strides from a
   real tensor-network workload. Parallel improvement is modest because bilateral
   fusion leaves few outer blocks to distribute.
-- **256^3 transpose**: Parallel execution yields dramatic speedup (16x) by
-  exploiting the large L3 cache and memory bandwidth of the 64-core machine.
-  Single-threaded performance is TLB-limited due to stride-65536 access.
+- **Small tensor reverse**: Slower than naive because plan construction overhead
+  dominates at 8K elements. The cyclic permutation fuses to fewer dims and is
+  much faster.
+- **256^3 transpose [2,0,1]**: Parallel speedup is limited because the outermost
+  ComputeNode dimension is small after bilateral fusion.
 - **Small tensor**: Below `MINTHREADLENGTH` (32K elements), the parallel path
   falls back to single-threaded, incurring no overhead.