refactor: rewrite hptt module as 2D micro-kernel architecture by shinaoka · Pull Request #112 · tensor4all/strided-rs

shinaoka · 2026-02-19T02:04:28Z

Summary

Replace monolithic hptt.rs (934 lines) with modular hptt/ directory (6 files)
Implement HPTT-faithful 2D micro-kernel architecture: 4×4 f64 / 8×8 f32 scalar kernels → BLOCK×BLOCK macro-kernel tiles → recursive ComputeNode loop nest
Unify ConstStride1 path to use the same recursive ComputeNode traversal as Transpose (matching HPTT C++ structure), removing ~210 lines of ad-hoc rank-specialized flat loops
Remove unnecessary dispatch_transpose wrapper
Update README with current Apple M2 benchmarks and document SIMD micro-kernel TODO

Benchmark (Apple M2, 1T)

Scenario	Before	After	Speedup
Scattered 24d (16M f64)	30 ms (9 GB/s)	11 ms (24 GB/s)	2.7×
Contig→contig perm (24d)	30 ms (9 GB/s)	6 ms (45 GB/s)	5.0×
256³ transpose [2,0,1]	76 ms (3.6 GB/s)	17 ms (16 GB/s)	4.5×

Test plan

cargo test -p strided-perm — 78 tests pass
cargo test -p strided-perm --features parallel — 80 tests pass
cargo bench --bench permute -p strided-perm — correctness checks pass
cargo test -p strided-kernel — downstream crate
cargo test -p strided-einsum2 — full pipeline

🤖 Generated with Claude Code

Replace monolithic hptt.rs (934 lines) with modular hptt/ directory: - micro_kernel/: MicroKernel trait + scalar 4x4 f64 / 8x8 f32 kernels - macro_kernel.rs: BLOCK×BLOCK tile processing via micro-kernel grid - plan.rs: PermutePlan with ComputeNode chain, bilateral fusion, ExecMode - execute.rs: recursive ComputeNode traversal for both Transpose and ConstStride1 paths (mirrors HPTT C++ structure) Key improvements: - 2D blocking (BLOCK×BLOCK tiles) reduces function call overhead ~16x - ConstStride1 loop ordering by dst-stride descending for sequential writes - Removed ad-hoc rank-specialized flat loops in favor of HPTT-style recursion - Removed unnecessary dispatch_transpose wrapper Update README with current benchmark results on Apple M2 and document SIMD micro-kernel TODO. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…e thresholds - Add THIRD-PARTY-LICENSES with HPTT BSD-3-Clause license text - Add attribution comment in hptt/mod.rs referencing original work - Apply rustfmt to all new hptt/ files - Set per-file coverage thresholds for execute.rs (65%) and macro_kernel.rs (60%) — unsafe pointer-heavy code is hard to instrument with llvm-cov Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

shinaoka and others added 2 commits February 19, 2026 11:03

shinaoka merged commit 8f3cfee into main Feb 19, 2026
5 checks passed

shinaoka deleted the refactor/hptt-cleanup-and-benchmarks branch February 19, 2026 02:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: rewrite hptt module as 2D micro-kernel architecture#112

refactor: rewrite hptt module as 2D micro-kernel architecture#112
shinaoka merged 2 commits intomainfrom
refactor/hptt-cleanup-and-benchmarks

shinaoka commented Feb 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shinaoka commented Feb 19, 2026

Summary

Benchmark (Apple M2, 1T)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant