Skip to content

refactor: rewrite hptt module as 2D micro-kernel architecture#112

Merged
shinaoka merged 2 commits intomainfrom
refactor/hptt-cleanup-and-benchmarks
Feb 19, 2026
Merged

refactor: rewrite hptt module as 2D micro-kernel architecture#112
shinaoka merged 2 commits intomainfrom
refactor/hptt-cleanup-and-benchmarks

Conversation

@shinaoka
Copy link
Member

Summary

  • Replace monolithic hptt.rs (934 lines) with modular hptt/ directory (6 files)
  • Implement HPTT-faithful 2D micro-kernel architecture: 4×4 f64 / 8×8 f32 scalar kernels → BLOCK×BLOCK macro-kernel tiles → recursive ComputeNode loop nest
  • Unify ConstStride1 path to use the same recursive ComputeNode traversal as Transpose (matching HPTT C++ structure), removing ~210 lines of ad-hoc rank-specialized flat loops
  • Remove unnecessary dispatch_transpose wrapper
  • Update README with current Apple M2 benchmarks and document SIMD micro-kernel TODO

Benchmark (Apple M2, 1T)

Scenario Before After Speedup
Scattered 24d (16M f64) 30 ms (9 GB/s) 11 ms (24 GB/s) 2.7×
Contig→contig perm (24d) 30 ms (9 GB/s) 6 ms (45 GB/s) 5.0×
256³ transpose [2,0,1] 76 ms (3.6 GB/s) 17 ms (16 GB/s) 4.5×

Test plan

  • cargo test -p strided-perm — 78 tests pass
  • cargo test -p strided-perm --features parallel — 80 tests pass
  • cargo bench --bench permute -p strided-perm — correctness checks pass
  • cargo test -p strided-kernel — downstream crate
  • cargo test -p strided-einsum2 — full pipeline

🤖 Generated with Claude Code

shinaoka and others added 2 commits February 19, 2026 11:03
Replace monolithic hptt.rs (934 lines) with modular hptt/ directory:
- micro_kernel/: MicroKernel trait + scalar 4x4 f64 / 8x8 f32 kernels
- macro_kernel.rs: BLOCK×BLOCK tile processing via micro-kernel grid
- plan.rs: PermutePlan with ComputeNode chain, bilateral fusion, ExecMode
- execute.rs: recursive ComputeNode traversal for both Transpose and
  ConstStride1 paths (mirrors HPTT C++ structure)

Key improvements:
- 2D blocking (BLOCK×BLOCK tiles) reduces function call overhead ~16x
- ConstStride1 loop ordering by dst-stride descending for sequential writes
- Removed ad-hoc rank-specialized flat loops in favor of HPTT-style recursion
- Removed unnecessary dispatch_transpose wrapper

Update README with current benchmark results on Apple M2 and document
SIMD micro-kernel TODO.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e thresholds

- Add THIRD-PARTY-LICENSES with HPTT BSD-3-Clause license text
- Add attribution comment in hptt/mod.rs referencing original work
- Apply rustfmt to all new hptt/ files
- Set per-file coverage thresholds for execute.rs (65%) and
  macro_kernel.rs (60%) — unsafe pointer-heavy code is hard to
  instrument with llvm-cov

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@shinaoka shinaoka merged commit 8f3cfee into main Feb 19, 2026
5 checks passed
@shinaoka shinaoka deleted the refactor/hptt-cleanup-and-benchmarks branch February 19, 2026 02:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant