perf: source-stride-order copy in prepare_input_owned by shinaoka · Pull Request #117 · tensor4all/strided-rs

shinaoka · 2026-02-19T11:14:16Z

Summary

Replace HPTT-based copy_into_col_major with source-stride-order copy (copy_strided_src_order) in prepare_input_owned. HPTT iterates in destination-stride order, causing scattered reads from cold L3 cache when the source is contiguous with many small permuted dimensions (e.g. 24 binary dims of size 2). Source-stride-order iteration gives sequential reads that exploit the hardware prefetcher.
Add optional rayon parallelization (--features parallel) via copy_strided_src_order_par that splits outer source-stride dimensions across threads, with automatic fallback to sequential for small tensors or RAYON_NUM_THREADS=1.
Update docs/permutation-optimization.md to reflect the new strategy and benchmark results.

Benchmark results

tensornetwork_permutation_light_415 (415 tensors, 24 binary dims, AMD EPYC 7713P):

Configuration	opt_flops (ms)	Change
HPTT (original) 1T	455	baseline
Source-order copy 1T	298	-34%
Source-order + parallel 4T	228	-50%

Full benchmark suite (10 instances, 1T): no regressions on other instances.

4T parallel copy effect (isolated from faer GEMM parallelism):

Instance	4T no-parallel	4T parallel	Change
tensornetwork_light_415	318 ms	228 ms	-28%
tensornetwork_focus_409	319 ms	226 ms	-29%
mera_closed_120	987 ms	797 ms	-19%
mera_open_26	605 ms	529 ms	-12%

Test plan

cargo test -p strided-einsum2 --features parallel (84 tests pass)
cargo test -p strided-opteinsum --features parallel (163 tests pass)
Full benchmark suite 1T: no regressions
Full benchmark suite 4T: parallel copy improves large instances
CI tests (default features + parallel feature)

🤖 Generated with Claude Code

Replace HPTT-based copy_into_col_major with a source-stride-order copy for GEMM input preparation. HPTT iterates in destination-stride order, causing scattered reads from cold L3 cache when the source is a large contiguous tensor with many small permuted dimensions (e.g. 24 binary dims of size 2). Source-stride-order iteration gives sequential reads that exploit the hardware prefetcher. Add optional rayon parallelization (--features parallel) that splits the outer source-stride dimensions across threads, with automatic fallback to single-threaded for small tensors or RAYON_NUM_THREADS=1. Benchmark (tensornetwork_permutation_light_415, AMD EPYC 7713P): - HPTT (original) 1T: 455 ms - Source-order copy 1T: 298 ms (-34%) - Source-order copy + parallel 4T: 228 ms (-50%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

shinaoka mentioned this pull request Feb 19, 2026

feat: add parallel feature and micro_bench scripts tensor4all/strided-rs-benchmark-suite#26

Merged

2 tasks

shinaoka merged commit b28e07d into main Feb 19, 2026
5 checks passed

shinaoka deleted the perf/source-order-copy branch February 19, 2026 11:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: source-stride-order copy in prepare_input_owned#117

perf: source-stride-order copy in prepare_input_owned#117
shinaoka merged 1 commit intomainfrom
perf/source-order-copy

shinaoka commented Feb 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shinaoka commented Feb 19, 2026

Summary

Benchmark results

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant