Skip to content

perf: source-stride-order copy in prepare_input_owned#117

Merged
shinaoka merged 1 commit intomainfrom
perf/source-order-copy
Feb 19, 2026
Merged

perf: source-stride-order copy in prepare_input_owned#117
shinaoka merged 1 commit intomainfrom
perf/source-order-copy

Conversation

@shinaoka
Copy link
Member

Summary

  • Replace HPTT-based copy_into_col_major with source-stride-order copy (copy_strided_src_order) in prepare_input_owned. HPTT iterates in destination-stride order, causing scattered reads from cold L3 cache when the source is contiguous with many small permuted dimensions (e.g. 24 binary dims of size 2). Source-stride-order iteration gives sequential reads that exploit the hardware prefetcher.
  • Add optional rayon parallelization (--features parallel) via copy_strided_src_order_par that splits outer source-stride dimensions across threads, with automatic fallback to sequential for small tensors or RAYON_NUM_THREADS=1.
  • Update docs/permutation-optimization.md to reflect the new strategy and benchmark results.

Benchmark results

tensornetwork_permutation_light_415 (415 tensors, 24 binary dims, AMD EPYC 7713P):

Configuration opt_flops (ms) Change
HPTT (original) 1T 455 baseline
Source-order copy 1T 298 -34%
Source-order + parallel 4T 228 -50%

Full benchmark suite (10 instances, 1T): no regressions on other instances.

4T parallel copy effect (isolated from faer GEMM parallelism):

Instance 4T no-parallel 4T parallel Change
tensornetwork_light_415 318 ms 228 ms -28%
tensornetwork_focus_409 319 ms 226 ms -29%
mera_closed_120 987 ms 797 ms -19%
mera_open_26 605 ms 529 ms -12%

Test plan

  • cargo test -p strided-einsum2 --features parallel (84 tests pass)
  • cargo test -p strided-opteinsum --features parallel (163 tests pass)
  • Full benchmark suite 1T: no regressions
  • Full benchmark suite 4T: parallel copy improves large instances
  • CI tests (default features + parallel feature)

🤖 Generated with Claude Code

Replace HPTT-based copy_into_col_major with a source-stride-order copy
for GEMM input preparation. HPTT iterates in destination-stride order,
causing scattered reads from cold L3 cache when the source is a large
contiguous tensor with many small permuted dimensions (e.g. 24 binary
dims of size 2). Source-stride-order iteration gives sequential reads
that exploit the hardware prefetcher.

Add optional rayon parallelization (--features parallel) that splits the
outer source-stride dimensions across threads, with automatic fallback
to single-threaded for small tensors or RAYON_NUM_THREADS=1.

Benchmark (tensornetwork_permutation_light_415, AMD EPYC 7713P):
- HPTT (original) 1T:            455 ms
- Source-order copy 1T:           298 ms (-34%)
- Source-order copy + parallel 4T: 228 ms (-50%)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@shinaoka shinaoka merged commit b28e07d into main Feb 19, 2026
5 checks passed
@shinaoka shinaoka deleted the perf/source-order-copy branch February 19, 2026 11:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant