perf: source-stride-order copy in prepare_input_owned#117
Merged
Conversation
Replace HPTT-based copy_into_col_major with a source-stride-order copy for GEMM input preparation. HPTT iterates in destination-stride order, causing scattered reads from cold L3 cache when the source is a large contiguous tensor with many small permuted dimensions (e.g. 24 binary dims of size 2). Source-stride-order iteration gives sequential reads that exploit the hardware prefetcher. Add optional rayon parallelization (--features parallel) that splits the outer source-stride dimensions across threads, with automatic fallback to single-threaded for small tensors or RAYON_NUM_THREADS=1. Benchmark (tensornetwork_permutation_light_415, AMD EPYC 7713P): - HPTT (original) 1T: 455 ms - Source-order copy 1T: 298 ms (-34%) - Source-order copy + parallel 4T: 228 ms (-50%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
copy_into_col_majorwith source-stride-order copy (copy_strided_src_order) inprepare_input_owned. HPTT iterates in destination-stride order, causing scattered reads from cold L3 cache when the source is contiguous with many small permuted dimensions (e.g. 24 binary dims of size 2). Source-stride-order iteration gives sequential reads that exploit the hardware prefetcher.--features parallel) viacopy_strided_src_order_parthat splits outer source-stride dimensions across threads, with automatic fallback to sequential for small tensors orRAYON_NUM_THREADS=1.docs/permutation-optimization.mdto reflect the new strategy and benchmark results.Benchmark results
tensornetwork_permutation_light_415(415 tensors, 24 binary dims, AMD EPYC 7713P):Full benchmark suite (10 instances, 1T): no regressions on other instances.
4T parallel copy effect (isolated from faer GEMM parallelism):
Test plan
cargo test -p strided-einsum2 --features parallel(84 tests pass)cargo test -p strided-opteinsum --features parallel(163 tests pass)🤖 Generated with Claude Code