x86: optimize CumulativeSum with SIMD Kogge-Stone prefix scan#6693
Open
crafcat7 wants to merge 2 commits intoTencent:masterfrom
Open
x86: optimize CumulativeSum with SIMD Kogge-Stone prefix scan#6693crafcat7 wants to merge 2 commits intoTencent:masterfrom
crafcat7 wants to merge 2 commits intoTencent:masterfrom
Conversation
Summary: Add x86 SIMD fast-path for CumulativeSum targeting the serial-scan axes (dims=1, dims=2 axis=1, dims=3 axis=2) where the base scalar code carries a true prefix-sum data dependency and the compiler cannot auto-vectorize. The kernel performs an in-register Kogge-Stone scan (AVX2 8-lane / SSE2 4-lane) with a running tile base, turning N scalar adds into log2(vec) SIMD adds per tile. Bandwidth-bound axes with no inner dependency are left to the base implementation since the compiler already auto-vectorizes them at memory bandwidth. Changes: 1. Add src/layer/x86/cumulativesum_x86.h declaring CumulativeSum_x86 2. Implement prefix_sum_row() with AVX2 8-wide Kogge-Stone scan (3 stages + cross-128 propagation) and SSE2 4-wide fallback 3. Route dims=1 / dims=2 axis=1 / dims=3 axis=2 to the SIMD scan path; dims=3 axis=2 runs in parallel over (channel, row) via OpenMP collapse(2) 4. Fall back to CumulativeSum::forward_inplace for non-pack1, non-fp32, and bandwidth-bound axes
Summary:
Add dedicated boundary test cases that exercise w at, just below, and
just above common SIMD vector widths (4 / 8 / 16). The cases cover the
single-tile no-tail path, the running base propagation across multiple
tiles, and the scalar tail seam (sum carried via ptr[j-1]). The existing
tests only hit these paths incidentally; the new cases make the coverage
explicit and remain valid for any vectorized backend.
Changes:
1. Add test_cumulativesum_boundary() covering dims=1/2/3 with w in {7, 8, 9, 15, 16, 17, 32}
2. Include axis=1 (row-parallel) and axis=2 (collapse(2) channel+row) variants to validate the OpenMP parallel paths
3. Wire the new function into main() alongside the existing 1d/2d/3d groups
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an x86-specific implementation of
CumulativeSum(src/layer/x86/cumulativesum_x86.{h,cpp}) that replaces the scalar prefix-sum loop on inner-dim scans with an AVX2 8-lane Kogge-Stone parallel prefix scan (SSE2 4-lane fallback), yielding ~2.5× single-thread speedup and ~1.7× 8-thread speedup on pure-scan workloads. Paths that the compiler can already auto-vectorize are left untouched.Motivation
CumulativeSum::forward_inplaceinsrc/layer/cumulativesum.cppwalks data with an inner serial recurrence:Of the 6
(dims, axis)cases handled by the base, three have the scan along the inner contiguous dimension and suffer from this serial dependency:dims == 1dims == 2, axis == 1dims == 3, axis == 2The other three cases (
dims=2 axis=0,dims=3 axis=0/1) scan across the outer dimension, so each iteration touches independent rows/channels and GCC-O3already vectorizes the inner loop to memory-bandwidth bound. Adding SIMD there yields zero measurable gain (measured: within ±2%), so this patch keeps them on the base path and only specializes the three dependent cases.Algorithm
Inner row of
wfloats, in-place SIMD prefix sum using a classic Kogge-Stone tree on 8 lanes:_mm256_slli_si256by 4B), then add.After stages 1–2 each half of the register holds the prefix sum of its 4 lanes.
_mm256_permute2f128_ps(v, v, 0x08) + _mm256_shuffle_ps(_, _, 0xff), then add. The full 8-lane prefix is now inv.Tail (<8 lanes) is finished with a scalar accumulator. SSE2 path (no AVX2) uses the same structure on 4 lanes with just stages 1 and 2.
We intentionally do not extend this to a 16-lane AVX-512 version: that requires a 4th stage plus 3 cross-lane permutes, lengthening the serial dependency chain faster than the lane count grows. Empirically the 8-lane AVX2 path already saturates the single-core ALU throughput on Zen5.
Dispatch
The x86 layer fast-paths the three serial-scan cases and calls the base implementation for everything else (any other
dims/axis, non-fp32, or non-pack1).support_packingis left at its inherited default so the packing machinery auto-unpacks topack1beforeforward_inplace.Multi-threading:
dims == 2, axis == 1:#pragma omp parallel forover rows.dims == 3, axis == 2:#pragma omp parallel for collapse(2)over(channel, row)— gives all 8 threads useful work even whencis small.Correctness
tests/test_cumulativesum.cppis extended with boundary cases that exercise every edge of the SIMD tiling:axis=1widths 1, 3, 8, 16, 17 with 5 rows.axis=2widths 1, 3, 8, 16, 17 with 5 rows × 3 channels — verifies thecollapse(2)parallel region.All 22 cases pass on the SIMD build.
Performance
Environment: Linux / WSL2, AMD Ryzen 7 9800X3D (Zen5, full AVX-512 family), g++,
-O3 -DNDEBUG, AVX/AVX2/FMA/AVX-512 all enabled. Measurements usebenchncnnwithloop=100,cooldown=0,taskset -c 0-7; reported as theminmetric (most stable at sub-millisecond workloads).Baseline build is identical but with
src/layer/x86/cumulativesum_x86.{h,cpp}removed so the base scalar implementation is used.Benchmark matrix
Four demo graphs, each stacking 3
CumulativeSumlayers to amortize benchmark harness noise:cumsum_1d_demo[65536]cumsum_2d_axis1_demo[512, 512]cumsum_axis2_demo[256, 256, 32]cumsum_demo[256, 256, 32]All numbers are
minmilliseconds over 100 iterations on cores 0-7.Interpretation
axis=2layer uses the SIMD kernel; the first two (axis=0, axis=1) fall back to the auto-vectorized base by design. The overall speedup matches the weighted contribution of the axis=2 layer.No regressions
All other layers and networks are unaffected — the new class only overrides
forward_inplaceforCumulativeSumand falls back to the base for any case it does not specialize.