Skip to content

x86: optimize CumulativeSum with SIMD Kogge-Stone prefix scan#6693

Open
crafcat7 wants to merge 2 commits intoTencent:masterfrom
crafcat7:feat/x86-cumulativesum
Open

x86: optimize CumulativeSum with SIMD Kogge-Stone prefix scan#6693
crafcat7 wants to merge 2 commits intoTencent:masterfrom
crafcat7:feat/x86-cumulativesum

Conversation

@crafcat7
Copy link
Copy Markdown
Contributor

Summary

Adds an x86-specific implementation of CumulativeSum (src/layer/x86/cumulativesum_x86.{h,cpp}) that replaces the scalar prefix-sum loop on inner-dim scans with an AVX2 8-lane Kogge-Stone parallel prefix scan (SSE2 4-lane fallback), yielding ~2.5× single-thread speedup and ~1.7× 8-thread speedup on pure-scan workloads. Paths that the compiler can already auto-vectorize are left untouched.

Motivation

CumulativeSum::forward_inplace in src/layer/cumulativesum.cpp walks data with an inner serial recurrence:

for (int k = 1; k < w; k++)
    ptr[k] = ptr[k] + ptr[k - 1];  // RAW dependency → cannot auto-vectorize

Of the 6 (dims, axis) cases handled by the base, three have the scan along the inner contiguous dimension and suffer from this serial dependency:

  • dims == 1
  • dims == 2, axis == 1
  • dims == 3, axis == 2

The other three cases (dims=2 axis=0, dims=3 axis=0/1) scan across the outer dimension, so each iteration touches independent rows/channels and GCC -O3 already vectorizes the inner loop to memory-bandwidth bound. Adding SIMD there yields zero measurable gain (measured: within ±2%), so this patch keeps them on the base path and only specializes the three dependent cases.

Algorithm

Inner row of w floats, in-place SIMD prefix sum using a classic Kogge-Stone tree on 8 lanes:

  1. Stage 1 — shift-by-1 within each 128-bit half (_mm256_slli_si256 by 4B), then add.
  2. Stage 2 — shift-by-2 within each 128-bit half (8B), then add.
    After stages 1–2 each half of the register holds the prefix sum of its 4 lanes.
  3. Stage 3 — broadcast the last lane of the low half into all 4 lanes of the high half via _mm256_permute2f128_ps(v, v, 0x08) + _mm256_shuffle_ps(_, _, 0xff), then add. The full 8-lane prefix is now in v.
  4. Running base — add a broadcast of the previous tile's last lane, then update the base with the new last lane for the next tile.
lanes:   a b c d | e f g h
stage 1: a ab bc cd | e ef fg gh
stage 2: a ab abc abcd | e ef efg efgh
stage 3: a ab abc abcd | abcd+e abcd+ef abcd+efg abcd+efgh
+ base : propagated across tiles

Tail (<8 lanes) is finished with a scalar accumulator. SSE2 path (no AVX2) uses the same structure on 4 lanes with just stages 1 and 2.

We intentionally do not extend this to a 16-lane AVX-512 version: that requires a 4th stage plus 3 cross-lane permutes, lengthening the serial dependency chain faster than the lane count grows. Empirically the 8-lane AVX2 path already saturates the single-core ALU throughput on Zen5.

Dispatch

The x86 layer fast-paths the three serial-scan cases and calls the base implementation for everything else (any other dims/axis, non-fp32, or non-pack1). support_packing is left at its inherited default so the packing machinery auto-unpacks to pack1 before forward_inplace.

Multi-threading:

  • dims == 2, axis == 1: #pragma omp parallel for over rows.
  • dims == 3, axis == 2: #pragma omp parallel for collapse(2) over (channel, row) — gives all 8 threads useful work even when c is small.

Correctness

ctest --test-dir build -R test_cumulativesum --output-on-failure
1/1 Test #44: test_cumulativesum ............... Passed

tests/test_cumulativesum.cpp is extended with boundary cases that exercise every edge of the SIMD tiling:

  • 1D lengths 1, 2, 3, 4, 5, 7, 8, 9, 15, 16, 17, 32 — covers tail-only, exact vector, partial-tail, and the first tile past the running-base propagation.
  • 2D axis=1 widths 1, 3, 8, 16, 17 with 5 rows.
  • 3D axis=2 widths 1, 3, 8, 16, 17 with 5 rows × 3 channels — verifies the collapse(2) parallel region.

All 22 cases pass on the SIMD build.

Performance

Environment: Linux / WSL2, AMD Ryzen 7 9800X3D (Zen5, full AVX-512 family), g++, -O3 -DNDEBUG, AVX/AVX2/FMA/AVX-512 all enabled. Measurements use benchncnn with loop=100, cooldown=0, taskset -c 0-7; reported as the min metric (most stable at sub-millisecond workloads).

Baseline build is identical but with src/layer/x86/cumulativesum_x86.{h,cpp} removed so the base scalar implementation is used.

Benchmark matrix

Four demo graphs, each stacking 3 CumulativeSum layers to amortize benchmark harness noise:

Demo Shape Axis path 1T base 1T opt 1T × 8T base 8T opt 8T ×
cumsum_1d_demo [65536] dims=1 0.07 0.04 1.75× 0.08 0.04 2.00×
cumsum_2d_axis1_demo [512, 512] dims=2 axis=1 0.29 0.13 2.23× 0.06 0.03 2.00×
cumsum_axis2_demo [256, 256, 32] dims=3 axis=2 2.25 0.90 2.50× 0.47 0.28 1.68×
cumsum_demo [256, 256, 32] mixed axis=0→1→2 1.07 0.59 1.81× 0.40 0.32 1.25×

All numbers are min milliseconds over 100 iterations on cores 0-7.

Interpretation

  • RAW-dependency paths (dims=1, dims=2 axis=1, dims=3 axis=2): single-thread speedups of 1.75×–2.50× match the theoretical 8× lane count pulled down by the 3-stage Kogge-Stone shuffle chain and the per-tile running-base add.
  • Multi-thread scaling: the base implementation already parallelizes over the channel dimension for dims=3. At 8 threads both versions share the same outer parallelism and the bottleneck shifts toward L3 bandwidth. The 1.68× at 8T on the pure axis=2 demo is the single-core ALU saving carried through.
  • Mixed demo (1.81× at 1T, 1.25× at 8T): only the final axis=2 layer uses the SIMD kernel; the first two (axis=0, axis=1) fall back to the auto-vectorized base by design. The overall speedup matches the weighted contribution of the axis=2 layer.

No regressions

All other layers and networks are unaffected — the new class only overrides forward_inplace for CumulativeSum and falls back to the base for any case it does not specialize.

Summary:
  Add x86 SIMD fast-path for CumulativeSum targeting the serial-scan
  axes (dims=1, dims=2 axis=1, dims=3 axis=2) where the base scalar
  code carries a true prefix-sum data dependency and the compiler
  cannot auto-vectorize. The kernel performs an in-register Kogge-Stone
  scan (AVX2 8-lane / SSE2 4-lane) with a running tile base, turning
  N scalar adds into log2(vec) SIMD adds per tile. Bandwidth-bound
  axes with no inner dependency are left to the base implementation
  since the compiler already auto-vectorizes them at memory bandwidth.

Changes:
  1. Add src/layer/x86/cumulativesum_x86.h declaring CumulativeSum_x86
  2. Implement prefix_sum_row() with AVX2 8-wide Kogge-Stone scan (3 stages + cross-128 propagation) and SSE2 4-wide fallback
  3. Route dims=1 / dims=2 axis=1 / dims=3 axis=2 to the SIMD scan path; dims=3 axis=2 runs in parallel over (channel, row) via OpenMP collapse(2)
  4. Fall back to CumulativeSum::forward_inplace for non-pack1, non-fp32, and bandwidth-bound axes
Summary:
  Add dedicated boundary test cases that exercise w at, just below, and
  just above common SIMD vector widths (4 / 8 / 16). The cases cover the
  single-tile no-tail path, the running base propagation across multiple
  tiles, and the scalar tail seam (sum carried via ptr[j-1]). The existing
  tests only hit these paths incidentally; the new cases make the coverage
  explicit and remain valid for any vectorized backend.

Changes:
  1. Add test_cumulativesum_boundary() covering dims=1/2/3 with w in {7, 8, 9, 15, 16, 17, 32}
  2. Include axis=1 (row-parallel) and axis=2 (collapse(2) channel+row) variants to validate the OpenMP parallel paths
  3. Wire the new function into main() alongside the existing 1d/2d/3d groups
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant