x86: optimize CumulativeSum with SIMD Kogge-Stone prefix scan by crafcat7 · Pull Request #6693 · Tencent/ncnn

crafcat7 · 2026-04-24T13:50:08Z

Summary

Adds an x86-specific implementation of CumulativeSum (src/layer/x86/cumulativesum_x86.{h,cpp}) that replaces the scalar prefix-sum loop on inner-dim scans with an AVX2 8-lane Kogge-Stone parallel prefix scan (SSE2 4-lane fallback), yielding ~2.5× single-thread speedup and ~1.7× 8-thread speedup on pure-scan workloads. Paths that the compiler can already auto-vectorize are left untouched.

Motivation

CumulativeSum::forward_inplace in src/layer/cumulativesum.cpp walks data with an inner serial recurrence:

for (int k = 1; k < w; k++)
    ptr[k] = ptr[k] + ptr[k - 1];  // RAW dependency → cannot auto-vectorize

Of the 6 (dims, axis) cases handled by the base, three have the scan along the inner contiguous dimension and suffer from this serial dependency:

dims == 1
dims == 2, axis == 1
dims == 3, axis == 2

The other three cases (dims=2 axis=0, dims=3 axis=0/1) scan across the outer dimension, so each iteration touches independent rows/channels and GCC -O3 already vectorizes the inner loop to memory-bandwidth bound. Adding SIMD there yields zero measurable gain (measured: within ±2%), so this patch keeps them on the base path and only specializes the three dependent cases.

Algorithm

Inner row of w floats, in-place SIMD prefix sum using a classic Kogge-Stone tree on 8 lanes:

Stage 1 — shift-by-1 within each 128-bit half (_mm256_slli_si256 by 4B), then add.
Stage 2 — shift-by-2 within each 128-bit half (8B), then add.
After stages 1–2 each half of the register holds the prefix sum of its 4 lanes.
Stage 3 — broadcast the last lane of the low half into all 4 lanes of the high half via _mm256_permute2f128_ps(v, v, 0x08) + _mm256_shuffle_ps(_, _, 0xff), then add. The full 8-lane prefix is now in v.
Running base — add a broadcast of the previous tile's last lane, then update the base with the new last lane for the next tile.

lanes:   a b c d | e f g h
stage 1: a ab bc cd | e ef fg gh
stage 2: a ab abc abcd | e ef efg efgh
stage 3: a ab abc abcd | abcd+e abcd+ef abcd+efg abcd+efgh
+ base : propagated across tiles

Tail (<8 lanes) is finished with a scalar accumulator. SSE2 path (no AVX2) uses the same structure on 4 lanes with just stages 1 and 2.

We intentionally do not extend this to a 16-lane AVX-512 version: that requires a 4th stage plus 3 cross-lane permutes, lengthening the serial dependency chain faster than the lane count grows. Empirically the 8-lane AVX2 path already saturates the single-core ALU throughput on Zen5.

Dispatch

The x86 layer fast-paths the three serial-scan cases and calls the base implementation for everything else (any other dims/axis, non-fp32, or non-pack1). support_packing is left at its inherited default so the packing machinery auto-unpacks to pack1 before forward_inplace.

Multi-threading:

dims == 2, axis == 1: #pragma omp parallel for over rows.
dims == 3, axis == 2: #pragma omp parallel for collapse(2) over (channel, row) — gives all 8 threads useful work even when c is small.

Correctness

ctest --test-dir build -R test_cumulativesum --output-on-failure
1/1 Test #44: test_cumulativesum ............... Passed

tests/test_cumulativesum.cpp is extended with boundary cases that exercise every edge of the SIMD tiling:

1D lengths 1, 2, 3, 4, 5, 7, 8, 9, 15, 16, 17, 32 — covers tail-only, exact vector, partial-tail, and the first tile past the running-base propagation.
2D axis=1 widths 1, 3, 8, 16, 17 with 5 rows.
3D axis=2 widths 1, 3, 8, 16, 17 with 5 rows × 3 channels — verifies the collapse(2) parallel region.

All 22 cases pass on the SIMD build.

Performance

Environment: Linux / WSL2, AMD Ryzen 7 9800X3D (Zen5, full AVX-512 family), g++, -O3 -DNDEBUG, AVX/AVX2/FMA/AVX-512 all enabled. Measurements use benchncnn with loop=100, cooldown=0, taskset -c 0-7; reported as the min metric (most stable at sub-millisecond workloads).

Baseline build is identical but with src/layer/x86/cumulativesum_x86.{h,cpp} removed so the base scalar implementation is used.

Benchmark matrix

Four demo graphs, each stacking 3 CumulativeSum layers to amortize benchmark harness noise:

Demo	Shape	Axis path	1T base	1T opt	1T ×	8T base	8T opt	8T ×
`cumsum_1d_demo`	`[65536]`	dims=1	0.07	0.04	1.75×	0.08	0.04	2.00×
`cumsum_2d_axis1_demo`	`[512, 512]`	dims=2 axis=1	0.29	0.13	2.23×	0.06	0.03	2.00×
`cumsum_axis2_demo`	`[256, 256, 32]`	dims=3 axis=2	2.25	0.90	2.50×	0.47	0.28	1.68×
`cumsum_demo`	`[256, 256, 32]`	mixed axis=0→1→2	1.07	0.59	1.81×	0.40	0.32	1.25×

All numbers are min milliseconds over 100 iterations on cores 0-7.

Interpretation

RAW-dependency paths (dims=1, dims=2 axis=1, dims=3 axis=2): single-thread speedups of 1.75×–2.50× match the theoretical 8× lane count pulled down by the 3-stage Kogge-Stone shuffle chain and the per-tile running-base add.
Multi-thread scaling: the base implementation already parallelizes over the channel dimension for dims=3. At 8 threads both versions share the same outer parallelism and the bottleneck shifts toward L3 bandwidth. The 1.68× at 8T on the pure axis=2 demo is the single-core ALU saving carried through.
Mixed demo (1.81× at 1T, 1.25× at 8T): only the final axis=2 layer uses the SIMD kernel; the first two (axis=0, axis=1) fall back to the auto-vectorized base by design. The overall speedup matches the weighted contribution of the axis=2 layer.

No regressions

All other layers and networks are unaffected — the new class only overrides forward_inplace for CumulativeSum and falls back to the base for any case it does not specialize.

Summary: Add x86 SIMD fast-path for CumulativeSum targeting the serial-scan axes (dims=1, dims=2 axis=1, dims=3 axis=2) where the base scalar code carries a true prefix-sum data dependency and the compiler cannot auto-vectorize. The kernel performs an in-register Kogge-Stone scan (AVX2 8-lane / SSE2 4-lane) with a running tile base, turning N scalar adds into log2(vec) SIMD adds per tile. Bandwidth-bound axes with no inner dependency are left to the base implementation since the compiler already auto-vectorizes them at memory bandwidth. Changes: 1. Add src/layer/x86/cumulativesum_x86.h declaring CumulativeSum_x86 2. Implement prefix_sum_row() with AVX2 8-wide Kogge-Stone scan (3 stages + cross-128 propagation) and SSE2 4-wide fallback 3. Route dims=1 / dims=2 axis=1 / dims=3 axis=2 to the SIMD scan path; dims=3 axis=2 runs in parallel over (channel, row) via OpenMP collapse(2) 4. Fall back to CumulativeSum::forward_inplace for non-pack1, non-fp32, and bandwidth-bound axes

Summary: Add dedicated boundary test cases that exercise w at, just below, and just above common SIMD vector widths (4 / 8 / 16). The cases cover the single-tile no-tail path, the running base propagation across multiple tiles, and the scalar tail seam (sum carried via ptr[j-1]). The existing tests only hit these paths incidentally; the new cases make the coverage explicit and remain valid for any vectorized backend. Changes: 1. Add test_cumulativesum_boundary() covering dims=1/2/3 with w in {7, 8, 9, 15, 16, 17, 32} 2. Include axis=1 (row-parallel) and axis=2 (collapse(2) channel+row) variants to validate the OpenMP parallel paths 3. Wire the new function into main() alongside the existing 1d/2d/3d groups

crafcat7 added 2 commits April 23, 2026 22:25

github-actions Bot added test x86 labels Apr 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x86: optimize CumulativeSum with SIMD Kogge-Stone prefix scan#6693

x86: optimize CumulativeSum with SIMD Kogge-Stone prefix scan#6693
crafcat7 wants to merge 2 commits intoTencent:masterfrom
crafcat7:feat/x86-cumulativesum

crafcat7 commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

crafcat7 commented Apr 24, 2026

Summary

Motivation

Algorithm

Dispatch

Correctness

Performance

Benchmark matrix

Interpretation

No regressions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant