Accelerate SVE128 SBGEMM/BGEMM by fadara01 · Pull Request #5667 · OpenMathLib/OpenBLAS

fadara01 · 2026-03-05T13:52:28Z

This accelerates SBGEMM/BGEMM by extending the existing 8x4 kernel to 8x8 (unrolling N by 8)

Not sure if it's a good idea to delete the previous 8x4 kernel?

Here are the speedups on single core Neoverse-V2 (SVE128) compared to prev state:

  M=N=K=64: SBGEMM 1.164x (16.42%), BGEMM 1.133x (13.30%)
  M=N=K=128: SBGEMM 1.220x (22.02%), BGEMM 1.186x (18.56%)
  M=N=K=256: SBGEMM 1.241x (24.08%), BGEMM 1.235x (23.54%)
  M=N=K=512: SBGEMM 1.240x (23.95%), BGEMM 1.227x (22.75%)
  M=N=K=1024: SBGEMM 1.251x (25.11%), BGEMM 1.232x (23.23%)
  M=N=K=2048: SBGEMM 1.235x (23.47%), BGEMM 1.246x (24.64%)

and here are the speedups for the same benchmark on Neoverse-N2:

  M=N=K=64: SBGEMM 1.019x (1.93%), BGEMM 1.055x (5.49%)
  M=N=K=128: SBGEMM 1.030x (3.02%), BGEMM 1.053x (5.31%)
  M=N=K=256: SBGEMM 1.129x (12.90%), BGEMM 1.121x (12.06%)
  M=N=K=512: SBGEMM 1.143x (14.28%), BGEMM 1.132x (13.25%)
  M=N=K=1024: SBGEMM 1.144x (14.41%), BGEMM 1.137x (13.69%)

This accelerates SBGEMM/BGEMM by extending the existing 8x4 kernel to 8x8 (unrolling N by 8) Not sure if it's a good idea to delete the previous 8x4 kernel? Here are the speedups on single core Neoverse-V2 (SVE128) compared to prev state: Per-shape speedup M=N=K=64: SBGEMM 1.164x (16.42%), BGEMM 1.133x (13.30%) M=N=K=128: SBGEMM 1.220x (22.02%), BGEMM 1.186x (18.56%) M=N=K=256: SBGEMM 1.241x (24.08%), BGEMM 1.235x (23.54%) M=N=K=512: SBGEMM 1.240x (23.95%), BGEMM 1.227x (22.75%) M=N=K=1024: SBGEMM 1.251x (25.11%), BGEMM 1.232x (23.23%) M=N=K=2048: SBGEMM 1.235x (23.47%), BGEMM 1.246x (24.64%) Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>

fadara01 · 2026-03-05T13:53:51Z

Hi @martin-frbg - could you please have a look?

(This currently copies the 8x4 kernel and extends it to 8x8 - please let me know if it's a good idea to remove the 8x4 kernel)

martin-frbg · 2026-03-05T21:00:54Z

Looks good to me, thanks. And I still think it makes sense to leave older kernels around, even if nothing uses them at the moment - they might still turn out to be adequate (or at least provide inspiration for kernels) for other cpus (or data types) that do not benefit from the latest enhancement or unrolling pattern

fadara01 · 2026-03-05T23:55:28Z

thanks for reviewing!
it would be great if we could get this merged before the next release OpenBLAS release for us to pick it up in PyTorch

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: cf38a01 Pull-Request: #177012

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 952fd9e Pull-Request: #177012

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 596be25 Pull-Request: #177012

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 545189c Pull-Request: #177012

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 38dd7dc Pull-Request: #177012

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 ... among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20% through OpenMathLib/OpenBLAS#5667. OpenBLAS v0.3.33 contains an SBGEMM fix for non-SVE machines and adds detection logic for Neoverse-V3 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc ## Performance Using [this SDPA benchmark](https://gist.github.com/fadara01/5357a52299a3722587f6691d145e71e9), here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V2 cores: | B | Hq | Hkv | Lq | Lk | D | causal | gqa | Speedup from #176881 vs current | Speedup from #176881 and this PR vs current | Speedup from #176881 , #177009 and this PR vs current | |---:|---:|---:|---:|---:|---:|---|---|---:|---:|---:| | 1 | 32 | 8 | 2048 | 2048 | 128 | True | True | +9.48% | +14.91% | +35.60% | | 1 | 32 | 8 | 1 | 2048 | 128 | False | True | -1.42% | -2.79% | -0.95%% | | 1 | 16 | 16 | 6400 | 6400 | 80 | False | False | +5.18% | +11.60% | +27.95% | | 1 | 20 | 20 | 1500 | 1500 | 64 | False | False | +6.63% | +11.80% | +24.86% | | 8 | 20 | 20 | 1500 | 1500 | 64 | False | False | +9.31% | +17.12% | +31.82% | PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 Pull Request resolved: #177012 Approved by: https://github.com/jgong5, https://github.com/aditew01, https://github.com/malfet

Fixes: #182091 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 55543d4 Pull-Request: #177012

Fixes: #182091 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 97dd48e Pull-Request: #177012

Fixes #182091 Fixes #177251 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: e7a80f4 Pull-Request: #177012

Fixes #182091 Fixes #177251 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: df3caad Pull-Request: #177012

Fixes #182091 Fixes #177251 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 236a040 Pull-Request: #177012

Fixes #182091 Fixes #177251 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 098874c Pull-Request: #177012

Fixes #182091 Fixes #177251 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 246c17f Pull-Request: #177012

Fixes #182091 Fixes SVE128 part of #182091 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 ... among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20% through OpenMathLib/OpenBLAS#5667. OpenBLAS v0.3.33 contains an SBGEMM fix for non-SVE machines and adds detection logic for Neoverse-V3 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc ## Performance Using [this SDPA benchmark](https://gist.github.com/fadara01/5357a52299a3722587f6691d145e71e9), here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V2 cores: | B | Hq | Hkv | Lq | Lk | D | causal | gqa | Speedup from #176881 vs current | Speedup from #176881 and this PR vs current | Speedup from #176881 , #177009 and this PR vs current | |---:|---:|---:|---:|---:|---:|---|---|---:|---:|---:| | 1 | 32 | 8 | 2048 | 2048 | 128 | True | True | +9.48% | +14.91% | +35.60% | | 1 | 32 | 8 | 1 | 2048 | 128 | False | True | -1.42% | -2.79% | -0.95%% | | 1 | 16 | 16 | 6400 | 6400 | 80 | False | False | +5.18% | +11.60% | +27.95% | | 1 | 20 | 20 | 1500 | 1500 | 64 | False | False | +6.63% | +11.80% | +24.86% | | 8 | 20 | 20 | 1500 | 1500 | 64 | False | False | +9.31% | +17.12% | +31.82% | PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 Pull Request resolved: #177012 Approved by: https://github.com/jgong5, https://github.com/aditew01, https://github.com/malfet

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 ... among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20% through OpenMathLib/OpenBLAS#5667. OpenBLAS v0.3.33 contains an SBGEMM fix for non-SVE machines and adds detection logic for Neoverse-V3 This accelerates SDPA, and will be capitalized on by pytorch#172945 further to accelerate linear,mm, bmm, etc ## Performance Using [this SDPA benchmark](https://gist.github.com/fadara01/5357a52299a3722587f6691d145e71e9), here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V2 cores: | B | Hq | Hkv | Lq | Lk | D | causal | gqa | Speedup from pytorch#176881 vs current | Speedup from pytorch#176881 and this PR vs current | Speedup from pytorch#176881 , pytorch#177009 and this PR vs current | |---:|---:|---:|---:|---:|---:|---|---|---:|---:|---:| | 1 | 32 | 8 | 2048 | 2048 | 128 | True | True | +9.48% | +14.91% | +35.60% | | 1 | 32 | 8 | 1 | 2048 | 128 | False | True | -1.42% | -2.79% | -0.95%% | | 1 | 16 | 16 | 6400 | 6400 | 80 | False | False | +5.18% | +11.60% | +27.95% | | 1 | 20 | 20 | 1500 | 1500 | 64 | False | False | +6.63% | +11.80% | +24.86% | | 8 | 20 | 20 | 1500 | 1500 | 64 | False | False | +9.31% | +17.12% | +31.82% | PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 Pull Request resolved: pytorch#177012 Approved by: https://github.com/jgong5, https://github.com/aditew01, https://github.com/malfet

Fixes pytorch#182091 Fixes SVE128 part of pytorch#182091 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 ... among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20% through OpenMathLib/OpenBLAS#5667. OpenBLAS v0.3.33 contains an SBGEMM fix for non-SVE machines and adds detection logic for Neoverse-V3 This accelerates SDPA, and will be capitalized on by pytorch#172945 further to accelerate linear,mm, bmm, etc ## Performance Using [this SDPA benchmark](https://gist.github.com/fadara01/5357a52299a3722587f6691d145e71e9), here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V2 cores: | B | Hq | Hkv | Lq | Lk | D | causal | gqa | Speedup from pytorch#176881 vs current | Speedup from pytorch#176881 and this PR vs current | Speedup from pytorch#176881 , pytorch#177009 and this PR vs current | |---:|---:|---:|---:|---:|---:|---|---|---:|---:|---:| | 1 | 32 | 8 | 2048 | 2048 | 128 | True | True | +9.48% | +14.91% | +35.60% | | 1 | 32 | 8 | 1 | 2048 | 128 | False | True | -1.42% | -2.79% | -0.95%% | | 1 | 16 | 16 | 6400 | 6400 | 80 | False | False | +5.18% | +11.60% | +27.95% | | 1 | 20 | 20 | 1500 | 1500 | 64 | False | False | +6.63% | +11.80% | +24.86% | | 8 | 20 | 20 | 1500 | 1500 | 64 | False | False | +9.31% | +17.12% | +31.82% | PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 Pull Request resolved: pytorch#177012 Approved by: https://github.com/jgong5, https://github.com/aditew01, https://github.com/malfet

martin-frbg added this to the 0.3.32 milestone Mar 5, 2026

martin-frbg merged commit 3726265 into OpenMathLib:develop Mar 6, 2026
100 of 102 checks passed

fadara01 mentioned this pull request Mar 10, 2026

Accelerate SDPA on Arm CPUs: Update OpenBLAS to v0.3.33 pytorch/pytorch#177012

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate SVE128 SBGEMM/BGEMM#5667

Accelerate SVE128 SBGEMM/BGEMM#5667
martin-frbg merged 1 commit into
OpenMathLib:developfrom
fadara01:accelerate_sve128_sbgemm

fadara01 commented Mar 5, 2026 •

edited

Loading

Uh oh!

fadara01 commented Mar 5, 2026 •

edited

Loading

Uh oh!

martin-frbg commented Mar 5, 2026

Uh oh!

fadara01 commented Mar 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fadara01 commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fadara01 commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martin-frbg commented Mar 5, 2026

Uh oh!

fadara01 commented Mar 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fadara01 commented Mar 5, 2026 •

edited

Loading

fadara01 commented Mar 5, 2026 •

edited

Loading