Optimize SBGEMM / BGEMM for NEOVERSEV1 further by Mousius · Pull Request #5419 · OpenMathLib/OpenBLAS

Mousius · 2025-08-11T09:29:52Z

This changes the kernels to pack full SVE vectors and reduces the overall complexity of the inner GEMM loop.

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: cf38a01 Pull-Request: #177012

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 952fd9e Pull-Request: #177012

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 596be25 Pull-Request: #177012

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 545189c Pull-Request: #177012

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 38dd7dc Pull-Request: #177012

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 ... among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20% through OpenMathLib/OpenBLAS#5667. OpenBLAS v0.3.33 contains an SBGEMM fix for non-SVE machines and adds detection logic for Neoverse-V3 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc ## Performance Using [this SDPA benchmark](https://gist.github.com/fadara01/5357a52299a3722587f6691d145e71e9), here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V2 cores: | B | Hq | Hkv | Lq | Lk | D | causal | gqa | Speedup from #176881 vs current | Speedup from #176881 and this PR vs current | Speedup from #176881 , #177009 and this PR vs current | |---:|---:|---:|---:|---:|---:|---|---|---:|---:|---:| | 1 | 32 | 8 | 2048 | 2048 | 128 | True | True | +9.48% | +14.91% | +35.60% | | 1 | 32 | 8 | 1 | 2048 | 128 | False | True | -1.42% | -2.79% | -0.95%% | | 1 | 16 | 16 | 6400 | 6400 | 80 | False | False | +5.18% | +11.60% | +27.95% | | 1 | 20 | 20 | 1500 | 1500 | 64 | False | False | +6.63% | +11.80% | +24.86% | | 8 | 20 | 20 | 1500 | 1500 | 64 | False | False | +9.31% | +17.12% | +31.82% | PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 Pull Request resolved: #177012 Approved by: https://github.com/jgong5, https://github.com/aditew01, https://github.com/malfet

Fixes: #182091 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 55543d4 Pull-Request: #177012

Fixes: #182091 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 97dd48e Pull-Request: #177012

Fixes #182091 Fixes #177251 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: e7a80f4 Pull-Request: #177012

Fixes #182091 Fixes #177251 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: df3caad Pull-Request: #177012

Fixes #182091 Fixes #177251 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 236a040 Pull-Request: #177012

Fixes #182091 Fixes #177251 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 098874c Pull-Request: #177012

Fixes #182091 Fixes #177251 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 246c17f Pull-Request: #177012

Fixes #182091 Fixes SVE128 part of #182091 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 ... among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20% through OpenMathLib/OpenBLAS#5667. OpenBLAS v0.3.33 contains an SBGEMM fix for non-SVE machines and adds detection logic for Neoverse-V3 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc ## Performance Using [this SDPA benchmark](https://gist.github.com/fadara01/5357a52299a3722587f6691d145e71e9), here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V2 cores: | B | Hq | Hkv | Lq | Lk | D | causal | gqa | Speedup from #176881 vs current | Speedup from #176881 and this PR vs current | Speedup from #176881 , #177009 and this PR vs current | |---:|---:|---:|---:|---:|---:|---|---|---:|---:|---:| | 1 | 32 | 8 | 2048 | 2048 | 128 | True | True | +9.48% | +14.91% | +35.60% | | 1 | 32 | 8 | 1 | 2048 | 128 | False | True | -1.42% | -2.79% | -0.95%% | | 1 | 16 | 16 | 6400 | 6400 | 80 | False | False | +5.18% | +11.60% | +27.95% | | 1 | 20 | 20 | 1500 | 1500 | 64 | False | False | +6.63% | +11.80% | +24.86% | | 8 | 20 | 20 | 1500 | 1500 | 64 | False | False | +9.31% | +17.12% | +31.82% | PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 Pull Request resolved: #177012 Approved by: https://github.com/jgong5, https://github.com/aditew01, https://github.com/malfet

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 ... among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20% through OpenMathLib/OpenBLAS#5667. OpenBLAS v0.3.33 contains an SBGEMM fix for non-SVE machines and adds detection logic for Neoverse-V3 This accelerates SDPA, and will be capitalized on by pytorch#172945 further to accelerate linear,mm, bmm, etc ## Performance Using [this SDPA benchmark](https://gist.github.com/fadara01/5357a52299a3722587f6691d145e71e9), here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V2 cores: | B | Hq | Hkv | Lq | Lk | D | causal | gqa | Speedup from pytorch#176881 vs current | Speedup from pytorch#176881 and this PR vs current | Speedup from pytorch#176881 , pytorch#177009 and this PR vs current | |---:|---:|---:|---:|---:|---:|---|---|---:|---:|---:| | 1 | 32 | 8 | 2048 | 2048 | 128 | True | True | +9.48% | +14.91% | +35.60% | | 1 | 32 | 8 | 1 | 2048 | 128 | False | True | -1.42% | -2.79% | -0.95%% | | 1 | 16 | 16 | 6400 | 6400 | 80 | False | False | +5.18% | +11.60% | +27.95% | | 1 | 20 | 20 | 1500 | 1500 | 64 | False | False | +6.63% | +11.80% | +24.86% | | 8 | 20 | 20 | 1500 | 1500 | 64 | False | False | +9.31% | +17.12% | +31.82% | PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 Pull Request resolved: pytorch#177012 Approved by: https://github.com/jgong5, https://github.com/aditew01, https://github.com/malfet

Fixes pytorch#182091 Fixes SVE128 part of pytorch#182091 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 ... among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20% through OpenMathLib/OpenBLAS#5667. OpenBLAS v0.3.33 contains an SBGEMM fix for non-SVE machines and adds detection logic for Neoverse-V3 This accelerates SDPA, and will be capitalized on by pytorch#172945 further to accelerate linear,mm, bmm, etc ## Performance Using [this SDPA benchmark](https://gist.github.com/fadara01/5357a52299a3722587f6691d145e71e9), here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V2 cores: | B | Hq | Hkv | Lq | Lk | D | causal | gqa | Speedup from pytorch#176881 vs current | Speedup from pytorch#176881 and this PR vs current | Speedup from pytorch#176881 , pytorch#177009 and this PR vs current | |---:|---:|---:|---:|---:|---:|---|---|---:|---:|---:| | 1 | 32 | 8 | 2048 | 2048 | 128 | True | True | +9.48% | +14.91% | +35.60% | | 1 | 32 | 8 | 1 | 2048 | 128 | False | True | -1.42% | -2.79% | -0.95%% | | 1 | 16 | 16 | 6400 | 6400 | 80 | False | False | +5.18% | +11.60% | +27.95% | | 1 | 20 | 20 | 1500 | 1500 | 64 | False | False | +6.63% | +11.80% | +24.86% | | 8 | 20 | 20 | 1500 | 1500 | 64 | False | False | +9.31% | +17.12% | +31.82% | PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 Pull Request resolved: pytorch#177012 Approved by: https://github.com/jgong5, https://github.com/aditew01, https://github.com/malfet

Mousius added 2 commits August 11, 2025 09:25

Optimize SBGEMM / BGEMM for NEOVERSEV1 further

114316f

This changes the kernels to pack full SVE vectors and reduces the overall complexity of the inner GEMM loop.

Remove older kernels for BGEMM on NEOVERSEV1

5f47b87

martin-frbg added this to the 0.3.31 milestone Aug 13, 2025

martin-frbg merged commit 5e43ba9 into OpenMathLib:develop Aug 13, 2025
88 checks passed

fadara01 mentioned this pull request Mar 10, 2026

Accelerate SDPA on Arm CPUs: Update OpenBLAS to v0.3.33 pytorch/pytorch#177012

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize SBGEMM / BGEMM for NEOVERSEV1 further#5419

Optimize SBGEMM / BGEMM for NEOVERSEV1 further#5419
martin-frbg merged 2 commits into
OpenMathLib:developfrom
Mousius:bgemm-optimisation

Mousius commented Aug 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Mousius commented Aug 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants