Skip to content

Optimize SBGEMM / BGEMM for NEOVERSEV1 further#5419

Merged
martin-frbg merged 2 commits into
OpenMathLib:developfrom
Mousius:bgemm-optimisation
Aug 13, 2025
Merged

Optimize SBGEMM / BGEMM for NEOVERSEV1 further#5419
martin-frbg merged 2 commits into
OpenMathLib:developfrom
Mousius:bgemm-optimisation

Conversation

@Mousius
Copy link
Copy Markdown
Contributor

@Mousius Mousius commented Aug 11, 2025

This changes the kernels to pack full SVE vectors and reduces the overall complexity of the inner GEMM loop.

This changes the kernels to pack full SVE vectors and reduces the
overall complexity of the inner GEMM loop.
@martin-frbg martin-frbg added this to the 0.3.31 milestone Aug 13, 2025
@martin-frbg martin-frbg merged commit 5e43ba9 into OpenMathLib:develop Aug 13, 2025
88 checks passed
fadara01 added a commit to pytorch/pytorch that referenced this pull request Mar 10, 2026
OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
ghstack-source-id: cf38a01
Pull-Request: #177012
fadara01 added a commit to pytorch/pytorch that referenced this pull request Mar 10, 2026
OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
ghstack-source-id: 952fd9e
Pull-Request: #177012
fadara01 added a commit to pytorch/pytorch that referenced this pull request Mar 10, 2026
OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
ghstack-source-id: 596be25
Pull-Request: #177012
fadara01 added a commit to pytorch/pytorch that referenced this pull request Mar 16, 2026
OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
ghstack-source-id: 545189c
Pull-Request: #177012
fadara01 added a commit to pytorch/pytorch that referenced this pull request Apr 23, 2026
OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
ghstack-source-id: 38dd7dc
Pull-Request: #177012
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Apr 30, 2026
OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 ... among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20% through OpenMathLib/OpenBLAS#5667.

OpenBLAS v0.3.33 contains an SBGEMM fix for non-SVE machines and adds detection logic for Neoverse-V3

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

## Performance

Using [this SDPA benchmark](https://gist.github.com/fadara01/5357a52299a3722587f6691d145e71e9), here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V2 cores:

| B | Hq | Hkv | Lq | Lk | D | causal | gqa | Speedup from #176881  vs current | Speedup from #176881 and this PR vs current | Speedup from #176881 , #177009 and this PR vs current |
|---:|---:|---:|---:|---:|---:|---|---|---:|---:|---:|
| 1 | 32 | 8 | 2048 | 2048 | 128 | True | True | +9.48% | +14.91% | +35.60% |
| 1 | 32 | 8 | 1 | 2048 | 128 | False | True | -1.42%  | -2.79% | -0.95%% |
| 1 | 16 | 16 | 6400 | 6400 | 80 | False | False | +5.18% | +11.60% | +27.95% |
| 1 | 20 | 20 | 1500 | 1500 | 64 | False | False | +6.63% | +11.80% | +24.86% |
| 8 | 20 | 20 | 1500 | 1500 | 64 | False | False | +9.31% | +17.12% | +31.82% |

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
Pull Request resolved: #177012
Approved by: https://github.com/jgong5, https://github.com/aditew01, https://github.com/malfet
fadara01 added a commit to pytorch/pytorch that referenced this pull request May 1, 2026
Fixes: #182091

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
ghstack-source-id: 55543d4
Pull-Request: #177012
fadara01 added a commit to pytorch/pytorch that referenced this pull request May 1, 2026
Fixes: #182091

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
ghstack-source-id: 97dd48e
Pull-Request: #177012
fadara01 added a commit to pytorch/pytorch that referenced this pull request May 1, 2026
Fixes #182091
Fixes #177251

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
ghstack-source-id: e7a80f4
Pull-Request: #177012
fadara01 added a commit to pytorch/pytorch that referenced this pull request May 1, 2026
Fixes #182091
Fixes #177251

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
ghstack-source-id: df3caad
Pull-Request: #177012
fadara01 added a commit to pytorch/pytorch that referenced this pull request May 1, 2026
Fixes #182091
Fixes #177251

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
ghstack-source-id: 236a040
Pull-Request: #177012
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request May 3, 2026
Fixes #182091
Fixes #177251

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
ghstack-source-id: 236a040
Pull-Request: #177012
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request May 3, 2026
Fixes #182091
Fixes #177251

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
ghstack-source-id: 098874c
Pull-Request: #177012
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request May 6, 2026
Fixes #182091
Fixes #177251

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
ghstack-source-id: 246c17f
Pull-Request: #177012
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request May 8, 2026
Fixes #182091
Fixes SVE128 part of #182091

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 ... among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20% through OpenMathLib/OpenBLAS#5667.

OpenBLAS v0.3.33 contains an SBGEMM fix for non-SVE machines and adds detection logic for Neoverse-V3

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

## Performance

Using [this SDPA benchmark](https://gist.github.com/fadara01/5357a52299a3722587f6691d145e71e9), here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V2 cores:

| B | Hq | Hkv | Lq | Lk | D | causal | gqa | Speedup from #176881  vs current | Speedup from #176881 and this PR vs current | Speedup from #176881 , #177009 and this PR vs current |
|---:|---:|---:|---:|---:|---:|---|---|---:|---:|---:|
| 1 | 32 | 8 | 2048 | 2048 | 128 | True | True | +9.48% | +14.91% | +35.60% |
| 1 | 32 | 8 | 1 | 2048 | 128 | False | True | -1.42%  | -2.79% | -0.95%% |
| 1 | 16 | 16 | 6400 | 6400 | 80 | False | False | +5.18% | +11.60% | +27.95% |
| 1 | 20 | 20 | 1500 | 1500 | 64 | False | False | +6.63% | +11.80% | +24.86% |
| 8 | 20 | 20 | 1500 | 1500 | 64 | False | False | +9.31% | +17.12% | +31.82% |

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
Pull Request resolved: #177012
Approved by: https://github.com/jgong5, https://github.com/aditew01, https://github.com/malfet
Alokksinha00 pushed a commit to Alokksinha00/pytorch that referenced this pull request May 15, 2026
OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 ... among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20% through OpenMathLib/OpenBLAS#5667.

OpenBLAS v0.3.33 contains an SBGEMM fix for non-SVE machines and adds detection logic for Neoverse-V3

This accelerates SDPA, and will be capitalized on by pytorch#172945 further to accelerate linear,mm, bmm, etc

## Performance

Using [this SDPA benchmark](https://gist.github.com/fadara01/5357a52299a3722587f6691d145e71e9), here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V2 cores:

| B | Hq | Hkv | Lq | Lk | D | causal | gqa | Speedup from pytorch#176881  vs current | Speedup from pytorch#176881 and this PR vs current | Speedup from pytorch#176881 , pytorch#177009 and this PR vs current |
|---:|---:|---:|---:|---:|---:|---|---|---:|---:|---:|
| 1 | 32 | 8 | 2048 | 2048 | 128 | True | True | +9.48% | +14.91% | +35.60% |
| 1 | 32 | 8 | 1 | 2048 | 128 | False | True | -1.42%  | -2.79% | -0.95%% |
| 1 | 16 | 16 | 6400 | 6400 | 80 | False | False | +5.18% | +11.60% | +27.95% |
| 1 | 20 | 20 | 1500 | 1500 | 64 | False | False | +6.63% | +11.80% | +24.86% |
| 8 | 20 | 20 | 1500 | 1500 | 64 | False | False | +9.31% | +17.12% | +31.82% |

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
Pull Request resolved: pytorch#177012
Approved by: https://github.com/jgong5, https://github.com/aditew01, https://github.com/malfet
Alokksinha00 pushed a commit to Alokksinha00/pytorch that referenced this pull request May 15, 2026
Fixes pytorch#182091
Fixes SVE128 part of pytorch#182091

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 ... among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20% through OpenMathLib/OpenBLAS#5667.

OpenBLAS v0.3.33 contains an SBGEMM fix for non-SVE machines and adds detection logic for Neoverse-V3

This accelerates SDPA, and will be capitalized on by pytorch#172945 further to accelerate linear,mm, bmm, etc

## Performance

Using [this SDPA benchmark](https://gist.github.com/fadara01/5357a52299a3722587f6691d145e71e9), here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V2 cores:

| B | Hq | Hkv | Lq | Lk | D | causal | gqa | Speedup from pytorch#176881  vs current | Speedup from pytorch#176881 and this PR vs current | Speedup from pytorch#176881 , pytorch#177009 and this PR vs current |
|---:|---:|---:|---:|---:|---:|---|---|---:|---:|---:|
| 1 | 32 | 8 | 2048 | 2048 | 128 | True | True | +9.48% | +14.91% | +35.60% |
| 1 | 32 | 8 | 1 | 2048 | 128 | False | True | -1.42%  | -2.79% | -0.95%% |
| 1 | 16 | 16 | 6400 | 6400 | 80 | False | False | +5.18% | +11.60% | +27.95% |
| 1 | 20 | 20 | 1500 | 1500 | 64 | False | False | +6.63% | +11.80% | +24.86% |
| 8 | 20 | 20 | 1500 | 1500 | 64 | False | False | +9.31% | +17.12% | +31.82% |

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
Pull Request resolved: pytorch#177012
Approved by: https://github.com/jgong5, https://github.com/aditew01, https://github.com/malfet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants