Skip to content

Accelerate SVE128 SBGEMM/BGEMM#5667

Merged
martin-frbg merged 1 commit into
OpenMathLib:developfrom
fadara01:accelerate_sve128_sbgemm
Mar 6, 2026
Merged

Accelerate SVE128 SBGEMM/BGEMM#5667
martin-frbg merged 1 commit into
OpenMathLib:developfrom
fadara01:accelerate_sve128_sbgemm

Conversation

@fadara01
Copy link
Copy Markdown
Contributor

@fadara01 fadara01 commented Mar 5, 2026

This accelerates SBGEMM/BGEMM by extending the existing 8x4 kernel to 8x8 (unrolling N by 8)

Not sure if it's a good idea to delete the previous 8x4 kernel?

Here are the speedups on single core Neoverse-V2 (SVE128) compared to prev state:

  M=N=K=64: SBGEMM 1.164x (16.42%), BGEMM 1.133x (13.30%)
  M=N=K=128: SBGEMM 1.220x (22.02%), BGEMM 1.186x (18.56%)
  M=N=K=256: SBGEMM 1.241x (24.08%), BGEMM 1.235x (23.54%)
  M=N=K=512: SBGEMM 1.240x (23.95%), BGEMM 1.227x (22.75%)
  M=N=K=1024: SBGEMM 1.251x (25.11%), BGEMM 1.232x (23.23%)
  M=N=K=2048: SBGEMM 1.235x (23.47%), BGEMM 1.246x (24.64%)

and here are the speedups for the same benchmark on Neoverse-N2:

  M=N=K=64: SBGEMM 1.019x (1.93%), BGEMM 1.055x (5.49%)
  M=N=K=128: SBGEMM 1.030x (3.02%), BGEMM 1.053x (5.31%)
  M=N=K=256: SBGEMM 1.129x (12.90%), BGEMM 1.121x (12.06%)
  M=N=K=512: SBGEMM 1.143x (14.28%), BGEMM 1.132x (13.25%)
  M=N=K=1024: SBGEMM 1.144x (14.41%), BGEMM 1.137x (13.69%)

This accelerates SBGEMM/BGEMM by extending the existing 8x4 kernel to 8x8 (unrolling N by 8)

Not sure if it's a good idea to delete the previous 8x4 kernel?

Here are the speedups on single core Neoverse-V2 (SVE128) compared to prev state:

Per-shape speedup
  M=N=K=64: SBGEMM 1.164x (16.42%), BGEMM 1.133x (13.30%)
  M=N=K=128: SBGEMM 1.220x (22.02%), BGEMM 1.186x (18.56%)
  M=N=K=256: SBGEMM 1.241x (24.08%), BGEMM 1.235x (23.54%)
  M=N=K=512: SBGEMM 1.240x (23.95%), BGEMM 1.227x (22.75%)
  M=N=K=1024: SBGEMM 1.251x (25.11%), BGEMM 1.232x (23.23%)
  M=N=K=2048: SBGEMM 1.235x (23.47%), BGEMM 1.246x (24.64%)

Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>
@fadara01
Copy link
Copy Markdown
Contributor Author

fadara01 commented Mar 5, 2026

Hi @martin-frbg - could you please have a look?

(This currently copies the 8x4 kernel and extends it to 8x8 - please let me know if it's a good idea to remove the 8x4 kernel)

@martin-frbg
Copy link
Copy Markdown
Collaborator

Looks good to me, thanks. And I still think it makes sense to leave older kernels around, even if nothing uses them at the moment - they might still turn out to be adequate (or at least provide inspiration for kernels) for other cpus (or data types) that do not benefit from the latest enhancement or unrolling pattern

@martin-frbg martin-frbg added this to the 0.3.32 milestone Mar 5, 2026
@fadara01
Copy link
Copy Markdown
Contributor Author

fadara01 commented Mar 5, 2026

thanks for reviewing!
it would be great if we could get this merged before the next release OpenBLAS release for us to pick it up in PyTorch

@martin-frbg martin-frbg merged commit 3726265 into OpenMathLib:develop Mar 6, 2026
100 of 102 checks passed
fadara01 added a commit to pytorch/pytorch that referenced this pull request Mar 10, 2026
OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
ghstack-source-id: cf38a01
Pull-Request: #177012
fadara01 added a commit to pytorch/pytorch that referenced this pull request Mar 10, 2026
OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
ghstack-source-id: 952fd9e
Pull-Request: #177012
fadara01 added a commit to pytorch/pytorch that referenced this pull request Mar 10, 2026
OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
ghstack-source-id: 596be25
Pull-Request: #177012
fadara01 added a commit to pytorch/pytorch that referenced this pull request Mar 16, 2026
OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
ghstack-source-id: 545189c
Pull-Request: #177012
fadara01 added a commit to pytorch/pytorch that referenced this pull request Apr 23, 2026
OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
ghstack-source-id: 38dd7dc
Pull-Request: #177012
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Apr 30, 2026
OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 ... among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20% through OpenMathLib/OpenBLAS#5667.

OpenBLAS v0.3.33 contains an SBGEMM fix for non-SVE machines and adds detection logic for Neoverse-V3

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

## Performance

Using [this SDPA benchmark](https://gist.github.com/fadara01/5357a52299a3722587f6691d145e71e9), here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V2 cores:

| B | Hq | Hkv | Lq | Lk | D | causal | gqa | Speedup from #176881  vs current | Speedup from #176881 and this PR vs current | Speedup from #176881 , #177009 and this PR vs current |
|---:|---:|---:|---:|---:|---:|---|---|---:|---:|---:|
| 1 | 32 | 8 | 2048 | 2048 | 128 | True | True | +9.48% | +14.91% | +35.60% |
| 1 | 32 | 8 | 1 | 2048 | 128 | False | True | -1.42%  | -2.79% | -0.95%% |
| 1 | 16 | 16 | 6400 | 6400 | 80 | False | False | +5.18% | +11.60% | +27.95% |
| 1 | 20 | 20 | 1500 | 1500 | 64 | False | False | +6.63% | +11.80% | +24.86% |
| 8 | 20 | 20 | 1500 | 1500 | 64 | False | False | +9.31% | +17.12% | +31.82% |

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
Pull Request resolved: #177012
Approved by: https://github.com/jgong5, https://github.com/aditew01, https://github.com/malfet
fadara01 added a commit to pytorch/pytorch that referenced this pull request May 1, 2026
Fixes: #182091

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
ghstack-source-id: 55543d4
Pull-Request: #177012
fadara01 added a commit to pytorch/pytorch that referenced this pull request May 1, 2026
Fixes: #182091

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
ghstack-source-id: 97dd48e
Pull-Request: #177012
fadara01 added a commit to pytorch/pytorch that referenced this pull request May 1, 2026
Fixes #182091
Fixes #177251

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
ghstack-source-id: e7a80f4
Pull-Request: #177012
fadara01 added a commit to pytorch/pytorch that referenced this pull request May 1, 2026
Fixes #182091
Fixes #177251

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
ghstack-source-id: df3caad
Pull-Request: #177012
fadara01 added a commit to pytorch/pytorch that referenced this pull request May 1, 2026
Fixes #182091
Fixes #177251

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
ghstack-source-id: 236a040
Pull-Request: #177012
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request May 3, 2026
Fixes #182091
Fixes #177251

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
ghstack-source-id: 236a040
Pull-Request: #177012
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request May 3, 2026
Fixes #182091
Fixes #177251

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
ghstack-source-id: 098874c
Pull-Request: #177012
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request May 6, 2026
Fixes #182091
Fixes #177251

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
ghstack-source-id: 246c17f
Pull-Request: #177012
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request May 8, 2026
Fixes #182091
Fixes SVE128 part of #182091

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 ... among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20% through OpenMathLib/OpenBLAS#5667.

OpenBLAS v0.3.33 contains an SBGEMM fix for non-SVE machines and adds detection logic for Neoverse-V3

This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc

## Performance

Using [this SDPA benchmark](https://gist.github.com/fadara01/5357a52299a3722587f6691d145e71e9), here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V2 cores:

| B | Hq | Hkv | Lq | Lk | D | causal | gqa | Speedup from #176881  vs current | Speedup from #176881 and this PR vs current | Speedup from #176881 , #177009 and this PR vs current |
|---:|---:|---:|---:|---:|---:|---|---|---:|---:|---:|
| 1 | 32 | 8 | 2048 | 2048 | 128 | True | True | +9.48% | +14.91% | +35.60% |
| 1 | 32 | 8 | 1 | 2048 | 128 | False | True | -1.42%  | -2.79% | -0.95%% |
| 1 | 16 | 16 | 6400 | 6400 | 80 | False | False | +5.18% | +11.60% | +27.95% |
| 1 | 20 | 20 | 1500 | 1500 | 64 | False | False | +6.63% | +11.80% | +24.86% |
| 8 | 20 | 20 | 1500 | 1500 | 64 | False | False | +9.31% | +17.12% | +31.82% |

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
Pull Request resolved: #177012
Approved by: https://github.com/jgong5, https://github.com/aditew01, https://github.com/malfet
Alokksinha00 pushed a commit to Alokksinha00/pytorch that referenced this pull request May 15, 2026
OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 ... among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20% through OpenMathLib/OpenBLAS#5667.

OpenBLAS v0.3.33 contains an SBGEMM fix for non-SVE machines and adds detection logic for Neoverse-V3

This accelerates SDPA, and will be capitalized on by pytorch#172945 further to accelerate linear,mm, bmm, etc

## Performance

Using [this SDPA benchmark](https://gist.github.com/fadara01/5357a52299a3722587f6691d145e71e9), here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V2 cores:

| B | Hq | Hkv | Lq | Lk | D | causal | gqa | Speedup from pytorch#176881  vs current | Speedup from pytorch#176881 and this PR vs current | Speedup from pytorch#176881 , pytorch#177009 and this PR vs current |
|---:|---:|---:|---:|---:|---:|---|---|---:|---:|---:|
| 1 | 32 | 8 | 2048 | 2048 | 128 | True | True | +9.48% | +14.91% | +35.60% |
| 1 | 32 | 8 | 1 | 2048 | 128 | False | True | -1.42%  | -2.79% | -0.95%% |
| 1 | 16 | 16 | 6400 | 6400 | 80 | False | False | +5.18% | +11.60% | +27.95% |
| 1 | 20 | 20 | 1500 | 1500 | 64 | False | False | +6.63% | +11.80% | +24.86% |
| 8 | 20 | 20 | 1500 | 1500 | 64 | False | False | +9.31% | +17.12% | +31.82% |

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
Pull Request resolved: pytorch#177012
Approved by: https://github.com/jgong5, https://github.com/aditew01, https://github.com/malfet
Alokksinha00 pushed a commit to Alokksinha00/pytorch that referenced this pull request May 15, 2026
Fixes pytorch#182091
Fixes SVE128 part of pytorch#182091

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines
and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 ... among other things.

OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20% through OpenMathLib/OpenBLAS#5667.

OpenBLAS v0.3.33 contains an SBGEMM fix for non-SVE machines and adds detection logic for Neoverse-V3

This accelerates SDPA, and will be capitalized on by pytorch#172945 further to accelerate linear,mm, bmm, etc

## Performance

Using [this SDPA benchmark](https://gist.github.com/fadara01/5357a52299a3722587f6691d145e71e9), here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V2 cores:

| B | Hq | Hkv | Lq | Lk | D | causal | gqa | Speedup from pytorch#176881  vs current | Speedup from pytorch#176881 and this PR vs current | Speedup from pytorch#176881 , pytorch#177009 and this PR vs current |
|---:|---:|---:|---:|---:|---:|---|---|---:|---:|---:|
| 1 | 32 | 8 | 2048 | 2048 | 128 | True | True | +9.48% | +14.91% | +35.60% |
| 1 | 32 | 8 | 1 | 2048 | 128 | False | True | -1.42%  | -2.79% | -0.95%% |
| 1 | 16 | 16 | 6400 | 6400 | 80 | False | False | +5.18% | +11.60% | +27.95% |
| 1 | 20 | 20 | 1500 | 1500 | 64 | False | False | +6.63% | +11.80% | +24.86% |
| 8 | 20 | 20 | 1500 | 1500 | 64 | False | False | +9.31% | +17.12% | +31.82% |

PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32
Pull Request resolved: pytorch#177012
Approved by: https://github.com/jgong5, https://github.com/aditew01, https://github.com/malfet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants