Add optimized BGEMM for NEOVERSEN2 target by Mousius · Pull Request #5399 · OpenMathLib/OpenBLAS

Mousius · 2025-07-24T11:05:17Z

This re-uses the existing NEOVERSEN2 8x4 sbgemm kernel to implement bgemm.

This re-uses the existing NEOVERSEN2 8x4 `sbgemm` kernel to implement `bgemm`.

mattip · 2025-08-03T07:01:27Z

I think there is a missing include here: it does not build in the weekly openblas-libs tests because vcvtah_f32_bf16 is not declared. It seems including both #include <arm_bf16.h> and #include <arm_neon.h> is needed. The failed run is here. Is there something different about the openblas-libs CI that is missing those includes?

martin-frbg · 2025-08-03T12:54:50Z

maybe it is a toolchain question, or you are using additional code checking options ? I have only limited options for testing the most recent Neoverse cpus - our Cirun job uses an Ubuntu Jammy image that appears to be stuck at gcc11, and the most modern hardware in the GCC Compile Farm is a N1. The code in question still appears to compile on my Pixel8 phone with gcc-15 though

Mousius · 2025-08-03T15:33:29Z

@mattip potentially a bigger issue is that this is building BGEMM/SBGEMM and GEMV variants by default thanks to #5396 - which likely increases the binary size for no gain to numpy.

mattip · 2025-08-03T22:50:18Z

maybe it is a toolchain question, or you are using additional code checking options?

This fails compilation on CI for macos-arm64. When I run it locally on a macbook M1, I do not see compilation of the sbgemm_kernel_8x4_neoversen2.c kernel.

CFLAGS=' -ftrapping-math -mmacos-version-min=11.0 -fvisibility=protected -Wno-uninitialized'
make BUFFERSIZE=20 DYNAMIC_ARCH=1 QUIET_MAKE=1 USE_OPENMP=0 NUM_THREADS=64 BINARY=64 INTERFACE64=1 SYMBOLSUFFIX=64_ LIBNAMESUFFIX=64_ OBJCONV=/Users/runner/work/openblas-libs/openblas-libs/objconv/objconv SYMBOLPREFIX=scipy_ LIBNAMEPREFIX=scipy_ FIXED_LIBNAME=1 TARGET=VORTEX

and the similar command on linux-arm64 fails tests, since it does not actually have bfloat

CFLAGS=' -fvisibility=protected -Wno-uninitialized'
2025-08-03T02:05:19.0182311Z + make BUFFERSIZE=20 DYNAMIC_ARCH=1 QUIET_MAKE=1 USE_OPENMP=0 NUM_THREADS=64 BINARY=64 OBJCONV=/io/objconv/objconv SYMBOLPREFIX=scipy_ LIBNAMEPREFIX=scipy_ FIXED_LIBNAME=1 TARGET=ARMV8

building BGEMM/SBGEMM and GEMV variants by default

Right, we should probably use a DYNAMIC_LIST to not use the bfloat kernels on linux-arm64

mattip · 2025-08-03T23:16:52Z

Ahh, the difference is that the CI run specifies MACOSX_DEPLOYMENT_TARGET="11.0", which then allows SVE and SME here

OpenBLAS/Makefile.system

Lines 425 to 431 in d23680b

    
           ifeq ($(OSNAME), Darwin) 
        
           ifndef MACOSX_DEPLOYMENT_TARGET 
        
           ifeq ($(ARCH), arm64) 
        
           export MACOSX_DEPLOYMENT_TARGET=11.0 
        
           export NO_SVE = 1 
        
           export NO_SME = 1 
        
           else

So for me a minimal reproducer for the build failure is this. Maybe the CI here does not care about the undefined function warning.

export MACOSX_DEPLOYMENT_TARGET="11.0"
export CFLAG=-Werror
make TARGET=NEOVERSEN2

martin-frbg · 2025-08-04T06:33:36Z

probably the xcode 15.4 toolchain ?
and dynamic_list won't change anything about bfloat16 support, you need to build with BUILD_BFLOAT16=0

martin-frbg · 2025-08-04T07:17:48Z

Looks like arm_neon.h should be included, while arm_bf16.h would be included automatically from either this or arm_sve.h if needed - #5396 had already fixed this in the N1/V1 kernels, but not here

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: cf38a01 Pull-Request: #177012

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 952fd9e Pull-Request: #177012

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 596be25 Pull-Request: #177012

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 545189c Pull-Request: #177012

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 38dd7dc Pull-Request: #177012

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 ... among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20% through OpenMathLib/OpenBLAS#5667. OpenBLAS v0.3.33 contains an SBGEMM fix for non-SVE machines and adds detection logic for Neoverse-V3 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc ## Performance Using [this SDPA benchmark](https://gist.github.com/fadara01/5357a52299a3722587f6691d145e71e9), here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V2 cores: | B | Hq | Hkv | Lq | Lk | D | causal | gqa | Speedup from #176881 vs current | Speedup from #176881 and this PR vs current | Speedup from #176881 , #177009 and this PR vs current | |---:|---:|---:|---:|---:|---:|---|---|---:|---:|---:| | 1 | 32 | 8 | 2048 | 2048 | 128 | True | True | +9.48% | +14.91% | +35.60% | | 1 | 32 | 8 | 1 | 2048 | 128 | False | True | -1.42% | -2.79% | -0.95%% | | 1 | 16 | 16 | 6400 | 6400 | 80 | False | False | +5.18% | +11.60% | +27.95% | | 1 | 20 | 20 | 1500 | 1500 | 64 | False | False | +6.63% | +11.80% | +24.86% | | 8 | 20 | 20 | 1500 | 1500 | 64 | False | False | +9.31% | +17.12% | +31.82% | PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 Pull Request resolved: #177012 Approved by: https://github.com/jgong5, https://github.com/aditew01, https://github.com/malfet

Fixes: #182091 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 55543d4 Pull-Request: #177012

Fixes: #182091 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 97dd48e Pull-Request: #177012

Fixes #182091 Fixes #177251 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: e7a80f4 Pull-Request: #177012

Fixes #182091 Fixes #177251 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: df3caad Pull-Request: #177012

Fixes #182091 Fixes #177251 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 236a040 Pull-Request: #177012

Fixes #182091 Fixes #177251 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 098874c Pull-Request: #177012

Fixes #182091 Fixes #177251 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 246c17f Pull-Request: #177012

Fixes #182091 Fixes SVE128 part of #182091 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 ... among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20% through OpenMathLib/OpenBLAS#5667. OpenBLAS v0.3.33 contains an SBGEMM fix for non-SVE machines and adds detection logic for Neoverse-V3 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc ## Performance Using [this SDPA benchmark](https://gist.github.com/fadara01/5357a52299a3722587f6691d145e71e9), here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V2 cores: | B | Hq | Hkv | Lq | Lk | D | causal | gqa | Speedup from #176881 vs current | Speedup from #176881 and this PR vs current | Speedup from #176881 , #177009 and this PR vs current | |---:|---:|---:|---:|---:|---:|---|---|---:|---:|---:| | 1 | 32 | 8 | 2048 | 2048 | 128 | True | True | +9.48% | +14.91% | +35.60% | | 1 | 32 | 8 | 1 | 2048 | 128 | False | True | -1.42% | -2.79% | -0.95%% | | 1 | 16 | 16 | 6400 | 6400 | 80 | False | False | +5.18% | +11.60% | +27.95% | | 1 | 20 | 20 | 1500 | 1500 | 64 | False | False | +6.63% | +11.80% | +24.86% | | 8 | 20 | 20 | 1500 | 1500 | 64 | False | False | +9.31% | +17.12% | +31.82% | PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 Pull Request resolved: #177012 Approved by: https://github.com/jgong5, https://github.com/aditew01, https://github.com/malfet

OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 ... among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20% through OpenMathLib/OpenBLAS#5667. OpenBLAS v0.3.33 contains an SBGEMM fix for non-SVE machines and adds detection logic for Neoverse-V3 This accelerates SDPA, and will be capitalized on by pytorch#172945 further to accelerate linear,mm, bmm, etc ## Performance Using [this SDPA benchmark](https://gist.github.com/fadara01/5357a52299a3722587f6691d145e71e9), here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V2 cores: | B | Hq | Hkv | Lq | Lk | D | causal | gqa | Speedup from pytorch#176881 vs current | Speedup from pytorch#176881 and this PR vs current | Speedup from pytorch#176881 , pytorch#177009 and this PR vs current | |---:|---:|---:|---:|---:|---:|---|---|---:|---:|---:| | 1 | 32 | 8 | 2048 | 2048 | 128 | True | True | +9.48% | +14.91% | +35.60% | | 1 | 32 | 8 | 1 | 2048 | 128 | False | True | -1.42% | -2.79% | -0.95%% | | 1 | 16 | 16 | 6400 | 6400 | 80 | False | False | +5.18% | +11.60% | +27.95% | | 1 | 20 | 20 | 1500 | 1500 | 64 | False | False | +6.63% | +11.80% | +24.86% | | 8 | 20 | 20 | 1500 | 1500 | 64 | False | False | +9.31% | +17.12% | +31.82% | PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 Pull Request resolved: pytorch#177012 Approved by: https://github.com/jgong5, https://github.com/aditew01, https://github.com/malfet

Fixes pytorch#182091 Fixes SVE128 part of pytorch#182091 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 ... among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20% through OpenMathLib/OpenBLAS#5667. OpenBLAS v0.3.33 contains an SBGEMM fix for non-SVE machines and adds detection logic for Neoverse-V3 This accelerates SDPA, and will be capitalized on by pytorch#172945 further to accelerate linear,mm, bmm, etc ## Performance Using [this SDPA benchmark](https://gist.github.com/fadara01/5357a52299a3722587f6691d145e71e9), here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V2 cores: | B | Hq | Hkv | Lq | Lk | D | causal | gqa | Speedup from pytorch#176881 vs current | Speedup from pytorch#176881 and this PR vs current | Speedup from pytorch#176881 , pytorch#177009 and this PR vs current | |---:|---:|---:|---:|---:|---:|---|---|---:|---:|---:| | 1 | 32 | 8 | 2048 | 2048 | 128 | True | True | +9.48% | +14.91% | +35.60% | | 1 | 32 | 8 | 1 | 2048 | 128 | False | True | -1.42% | -2.79% | -0.95%% | | 1 | 16 | 16 | 6400 | 6400 | 80 | False | False | +5.18% | +11.60% | +27.95% | | 1 | 20 | 20 | 1500 | 1500 | 64 | False | False | +6.63% | +11.80% | +24.86% | | 8 | 20 | 20 | 1500 | 1500 | 64 | False | False | +9.31% | +17.12% | +31.82% | PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 Pull Request resolved: pytorch#177012 Approved by: https://github.com/jgong5, https://github.com/aditew01, https://github.com/malfet

Add optimized BGEMM for NEOVERSEN2 target

ea2faf0

This re-uses the existing NEOVERSEN2 8x4 `sbgemm` kernel to implement `bgemm`.

martin-frbg added this to the 0.3.31 milestone Jul 24, 2025

martin-frbg merged commit c9204f7 into OpenMathLib:develop Jul 25, 2025
87 checks passed

mattip mentioned this pull request Aug 4, 2025

disallow BFLOAT16 kernels on linux-aarch64 MacPython/openblas-libs#212

Closed

1 task

martin-frbg mentioned this pull request Aug 4, 2025

Fix compilation of the NeoverseN2 SBGEMM kernel #5415

Merged

martin-frbg mentioned this pull request Feb 22, 2026

Fix SGEMM returning wrong results in multithreading on NeoverseV2 #5643

Merged

fadara01 mentioned this pull request Mar 10, 2026

Accelerate SDPA on Arm CPUs: Update OpenBLAS to v0.3.33 pytorch/pytorch#177012

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optimized BGEMM for NEOVERSEN2 target#5399

Add optimized BGEMM for NEOVERSEN2 target#5399
martin-frbg merged 1 commit into
OpenMathLib:developfrom
Mousius:bgemm-8x4

Mousius commented Jul 24, 2025

Uh oh!

Uh oh!

mattip commented Aug 3, 2025

Uh oh!

martin-frbg commented Aug 3, 2025

Uh oh!

Mousius commented Aug 3, 2025

Uh oh!

mattip commented Aug 3, 2025 •

edited

Loading

Uh oh!

mattip commented Aug 3, 2025

Uh oh!

martin-frbg commented Aug 4, 2025

Uh oh!

martin-frbg commented Aug 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Mousius commented Jul 24, 2025

Uh oh!

Uh oh!

mattip commented Aug 3, 2025

Uh oh!

martin-frbg commented Aug 3, 2025

Uh oh!

Mousius commented Aug 3, 2025

Uh oh!

mattip commented Aug 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattip commented Aug 3, 2025

Uh oh!

martin-frbg commented Aug 4, 2025

Uh oh!

martin-frbg commented Aug 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mattip commented Aug 3, 2025 •

edited

Loading