Add optimized BGEMM for NEOVERSEN2 target#5399
Conversation
This re-uses the existing NEOVERSEN2 8x4 `sbgemm` kernel to implement `bgemm`.
|
I think there is a missing include here: it does not build in the weekly openblas-libs tests because |
|
maybe it is a toolchain question, or you are using additional code checking options ? I have only limited options for testing the most recent Neoverse cpus - our Cirun job uses an Ubuntu Jammy image that appears to be stuck at gcc11, and the most modern hardware in the GCC Compile Farm is a N1. The code in question still appears to compile on my Pixel8 phone with gcc-15 though |
This fails compilation on CI for macos-arm64. When I run it locally on a macbook M1, I do not see compilation of the and the similar command on linux-arm64 fails tests, since it does not actually have bfloat
Right, we should probably use a |
|
Ahh, the difference is that the CI run specifies Lines 425 to 431 in d23680b So for me a minimal reproducer for the build failure is this. Maybe the CI here does not care about the undefined function warning. |
|
probably the xcode 15.4 toolchain ? |
|
Looks like arm_neon.h should be included, while arm_bf16.h would be included automatically from either this or arm_sve.h if needed - #5396 had already fixed this in the N1/V1 kernels, but not here |
OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: cf38a01 Pull-Request: #177012
OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 952fd9e Pull-Request: #177012
OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 596be25 Pull-Request: #177012
OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 545189c Pull-Request: #177012
OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 38dd7dc Pull-Request: #177012
OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 ... among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20% through OpenMathLib/OpenBLAS#5667. OpenBLAS v0.3.33 contains an SBGEMM fix for non-SVE machines and adds detection logic for Neoverse-V3 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc ## Performance Using [this SDPA benchmark](https://gist.github.com/fadara01/5357a52299a3722587f6691d145e71e9), here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V2 cores: | B | Hq | Hkv | Lq | Lk | D | causal | gqa | Speedup from #176881 vs current | Speedup from #176881 and this PR vs current | Speedup from #176881 , #177009 and this PR vs current | |---:|---:|---:|---:|---:|---:|---|---|---:|---:|---:| | 1 | 32 | 8 | 2048 | 2048 | 128 | True | True | +9.48% | +14.91% | +35.60% | | 1 | 32 | 8 | 1 | 2048 | 128 | False | True | -1.42% | -2.79% | -0.95%% | | 1 | 16 | 16 | 6400 | 6400 | 80 | False | False | +5.18% | +11.60% | +27.95% | | 1 | 20 | 20 | 1500 | 1500 | 64 | False | False | +6.63% | +11.80% | +24.86% | | 8 | 20 | 20 | 1500 | 1500 | 64 | False | False | +9.31% | +17.12% | +31.82% | PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 Pull Request resolved: #177012 Approved by: https://github.com/jgong5, https://github.com/aditew01, https://github.com/malfet
Fixes: #182091 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 55543d4 Pull-Request: #177012
Fixes: #182091 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 97dd48e Pull-Request: #177012
Fixes #182091 Fixes #177251 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: e7a80f4 Pull-Request: #177012
Fixes #182091 Fixes #177251 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: df3caad Pull-Request: #177012
Fixes #182091 Fixes #177251 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 236a040 Pull-Request: #177012
Fixes #182091 Fixes #177251 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 236a040 Pull-Request: #177012
Fixes #182091 Fixes #177251 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 098874c Pull-Request: #177012
Fixes #182091 Fixes #177251 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20%: OpenMathLib/OpenBLAS#5667 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 ghstack-source-id: 246c17f Pull-Request: #177012
Fixes #182091 Fixes SVE128 part of #182091 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 ... among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20% through OpenMathLib/OpenBLAS#5667. OpenBLAS v0.3.33 contains an SBGEMM fix for non-SVE machines and adds detection logic for Neoverse-V3 This accelerates SDPA, and will be capitalized on by #172945 further to accelerate linear,mm, bmm, etc ## Performance Using [this SDPA benchmark](https://gist.github.com/fadara01/5357a52299a3722587f6691d145e71e9), here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V2 cores: | B | Hq | Hkv | Lq | Lk | D | causal | gqa | Speedup from #176881 vs current | Speedup from #176881 and this PR vs current | Speedup from #176881 , #177009 and this PR vs current | |---:|---:|---:|---:|---:|---:|---|---|---:|---:|---:| | 1 | 32 | 8 | 2048 | 2048 | 128 | True | True | +9.48% | +14.91% | +35.60% | | 1 | 32 | 8 | 1 | 2048 | 128 | False | True | -1.42% | -2.79% | -0.95%% | | 1 | 16 | 16 | 6400 | 6400 | 80 | False | False | +5.18% | +11.60% | +27.95% | | 1 | 20 | 20 | 1500 | 1500 | 64 | False | False | +6.63% | +11.80% | +24.86% | | 8 | 20 | 20 | 1500 | 1500 | 64 | False | False | +9.31% | +17.12% | +31.82% | PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 Pull Request resolved: #177012 Approved by: https://github.com/jgong5, https://github.com/aditew01, https://github.com/malfet
OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 ... among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20% through OpenMathLib/OpenBLAS#5667. OpenBLAS v0.3.33 contains an SBGEMM fix for non-SVE machines and adds detection logic for Neoverse-V3 This accelerates SDPA, and will be capitalized on by pytorch#172945 further to accelerate linear,mm, bmm, etc ## Performance Using [this SDPA benchmark](https://gist.github.com/fadara01/5357a52299a3722587f6691d145e71e9), here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V2 cores: | B | Hq | Hkv | Lq | Lk | D | causal | gqa | Speedup from pytorch#176881 vs current | Speedup from pytorch#176881 and this PR vs current | Speedup from pytorch#176881 , pytorch#177009 and this PR vs current | |---:|---:|---:|---:|---:|---:|---|---|---:|---:|---:| | 1 | 32 | 8 | 2048 | 2048 | 128 | True | True | +9.48% | +14.91% | +35.60% | | 1 | 32 | 8 | 1 | 2048 | 128 | False | True | -1.42% | -2.79% | -0.95%% | | 1 | 16 | 16 | 6400 | 6400 | 80 | False | False | +5.18% | +11.60% | +27.95% | | 1 | 20 | 20 | 1500 | 1500 | 64 | False | False | +6.63% | +11.80% | +24.86% | | 8 | 20 | 20 | 1500 | 1500 | 64 | False | False | +9.31% | +17.12% | +31.82% | PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 Pull Request resolved: pytorch#177012 Approved by: https://github.com/jgong5, https://github.com/aditew01, https://github.com/malfet
Fixes pytorch#182091 Fixes SVE128 part of pytorch#182091 OpenBLAS v0.3.31 adds support for BGEMM on SVE128, SVE256 machines and general optimizations for SBGEMM/BGEMM: OpenMathLib/OpenBLAS#5419, OpenMathLib/OpenBLAS#5399 ... among other things. OpenBLAS v0.3.32 accelerates SBGEMM/BGEMM on SVE128 machines by ~20% through OpenMathLib/OpenBLAS#5667. OpenBLAS v0.3.33 contains an SBGEMM fix for non-SVE machines and adds detection logic for Neoverse-V3 This accelerates SDPA, and will be capitalized on by pytorch#172945 further to accelerate linear,mm, bmm, etc ## Performance Using [this SDPA benchmark](https://gist.github.com/fadara01/5357a52299a3722587f6691d145e71e9), here are the scaled-dot-production-attention speedups achieved with 16 Neoverse-V2 cores: | B | Hq | Hkv | Lq | Lk | D | causal | gqa | Speedup from pytorch#176881 vs current | Speedup from pytorch#176881 and this PR vs current | Speedup from pytorch#176881 , pytorch#177009 and this PR vs current | |---:|---:|---:|---:|---:|---:|---|---|---:|---:|---:| | 1 | 32 | 8 | 2048 | 2048 | 128 | True | True | +9.48% | +14.91% | +35.60% | | 1 | 32 | 8 | 1 | 2048 | 128 | False | True | -1.42% | -2.79% | -0.95%% | | 1 | 16 | 16 | 6400 | 6400 | 80 | False | False | +5.18% | +11.60% | +27.95% | | 1 | 20 | 20 | 1500 | 1500 | 64 | False | False | +6.63% | +11.80% | +24.86% | | 8 | 20 | 20 | 1500 | 1500 | 64 | False | False | +9.31% | +17.12% | +31.82% | PS: BGEMM means bf16 x bf16 -> bf16 and SBGEMM means: bf16 x bf16 -> fp32 Pull Request resolved: pytorch#177012 Approved by: https://github.com/jgong5, https://github.com/aditew01, https://github.com/malfet
This re-uses the existing NEOVERSEN2 8x4
sbgemmkernel to implementbgemm.