-
Notifications
You must be signed in to change notification settings - Fork 68
Add alpha support for 9.7 data processing intrinsics #428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -11,7 +11,7 @@ toc: true | |
| --- | ||
|
|
||
| <!-- | ||
| SPDX-FileCopyrightText: Copyright 2011-2025 Arm Limited and/or its affiliates <open-source-office@arm.com> | ||
| SPDX-FileCopyrightText: Copyright 2011-2026 Arm Limited and/or its affiliates <open-source-office@arm.com> | ||
| SPDX-FileCopyrightText: Copyright 2022 Google LLC. | ||
| CC-BY-SA-4.0 AND Apache-Patent-License | ||
| See LICENSE.md file for details | ||
|
|
@@ -487,6 +487,9 @@ Armv8.4-A [[ARMARMv84]](#ARMARMv84). Support is added for the Dot Product intrin | |
| * Removed all references to Transactional Memory Extension (TME). | ||
| * Added [**Alpha**](#current-status-and-anticipated-changes) support | ||
| for Brain 16-bit floating-point vector multiplication intrinsics. | ||
| * Added [**Alpha**](#current-status-and-anticipated-changes) | ||
| support for SVE2.3 (FEAT_SVE2p3), SME2.3 (FEAT_SME2p3), FEAT_F16F32DOT, | ||
| FEAT_F16F32MM, FEAT_F16MM and FEAT_SVE_B16MM intrinsics. | ||
|
|
||
| ### References | ||
|
|
||
|
|
@@ -2003,6 +2006,10 @@ are available. This implies that `__ARM_FEATURE_SVE` is nonzero. | |
| are available and if the associated [ACLE features] | ||
| (#sme-language-extensions-and-intrinsics) are supported. | ||
|
|
||
| `__ARM_FEATURE_SVE2p3` is defined to 1 if the FEAT_SVE2p3 instructions | ||
| are available and if the associated [ACLE features] | ||
| (#sme-language-extensions-and-intrinsics) are supported. | ||
|
|
||
| #### NEON-SVE Bridge macro | ||
|
|
||
| `__ARM_NEON_SVE_BRIDGE` is defined to 1 if the [`<arm_neon_sve_bridge.h>`](#arm_neon_sve_bridge.h) | ||
|
|
@@ -2026,6 +2033,7 @@ of SME has an associated preprocessor macro, given in the table below: | |
| | FEAT_SME2 | __ARM_FEATURE_SME2 | | ||
| | FEAT_SME2p1 | __ARM_FEATURE_SME2p1 | | ||
| | FEAT_SME2p2 | __ARM_FEATURE_SME2p2 | | ||
| | FEAT_SME2p3 | __ARM_FEATURE_SME2p3 | | ||
|
|
||
| Each macro is defined if there is hardware support for the associated | ||
| architecture feature and if all of the [ACLE | ||
|
|
@@ -2162,6 +2170,20 @@ See [Half-precision brain | |
| floating-point](#half-precision-brain-floating-point) for details | ||
| of half-precision brain floating-point types. | ||
|
|
||
| #### Brain 16-bit floating-point matrix multiplication support | ||
|
|
||
| This section is in | ||
| [**Alpha** state](#current-status-and-anticipated-changes) and might change or be | ||
| extended in the future. | ||
|
|
||
| `__ARM_FEATURE_SVE_B16MM` is defined to `1` if there is hardware | ||
| support for the SVE BF16 matrix multiply extension and if the | ||
| associated ACLE intrinsics are available. | ||
|
|
||
| See [Half-precision brain | ||
| floating-point](#half-precision-brain-floating-point) for details | ||
| of half-precision brain floating-point types. | ||
|
|
||
| ### Cryptographic extensions | ||
|
|
||
| #### “Crypto” extension | ||
|
|
@@ -2380,6 +2402,18 @@ this implies: | |
|
|
||
| * `__ARM_NEON == 1` | ||
|
|
||
| #### Half-precision to single-precision dot product extension | ||
|
|
||
| This section is in | ||
| [**Alpha** state](#current-status-and-anticipated-changes) and might change or be | ||
| extended in the future. | ||
|
|
||
| `__ARM_FEATURE_F16F32DOT` is defined if the half-precision dot product | ||
| accumulating to single-precision instructions are supported and the vector | ||
| intrinsics are available. Note that this implies: | ||
|
|
||
| * `__ARM_NEON == 1` | ||
|
|
||
| #### Complex number intrinsics | ||
|
|
||
| `__ARM_FEATURE_COMPLEX` is defined if the complex addition and complex | ||
|
|
@@ -2425,6 +2459,21 @@ instructions and if the associated ACLE intrinsics are available. | |
| for the SVE 16-bit floating-point to 32-bit floating-point matrix multiply and add | ||
| (FEAT_SVE_F16F32MM) instructions and if the associated ACLE intrinsics are available. | ||
|
|
||
| ##### Multiplication of 16-bit floating-point matrices (AdvSIMD) | ||
|
|
||
| This section is in | ||
| [**Alpha** state](#current-status-and-anticipated-changes) and might change or be | ||
| extended in the future. | ||
|
|
||
| `__ARM_FEATURE_F16F32MM` is defined if the NEON half-precision matrix multiply | ||
| accumulating to single-precision instruction is supported. Note that this implies: | ||
|
|
||
| * `__ARM_NEON == 1` | ||
|
|
||
| `__ARM_FEATURE_F16MM` is defined if the NEON non-widening half-precision matrix multiply instruction instruction is supported. Note that this implies: | ||
|
|
||
| * `__ARM_NEON == 1` | ||
|
|
||
| ##### Multiplication of 32-bit floating-point matrices | ||
|
|
||
| `__ARM_FEATURE_SVE_MATMUL_FP32` is defined to `1` if there is hardware support | ||
|
|
@@ -2662,6 +2711,9 @@ be found in [[BA]](#BA). | |
| | [`__ARM_FEATURE_DIRECTED_ROUNDING`](#directed-rounding) | Directed Rounding | 1 | | ||
| | [`__ARM_FEATURE_DOTPROD`](#availability-of-dot-product-intrinsics) | Dot product extension (ARM v8.2-A) | 1 | | ||
| | [`__ARM_FEATURE_DSP`](#dsp-instructions) | DSP instructions (Arm v5E) (32-bit-only) | 1 | | ||
| | [`__ARM_FEATURE_F16F32DOT`](#half-precision-to-single-precision-dot-product-extension) | Half-precision to single-precision dot product extension (FEAT_F16F32DOT) | 1 | | ||
| | [`__ARM_FEATURE_F16F32MM`](#multiplication-of-16-bit-floating-point-matrices-advsimd) | Half-precision to single-precision matrix multiply accumulating extension (FEAT_F16F32MM). | 1 | | ||
| | [`__ARM_FEATURE_F16MM`](#multiplication-of-16-bit-floating-point-matrices-advsimd) | Half-precision matrix multiply accumulating extension (FEAT_F16MM) | 1 | | ||
| | [`__ARM_FEATURE_FAMINMAX`](#floating-point-absolute-minimum-and-maximum-extension) | Floating-point absolute minimum and maximum extension | 1 | | ||
| | [`__ARM_FEATURE_FMA`](#fused-multiply-accumulate-fma) | Floating-point fused multiply-accumulate | 1 | | ||
| | [`__ARM_FEATURE_FP16_FML`](#fp16-fml-extension) | FP16 FML extension (Arm v8.4-A, optional Armv8.2-A, Armv8.3-A) | 1 | | ||
|
|
@@ -2714,6 +2766,7 @@ be found in [[BA]](#BA). | |
| | [`__ARM_FEATURE_SSVE_FP8FMA`](#modal-8-bit-floating-point-extensions) | Modal 8-bit floating-point extensions | 1 | | ||
| | [`__ARM_FEATURE_SVE`](#scalable-vector-extension-sve) | Scalable Vector Extension (FEAT_SVE) | 1 | | ||
| | [`__ARM_FEATURE_SVE_B16B16`](#non-widening-brain-16-bit-floating-point-support) | Non-widening brain 16-bit floating-point intrinsics (FEAT_SVE_B16B16) | 1 | | ||
| | [`__ARM_FEATURE_SVE_B16MM`](#brain-16-bit-floating-point-matrix-multiplication-support) | SVE brain 16-bit floating-point matrix multiply extension (FEAT_SVE_B16MM) | 1 | | ||
| | [`__ARM_FEATURE_SVE_BF16`](#brain-16-bit-floating-point-support) | SVE support for the 16-bit brain floating-point extension (FEAT_BF16) | 1 | | ||
| | [`__ARM_FEATURE_SVE_BFSCALE`](#brain-16-bit-floating-point-vector-multiplication-support) | SVE support for the 16-bit brain floating-point vector multiplication extension (FEAT_SVE_BFSCALE) | 1 | | ||
| | [`__ARM_FEATURE_SVE_BITS`](#scalable-vector-extension-sve) | The number of bits in an SVE vector, when known in advance | 256 | | ||
|
|
@@ -2737,6 +2790,7 @@ be found in [[BA]](#BA). | |
| | [`__ARM_FEATURE_SVE2_SM4`](#sm4-extension) | SVE2 support for the SM4 cryptographic extension (FEAT_SVE_SM4) | 1 | | ||
| | [`__ARM_FEATURE_SVE2p1`](#sve2) | SVE version 2.1 (FEAT_SVE2p1) | ||
| | [`__ARM_FEATURE_SVE2p2`](#sve2) | SVE version 2.2 (FEAT_SVE2p2) | ||
| | [`__ARM_FEATURE_SVE2p3`](#sve2) | SVE version 2.3 (FEAT_SVE2p3) | ||
| | [`__ARM_FEATURE_SYSREG128`](#bit-system-registers) | Support for 128-bit system registers (FEAT_SYSREG128) | 1 | | ||
| | [`__ARM_FEATURE_UNALIGNED`](#unaligned-access-supported-in-hardware) | Hardware support for unaligned access | 1 | | ||
| | [`__ARM_FP`](#hardware-floating-point) | Hardware floating-point | 1 | | ||
|
|
@@ -9594,6 +9648,15 @@ BFloat16 floating-point adjust exponent vectors. | |
|
|
||
| ### SVE2 floating-point matrix multiply-accumulate instructions. | ||
|
|
||
| #### BFMMLA, FMMLA(non-widening) | ||
|
|
||
| 16-bit floating-point matrix multiply-accumulate. | ||
| ```c | ||
| // Only if __ARM_FEATURE_SVE_B16MM | ||
| // Variant also available for _f16 if (__ARM_FEATURE_SVE2p2 && __ARM_FEATURE_F16MM) | ||
| svbfloat16_t svmmla[_bf16](svbfloat16_t zda, svbfloat16_t zn, svbfloat16_t zm); | ||
| ``` | ||
|
|
||
| #### FMMLA (widening, FP8 to FP16) | ||
|
|
||
| Modal 8-bit floating-point matrix multiply-accumulate to half-precision. | ||
|
|
@@ -9955,6 +10018,23 @@ Lookup table read with 4-bit indices. | |
| svint16_t svluti4_lane[_s16_x2](svint16x2_t table, svuint8_t indices, uint64_t imm_idx); | ||
| ``` | ||
|
|
||
| ### SVE2.3 lookup table | ||
|
|
||
| The intrinsics in this section are defined by the header file | ||
| [`<arm_sve.h>`](#arm_sve.h) when `__ARM_FEATURE_SVE2p3` is defined to 1. | ||
|
|
||
| #### LUTI6 | ||
|
|
||
| Lookup table read with 6-bit indices (8-bit). | ||
|
|
||
| Use of this intrinsic if `svcntb() * 8 < 256` results in undefined behaviour. | ||
|
|
||
| ```c | ||
| // Variant is also available for: _u8 _mf8 | ||
| svint8_t svluti6[_s8](svint8x2_t table, svuint8_t indices); | ||
| ``` | ||
|
|
||
|
|
||
| ### SVE2 Multi-vector AES and 128-bit polynomial multiply long instructions | ||
|
|
||
| The specification for SVE2 Multi-vector AES and 128-bit polynomial multiply long instructions is in | ||
|
|
@@ -13548,6 +13628,7 @@ Multi-vector saturating rounding shift right narrow and interleave | |
|
|
||
| ``` c | ||
| // Variants are also available for _u16[_u32_x2] | ||
| // and _s8[_s16_x2] _u8[_u16_x2] if __ARM_FEATURE_SVE2p3 || __ARM_FEATURE_SME2p3 | ||
| svint16_t svqrshrn[_n]_s16[_s32_x2](svint32x2_t zn, uint64_t imm); | ||
| ``` | ||
|
|
||
|
|
@@ -13556,6 +13637,7 @@ Multi-vector saturating rounding shift right narrow and interleave | |
| Multi-vector saturating rounding shift right unsigned narrow and interleave | ||
|
|
||
| ``` c | ||
| // Variant for _u8[_u16_x2] is available if __ARM_FEATURE_SVE2p3 || __ARM_FEATURE_SME2p3 | ||
| svuint16_t svqrshrun[_n]_u16[_s32_x2](svint32x2_t zn, uint64_t imm); | ||
| ``` | ||
|
|
||
|
|
@@ -13857,6 +13939,128 @@ Scalar index of first/last true predicate element (predicated). | |
|
|
||
| ``` | ||
|
|
||
| ### SVE2.3 and SME2.3 instruction intrinsics | ||
|
|
||
| The specification for SVE2.3 and SME2.3 are in | ||
| [**Alpha** state](#current-status-and-anticipated-changes) and might change or be | ||
| extended in the future. | ||
|
|
||
| The functions in this section are defined by either the header file | ||
| [`<arm_sve.h>`](#arm_sve.h) or [`<arm_sme.h>`](#arm_sme.h) | ||
| when `__ARM_FEATURE_SVE2p3` or `__ARM_FEATURE_SME2p3` is defined, respectively. | ||
|
|
||
| #### ADDQP | ||
|
|
||
| Add pairwise within quadword vector segments. | ||
|
|
||
| ``` c | ||
| // Variants are also available for _s16, _s32 and _s64 | ||
| svint8_t svaddqp[_s8](svint8_t zn, svint8_t zm); | ||
| ``` | ||
|
|
||
| #### ADDSUBP | ||
|
|
||
| Add subtract pairwise. | ||
|
|
||
| ``` c | ||
| // Variants are also available for _s16, _s32 and _s64 | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should also have unsigned variants here? |
||
| svint8_t svaddsubp[_s8](svint8_t zn, svint8_t zm); | ||
| ``` | ||
|
|
||
| #### LUTI6 | ||
|
|
||
| Lookup table read with 6-bit indices (16-bit). | ||
|
|
||
| Use of this intrinsic if `svcntb() * 8 < 512` results in undefined behaviour. | ||
|
|
||
| ``` c | ||
| // Variants are also available for _u16_x2 _f16_x2 | ||
| svint16_t svluti6_lane[_s16_x2](svint16x2_t table, svuint8_t indices, uint64_t imm_idx); | ||
| ``` | ||
|
|
||
| #### FCVTZSN, FCVTZUN | ||
|
|
||
| Floating-point narrowing convert to interleaved integer, rounding toward zero | ||
|
|
||
| ``` c | ||
| // Variants are also available for | ||
| // s16[_f32_x2], s32[_f64_x2], | ||
| // u8[_f16_x2], u16[_f32_x2], u32[_f64_x2] | ||
| svint8_t svcvt_s8[_f16_x2](svfloat16x2_t zn); | ||
| ``` | ||
|
|
||
| #### SABAL, UABAL | ||
|
|
||
| Two-way absolute difference sum and accumulate long. | ||
|
|
||
| ``` c | ||
| // Variants are also available for | ||
| // s32[_s16], s64[_s32], | ||
| // u16[_u8], u32[_u16], u64[_u32] | ||
| svint16_t svaba_s16[_s8](svint16_t zda, svint8_t zn, svint8_t zm); | ||
| ``` | ||
|
|
||
| #### SCVTF, SCVTFLT, UCVTF, UCVTFLT | ||
|
|
||
| Integer convert to floating-point (top and bottom). | ||
|
|
||
| ``` c | ||
| // Variants are also available for | ||
| // f32[_s16], f64[_s32], | ||
| // f16[_u8], f32[_u16], f64[_u32] | ||
|
|
||
| svfloat16_t svcvtt_f16[_s8](svint8_t zn); | ||
| svfloat16_t svcvtb_f16[_s8](svint8_t zn); | ||
| ``` | ||
|
|
||
| #### SDOT, UDOT (vectors) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There should already be a section for the _s32_s16 etc variant of this added with |
||
|
|
||
| Integer dot-product (2-way, vectors). | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "2-way" vs "two-way" are used inconsistently, pick one? Personal preference for the latter, but fine with either. |
||
|
|
||
| ``` c | ||
| // Variants are also available for _u16_u8 | ||
| svint16_t svdot[_s16_s8](svint16_t zda, svint8_t zn, svint8_t zm); | ||
| ``` | ||
|
|
||
| #### SDOT, UDOT (indexed) | ||
|
|
||
| Integer dot product by indexed element (two-way). | ||
|
|
||
| ``` c | ||
| // Variants are also available for _u16_u8 | ||
| svint16_t svdot_lane[_s16_s8](svint16_t zda, svint8_t zn, svint8_t zm, | ||
| uint64_t imm_idx); | ||
| ``` | ||
|
|
||
| #### SQSHRN, UQSHRN | ||
|
|
||
| Multi-vector saturating shift right narrow and interleave. | ||
|
|
||
| ``` c | ||
| // Variants are also available for | ||
| // _u16[_u32_x2] | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Rework this so that signed/unsigned are on their own lines instead of grouping by 8/16, then we can fix the alignment here? |
||
| // _s8[_s16_x2] _u8[_u16_x2] | ||
| svint16_t svqshrn[_n]_s16[_s32_x2](svint32x2_t zn, uint64_t imm); | ||
| ``` | ||
|
|
||
| #### SQSHRUN | ||
|
|
||
| Signed saturating shift right narrow by immediate to interleaved unsigned integer. | ||
|
|
||
| ``` c | ||
| // Variant for _u8[_s16_x2] is also available. | ||
| svuint16_t svqshrun[_n]_u16[_s32_x2](svint32x2_t zn, uint64_t imm); | ||
| ``` | ||
|
|
||
| #### SUBP | ||
|
|
||
| Subtract pairwise. | ||
|
|
||
| ``` c | ||
| // Variants are also available for _s16, _s32 and _s64 | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should have unsigned variants? |
||
| svint8_t svsubp[_s8](svbool_t pg, svint8_t zn, svint8_t zm); | ||
| ``` | ||
|
|
||
| ### SME2 maximum and minimum absolute value | ||
|
|
||
| The intrinsics in this section are defined by the header file | ||
|
|
@@ -14581,6 +14785,36 @@ non-overloaded names to indicate which vector argument is a vector register pair | |
| __arm_streaming __arm_inout("za"); | ||
| ``` | ||
|
|
||
| ### SME2.3 lookup table | ||
|
|
||
| The intrinsics in this section are defined by the header file | ||
| [`<arm_sme.h>`](#arm_sme.h) when `__ARM_FEATURE_SME2p3` is defined to 1. | ||
|
|
||
| #### LUTI6 | ||
|
|
||
| Lookup table read with 6-bit indices (16-bit) | ||
|
|
||
| Use of this intrinsic if `svcntb() * 8 < 512` results in undefined behaviour. | ||
|
|
||
| ```c | ||
| // Variant are also available for: _u16, _f16 and _bf16 | ||
| svint16x4_t svluti6_lane_s16_x4[_s16_x2](svint16x2_t table, svuint8x2_t indices, uint64_t imm_idx); | ||
| ``` | ||
|
|
||
| Lookup table read with 6-bit indices (four registers, 8-bit) | ||
|
|
||
| ``` c | ||
| // Variants are also available for: _u8 _mf8 | ||
| svint8x4_t svluti6_zt_s8_x4(uint64_t zt0, svuint8x3_t zn) __arm_streaming __arm_in("zt0"); | ||
| ``` | ||
|
|
||
| Lookup table read with 6-bit indices (table, single, 8-bit) | ||
|
|
||
| ``` c | ||
| // Variants are also available for: _u8 _mf8 | ||
| svint8_t svluti6_zt_s8(uint64_t zt0, svuint8_t zn) __arm_streaming __arm_in("zt0"); | ||
| ``` | ||
|
|
||
| # M-profile Vector Extension (MVE) intrinsics | ||
|
|
||
| The M-profile Vector Extension (MVE) [[MVE-spec]](#MVE-spec) instructions provide packed Single | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also have unsigned variants here.