Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
236 changes: 235 additions & 1 deletion main/acle.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ toc: true
---

<!--
SPDX-FileCopyrightText: Copyright 2011-2025 Arm Limited and/or its affiliates <open-source-office@arm.com>
SPDX-FileCopyrightText: Copyright 2011-2026 Arm Limited and/or its affiliates <open-source-office@arm.com>
SPDX-FileCopyrightText: Copyright 2022 Google LLC.
CC-BY-SA-4.0 AND Apache-Patent-License
See LICENSE.md file for details
Expand Down Expand Up @@ -487,6 +487,9 @@ Armv8.4-A [[ARMARMv84]](#ARMARMv84). Support is added for the Dot Product intrin
* Removed all references to Transactional Memory Extension (TME).
* Added [**Alpha**](#current-status-and-anticipated-changes) support
for Brain 16-bit floating-point vector multiplication intrinsics.
* Added [**Alpha**](#current-status-and-anticipated-changes)
support for SVE2.3 (FEAT_SVE2p3), SME2.3 (FEAT_SME2p3), FEAT_F16F32DOT,
FEAT_F16F32MM, FEAT_F16MM and FEAT_SVE_B16MM intrinsics.

### References

Expand Down Expand Up @@ -2003,6 +2006,10 @@ are available. This implies that `__ARM_FEATURE_SVE` is nonzero.
are available and if the associated [ACLE features]
(#sme-language-extensions-and-intrinsics) are supported.

`__ARM_FEATURE_SVE2p3` is defined to 1 if the FEAT_SVE2p3 instructions
are available and if the associated [ACLE features]
(#sme-language-extensions-and-intrinsics) are supported.

#### NEON-SVE Bridge macro

`__ARM_NEON_SVE_BRIDGE` is defined to 1 if the [`<arm_neon_sve_bridge.h>`](#arm_neon_sve_bridge.h)
Expand All @@ -2026,6 +2033,7 @@ of SME has an associated preprocessor macro, given in the table below:
| FEAT_SME2 | __ARM_FEATURE_SME2 |
| FEAT_SME2p1 | __ARM_FEATURE_SME2p1 |
| FEAT_SME2p2 | __ARM_FEATURE_SME2p2 |
| FEAT_SME2p3 | __ARM_FEATURE_SME2p3 |

Each macro is defined if there is hardware support for the associated
architecture feature and if all of the [ACLE
Expand Down Expand Up @@ -2162,6 +2170,20 @@ See [Half-precision brain
floating-point](#half-precision-brain-floating-point) for details
of half-precision brain floating-point types.

#### Brain 16-bit floating-point matrix multiplication support

This section is in
[**Alpha** state](#current-status-and-anticipated-changes) and might change or be
extended in the future.

`__ARM_FEATURE_SVE_B16MM` is defined to `1` if there is hardware
support for the SVE BF16 matrix multiply extension and if the
associated ACLE intrinsics are available.

See [Half-precision brain
floating-point](#half-precision-brain-floating-point) for details
of half-precision brain floating-point types.

### Cryptographic extensions

#### “Crypto” extension
Expand Down Expand Up @@ -2380,6 +2402,18 @@ this implies:

* `__ARM_NEON == 1`

#### Half-precision to single-precision dot product extension

This section is in
[**Alpha** state](#current-status-and-anticipated-changes) and might change or be
extended in the future.

`__ARM_FEATURE_F16F32DOT` is defined if the half-precision dot product
accumulating to single-precision instructions are supported and the vector
intrinsics are available. Note that this implies:

* `__ARM_NEON == 1`

#### Complex number intrinsics

`__ARM_FEATURE_COMPLEX` is defined if the complex addition and complex
Expand Down Expand Up @@ -2425,6 +2459,21 @@ instructions and if the associated ACLE intrinsics are available.
for the SVE 16-bit floating-point to 32-bit floating-point matrix multiply and add
(FEAT_SVE_F16F32MM) instructions and if the associated ACLE intrinsics are available.

##### Multiplication of 16-bit floating-point matrices (AdvSIMD)

This section is in
[**Alpha** state](#current-status-and-anticipated-changes) and might change or be
extended in the future.

`__ARM_FEATURE_F16F32MM` is defined if the NEON half-precision matrix multiply
accumulating to single-precision instruction is supported. Note that this implies:

* `__ARM_NEON == 1`

`__ARM_FEATURE_F16MM` is defined if the NEON non-widening half-precision matrix multiply instruction instruction is supported. Note that this implies:

* `__ARM_NEON == 1`

##### Multiplication of 32-bit floating-point matrices

`__ARM_FEATURE_SVE_MATMUL_FP32` is defined to `1` if there is hardware support
Expand Down Expand Up @@ -2662,6 +2711,9 @@ be found in [[BA]](#BA).
| [`__ARM_FEATURE_DIRECTED_ROUNDING`](#directed-rounding) | Directed Rounding | 1 |
| [`__ARM_FEATURE_DOTPROD`](#availability-of-dot-product-intrinsics) | Dot product extension (ARM v8.2-A) | 1 |
| [`__ARM_FEATURE_DSP`](#dsp-instructions) | DSP instructions (Arm v5E) (32-bit-only) | 1 |
| [`__ARM_FEATURE_F16F32DOT`](#half-precision-to-single-precision-dot-product-extension) | Half-precision to single-precision dot product extension (FEAT_F16F32DOT) | 1 |
| [`__ARM_FEATURE_F16F32MM`](#multiplication-of-16-bit-floating-point-matrices-advsimd) | Half-precision to single-precision matrix multiply accumulating extension (FEAT_F16F32MM). | 1 |
| [`__ARM_FEATURE_F16MM`](#multiplication-of-16-bit-floating-point-matrices-advsimd) | Half-precision matrix multiply accumulating extension (FEAT_F16MM) | 1 |
| [`__ARM_FEATURE_FAMINMAX`](#floating-point-absolute-minimum-and-maximum-extension) | Floating-point absolute minimum and maximum extension | 1 |
| [`__ARM_FEATURE_FMA`](#fused-multiply-accumulate-fma) | Floating-point fused multiply-accumulate | 1 |
| [`__ARM_FEATURE_FP16_FML`](#fp16-fml-extension) | FP16 FML extension (Arm v8.4-A, optional Armv8.2-A, Armv8.3-A) | 1 |
Expand Down Expand Up @@ -2714,6 +2766,7 @@ be found in [[BA]](#BA).
| [`__ARM_FEATURE_SSVE_FP8FMA`](#modal-8-bit-floating-point-extensions) | Modal 8-bit floating-point extensions | 1 |
| [`__ARM_FEATURE_SVE`](#scalable-vector-extension-sve) | Scalable Vector Extension (FEAT_SVE) | 1 |
| [`__ARM_FEATURE_SVE_B16B16`](#non-widening-brain-16-bit-floating-point-support) | Non-widening brain 16-bit floating-point intrinsics (FEAT_SVE_B16B16) | 1 |
| [`__ARM_FEATURE_SVE_B16MM`](#brain-16-bit-floating-point-matrix-multiplication-support) | SVE brain 16-bit floating-point matrix multiply extension (FEAT_SVE_B16MM) | 1 |
| [`__ARM_FEATURE_SVE_BF16`](#brain-16-bit-floating-point-support) | SVE support for the 16-bit brain floating-point extension (FEAT_BF16) | 1 |
| [`__ARM_FEATURE_SVE_BFSCALE`](#brain-16-bit-floating-point-vector-multiplication-support) | SVE support for the 16-bit brain floating-point vector multiplication extension (FEAT_SVE_BFSCALE) | 1 |
| [`__ARM_FEATURE_SVE_BITS`](#scalable-vector-extension-sve) | The number of bits in an SVE vector, when known in advance | 256 |
Expand All @@ -2737,6 +2790,7 @@ be found in [[BA]](#BA).
| [`__ARM_FEATURE_SVE2_SM4`](#sm4-extension) | SVE2 support for the SM4 cryptographic extension (FEAT_SVE_SM4) | 1 |
| [`__ARM_FEATURE_SVE2p1`](#sve2) | SVE version 2.1 (FEAT_SVE2p1)
| [`__ARM_FEATURE_SVE2p2`](#sve2) | SVE version 2.2 (FEAT_SVE2p2)
| [`__ARM_FEATURE_SVE2p3`](#sve2) | SVE version 2.3 (FEAT_SVE2p3)
| [`__ARM_FEATURE_SYSREG128`](#bit-system-registers) | Support for 128-bit system registers (FEAT_SYSREG128) | 1 |
| [`__ARM_FEATURE_UNALIGNED`](#unaligned-access-supported-in-hardware) | Hardware support for unaligned access | 1 |
| [`__ARM_FP`](#hardware-floating-point) | Hardware floating-point | 1 |
Expand Down Expand Up @@ -9594,6 +9648,15 @@ BFloat16 floating-point adjust exponent vectors.

### SVE2 floating-point matrix multiply-accumulate instructions.

#### BFMMLA, FMMLA(non-widening)

16-bit floating-point matrix multiply-accumulate.
```c
// Only if __ARM_FEATURE_SVE_B16MM
// Variant also available for _f16 if (__ARM_FEATURE_SVE2p2 && __ARM_FEATURE_F16MM)
svbfloat16_t svmmla[_bf16](svbfloat16_t zda, svbfloat16_t zn, svbfloat16_t zm);
```

#### FMMLA (widening, FP8 to FP16)

Modal 8-bit floating-point matrix multiply-accumulate to half-precision.
Expand Down Expand Up @@ -9955,6 +10018,23 @@ Lookup table read with 4-bit indices.
svint16_t svluti4_lane[_s16_x2](svint16x2_t table, svuint8_t indices, uint64_t imm_idx);
```

### SVE2.3 lookup table

The intrinsics in this section are defined by the header file
[`<arm_sve.h>`](#arm_sve.h) when `__ARM_FEATURE_SVE2p3` is defined to 1.

#### LUTI6

Lookup table read with 6-bit indices (8-bit).

Use of this intrinsic if `svcntb() * 8 < 256` results in undefined behaviour.

```c
// Variant is also available for: _u8 _mf8
svint8_t svluti6[_s8](svint8x2_t table, svuint8_t indices);
```


### SVE2 Multi-vector AES and 128-bit polynomial multiply long instructions

The specification for SVE2 Multi-vector AES and 128-bit polynomial multiply long instructions is in
Expand Down Expand Up @@ -13548,6 +13628,7 @@ Multi-vector saturating rounding shift right narrow and interleave

``` c
// Variants are also available for _u16[_u32_x2]
// and _s8[_s16_x2] _u8[_u16_x2] if __ARM_FEATURE_SVE2p3 || __ARM_FEATURE_SME2p3
svint16_t svqrshrn[_n]_s16[_s32_x2](svint32x2_t zn, uint64_t imm);
```

Expand All @@ -13556,6 +13637,7 @@ Multi-vector saturating rounding shift right narrow and interleave
Multi-vector saturating rounding shift right unsigned narrow and interleave

``` c
// Variant for _u8[_u16_x2] is available if __ARM_FEATURE_SVE2p3 || __ARM_FEATURE_SME2p3
svuint16_t svqrshrun[_n]_u16[_s32_x2](svint32x2_t zn, uint64_t imm);
```

Expand Down Expand Up @@ -13857,6 +13939,128 @@ Scalar index of first/last true predicate element (predicated).

```

### SVE2.3 and SME2.3 instruction intrinsics

The specification for SVE2.3 and SME2.3 are in
[**Alpha** state](#current-status-and-anticipated-changes) and might change or be
extended in the future.

The functions in this section are defined by either the header file
[`<arm_sve.h>`](#arm_sve.h) or [`<arm_sme.h>`](#arm_sme.h)
when `__ARM_FEATURE_SVE2p3` or `__ARM_FEATURE_SME2p3` is defined, respectively.

#### ADDQP

Add pairwise within quadword vector segments.

``` c
// Variants are also available for _s16, _s32 and _s64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also have unsigned variants here.

svint8_t svaddqp[_s8](svint8_t zn, svint8_t zm);
```

#### ADDSUBP

Add subtract pairwise.

``` c
// Variants are also available for _s16, _s32 and _s64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also have unsigned variants here?

svint8_t svaddsubp[_s8](svint8_t zn, svint8_t zm);
```

#### LUTI6

Lookup table read with 6-bit indices (16-bit).

Use of this intrinsic if `svcntb() * 8 < 512` results in undefined behaviour.

``` c
// Variants are also available for _u16_x2 _f16_x2
svint16_t svluti6_lane[_s16_x2](svint16x2_t table, svuint8_t indices, uint64_t imm_idx);
```

#### FCVTZSN, FCVTZUN

Floating-point narrowing convert to interleaved integer, rounding toward zero

``` c
// Variants are also available for
// s16[_f32_x2], s32[_f64_x2],
// u8[_f16_x2], u16[_f32_x2], u32[_f64_x2]
svint8_t svcvt_s8[_f16_x2](svfloat16x2_t zn);
```

#### SABAL, UABAL

Two-way absolute difference sum and accumulate long.

``` c
// Variants are also available for
// s32[_s16], s64[_s32],
// u16[_u8], u32[_u16], u64[_u32]
svint16_t svaba_s16[_s8](svint16_t zda, svint8_t zn, svint8_t zm);
```

#### SCVTF, SCVTFLT, UCVTF, UCVTFLT

Integer convert to floating-point (top and bottom).

``` c
// Variants are also available for
// f32[_s16], f64[_s32],
// f16[_u8], f32[_u16], f64[_u32]

svfloat16_t svcvtt_f16[_s8](svint8_t zn);
svfloat16_t svcvtb_f16[_s8](svint8_t zn);
```

#### SDOT, UDOT (vectors)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should already be a section for the _s32_s16 etc variant of this added with FEAT_SVE2p1 somewhere else in this doc, should this be merged into that?


Integer dot-product (2-way, vectors).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"2-way" vs "two-way" are used inconsistently, pick one? Personal preference for the latter, but fine with either.


``` c
// Variants are also available for _u16_u8
svint16_t svdot[_s16_s8](svint16_t zda, svint8_t zn, svint8_t zm);
```

#### SDOT, UDOT (indexed)

Integer dot product by indexed element (two-way).

``` c
// Variants are also available for _u16_u8
svint16_t svdot_lane[_s16_s8](svint16_t zda, svint8_t zn, svint8_t zm,
uint64_t imm_idx);
```

#### SQSHRN, UQSHRN

Multi-vector saturating shift right narrow and interleave.

``` c
// Variants are also available for
// _u16[_u32_x2]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rework this so that signed/unsigned are on their own lines instead of grouping by 8/16, then we can fix the alignment here?

// _s8[_s16_x2] _u8[_u16_x2]
svint16_t svqshrn[_n]_s16[_s32_x2](svint32x2_t zn, uint64_t imm);
```

#### SQSHRUN

Signed saturating shift right narrow by immediate to interleaved unsigned integer.

``` c
// Variant for _u8[_s16_x2] is also available.
svuint16_t svqshrun[_n]_u16[_s32_x2](svint32x2_t zn, uint64_t imm);
```

#### SUBP

Subtract pairwise.

``` c
// Variants are also available for _s16, _s32 and _s64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have unsigned variants?

svint8_t svsubp[_s8](svbool_t pg, svint8_t zn, svint8_t zm);
```

### SME2 maximum and minimum absolute value

The intrinsics in this section are defined by the header file
Expand Down Expand Up @@ -14581,6 +14785,36 @@ non-overloaded names to indicate which vector argument is a vector register pair
__arm_streaming __arm_inout("za");
```

### SME2.3 lookup table

The intrinsics in this section are defined by the header file
[`<arm_sme.h>`](#arm_sme.h) when `__ARM_FEATURE_SME2p3` is defined to 1.

#### LUTI6

Lookup table read with 6-bit indices (16-bit)

Use of this intrinsic if `svcntb() * 8 < 512` results in undefined behaviour.

```c
// Variant are also available for: _u16, _f16 and _bf16
svint16x4_t svluti6_lane_s16_x4[_s16_x2](svint16x2_t table, svuint8x2_t indices, uint64_t imm_idx);
```

Lookup table read with 6-bit indices (four registers, 8-bit)

``` c
// Variants are also available for: _u8 _mf8
svint8x4_t svluti6_zt_s8_x4(uint64_t zt0, svuint8x3_t zn) __arm_streaming __arm_in("zt0");
```

Lookup table read with 6-bit indices (table, single, 8-bit)

``` c
// Variants are also available for: _u8 _mf8
svint8_t svluti6_zt_s8(uint64_t zt0, svuint8_t zn) __arm_streaming __arm_in("zt0");
```

# M-profile Vector Extension (MVE) intrinsics

The M-profile Vector Extension (MVE) [[MVE-spec]](#MVE-spec) instructions provide packed Single
Expand Down
Loading