feat(be-infra): endian-aware SIMD loaders across 5 backends#81
Merged
Conversation
Strategy C+: each helper takes a raw byte pointer to LE- or BE-encoded
data and returns a SIMD vector in host-native byte order, ready for
native integer math. The host-native conversion is monomorphized at
compile time via cfg(target_endian = ...):
- load_le_* is a no-op on LE targets, byte-swaps on BE targets
- load_be_* byte-swaps on LE targets, no-op on BE targets
- load_endian_*::<const BE: bool> generic dispatcher; the unused
branch is dead-code-eliminated per monomorphization
Per-backend helper inventory:
- neon/endian.rs: load_le/be_u16x8, load_le/be_u32x4 + dispatchers (6 fns)
- x86_sse41/endian.rs: load_le/be_u16x8, load_le/be_u32x4 + dispatchers (6 fns)
- x86_avx2/endian.rs: load_le/be_u16x16, load_le/be_u32x8 + dispatchers (6 fns)
- x86_avx512/endian.rs: load_le/be_u16x32, load_le/be_u32x16 + dispatchers (6 fns)
- wasm_simd128/endian.rs: load_le/be_u16x8, load_le/be_u32x4 + dispatchers (6 fns)
Byte-swap implementation:
NEON: vrev16q_u8 / vrev32q_u8 (via vreinterpretq round-trip)
SSE4.1: _mm_shuffle_epi8 with compile-time BYTESWAP_MASK_U16/U32 const
AVX2: _mm256_shuffle_epi8 with lane-replicated 256-bit masks
AVX-512: _mm512_shuffle_epi8 with lane-replicated 512-bit masks
WASM: u8x16_swizzle with i8x16(...) shuffle index constants
Each module carries #[allow(dead_code)] — the helpers are blocker
infrastructure for the 8 tier rollout PRs (Phase 2); tier kernels will
call load_endian_u16x*::<BE> from their own <const BE: bool> contexts.
Test coverage (per backend, 8 tests each, 40 total):
- LE loader on LE host: no-op verified (u16 + u32)
- BE loader on LE host: swap verified (u16 + u32)
- LE loader on BE host: swap verified (cfg-gated, runs on s390x QEMU)
- BE loader on BE host: no-op verified (cfg-gated)
- Generic dispatcher consistency: ::<false> == load_le, ::<true> == load_be
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Test-only refactor — extractor helpers in each backend's
tests/endian.rs returned `std::vec::Vec<T>`, allocating heap memory
on every assertion. SIMD lane counts are fixed at compile time, so
`[T; N]` stack arrays match the use case exactly and avoid the
allocation. AVX-512's expected-value builder also moves from
`(0..32).map(...).collect::<Vec<_>>()` to
`core::array::from_fn::<_, 32, _>(...)`.
Renames `*_to_vec` → `*_to_arr` for consistency. All 8 helpers
across the 5 backends are updated; assertion sites continue to use
`assert_eq!(got, [...])`, which works identically for arrays and
already worked for `Vec<T>` against array literals.
Verified:
- cargo test --lib row::arch::neon::tests::endian (aarch64): 6/6 OK
- cargo build --target x86_64-apple-darwin --tests: clean
- cargo build --target wasm32-unknown-unknown --tests
(RUSTFLAGS=-C target-feature=+simd128): clean
- cargo fmt --check: clean
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
Codex review of #81 caught four endian test files that imported `super::*` and never used a single name from it. The repository's CI sets RUSTFLAGS=-Dwarnings, so on the x86 test jobs (sse4.1, avx2, avx512) and wasm32 the unused-import warnings would have failed compilation before any SIMD coverage ran. Each module already has `use crate::row::arch::<backend>::endian::*;` for the helpers under test, which is the only import actually needed. Verified: - cargo build --target x86_64-apple-darwin --tests: 0 warnings - cargo build --target wasm32-unknown-unknown --tests (RUSTFLAGS=-C target-feature=+simd128): only pre-existing warnings in wasm_simd128/tests/yuva.rs (unrelated to this PR) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 7, 2026
There was a problem hiding this comment.
Pull request overview
Phase 1 of big-endian support rollout by introducing endian-aware SIMD load helpers for u16/u32 across NEON, SSE4.1, AVX2, AVX-512, and wasm-simd128 backends (plus per-backend tests), without modifying any row kernels yet.
Changes:
- Added per-backend
load_{le,be,endian}_u16xNandload_{le,be,endian}_u32xNSIMD loader helpers with compile-time endian selection. - Exposed new
endianhelper modules from each backend and wired new test modules into backend test suites. - Added backend-specific unit tests validating swap/no-op behavior and generic dispatcher routing.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/row/arch/x86_sse41/tests/mod.rs | Registers the new SSE4.1 endian test module. |
| src/row/arch/x86_sse41/tests/endian.rs | Adds SSE4.1 endian loader tests for u16/u32 and generic dispatcher routing. |
| src/row/arch/x86_sse41/mod.rs | Exposes the new SSE4.1 endian helper module. |
| src/row/arch/x86_sse41/endian.rs | Implements SSE4.1 endian-aware u16x8/u32x4 SIMD loaders and const-generic dispatchers. |
| src/row/arch/x86_avx512/tests/mod.rs | Registers the new AVX-512 endian test module. |
| src/row/arch/x86_avx512/tests/endian.rs | Adds AVX-512 endian loader tests for u16/u32 and generic dispatcher routing. |
| src/row/arch/x86_avx512/mod.rs | Exposes the new AVX-512 endian helper module. |
| src/row/arch/x86_avx512/endian.rs | Implements AVX-512 endian-aware u16x32/u32x16 SIMD loaders and const-generic dispatchers. |
| src/row/arch/x86_avx2/tests/mod.rs | Registers the new AVX2 endian test module. |
| src/row/arch/x86_avx2/tests/endian.rs | Adds AVX2 endian loader tests for u16/u32 and generic dispatcher routing. |
| src/row/arch/x86_avx2/mod.rs | Exposes the new AVX2 endian helper module. |
| src/row/arch/x86_avx2/endian.rs | Implements AVX2 endian-aware u16x16/u32x8 SIMD loaders and const-generic dispatchers. |
| src/row/arch/wasm_simd128/tests/mod.rs | Registers the new wasm-simd128 endian test module. |
| src/row/arch/wasm_simd128/tests/endian.rs | Adds wasm-simd128 endian loader tests for u16/u32 and generic dispatcher routing. |
| src/row/arch/wasm_simd128/mod.rs | Exposes the new wasm-simd128 endian helper module. |
| src/row/arch/wasm_simd128/endian.rs | Implements wasm-simd128 endian-aware u16x8/u32x4 SIMD loaders and const-generic dispatchers. |
| src/row/arch/neon/tests/mod.rs | Registers the new NEON endian test module. |
| src/row/arch/neon/tests/endian.rs | Adds NEON endian loader tests for u16/u32 and generic dispatcher routing. |
| src/row/arch/neon/mod.rs | Exposes the new NEON endian helper module. |
| src/row/arch/neon/endian.rs | Implements NEON endian-aware u16x8/u32x4 SIMD loaders and const-generic dispatchers. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+21
to
+27
| /// On a LE host, `load_le_u16x8` must NOT swap bytes. | ||
| #[test] | ||
| #[cfg(target_endian = "little")] | ||
| fn sse41_load_le_u16x8_noop_on_le_host() { | ||
| if !std::arch::is_x86_feature_detected!("sse4.1") { | ||
| return; | ||
| } |
Comment on lines
+21
to
+26
| #[test] | ||
| #[cfg(target_endian = "little")] | ||
| fn avx2_load_le_u16x16_noop_on_le_host() { | ||
| if !std::arch::is_x86_feature_detected!("avx2") { | ||
| return; | ||
| } |
Comment on lines
+21
to
+26
| #[test] | ||
| #[cfg(target_endian = "little")] | ||
| fn avx512_load_le_u16x32_noop_on_le_host() { | ||
| if !std::arch::is_x86_feature_detected!("avx512bw") { | ||
| return; | ||
| } |
Comment on lines
+19
to
+24
| /// On a LE host, `load_le_u16x8` must NOT swap bytes — the in-memory LE | ||
| /// layout already matches host-native order. | ||
| #[test] | ||
| #[cfg(target_endian = "little")] | ||
| fn neon_load_le_u16x8_noop_on_le_host() { | ||
| // 0x0102 stored LE = bytes [0x02, 0x01]; host reads as 0x0102. |
Five Copilot review comments fixed:
1-4. Add `#[cfg_attr(miri, ignore = "...")]` to every #[test] in the
four arch-specific endian tests (NEON, SSE4.1, AVX2, AVX-512).
Matches the convention applied to every other SIMD-intrinsic
test in `row/arch/{neon,x86_*}/tests/` (see `legacy_rgb.rs`,
`planar_gbr_high_bit.rs`, etc.). CI's miri job currently
passes for these tests (the simple swizzle/byte-swap intrinsics
are Miri-supported), but the project convention is to ignore
all SIMD-intrinsic tests under Miri as a defensive measure
against future intrinsic additions and Miri regressions.
Skipping wasm_simd128 on the same grounds the test job uses
(`-C target-feature=+simd128` without `-Dwarnings`) — Copilot
did not flag wasm and there is no precedent of miri ignores
on existing wasm tests.
5. Comment in `x86_avx512/tests/endian.rs:33` claimed the
constructed u16 values span `0x0102..0x4142`. With the
`((i+1) as u8).wrapping_add(1)` low-byte / `(i+1) as u8`
high-byte construction, `i=31` produces `(0x20, 0x21)` →
`0x2021`, not `0x4142`. Corrected to `0x0102..0x2021`.
Verified:
- cargo test --lib endian: 6/6 NEON tests OK
- cargo fmt --check: clean
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
uqio
added a commit
that referenced
this pull request
May 7, 2026
Codex adversarial review of #82 caught a high-severity bug: the scalar `if BE { x.swap_bytes() } else { x }` pattern is a no-op when the data byte order matches the host CPU's byte order, but unconditionally swaps bytes regardless of target endianness. That diverges from the SIMD `load_endian_u16x*::<BE>` helpers from #81, which are target-endian-aware (a swap is needed only when the data byte order differs from the host). Mismatched semantics between scalar and SIMD paths means scalar tails (and luma kernels, which are scalar-only) would corrupt rows on a big-endian host (s390x), in BOTH BE=true and BE=false cases. The fix replaces every scalar load with the standard `u16::from_be` / `u16::from_le` pair, which expand exactly to the SIMD helper semantics: each is a no-op when the data byte order matches the host, and a `swap_bytes()` when they differ. if BE { u16::from_be(r[x]) } else { u16::from_le(r[x]) } 26 call sites across the planar_gbr_high_bit kernels (g, b, r, a) updated. Test helper `byte_swap_vec` left as-is; it intentionally synthesizes BE-encoded buffers from LE inputs for parity tests on LE-host CI (a future follow-up should make the test helper target-endian aware too, when Phase 3 s390x QEMU coverage lands). Verified: 2177 tests pass; cargo fmt clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
uqio
added a commit
that referenced
this pull request
May 7, 2026
The scalar `load_f32::<BE>` / `load_f16::<BE>` helpers used an unconditional `swap_bytes()` regardless of host endianness. The corresponding SIMD `load_endian_u32x4::<BE>` / `load_endian_u16x8::<BE>` helpers (added in PR #81 be-infra) are target-endian aware via `cfg(target_endian = ...)`, so SIMD and scalar disagreed on big-endian hosts. Tail loops dispatch to the scalar fallback, so any width whose tail is non-zero on s390x corrupted the row. Why s390x corrupts with the old code: when reading a `&[f32]` reinterpreted from raw bytes, the host CPU reads the four bytes in host-native order. On LE hosts that matches LE-on-disk; on BE hosts it matches BE-on-disk. An unconditional swap therefore: - LE host + BE data: correct (swap turns BE bytes into native LE) — the case the original code targeted. - BE host + LE data: correct (swap turns LE bytes into native BE). - BE host + BE data: WRONG (host-native is already BE, swap inverts it). - LE host + LE data: handled by `BE = false` no-op — fine. The fix routes both branches through `u32::from_be` / `u32::from_le` (and `u16::from_be` / `u16::from_le` for f16): BE branch: `f32::from_bits(u32::from_be(raw.to_bits()))` LE branch: `f32::from_bits(u32::from_le(raw.to_bits()))` `u32::from_le` is a no-op on LE hosts and a byte-swap on BE hosts; symmetric for `from_be`. This makes both `<BE>` monomorphizations correct on every target endianness and matches the contract the SIMD endian helpers already implement. f32 / f16 paths use `from_bits(u{32,16}::from_be(raw.to_bits()))` so the result is host-native f32 / `half::f16` regardless of the source encoding. The test helpers (`be_encode` in `planar_gbr_float.rs`, `be_encode_f16` in `planar_gbr_f16.rs`) intentionally use unconditional `swap_bytes` to synthesise BE-on-disk fixtures from LE input on an LE host. They are not load helpers and remain unchanged. No SIMD code paths needed changes — the per-arch `load_endian_*` helpers already use `cfg(target_endian = ...)`. Tail loops still call the scalar helpers, which are now correct. Verified: - `cargo test --target aarch64-apple-darwin --lib`: 2176 passed - `cargo build --target x86_64-apple-darwin --tests`: 0 warnings - `RUSTFLAGS="-C target-feature=+simd128" cargo build --target wasm32-unknown-unknown --tests`: clean - `cargo build --no-default-features`: clean - `cargo fmt --check`: clean - `cargo clippy --all-targets --all-features -- -D warnings`: clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
uqio
added a commit
that referenced
this pull request
May 7, 2026
Codex adversarial review of #82 caught a high-severity bug in the Strategy A+ alpha-only fast-path helpers. `copy_alpha_plane_u16_to_u8` and `copy_alpha_plane_u16` mask + shift raw u16 values without endian awareness. On a big-endian host (s390x) processing LE-encoded Gbrap input, the alpha plane is byte-reversed when both `with_rgb` and `with_rgba` outputs are requested — the same class of bug we already fixed for the direct `gbra_to_rgba_*` kernels in 26e5077. The direct kernels are correct because they were threaded through `<const BITS: u32, const BE: bool>` in PR #82 + 26e5077. The α-extract helpers were left at `<const BITS: u32>` with raw `alpha[n]` reads and silently drift to incorrect output on a BE host. Fix: thread `<const BE: bool>` through both scalar α-extract helpers and apply the same `u16::from_be` / `u16::from_le` pattern as the direct-kernel scalar fix: let raw = if BE { u16::from_be(alpha[n]) } else { u16::from_le(alpha[n]) }; rgba_out[n * 4 + 3] = ((raw & mask) >> shift) as u8; // u16-to-u8 variant Each conversion compiles to a no-op when the data byte order matches the host CPU and a byte-swap otherwise, mirroring the SIMD `load_endian_u16x*::<BE>` semantics from #81. Scalar tails and SIMD hot paths now stay byte-for-byte equivalent on every host for BE = false (the case currently exercised). The dispatcher (`row::dispatch::alpha_extract`) gained the matching `<const BE: bool>` parameter. When `BE = true` it routes directly to scalar — the SIMD α-extract backends use raw native-u16 loads (`vld1q_u16` / `_mm_loadu_si128` / `v128_load64_zero`) and have no byte-swap path, so feeding them BE-encoded input would re-introduce the same corruption. Per the spec ("Don't touch SIMD α-extract paths ... codex didn't flag those"), the SIMD kernels keep their existing LE-oriented loads. Phase 4 will plumb `<const BE: bool>` through SIMD if/when a real BE-input sinker hot-path lands. All sinker call sites pass `<BITS, false>` for now (LE-only sinkers today; matches the `false` already passed to the sibling `gbr_to_rgb_u16_high_bit_row::<BITS, false>` calls). Eight call sites updated: - sinker/mixed/planar_gbr_high_bit.rs (Tier 10b, 2 sites) - sinker/mixed/yuva_4_4_4.rs (Tier 9, 2 sites) - sinker/mixed/yuva_4_2_2.rs (Tier 9, 2 sites) - sinker/mixed/yuva_4_2_0.rs (Tier 9, 2 sites) Each call site has an inline `// BE = false: ...` comment naming Phase 4 as the follow-up that will plumb a real `<const BE: bool>` from the row type. The 8/16-bit variants `copy_alpha_plane_u8`, `copy_alpha_packed_u8x4_at_3`, `copy_alpha_packed_u16x4_to_u8_at_0`, `copy_alpha_packed_u16x4_at_0`, and `copy_alpha_ya_*` are unchanged: the 8-bit ones have no endianness; the AYUV64 / Rgba64 / Bgra64 / Ya16 variants take packed sources whose endianness is already a property of the source's row-type wrapper rather than this helper. The f32 helpers (`copy_alpha_plane_f32*`) are left untouched — they belong to Tier 10 float (Gbrapf32 / Gbrpf16), out of scope for this PR. They will be addressed in a separate PR when Phase 4 rolls up through the float sinkers. Tests added: - `copy_alpha_plane_u16_to_u8_be_parity_with_swapped_buffer` — builds a host-side `swap_bytes` of the LE fixture, calls the helper with `<10, false>` on the LE buffer and `<10, true>` on the BE-encoded buffer, asserts identical output. Locks down the BE-flag round-trip on every host. - `copy_alpha_plane_u16_be_parity_with_swapped_buffer` — same pattern for the u16-output variant. Existing scalar tests retargeted to `<BITS, false>` (LE) to preserve current behavior. SIMD parity tests in `row/arch/{neon,x86_*,wasm_simd128}/alpha_extract.rs` retargeted the scalar reference call to `<BITS, false>` — the SIMD helpers do host-native loads, which matches scalar BE = false on LE hosts. Verification: - cargo test --target aarch64-apple-darwin --lib → 2179 passed - cargo test --target x86_64-apple-darwin --lib → 2873 passed - cargo build --target x86_64-apple-darwin --tests → 0 warnings - RUSTFLAGS=+simd128 cargo build --target wasm32-unknown-unknown --tests → only pre-existing unused-import warnings - cargo build --no-default-features → ok - cargo fmt --check → clean - cargo clippy --all-targets --all-features -- -D warnings → clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7 tasks
uqio
added a commit
that referenced
this pull request
May 7, 2026
Codex flagged a high-severity scalar BE bug in tier 10b: the
inline `if BE { x.swap_bytes() } else { x }` pattern is wrong
on big-endian hosts because `swap_bytes()` is unconditional —
it swaps even when the data already matches the host's byte
order. The matching SIMD `load_endian_*::<BE>` helpers from
PR #81 are target-endian aware (cfg-gated reverses; no-op when
source order matches host order), so the buggy scalar paths
diverge on s390x, corrupting both BE-input and LE-input rows
when run through scalar tails or the (always-scalar) luma
kernels.
Audit of tier 4 scalar code confirms tier 4 was implemented
from the start using the helper functions `load_endian_u16::<BE>`
and `load_endian_u32::<BE>` declared in `src/row/scalar/mod.rs`,
which build a fresh `[u8; N]` from the source pointer and decode
via `u16::from_be_bytes` / `u16::from_le_bytes` (and the u32
pair). Those byte-array decoders are target-endian aware: each
is a no-op when the data byte order matches the host CPU and a
hardware byte-swap when they differ — the same semantics as
`u16::from_be` / `u16::from_le` and the SIMD `load_endian_*`
helpers. No `if BE { x.swap_bytes() } else { x }` pattern exists
in tier 4 production scalar code (`src/row/scalar/{v210,y2xx,y216}.rs`),
so no scalar production fix is needed for s390x correctness.
To prevent a future regression that introduces the buggy
pattern (a real risk now that the codex finding is on file
across tier 5/8/10b/10-float/11), this commit upgrades the
doc-comments on `load_endian_u16<BE>` and `load_endian_u32<BE>`
to:
- Spell out the **target-endian aware** contract (no swap on
matching host order, swap on differing order).
- Cite the codex finding and reference the tier 10b fix commit
message for the full motivation.
- Mark the inline-`swap_bytes` pattern as the "naive alternative"
that the helpers exist specifically to avoid.
Test helpers `to_be_u16` (`src/row/scalar/y2xx.rs`,
`src/row/scalar/y216.rs`) and `pack_v210_word_be`
(`src/row/scalar/v210.rs`) are intentionally left unchanged —
they synthesize BE-encoded fixtures from LE inputs on the
LE-host CI, mirroring the tier 5/8/10b convention; a future
phase 3 s390x QEMU run will revisit them.
Verified:
- cargo test --target aarch64-apple-darwin --lib (2171 passed, 0 failed)
- cargo build --target x86_64-apple-darwin --tests (clean, 0 warnings)
- RUSTFLAGS='-C target-feature=+simd128' cargo build --target wasm32-unknown-unknown --tests
- cargo build --no-default-features
- cargo fmt --check
- cargo clippy --all-targets --all-features -- -D warnings
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 1 of Big-Endian (BE) support rollout. Strategy C+ —
<const BE: bool = false>generic loaders that monomorphize at compile time viacfg(target_endian). Self-contained infra; no row kernels touched yet (Phase 2 — 8 tier PRs to follow).Per-backend inventory (6 fns each, 30 total):
load_le_u16xN/load_be_u16xN/load_endian_u16xN::<const BE: bool>load_le_u32xN/load_be_u32xN/load_endian_u32xN::<const BE: bool>vrev16q_u8/vrev32q_u8viavreinterpretqround-trip_mm_shuffle_epi8+BYTESWAP_MASK_*const_mm256_shuffle_epi8lane-replicated 256-bit masks_mm512_shuffle_epi8lane-replicated 512-bit masksu8x16_swizzlewithi8x16(...)index constCompile-time monomorphization:
load_le_*is no-op load,load_be_*swapsload_le_*swaps,load_be_*is no-op loadload_endian_*::<false>→load_le_*;<true>→load_be_*. Unused branch is dead-code-eliminated.Test coverage
40 tests across 5 backends — per backend (8 tests):
::<false> == load_le,::<true> == load_beScope
Each module carries
#[allow(dead_code)]; helpers become live as tier PRs land.Test plan
cargo test --target aarch64-apple-darwin(NEON path)cargo test --target x86_64-apple-darwin(SSE4.1 / AVX2 / AVX-512 paths via runtime detection)cargo test --target wasm32-unknown-unknown(wasm-simd128 — gated behindsimd128)🤖 Generated with Claude Code