feat(be-tier9): BE support for Rgbf32 / Rgbf16 row kernels by uqio · Pull Request #83 · Findit-AI/colconv

uqio · 2026-05-07T11:38:39Z

Summary

Phase 2 — Tier 9 BE rollout. Stacked on #81 (BE infra). Adds <const BE: bool> to all Rgbf32 / Rgbf16 packed-RGB float row kernels across all 6 backends + dispatcher.

Implementation:

Scalar: to_bits().swap_bytes() for both f32 and f16 elements; LE f32 fast path uses copy_from_slice
NEON: load_f32x4::<BE> via load_endian_u32x4; widen_f16x4::<BE> via load_endian_u16x8; BE-only deinterleave path for vld3q_f32-inaccessible layouts
SSE4.1 / AVX2 / AVX-512: _mm*_castsi*_ps(load_endian_u32xN::<BE>(...)); F16C widening via _mm{,256,512}_cvtph_ps with endian-aware u16 loads
wasm-simd128: load_f32x4::<BE> via load_endian_u32x4; scalar f16 byte-swap before widening (no native f16 SIMD on wasm)

Test results: 2862 tests pass total (33 new BE parity tests). cargo clippy and cargo fmt clean.

Stacking

Base: feat/be-infra (#81). Will rebase onto main once #81 merges.

Test plan

cargo test --target aarch64-apple-darwin
cargo build --target x86_64-apple-darwin --tests
cargo build --target wasm32-unknown-unknown --tests
s390x QEMU (Phase 3)

🤖 Generated with Claude Code

`widen_f16x4_sse<BE=true>` was calling `load_endian_u16x8` (which reads 16 bytes via `_mm_loadu_si128`) but the kernel only guarantees 8 readable bytes per call (4 × f16). The third widen call per loop iteration reads [lane*2+16, lane*2+32) while the buffer ends at lane*2+24 when `lane+12 == total_lanes` — an 8-byte tail-overread that ASan caught on PR #83's CI sanitizer job. Add `load_endian_u16x4` (8-byte load via `_mm_loadl_epi64` + low-half byte-swap; upper half zeroed). The fix is correct because `_mm_cvtph_ps` only reads the low 64 bits (4 × f16) of its `__m128i` operand, so the zeroed upper half is harmless. AVX2 (`widen_f16x8_avx`) and AVX-512 (`widen_f16x16_avx512`) need a full 16/32-byte u16 region per call so they keep using `load_endian_u16x8` / `load_endian_u16x16`. Verified locally with `RUSTFLAGS=-Zsanitizer=address cargo +nightly test --target x86_64-apple-darwin` (2862 tests pass).

Add `<const BE: bool>` to all Rgbf32 and Rgbf16 row kernels across scalar, NEON, SSE4.1, AVX2, AVX-512, and wasm-simd128 backends, plus both dispatchers. BE parity tests added to every backend; existing callers (sinkers, scalar tests, arch tests) updated to `<false>`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

`widen_f16x4_sse<BE=true>` was calling `load_endian_u16x8` (which reads 16 bytes via `_mm_loadu_si128`) but the kernel only guarantees 8 readable bytes per call (4 × f16). The third widen call per loop iteration reads [lane*2+16, lane*2+32) while the buffer ends at lane*2+24 when `lane+12 == total_lanes` — an 8-byte tail-overread that ASan caught on PR #83's CI sanitizer job. Add `load_endian_u16x4` (8-byte load via `_mm_loadl_epi64` + low-half byte-swap; upper half zeroed). The fix is correct because `_mm_cvtph_ps` only reads the low 64 bits (4 × f16) of its `__m128i` operand, so the zeroed upper half is harmless. AVX2 (`widen_f16x8_avx`) and AVX-512 (`widen_f16x16_avx512`) need a full 16/32-byte u16 region per call so they keep using `load_endian_u16x8` / `load_endian_u16x16`. Verified locally with `RUSTFLAGS=-Zsanitizer=address cargo +nightly test --target x86_64-apple-darwin` (2862 tests pass).

Tier 9 packed-float-RGB scalar BE conversion used unconditional `x.swap_bytes()`, which always swaps regardless of host endianness. On big-endian hosts (powerpc64, s390x) the source bytes are already in host-native order, so an extra swap corrupts every BE row. The SIMD `load_endian_*::<BE>` helpers shipped with feat/be-infra are already target-endian aware (no-op on a matching host), so the scalar and per-arch tail paths produced wrong output relative to the SIMD body on a hypothetical s390x runner. Replaced every `bits.swap_bytes()` / `to_bits().swap_bytes()` site in the source files with `u32::from_be` / `u32::from_le` (for f32) or `u16::from_be` / `u16::from_le` (for f16): - `if BE { x.swap_bytes() } else { x }` → `if BE { u32::from_be(x) } else { u32::from_le(x) }` - `f32::from_bits(raw.to_bits().swap_bytes())` (BE-only) → `f32::from_bits(if BE { u32::from_be(raw.to_bits()) } else { u32::from_le(raw.to_bits()) })` `from_be` / `from_le` is a no-op when the encoded byte order matches the host, a byte-swap when they differ — exactly mirroring the SIMD helper semantics so LE and BE hosts now produce bit-identical output. Special note for the f32 / f16 pass-through kernels: previously the `else` branch fell back to `copy_from_slice`, which is a byte-level copy. On a BE host that copies LE-encoded bytes into f32 / f16 lanes verbatim, leaving the destination in non-host-native order — the docstring claims "output is always host-native". The fix routes both branches through `from_bits(from_be/from_le(to_bits()))`, which is a no-op on LE host (correct, byte order matches) and a swap on BE host (correct, since the data is LE-encoded). Source-file call sites fixed: - u32 (f32 → bits, target-endian decoded): 7 — scalar `load_f32`, scalar `rgbf32_to_rgb_f32_row`, neon / x86_sse41 / x86_avx2 / x86_avx512 / wasm_simd128 `rgbf32_to_rgb_f32_row` BE tails. - u16 (f16 → bits, target-endian decoded): 12 — scalar `load_f16`, scalar `rgbf16_to_rgb_f32_row`, scalar `rgbf16_to_rgb_f16_row`, neon `widen_f16_tail`, x86_sse41 `load_f16_scalar`, x86_avx2 / x86_avx512 `rgbf16_to_rgb_f32_row` f16 widen tails, wasm_simd128 five f16 widen lanes (`rgbf16_to_rgb_row`, `rgbf16_to_rgba_row`, `rgbf16_to_rgb_u16_row`, `rgbf16_to_rgba_u16_row`, `rgbf16_to_rgb_f32_row`). - f32 special case: covered by the u32 sites (scalar `rgbf32_to_rgb_f32_row` and the per-arch BE tails go through `f32::from_bits(u32::from_be/le(to_bits()))`). - f16 special case: covered by the u16 sites (scalar `rgbf16_to_rgb_f16_row` and `rgbf16_to_rgb_f32_row` go through `half::f16::from_bits(u16::from_be/le(to_bits()))`). Test helpers (`be_rgbf32` / `be_rgbf16` in `tests/packed_rgb_float.rs` across all arch backends) intentionally still use `swap_bytes()` because they synthesize a BE-encoded buffer from an LE host input — the unconditional swap is correct there and per-instructions remains unchanged. The neon `widen_f16_tail` helper additionally became `<const BE: bool>` (was previously calling `to_f32()` directly on host-native bits, producing garbage when fed BE-encoded f16 — the test `neon_rgbf16_to_rgb_f32_be_matches_le` failed at widths where the 4-lane SIMD body left a non-zero tail). Verified: 2170 lib tests pass on `aarch64-apple-darwin`; `cargo build --target x86_64-apple-darwin --tests` clean; `RUSTFLAGS="-C target-feature=+simd128" cargo build --target wasm32-unknown-unknown --tests` clean (warnings pre-existing); `cargo build --no-default-features` clean; `cargo fmt --check` clean; `cargo clippy --all-targets --all-features -- -D warnings` clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR is Phase 2 of the Tier 9 Big-Endian rollout, adding a <const BE: bool> endianness parameter to packed-float RGB row kernels (Rgbf32 / Rgbf16) and updating callers/tests accordingly so kernels can decode BE-encoded input streams.

Changes:

Updated Tier 9 packed float scalar kernels to decode f32/f16 element bit-patterns based on <const BE: bool>.
Updated MixedSinker Tier 9 sinker implementations to call the new const-generic row dispatchers.
Updated scalar tests to call the new const-generic kernel APIs.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 11 comments.

File	Description
`src/sinker/mixed/packed_rgb_float.rs`	Updates `Rgbf32` sinker to call const-generic row dispatchers with `BE` specified.
`src/sinker/mixed/packed_rgb_f16.rs`	Updates `Rgbf16` sinker to call const-generic row dispatchers with `BE` specified.
`src/row/scalar/tests.rs`	Updates scalar Tier 9 tests to call const-generic kernels.
`src/row/scalar/packed_rgb_float.rs`	Adds endian-aware scalar loads and propagates `<const BE: bool>` across Tier 9 scalar kernels.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

    if let Some(buf) = rgb_f32.as_deref_mut() {
      let f32_start = one_plane_start * 3;
      let f32_end = one_plane_end * 3;
-      rgbf32_to_rgb_f32_row(rgb_in, &mut buf[f32_start..f32_end], w, use_simd);
+      rgbf32_to_rgb_f32_row::<false>(rgb_in, &mut buf[f32_start..f32_end], w, use_simd);
    }


    if let Some(buf) = rgb_f16.as_deref_mut() {
      let f16_start = one_plane_start * 3;
      let f16_end = one_plane_end * 3;
-      rgbf16_to_rgb_f16_row(rgb_in, &mut buf[f16_start..f16_end], w, use_simd);
+      rgbf16_to_rgb_f16_row::<false>(rgb_in, &mut buf[f16_start..f16_end], w, use_simd);
    }

    // Lossless f32 widen — also independent of integer conversion paths.
    if let Some(buf) = rgb_f32.as_deref_mut() {
      let f32_start = one_plane_start * 3;
      let f32_end = one_plane_end * 3;
-      rgbf16_to_rgb_f32_row(rgb_in, &mut buf[f32_start..f32_end], w, use_simd);
+      rgbf16_to_rgb_f32_row::<false>(rgb_in, &mut buf[f32_start..f32_end], w, use_simd);
    }


+/// When `BE = true` the input `f32` values are encoded big-endian
+/// (bytes swapped relative to the host's native little-endian layout).


+pub(crate) fn rgbf32_to_rgb_f32_row<const BE: bool>(
+  rgb_in: &[f32],
+  rgb_out: &mut [f32],
+  width: usize,
+) {
  debug_assert!(rgb_in.len() >= width * 3, "rgbf32 row too short");
  debug_assert!(rgb_out.len() >= width * 3, "rgb_f32_out row too short");
-  rgb_out[..width * 3].copy_from_slice(&rgb_in[..width * 3]);
+  // Decode each source f32 from `BE` byte order to host-native.
+  // `u32::from_be` / `u32::from_le` is target-endian aware: a no-op
+  // when encoded byte order matches the host, a byte-swap when they
+  // differ. Output is always host-native f32 on every target.
+  for (dst, src) in rgb_out[..width * 3]
+    .iter_mut()
+    .zip(rgb_in[..width * 3].iter())
+  {
+    let bits = src.to_bits();
+    *dst = f32::from_bits(if BE {
+      u32::from_be(bits)
+    } else {
+      u32::from_le(bits)
+    });
+  }
 }


+  // Decode each source f16 from `BE` byte order to host-native, mirror
+  // of `rgbf32_to_rgb_f32_row`. `u16::from_be` / `u16::from_le` is
+  // target-endian aware: no-op when encoded byte order matches the
+  // host, swap when they differ. Output is always host-native f16.
+  for (dst, src) in rgb_out[..width * 3]
+    .iter_mut()
+    .zip(rgb_in[..width * 3].iter())
+  {
+    let bits = src.to_bits();
+    *dst = half::f16::from_bits(if BE {
+      u16::from_be(bits)
+    } else {
+      u16::from_le(bits)
+    });
+  }


    if let Some(buf) = rgb_f32.as_deref_mut() {
      let f32_start = one_plane_start * 3;
      let f32_end = one_plane_end * 3;
-      rgbf32_to_rgb_f32_row(rgb_in, &mut buf[f32_start..f32_end], w, use_simd);
+      rgbf32_to_rgb_f32_row::<false>(rgb_in, &mut buf[f32_start..f32_end], w, use_simd);
    }


    // Lossless f16 pass-through — emit first (independent of all other paths).
    if let Some(buf) = rgb_f16.as_deref_mut() {
      let f16_start = one_plane_start * 3;
      let f16_end = one_plane_end * 3;
-      rgbf16_to_rgb_f16_row(rgb_in, &mut buf[f16_start..f16_end], w, use_simd);
+      rgbf16_to_rgb_f16_row::<false>(rgb_in, &mut buf[f16_start..f16_end], w, use_simd);
    }


+/// When `BE = true` the input `f32` values are encoded big-endian
+/// (bytes swapped relative to the host's native little-endian layout).


+pub(crate) fn rgbf32_to_rgb_f32_row<const BE: bool>(
+  rgb_in: &[f32],
+  rgb_out: &mut [f32],
+  width: usize,
+) {
  debug_assert!(rgb_in.len() >= width * 3, "rgbf32 row too short");
  debug_assert!(rgb_out.len() >= width * 3, "rgb_f32_out row too short");
-  rgb_out[..width * 3].copy_from_slice(&rgb_in[..width * 3]);
+  // Decode each source f32 from `BE` byte order to host-native.
+  // `u32::from_be` / `u32::from_le` is target-endian aware: a no-op
+  // when encoded byte order matches the host, a byte-swap when they
+  // differ. Output is always host-native f32 on every target.
+  for (dst, src) in rgb_out[..width * 3]
+    .iter_mut()
+    .zip(rgb_in[..width * 3].iter())
+  {
+    let bits = src.to_bits();
+    *dst = f32::from_bits(if BE {
+      u32::from_be(bits)
+    } else {
+      u32::from_le(bits)
+    });
+  }
 }


+pub(crate) fn rgbf16_to_rgb_f16_row<const BE: bool>(
+  rgb_in: &[half::f16],
+  rgb_out: &mut [half::f16],
+  width: usize,
+) {
  debug_assert!(rgb_in.len() >= width * 3, "rgbf16 row too short");
  debug_assert!(rgb_out.len() >= width * 3, "rgb_f16_out row too short");
-  rgb_out[..width * 3].copy_from_slice(&rgb_in[..width * 3]);
+  // Decode each source f16 from `BE` byte order to host-native, mirror
+  // of `rgbf32_to_rgb_f32_row`. `u16::from_be` / `u16::from_le` is
+  // target-endian aware: no-op when encoded byte order matches the
+  // host, swap when they differ. Output is always host-native f16.
+  for (dst, src) in rgb_out[..width * 3]
+    .iter_mut()
+    .zip(rgb_in[..width * 3].iter())
+  {
+    let bits = src.to_bits();
+    *dst = half::f16::from_bits(if BE {
+      u16::from_be(bits)
+    } else {
+      u16::from_le(bits)
+    });
+  }
 }


Codex review of PR #83 found two NEON-specific kernel correctness bugs in the Tier 9 Rgbf16 BE path. Both are fixed here together. Finding 1 — NEON f16 widening over-reads past row. `widen_f16x4::<BE=true>` was calling `load_endian_u16x8` (16-byte load via `vld1q_u16`) but the kernel only guarantees 8 readable bytes per call (4 × f16). The third widen call per 12-lane chunk reads bytes [(lane+8)*2 .. (lane+16)*2) while the row ends at total_lanes*2 when lane+12 == total_lanes — an 8-byte tail-overread that ASan/Miri catch on guarded pages. Mirrors the SSE4.1 fix in 5967967. The fix adds `load_endian_u16x4` to `src/row/arch/neon/endian.rs` (8-byte load via `vld1_u16` + `vrev16_u8` byte-swap when needed) and uses it in both the `BE=true` and `BE=false` arms of `widen_f16x4`. The downstream `vcvt_f32_f16` already takes `uint16x4_t` so no further plumbing is needed. Finding 2 — f16→f32 widen-then-convert paths treat host-native f32 as LE-encoded. After widening f16 → f32 via `vcvt_f32_f16` (NEON), `_mm_cvtph_ps` / `_mm256_cvtph_ps` / `_mm512_cvtph_ps` (x86), or scalar `to_f32()` (wasm), the stack buffer carries host-native f32 values. The kernels then called `rgbf32_to_*::<false>` to convert the buffer. With BE-aware kernel semantics (BE=false means LE-encoded input, NOT host-native), the f32 loaders inside `rgbf32_to_*` would byte-swap the already-decoded host- native buffer on a BE host — corrupting it. The fix introduces a per-backend const HOST_NATIVE_BE: bool = cfg!(target_endian = "big"); and routes the post-widen conversion as `rgbf32_to_*::<HOST_NATIVE_BE>`. This is a no-op byte-swap on both LE and BE hosts: • LE host: HOST_NATIVE_BE = false → `from_le` (no-op on LE) → correct. • BE host: HOST_NATIVE_BE = true → `from_be` (no-op on BE) → correct. Applied to all 5 backends (NEON / SSE4.1 / AVX2 / AVX-512 / wasm-simd128) across 20 call sites total. Each `rgbf32_to_rgb_row::<false>` / `rgbf32_to_rgba_row::<false>` / `rgbf32_to_rgb_u16_row::<false>` / `rgbf32_to_rgba_u16_row::<false>` after a SIMD widen now uses `HOST_NATIVE_BE`. The lossless f16→f32 paths (`rgbf16_to_rgb_f32_row`) write directly to the f32 output without a downstream convert and are already correct. Audit notes (other backends): • SSE4.1: only Finding 2 applied here; Finding 1 was fixed in 5967967. • AVX2: `widen_f16x8_avx` correctly loads 16 bytes (8 × f16 = 16) so no Finding 1; Finding 2 fix applied. • AVX-512: `widen_f16x16_avx512` correctly loads 32 bytes (16 × f16 = 32) so no Finding 1; Finding 2 fix applied. • wasm-simd128: scalar widen, no SIMD u16 load to over-read; Finding 2 fix applied (path was explicitly documented "call LE downstream", making the wasm32-LE-only assumption visible — fix is endian-agnostic so it survives any future BE wasm target). Tests: Width set for all NEON Rgbf16 BE-parity tests extended to include `5` (was `[1, 4, 7, 16, 33, 1920, 1921]`) to cover the "lane+12 == total_lanes via 1-pixel scalar tail" boundary. Added a dedicated `neon_rgbf16_be_tail_overread_widths_4_5_16_33` regression test that calls each kernel at exactly the over-read-prone widths through exact-sized allocations. Verified locally: • cargo test --target aarch64-apple-darwin --lib → 2200 pass • cargo test --target x86_64-apple-darwin --lib → 2915 pass • cargo test --no-default-features --lib → 35 pass • RUSTFLAGS=-Zsanitizer=address cargo +nightly test --target aarch64-apple-darwin --lib row::arch::neon::tests::packed_rgb_float -Zbuild-std → 23 pass • RUSTFLAGS=-Zsanitizer=address cargo +nightly test --target x86_64-apple-darwin --lib rgbf16 -Zbuild-std → 60 pass • cargo build --target x86_64-apple-darwin --tests → 0 warnings (new) • RUSTFLAGS="-C target-feature=+simd128" cargo build --target wasm32-unknown-unknown --tests → 0 warnings (new) • cargo build --no-default-features → ok • cargo fmt --check → clean • cargo clippy --all-targets --all-features (both targets) -D warnings → clean Out of scope: • Finding 3 (sinker hardcodes ::<false>) — Phase 4 deferred per PR scope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Codex 2nd-pass review of PR #83 found a third high-severity bug at the **sinker** layer (one level above the SIMD-internal HOST_NATIVE_BE fix that landed in c3a6478): the Rgbf32 and Rgbf16 `MixedSinker` `PixelSink::process` impls hardcoded `::<false>` on every row dispatcher call. With the new BE-aware kernel semantics, `BE = false` means "decode LE-encoded input" (`u32::from_le` / `u16::from_le` / SIMD `load_endian_*` LE arms). But `Rgbf32Frame` exposes a host-native `&[f32]` row and `Rgbf16Frame` exposes a host-native `&[half::f16]` row — the public API contract is "caller passes host-native floats". On a BE host, `::<false>` would byte-swap the already-decoded host-native values inside the loaders, corrupting the lossless `with_rgb_f32` / `with_rgb_f16` pass-throughs **and** every downstream u8/u16/luma/HSV output that flows through the same row. Direct backend `::<true>` BE-parity tests (the body of c3a6478 and prior commits) don't catch this because they bypass the sinker entirely — they hand the kernel BE-encoded bytes and assert against the LE- encoded counterpart, exercising only the kernel decode boundary, not the sinker-to-kernel routing. The fix is the **sinker-layer** complement of the SIMD-backend-internal HOST_NATIVE_BE introduced in c3a6478: const HOST_NATIVE_BE: bool = cfg!(target_endian = "big"); Defined at module scope in both `src/sinker/mixed/packed_rgb_float.rs` and `src/sinker/mixed/packed_rgb_f16.rs`. Every hardcoded `::<false>` in each sinker's `process` is replaced with `::<HOST_NATIVE_BE>`. Same truth table as the SIMD fix, different layer: • LE host: HOST_NATIVE_BE = false → `from_le` (no-op on LE) → correct. • BE host: HOST_NATIVE_BE = true → `from_be` (no-op on BE) → correct. Distinction from Phase 4 (out of scope here): This is **host-native correctness** — the contract that `Rgbf32Frame` / `Rgbf16Frame` exposes already-decoded floats. It is **NOT** the Phase 4 BE-source-frame work, which would let the Frame type itself carry an encoding tag (LE-encoded bytes vs BE-encoded bytes vs host- native) and thread that through the walker / row / sinker stack. The Yuva / Gbrap / packed-RGB-u16 sinkers that are still on `::<false>` fall under Phase 4 because their Frames hold `&[u16]` plane buffers whose interpretation depends on whether the caller passed an LE- encoded or BE-encoded byte stream — that's a Frame-API design question, not a host-native routing bug. This commit touches only the float sinkers whose Frame types unambiguously specify host- native element semantics. Call sites changed: • src/sinker/mixed/packed_rgb_float.rs — 6 sites (rgbf32_to_rgb_row, rgbf32_to_rgba_row ×2, rgbf32_to_rgb_u16_row, rgbf32_to_rgba_u16_row, rgbf32_to_rgb_f32_row). • src/sinker/mixed/packed_rgb_f16.rs — 7 sites (rgbf16_to_rgb_row, rgbf16_to_rgba_row ×2, rgbf16_to_rgb_u16_row, rgbf16_to_rgba_u16_row, rgbf16_to_rgb_f16_row, rgbf16_to_rgb_f32_row). Out of scope (Phase 4 territory, not touched here): • Frame types • Walker types • Other sinkers (Yuva, Gbrap, mono1bit u16, etc.) — their Frames carry plane bytes whose encoding semantics need explicit Phase 4 plumbing. • Row dispatchers themselves (already BE-aware via const generic). Tests: Added 4 sinker-level regression tests (one kernel-equivalence test + one public-API contract test for each of Rgbf32 / Rgbf16): • `rgbf32_kernel_host_native_be_matches_false_on_le_host` and `rgbf16_kernel_host_native_be_matches_false_on_le_host` — call each `rgbf32_to_*` / `rgbf16_to_*` dispatcher with both `BE = false` and `BE = HOST_NATIVE_BE` (= `cfg!(target_endian = "big")`), asserting outputs are byte-equal on the active host. On a LE host both are no-op so this documents the routing equivalence; on a BE host the same equivalence holds for the **fixed** sinker but would fail for the broken one. Width 33 covers SIMD main loop + scalar tail across every backend. • `rgbf32_sinker_host_native_contract_lossless_passthrough` and `rgbf16_sinker_host_native_contract_lossless_passthrough` — feed `Rgbf32Frame` / `Rgbf16Frame` through the public sinker API and assert `with_rgb_f32` / `with_rgb_f16` round-trips host-native input bit-exact. Pairs with the kernel-level test to cover both the dispatch boundary and the public sinker boundary. Comment in each docstring notes that full BE-host coverage requires QEMU s390x (Phase 3) — these tests document the contract on LE and would catch the bug on BE. Verified locally: • cargo test --target aarch64-apple-darwin --lib → 2204 pass (was 2200; +4 new tests). • cargo test --target x86_64-apple-darwin --lib → 2919 pass (was 2915; +4 new tests). • cargo test --no-default-features --lib → 35 pass. • cargo build --target x86_64-apple-darwin --tests → 0 warnings. • RUSTFLAGS="-C target-feature=+simd128" cargo build --target wasm32-unknown-unknown --tests → only 3 pre-existing `unused imports` warnings in unrelated `wasm_simd128/tests/*` files, not introduced by this change (confirmed by stashing the diff and rebuilding on c3a6478). • cargo build --no-default-features → ok. • cargo fmt --check → clean. • cargo clippy --all-targets --all-features --target {aarch64,x86_64}-apple-darwin -- -D warnings → clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tcher-equivalence tests on LE host Codex 3rd-pass review of PR #83 found that `rgbf32_to_rgb_f32_row::<BE>` had two endian-correctness defects: Finding 1 (high) — NEON `BE = false` branch used a raw `vld1q_f32` / `vst1q_f32` copy. That preserves on-disk byte order: it's correct when the input encoding matches host-native (the historical assumption: LE-encoded input on a LE host) but corrupts the lossless f32 output on a BE host where the request `BE = false` should mean "decode LE-encoded input to host-native" — the kernel must byte-swap, not pass through. Audit revealed identical defects in the SSE4.1, AVX2, AVX-512, and wasm-simd128 backends. Findings 2 + 3 (medium) — the Rgbf32 / Rgbf16 sinker dispatcher-equivalence tests asserted `::<false>` ≡ `::<HOST_NATIVE_BE>` while feeding host-native fixtures. On LE hosts both calls are byte-for-byte identical (the test's intent), but on BE hosts `::<false>` decodes the host-native fixture as if it were LE-encoded (byte-swap) while `::<HOST_NATIVE_BE> == ::<true>` decodes as BE (no swap), so the outputs diverge by design — the equivalence claim is specifically about the LE host-routing pattern. Fix: * Replace the raw-copy fast path in every `rgbf32_to_rgb_f32_row::<BE>` backend (NEON, SSE4.1, AVX2, AVX-512, wasm-simd128) with a host-endian gate: `if BE == HOST_NATIVE_BE { raw copy } else { endian-aware load }`. When the requested encoding matches host-native the bytes can be copied verbatim (perf-equivalent to the old fast path on the only shipping target — LE); otherwise the kernel falls through to the existing `load_f32x{4,8,16}::<BE>` slow path which byte-swaps via the `_endian_*` loaders. Tail loop now uses the endian-aware `if BE { from_be } else { from_le }` decode (matches scalar reference at `src/row/scalar/packed_rgb_float.rs:213`). * Add five new BE-target regression tests (`{neon,sse41,avx2,avx512,wasm}_rgbf32_to_rgb_f32_row_le_input_decodes_correctly_on_any_host`). Each constructs an LE-encoded f32 byte fixture (host-native bits passed through `f32::from_bits(u32::from_le(_))`) and feeds it through `::<false>`, asserting the output matches the original host-native expected values. On LE hosts this is a vacuous identity check; on BE hosts (full QEMU s390x coverage is Phase 3) it would have caught the original bug. * Gate the two sinker dispatcher-equivalence tests (`rgbf32_kernel_host_native_be_matches_false_on_le_host` and `rgbf16_kernel_host_native_be_matches_false_on_le_host`) on `#[cfg(target_endian = "little")]`. BE-host correctness of the routing change is verified instead by `*_sinker_host_native_contract_lossless_passthrough` (sinker public-API contract) and the row-kernel BE parity tests. Audit results — every backend with a `rgbf32_to_rgb_f32_row::<BE>` kernel that used a raw passthrough on `BE = false` had the same defect: src/row/arch/neon/packed_rgb_float.rs:414 — fixed src/row/arch/x86_sse41/packed_rgb_float.rs:322 — fixed src/row/arch/x86_avx2/packed_rgb_float.rs:332 — fixed src/row/arch/x86_avx512/packed_rgb_float.rs:303 — fixed src/row/arch/wasm_simd128/packed_rgb_float.rs:287 — fixed Five call sites changed (all five vector backends), five new regression tests, two existing sinker dispatcher-equivalence tests cfg-gated on `target_endian = "little"`. All `cargo test` / `cargo build` / `cargo fmt --check` / `cargo clippy --all-targets --all-features -- -D warnings` checks pass on aarch64-apple-darwin and x86_64-apple-darwin (LE hosts). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…llow-up) Codex 4th-pass review of PR #83 found that the previous `f1161d7` audit, which fixed the `BE = false` raw-load fast path in `rgbf32_to_rgb_f32_row` per backend, missed three other kernels with the same broken pattern: NEON's `rgbf32_to_rgba_row` / `rgbf32_to_rgba_u16_row` (`vld3q_f32` deinterleave hardcoded for the LE-host quadrant) and the AVX2 / AVX-512 `widen_f16x*` helpers used by every Rgbf16 SIMD entry point (raw `_mm_loadu_si128` / `_mm256_loadu_si256` for `BE = false`). On a big-endian AArch64 or x86 host with LE-encoded input, each of these reads host-native (BE) bytes from an LE buffer and mis-decodes the f32 / f16 lanes downstream — same defect class as the original f32 pass-through bug, just in different kernels. Fix: * NEON `rgbf32_to_rgba_row` / `rgbf32_to_rgba_u16_row`: replace the hardcoded `if BE { endian-aware } else { vld3q_f32 }` deinterleave gate with the `BE == HOST_NATIVE_BE` host-endian gate (same pattern f1161d7 established for `rgbf32_to_rgb_f32_row`). Fast path uses `vld3q_f32` when on-disk encoding matches host-native; otherwise falls through to the existing endian-aware `load_f32x4::<BE>` slow path with manual deinterleave. Two call sites changed. * AVX2 `widen_f16x8_avx`: drop the `if BE { load_endian_u16x8::<BE> } else { _mm_loadu_si128 }` conditional in favor of unconditionally routing through `load_endian_u16x8::<BE>`. The endian-aware loader monomorphizes to a no-op `_mm_loadu_si128` when `BE` matches host-native and to a byte-swap shuffle otherwise — correct on both LE and BE hosts. One call site changed; transitively fixes 5 entry points (`rgbf16_to_rgb_row`, `rgbf16_to_rgba_row`, `rgbf16_to_rgb_u16_row`, `rgbf16_to_rgba_u16_row`, `rgbf16_to_rgb_f32_row`). * AVX-512 `widen_f16x16_avx512`: same fix using `load_endian_u16x16::<BE>`. One call site changed; transitively fixes the same 5 entry points. Audit results — only kernels with a `<const BE: bool>` parameter that gated a raw-load fast path on the bare `BE` flag had this defect: src/row/arch/neon/packed_rgb_float.rs:162 rgbf32_to_rgba_row — fixed src/row/arch/neon/packed_rgb_float.rs:341 rgbf32_to_rgba_u16_row — fixed src/row/arch/x86_avx2/packed_rgb_float.rs:393 widen_f16x8_avx — fixed src/row/arch/x86_avx512/packed_rgb_float.rs:367 widen_f16x16_avx512 — fixed Backends checked clean: SSE4.1 (RGBA paths already use `load_f32x4::<BE>`, f16 widen already uses `load_endian_u16x4::<BE>`); WASM SIMD128 (RGBA paths use `load_f32x4::<BE>`, f16 widen is scalar with explicit endian decode); NEON `widen_f16x4` (already uses `load_endian_u16x4::<BE>`). Regression tests — twelve new LE-decode tests using the established `*_le_input_decodes_correctly_on_any_host` pattern: neon_rgbf32_to_rgba_row_le_input_decodes_correctly_on_any_host neon_rgbf32_to_rgba_u16_row_le_input_decodes_correctly_on_any_host avx2_rgbf16_to_rgb_row_le_input_decodes_correctly_on_any_host avx2_rgbf16_to_rgba_row_le_input_decodes_correctly_on_any_host avx2_rgbf16_to_rgb_u16_row_le_input_decodes_correctly_on_any_host avx2_rgbf16_to_rgba_u16_row_le_input_decodes_correctly_on_any_host avx2_rgbf16_to_rgb_f32_row_le_input_decodes_correctly_on_any_host avx512_rgbf16_to_rgb_row_le_input_decodes_correctly_on_any_host avx512_rgbf16_to_rgba_row_le_input_decodes_correctly_on_any_host avx512_rgbf16_to_rgb_u16_row_le_input_decodes_correctly_on_any_host avx512_rgbf16_to_rgba_u16_row_le_input_decodes_correctly_on_any_host avx512_rgbf16_to_rgb_f32_row_le_input_decodes_correctly_on_any_host Each constructs an LE-encoded fixture (host-native bits passed through `from_bits(u_n::from_le(_))`) and feeds it through `::<false>`, asserting the SIMD output matches the scalar reference (which is endian-correct via `load_f32::<BE>` / `load_f16::<BE>`). On LE hosts these are vacuous identity checks; on BE hosts (full QEMU s390x coverage is Phase 3) each would have caught the original bug. Aarch64 LE-host test count: 2205 → 2207 (+2 NEON RGBA tests). x86_64 LE-host test count: 2922 → 2932 (+10 = 5 AVX2 + 5 AVX-512). All `cargo test` / `cargo build` (LE x86 + aarch64 + WASM) / `cargo build --no-default-features` / `cargo fmt --check` / `cargo clippy --all-targets --all-features -- -D warnings` checks pass on aarch64-apple-darwin and x86_64-apple-darwin. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…little") Codex 5th-pass review of PR #83 found a medium finding in src/row/scalar/tests.rs: the six Rgbf16 scalar row kernel parity / widen / copy tests at lines 658-743 use host-native fixtures (half::f16::from_f32 / Vec<f32>) and call the kernels with `::<false>` (LE-encoded path). On a big-endian host (e.g. miri-sb-s390x), the kernel's `u16::from_le` / `u32::from_le` byte-swap correctly reinterprets the host-native fixture as if it were LE-encoded — producing different bits than the test asserts. Same class as the PR #82 alpha_extract / planar_gbr_high_bit fix in 8f2e329. Fix: add `#[cfg(target_endian = "little")]` alongside `#[test]` on the six tests, plus on the shared `rgbf16_test_inputs` helper to avoid an unused-fn warning on BE hosts. Add a section-header comment block explaining why these tests fail on BE, why the kernel itself remains correct (locked down by the dedicated BE-parity tests in the per-backend `tests/packed_rgb_float.rs` modules that build LE-encoded fixtures via `f32::from_bits(u32::from_le(_))` / `half::f16::from_bits(u16::from_le(_))`), and why byte-symmetric value tests are intentionally NOT gated. Audit: - The only `tests.rs` file under `src/row/scalar/` is the file in this commit; no other dedicated test files in that directory. - Inline `mod tests` blocks in other scalar source files were audited in PR #82 (`8f2e329`); no new occurrences of the host-native + `<false>` pattern landed in PR #83. - YUV planar high-bit-depth tests (e.g. yuv420p10_*) use host-native u16 too, but their kernels read u16 directly without `from_le` — the byte-format-agnostic contract is documented in `src/row/scalar/yuv_planar_high_bit.rs`. Those tests are correctly NOT gated. Tests gated: 6 (rgbf16_scalar_{rgb,rgba,rgb_u16,rgba_u16}_matches_widen_then_rgbf32, rgbf16_scalar_rgb_f32_matches_element_wise_widen, rgbf16_scalar_rgb_f16_is_copy) + the rgbf16_test_inputs helper. LE-host test count unchanged (gates are no-ops): Before: 2207 passed After: 2207 passed (cargo test --target aarch64-apple-darwin --lib) cargo fmt --check, cargo clippy --all-targets --all-features -- -D warnings, cargo build --target x86_64-apple-darwin --tests, RUSTFLAGS="-C target-feature=+simd128" cargo build --target wasm32-unknown-unknown --tests, cargo build --no-default-features all pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…n = "little") (audit follow-up) Codex 6th-pass review of PR #83 found a medium finding in src/row/arch/neon/tests/packed_rgb_float.rs (lines 32-242): the eleven Rgbf32 / Rgbf16 SIMD-vs-scalar parity tests use host-native fixtures (`pseudo_random_rgbf32` / `pseudo_random_rgbf16`) and call the kernels with `::<false>` (LE-encoded path). On a big-endian host (e.g. aarch64-be-linux-gnu, miri-sb-s390x), the kernel's `u32::from_le` / `u16::from_le` byte-swap correctly reinterprets the host-native fixture as if it were LE-encoded — producing different bits than the test asserts. Same class as the PR #82 alpha_extract / planar_gbr_high_bit fix in 8f2e329 and the PR #83 5th-pass scalar gate in 56342c0. For the SIMD-vs-scalar parity assertions (`assert_eq!(out_scalar, out_simd)`), parity holds vacuously on BE because both paths apply the same `from_le` byte-swap to the host-native fixture and produce the same (corrupted) decoded f32/f16. For the two `lossless` host-native equality assertions (`assert_eq!(out_neon, input[..w * 3])` for `rgbf32_to_rgb_f32_row` and `rgbf16_to_rgb_f16_row`), the assertion fails outright on BE since the kernel decodes through `load_f32x4::<false>` / scalar `from_le` to produce a byte-swapped (relative to host-native) result. The kernel itself is correct on BE; this is purely a fixture-vs- kernel byte-order mismatch. NEON BE-host correctness is locked down separately by the dedicated BE-parity tests in this same module (which build LE-encoded fixtures via `byte_swap` helpers and assert `<true>`/`<false>` parity on every host) and by the LE-decode regression tests added in commits c3a6478, dcf40a3, f1161d7, 63fdf8f. Those tests are intentionally NOT gated. Fix: add `#[cfg(target_endian = "little")]` alongside `#[test]` on the eleven NEON parity tests, plus a section-header comment block explaining why these tests fail on BE, why the kernel itself remains correct, and why byte-swap-helper / LE-decode regression tests are intentionally NOT gated. The shared `pseudo_random_rgbf32` / `pseudo_random_rgbf16` helpers are NOT gated because they're also used by the BE-parity / LE-decode tests that compile on every host. Audit of other backend test files (`packed_rgb_float.rs` under x86_sse41, x86_avx2, x86_avx512, wasm_simd128): - SSE4.1: 12 tests with same pattern (1 MXCSR regression + 5 Rgbf32 + 6 Rgbf16). Gated for structural consistency. Already only compiled on `target_arch = "x86_64"` which always implies `target_endian = "little"`, so the gate is functionally a no-op on every supported configuration — but it documents the assumption and matches the audit pattern. - AVX2: 11 tests (5 Rgbf32 + 6 Rgbf16). Same rationale. - AVX-512: 11 tests (5 Rgbf32 + 6 Rgbf16). Same rationale. - wasm_simd128: 11 tests (5 Rgbf32 + 6 Rgbf16). `target_arch = "wasm32"` is LE by spec; gate added for consistency / future-proofing against hypothetical BE wasm. Total tests gated: 56 (11 NEON + 12 SSE4.1 + 11 AVX2 + 11 AVX-512 + 11 wasm_simd128). LE-host test count unchanged (gates are no-ops): Before: 2207 passed After: 2207 passed (cargo test --target aarch64-apple-darwin --lib) cargo fmt --check, cargo clippy --all-targets --all-features -- -D warnings, cargo build --target x86_64-apple-darwin --tests (0 warnings), RUSTFLAGS="-C target-feature=+simd128" cargo build --target wasm32-unknown-unknown --tests (3 pre-existing warnings, not from this commit), cargo build --no-default-features, and cargo check --target s390x-unknown-linux-gnu --lib (BE-host smoke check) all pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three Copilot findings addressed (others were already fixed in dcf40a3 / 56342c0 / 4340b15 — Copilot reviewed pre-fix state): 1. **Doc rewording** at `rgbf32_to_rgb_row` — was "bytes swapped relative to the host's native little-endian layout" which is misleading on BE hosts. Reframed in terms of the input buffer's encoded byte order vs the host CPU's native order. 2. **Pass-through perf fast path** for `rgbf32_to_rgb_f32_row` — added `BE == HOST_NATIVE_BE` branch that becomes a single `copy_from_slice` (memcpy) when the encoded byte order matches the host. Restores the pre-BE-aware "lossless pass-through" perf characteristic. The const-generic dead branch is eliminated per monomorphization, so the slow byte-swap path is only emitted for `BE != HOST_NATIVE_BE` callers. 3. **Pass-through perf fast path** for `rgbf16_to_rgb_f16_row` — mirror of (2) for half-precision input. Verified: - cargo test --target aarch64-apple-darwin --lib: 2207 pass - cargo build --target x86_64-apple-darwin --tests: 0 warnings - cargo fmt --check: clean - cargo clippy --all-targets --all-features -D warnings: clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Same as the NEON dedup in the rebase amend (`6d3c8b5`): PR #83's `c3a6478` upstreamed `load_le_u16x4` / `load_be_u16x4` / `load_endian_u16x4` to the SSE4.1 endian module. Tier 10 float's earlier addition of these helpers is now redundant — caused E0428 duplicate-definition errors on x86 cross-builds (test ubuntu/windows/macos, build ubuntu/macos/windows, sanitizer, sde-avx512 all failed in CI run 25536150449). Verified: - cargo test --target aarch64-apple-darwin --lib: 2231 pass - cargo build --target x86_64-apple-darwin --tests: 0 errors, 0 warnings - cargo build --target wasm32-unknown-unknown --tests: clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…IVE_BE routing Codex review of PR #84 surfaced 3 high-severity correctness bugs in the BE support added for the Tier 10 float planar GBR family. All 3 are direct analogs of the PR #83 fixes (`dcf40a3` sinker routing + `c3a6478` dispatch f16-widen routing) — the Tier 10 work missed both patterns. Finding 1 — dispatch f16-widen double-byte-swap. The dispatcher's f16 → f32 widen-then-convert fallback paths called `g[offset + i].to_f32()` BEFORE any endian-aware load, then routed the host-native f32 scratch into the f32 kernel with the source `BE`. On a BE-source-on-LE-host (or LE-source-on-BE- host) this double-swaps: the f16 read interprets BE-encoded bytes as host- native f16 (wrong), then the f32 kernel's `from_be` swaps the wrong f32 again. Fix: introduce a `widen_f16_be_to_host_f32::<BE>` helper that normalizes f16 bits via `from_be` / `from_le` BEFORE `to_f32`, producing host-native f32 for any source `BE` × host endian; route the downstream `gbrpf32_to_*` chain via `HOST_NATIVE_BE = cfg!(target_endian = "big")`. Applies to all 12 widen-fallback paths in `src/row/dispatch/planar_gbr_float.rs` (4 main scalar fallbacks + 8 SIMD-arm scalar widens for NEON-no-fp16, AVX-512-no-F16C, AVX2-no-F16C, SSE4.1-no-F16C across `gbrpf16_to_rgb_row` and `gbrpf16_to_rgba_row`). Finding 2 — Gbrpf32 / Gbrapf32 sinker hardcodes `::<false>`. `Gbrpf32Frame` exposes `&[f32]` rows in host-native layout; the sinker hardcoded `::<false>` on every `gbrpf32_to_*` / `gbrapf32_to_*` call, so on BE hosts the kernel's `from_le` loaders byte-swap an already-decoded host-native f32, corrupting every output path. Fix: add module-scope `const HOST_NATIVE_BE: bool = cfg!(target_endian = "big")` mirroring PR #83 `dcf40a3`'s `Rgbf32` sinker pattern, replace all 22 `::<false>` call sites in `src/sinker/mixed/planar_gbr_float.rs` with `::<HOST_NATIVE_BE>`. Strategy A+ α-plane consistency: `copy_alpha_plane_f32_to_u8` is endian-agnostic (no `from_be` / `from_le` byte-load), so routing the RGB chain via `HOST_NATIVE_BE` eliminates the prior LE-RGB + host-α mix-mode. Finding 3 — Gbrpf16 / Gbrapf16 sinker hardcodes `::<false>`. Same pattern as Finding 2 but for the f16 sinker, covering both the lossless f16 paths (`gbrpf16_to_rgb_f16_row`, `gbrapf16_to_rgba_f16_row`, etc.) and the widened f32 chain (`gbrpf32_to_*` after `widen_f16_to_f32`). After `widen_f16_to_f32` consumes host-native `&[half::f16]` and emits host-native f32, the downstream `gbrpf32_to_*` must use `HOST_NATIVE_BE`. Fix: add `HOST_NATIVE_BE` const in `src/sinker/mixed/planar_gbr_f16.rs` mirroring PR #83's `Rgbf16` sinker, replace all 22 `::<false>` call sites with `::<HOST_NATIVE_BE>`. The Gbrapf16 standalone-RGBA + Strategy A+ paths widen the f16 α plane to host-native f32 then call `copy_alpha_plane_f32_to_u8` (endian-agnostic), so α consistency is preserved by the RGB-chain fix. Tests added (6 new, all on the Tier 10 sinker side): - `gbrpf32_kernel_host_native_be_matches_false_on_le_host` — kernel-level routing equivalence at tail widths 5 / 7 / 33 across u8 RGB / u8 RGBA / u16 RGB / u16 RGBA / f32 lossless (LE-host gated; BE-host correctness is verified by row-kernel BE parity tests + the contract tests below). - `gbrpf16_kernel_host_native_be_matches_false_on_le_host` — same coverage for the f16 dispatcher with both `use_simd = false` (scalar widen- fallback) and `use_simd = true` (SIMD widen path). - `gbrpf32_sinker_host_native_contract_lossless_passthrough` — host-native f32 → `with_rgb_f32` bit-exact pass-through on every host (HDR, NaN, Inf). - `gbrpf16_sinker_host_native_contract_lossless_passthrough` — host-native f16 → `with_rgb_f16` bit-exact pass-through on every host. - `gbrapf16_sinker_host_native_contract_lossless_passthrough_with_alpha` — α plane round-trips bit-exact alongside RGB (Strategy A+ consistency). - `gbrapf32_sinker_host_native_contract_lossless_passthrough_with_alpha` — same for f32 sources with α. Audit summary across all 5 row backends for the f16 widening pattern: | Backend | Dispatch fallback | SIMD scalar tail (in row kernels) | |-----------------|-------------------|------------------------------------| | scalar | n/a | n/a | | NEON | FIXED | unchanged (per "DO NOT" directive) | | SSE4.1 + F16C | FIXED | unchanged (per "DO NOT" directive) | | AVX2 + F16C | FIXED | unchanged (per "DO NOT" directive) | | AVX-512 + F16C | FIXED | unchanged (per "DO NOT" directive) | | wasm-simd128 | FIXED | unchanged (per "DO NOT" directive) | The SIMD scalar tails (`g[x + i].to_f32()` followed by `scalar::gbrpf32_to_*::<BE>`) inside `arch::*::gbrpf16_*` row kernels share the same bug pattern as the dispatch fallback. Per the explicit "DO NOT touch row kernels themselves" directive in the codex review, those tails are NOT modified in this commit; they are tracked as a separate follow-up. The dispatch fallback fix here is sufficient for `use_simd = false` callers and for backends where the SIMD widen kernel is not selected at runtime. LE-host correctness for `use_simd = true` paths is preserved (since `HOST_NATIVE_BE = false ≡ false`); BE-host correctness for the SIMD tails will be addressed in a follow-up cleanup PR. Verification: - cargo test --target aarch64-apple-darwin --lib → 2237 passed - cargo test --target x86_64-apple-darwin --lib → 2968 passed - cargo build --target x86_64-apple-darwin --tests → clean - RUSTFLAGS=-Ctarget-feature=+simd128 cargo build --target wasm32-unknown-unknown --tests → clean - cargo build --no-default-features → clean - cargo fmt --check → clean - cargo clippy --all-targets --all-features -- -D warnings → clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…dian.rs Copilot review caught two stale comments in `src/row/arch/{neon,x86_sse41}/endian.rs` that referenced "PR #83's be-tier9 branch" as the origin of the u16x4 loaders. The references made sense as transient dedup markers during the rebase, but post-merge they're stale — the helpers are now just normal members of each module. The previous comments above the u16x4 loaders already explain why both u16x8 and u16x4 variants exist (tail-safe 8-byte loads for f16 widen kernels), so removing the dedup note doesn't lose information. Copilot's third finding (about `widen_f16_be_to_host_f32` doc claiming shared use with dispatch when dispatch had its own copy) was already resolved by 6e946a6 which dedup'd them — dispatch now imports the scalar `pub(crate)` helper. No code changes; comment-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…s-through fast path Codex 2nd-pass review of PR #85 found two issues — one high-severity correctness bug at the **sinker** layer (same class as PR #83 dcf40a3 and PR #84 8627280) plus a low-severity perf regression in the scalar lossless pass-through. **Finding 1 [high] — Grayf32 sinker hardcoded LE for host-native f32.** The `MixedSinker<Grayf32>` `PixelSink::process` impl hardcoded `::<false>` on every `grayf32_to_*_row` dispatcher call. With the new BE-aware kernel semantics, `BE = false` means "decode LE-encoded input" via `u32::from_le`. But `Grayf32Frame` exposes a host-native `&[f32]` plane — the API contract is "caller passes host-native floats". On a BE host, `::<false>` would byte-swap the already-decoded host-native f32 inside the loaders, corrupting every Grayf32 output path (including the lossless `with_luma_f32` and `with_rgb_f32` pass-throughs **and** every downstream u8/u16/luma/HSV output). Direct backend `::<true>` BE-parity tests (in `src/row/arch/*/tests/`) don't catch this because they bypass the sinker entirely — they hand the kernel BE-encoded bytes and assert against the LE-encoded counterpart, exercising only the kernel decode boundary, not the sinker-to-kernel routing. The fix is the **sinker-layer** complement of the SIMD-backend-internal HOST_NATIVE_BE introduced in PR #83's `c3a6478`, mirroring exactly the fixes applied in PR #83 `dcf40a3` (`packed_rgb_float.rs`) and PR #84 `8627280` (`packed_rgb_f16.rs` / `planar_gbr_float.rs`): const HOST_NATIVE_BE: bool = cfg!(target_endian = "big"); Defined at module scope in `src/sinker/mixed/gray.rs`. Every hardcoded `::<false>` in the Grayf32 `process` impl is replaced with `::<HOST_NATIVE_BE>`. Same truth table as the previous fixes: • LE host: HOST_NATIVE_BE = false → `from_le` (no-op on LE) → correct. • BE host: HOST_NATIVE_BE = true → `from_be` (no-op on BE) → correct. Out of scope (different layer): The Gray8 / GrayN / Gray16 / Ya8 / Ya16 sinkers in this same module take their `BE` from the source `Frame`'s flag (their Frames carry plane bytes whose encoding semantics are caller-specified — Phase 4 territory) and must keep their existing `<BE>` routing. This commit only touches the Grayf32 sinker, whose Frame type unambiguously specifies host-native f32 element semantics. Call sites changed in `src/sinker/mixed/gray.rs` (Grayf32 sinker only): • `grayf32_to_luma_f32_row` (lossless luma pass-through) • `grayf32_to_rgb_f32_row` (lossless RGB replicate) • `grayf32_to_luma_row` (luma u8) • `grayf32_to_luma_u16_row` (luma u16) • `grayf32_to_rgba_u16_row` (standalone u16 RGBA fast path) • `grayf32_to_rgb_u16_row` (u16 RGB) • `grayf32_to_rgba_row` ×2 (standalone u8 RGBA + HSV-with-RGBA) • `grayf32_to_hsv_row` (standalone HSV fast path) • `grayf32_to_rgb_row` (combined u8 RGB / HSV path) 10 dispatch-line replacements across 9 distinct dispatcher names. **Finding 2 [low] — `grayf32_to_luma_f32_row` lossless pass-through perf.** `grayf32_to_luma_f32_row` always looped and rebuilt each sample via `to_bits` + `from_le`/`from_be`, even when the encoded byte order matches the host-native (i.e. the no-swap case). Mirroring PR #83's `b915754` `rgbf32_to_rgb_f32_row` fast path: const HOST_NATIVE_BE: bool = cfg!(target_endian = "big"); if BE == HOST_NATIVE_BE { luma_f32_out[..width].copy_from_slice(&plane[..width]); return; } The const-generic dead branch is eliminated per monomorphization, so the slow byte-swap path is only emitted for `BE != HOST_NATIVE_BE` callers. Restores the pre-BE-aware "lossless pass-through" perf characteristic for the dominant LE-host + LE-encoded routing. Not applied to `grayf32_to_rgb_f32_row` in this commit because that kernel does broadcast (Y → R=G=B; output is 3× input, so no `copy_from_slice` is possible). Doc rewording on `grayf32_to_luma_f32_row` consistent with PR #83's `b915754` rewording (input encoded byte order vs host CPU's native order, instead of "bytes swapped relative to the host's native little-endian layout"). **Tests:** Added 2 new sinker-level regression tests in `src/sinker/mixed/gray.rs` (mirroring the PR #83 `dcf40a3` test pair): • `grayf32_kernel_host_native_be_matches_false_on_le_host` — calls each `grayf32_to_*` dispatcher (luma_f32, rgb_f32, luma u8/u16, RGB u8/u16, RGBA u8/u16, HSV) with both `BE = false` and `BE = HOST_NATIVE_BE` (= `cfg!(target_endian = "big")`), asserting outputs are byte-equal on the active host. LE-host-only (gated on `target_endian = "little"`); on a BE host the equality `::<false>` ≡ `::<HOST_NATIVE_BE>` is _false_ by design — the `::<true>` arm decodes the host-native fixture as if it were BE-encoded (no swap), while `::<false>` would byte-swap. Width 33 covers SIMD main loop + scalar tail across every backend. Doubles as a smoke test for the new `luma_f32` `copy_from_slice` fast path. • `grayf32_sinker_host_native_contract_lossless_passthrough` — feeds `Grayf32Frame` through the public sinker API and asserts both `with_luma_f32` and `with_rgb_f32` round-trip host-native input bit-exact. Pairs with the kernel-level test to cover both the dispatch boundary and the public sinker boundary. Runs on every host (LE and BE). Both tests miri-ignored because the SIMD-dispatched row kernels use intrinsics unsupported by Miri (consistent with existing tests). **Verified locally:** • cargo test --target aarch64-apple-darwin --lib → 2271 pass (was 2269; +2 new tests). • cargo test --target x86_64-apple-darwin --lib → 3020 pass. • cargo test --no-default-features --lib → 35 pass. • cargo build --target x86_64-apple-darwin --tests → 0 warnings. • RUSTFLAGS="-C target-feature=+simd128" cargo build --target wasm32-unknown-unknown --tests → ok. • cargo build --no-default-features → ok. • cargo fmt --check → clean. • cargo clippy --all-targets --all-features -- -D warnings → clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ttle") + add f32 BE parity test CI miri-sb (run 25544222638) failed on powerpc64 / s390x / i686 / aarch64-apple-darwin / aarch64-unknown-linux-gnu sanitizer targets: the LE-fixture scalar tests in src/row/scalar/gray.rs (e.g. gray10_full_range_pass_through, gray10_limited_range_midpoint, gray16_limited_range_white, gray16_to_luma_u16_identity, gray9_*, gray12/14_limited_range_white, ...) construct fixtures as host-native `Vec<u16>` literals and call the kernels with `::<false>` (LE-encoded contract). On a big-endian host, host-native u16 bits do NOT lay out little-endian, so the kernel's `u16::from_le` swap correctly reinterprets the fixture as if it were an LE-encoded payload — producing a different (corrupted) value than the test asserts. Same class as PR #82's `8f2e329` (alpha_extract / planar_gbr_high_bit) and PR #83's `56342c0` (Rgbf16 scalar widen / parity / copy tests). Fix: gate the 24 affected tests on `#[cfg(target_endian = "little")]` and add a section-header comment explaining the rationale (mirroring PR #82 / #83). Tests intentionally NOT gated: - gray8_* tests (u8 source, no endian dependency) - byte-symmetric value tests (`0xFFFF`, `u16::MAX`): `gray16_to_rgb_u16_limited_range_over_white_clamps`, `gray_n_to_luma_u16_10bit_masks` - dedicated `*_be_parity_*` tests built via `swap_bytes()` — those lock down BE-host kernel correctness directly and pass on every host. Audit: full sweep of `src/row/scalar/gray.rs` `mod tests` block; no shared helper functions to gate (every fixture is constructed inline). Part 2 (closes Copilot review PR #85 finding 3): adds `grayf32_to_luma_f32_row_be_le_parity_lossless` in `src/row/scalar/grayf32.rs`. The existing `grayf32_be_parity_*` suite covers integer-output paths but never exercises the lossless `grayf32_to_luma_f32_row::<true>` (BE-encoded f32 → host-native f32) fast/slow paths. The new test builds a BE-encoded mirror via the same `f32::from_bits(v.to_bits().swap_bytes())` construction the suite already uses for `f32_to_be_bytes`, runs both BE and LE kernels, and asserts bitwise equality of their outputs (NaN-safe via `f32::to_bits`). Path coverage by host: - LE host: LE = memcpy fast path; BE = slow swap path. - BE host: LE = slow swap path; BE = memcpy fast path. This exercises the `BE == HOST_NATIVE_BE` gate from both directions. Copilot findings 1 + 2 REJECTED: those suggested replacing `if BE == HOST_NATIVE_BE { copy_from_slice }` with `if !BE { ... }`. That simplification is incorrect on BE hosts: - BE host, BE=false: `!BE = true` → would copy_from_slice the LE-encoded buffer as-is, propagating swapped bytes into output. - The current `BE == HOST_NATIVE_BE` gate handles all four host×data quadrants correctly. Kernel code unchanged (codex 5th-pass approved the existing gate). Verified: cargo test --target aarch64-apple-darwin --lib → 2271 passed (previous 2270 + 1 new f32 BE parity test) cargo build --target x86_64-apple-darwin --tests → 0 warnings RUSTFLAGS=-Ctarget-feature=+simd128 cargo build --target wasm32-unknown-unknown --tests → ok cargo build --no-default-features → ok cargo fmt --check → clean cargo clippy --all-targets --all-features -- -D warnings → clean cargo check --target s390x-unknown-linux-gnu --lib → ok (BE-host smoke check; cross-toolchain link errors at --tests stage are expected and acceptable per project convention) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Affected types: Rgbf32Frame, Rgbf16Frame, Gbrpf32Frame, Gbrapf32Frame, Gbrpf16Frame, Gbrapf16Frame. Each doc-comment now explicitly states the LE-encoded byte contract, mirroring the established Grayf32Frame doc pattern: the &[f32] / &[f16] plane is the LE-encoded byte layout reinterpreted as f32 / f16, matching FFmpeg's canonical *LE pixel-format suffixes (AV_PIX_FMT_GBRPF32LE etc). Adds the bytemuck::cast_slice + linesize-division-by-element-size instruction so callers wiring up FFmpeg buffers don't have to guess. Codex 3rd-pass review of PR #85 caught that PR #83 + PR #84 introduced sinker routings that assumed the planes are already host-native f32 / f16 — a wrong reading of the *LE contract that would corrupt every output path on a BE host. Pinning the doc-level contract here so future readers don't repeat the mistake; sinker revert lands in the next commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirrors PR #85's Grayf32 sinker revert (52f8191) for the other six float frame types affected by PR #83 (dcf40a3) and PR #84 (8627280): Rgbf32, Rgbf16, Gbrpf32, Gbrapf32, Gbrpf16, Gbrapf16. The earlier PRs introduced sinker-layer `HOST_NATIVE_BE` consts that made the row-kernel `BE` parameter a no-op byte-swap on whichever host the build targeted. That assumed the Frame's plane bytes are already host-native f32/f16 — which contradicts the LE-encoded byte contract clarified in the previous commit. On a BE host the routing skipped the required `u32::from_le` / `u16::from_le` swap and would have corrupted every output path. Reverted call sites (all `::<HOST_NATIVE_BE>` -> `::<false>`): - src/sinker/mixed/packed_rgb_float.rs: 6 sites (Rgbf32) - src/sinker/mixed/packed_rgb_f16.rs: 7 sites (Rgbf16) - src/sinker/mixed/planar_gbr_float.rs: 22 sites (Gbrpf32 + Gbrapf32) - src/sinker/mixed/planar_gbr_f16.rs: 22 sites (Gbrpf16 + Gbrapf16) Removes the four `const HOST_NATIVE_BE: bool = cfg!(target_endian = "big")` definitions and their explanatory doc-comments. Rewrites the `widen_f16_to_f32` doc in planar_gbr_f16.rs to accurately describe the LE-encoded contract (helper is correct on LE host; cross-endian widen should use the bit-normalising `widen_f16_be_to_host_f32` pattern). Untouched: `BE == HOST_NATIVE_BE` `copy_from_slice` fast paths in scalar row kernels (`rgbf32_to_rgb_f32_row`, `gbrpf32_to_rgb_f32_row`, etc.) — generic in `BE`, correct under either contract. Per-arch SIMD kernels and `widen_f16_be_to_host_f32` left alone — different layer. Verification (aarch64-apple-darwin): cargo test --target aarch64-apple-darwin --lib # 2246 ok Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…oded regressions Removes the host-native-contract tests that PR #83 + PR #84 added — they were typed against an incorrect host-native-f32/f16 reading of the Frame contract and would mask the BE-host corruption introduced by the now- reverted `::<HOST_NATIVE_BE>` sinker routings: - rgbf32_kernel_host_native_be_matches_false_on_le_host (deleted) - rgbf32_sinker_host_native_contract_lossless_passthrough (deleted) - rgbf16_kernel_host_native_be_matches_false_on_le_host (deleted) - rgbf16_sinker_host_native_contract_lossless_passthrough (deleted) - gbrpf32_kernel_host_native_be_matches_false_on_le_host (deleted) - gbrpf32_sinker_host_native_contract_lossless_passthrough (deleted) - gbrpf16_kernel_host_native_be_matches_false_on_le_host (deleted) - gbrpf16_sinker_host_native_contract_lossless_passthrough (deleted) - gbrapf32_sinker_host_native_contract_lossless_passthrough_with_alpha (deleted) - gbrapf16_sinker_host_native_contract_lossless_passthrough_with_alpha (deleted) Adds LE-encoded byte contract regressions following PR #85's `52f8191` pattern (`grayf32_sinker_le_encoded_frame_decodes_correctly`): - rgbf32_sinker_le_encoded_frame_decodes_correctly (new) - rgbf16_sinker_le_encoded_frame_decodes_correctly (new) - gbrpf32_sinker_le_encoded_frame_decodes_correctly (new) - gbrapf32_sinker_le_encoded_frame_decodes_correctly (new) - gbrpf16_sinker_le_encoded_frame_decodes_correctly (new) - gbrapf16_sinker_le_encoded_frame_decodes_correctly (new) Each test constructs an explicitly LE-encoded plane via `f32::from_bits(intended.to_bits().to_le())` (or the f16 analogue), builds the Frame, runs the sinker's lossless pass-through, and asserts the output equals the host-native intended values. Vacuous on LE hosts (where `to_le` is identity) but on a BE host any regression that drops the `::<false>` kernel routing would fail fast. Also picks up two trivial pre-existing trailing-blank-line cleanups in `src/row/arch/{neon,x86_sse41}/endian.rs` produced by `cargo fmt` while running it across the modified files. Verification (aarch64-apple-darwin, x86_64-apple-darwin, wasm32): cargo test --target aarch64-apple-darwin --lib # 2242 ok cargo build --target x86_64-apple-darwin --tests # ok RUSTFLAGS="-C target-feature=+simd128" \ cargo build --target wasm32-unknown-unknown --tests # ok cargo build --no-default-features # ok cargo fmt --check # ok cargo clippy --all-targets --all-features -- -D warnings # ok Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…inker routings (post-#83/#84/#85 audit) (#92)

uqio force-pushed the feat/be-tier9 branch from c99a0a6 to d604016 Compare May 7, 2026 12:26

Base automatically changed from feat/be-infra to main May 7, 2026 12:37

uqio force-pushed the feat/be-tier9 branch 2 times, most recently from e83608b to bcc366b Compare May 7, 2026 13:10

al8n requested a review from Copilot May 7, 2026 23:38

uqio and others added 3 commits May 8, 2026 11:38

uqio force-pushed the feat/be-tier9 branch from bcc366b to 115a85e Compare May 7, 2026 23:38

Copilot started reviewing on behalf of al8n May 7, 2026 23:42 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

uqio and others added 7 commits May 8, 2026 12:01

uqio merged commit ea21141 into main May 8, 2026
43 checks passed

uqio deleted the feat/be-tier9 branch May 8, 2026 04:04

Copilot AI mentioned this pull request May 8, 2026

feat(be-tier10-float): BE support for Gbrpf32/Gbrapf32/Gbrpf16/Gbrapf16 kernels #84

Merged

4 tasks

uqio mentioned this pull request May 8, 2026

fix(frame): clarify LE-encoded byte contract; revert HOST_NATIVE_BE sinker routings (post-#83/#84/#85 audit) #92

Merged

6 tasks

al8n pushed a commit that referenced this pull request May 8, 2026

fix(frame): clarify LE-encoded byte contract; revert HOST_NATIVE_BE s…

3b1d716

…inker routings (post-#83/#84/#85 audit) (#92)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(be-tier9): BE support for Rgbf32 / Rgbf16 row kernels#83

feat(be-tier9): BE support for Rgbf32 / Rgbf16 row kernels#83
uqio merged 10 commits intomainfrom
feat/be-tier9

uqio commented May 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		/// When `BE = true` the input `f32` values are encoded big-endian
		/// (bytes swapped relative to the host's native little-endian layout).

Conversation

uqio commented May 7, 2026

Summary

Stacking

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants