Skip to content

feat(be-tier9): BE support for Rgbf32 / Rgbf16 row kernels#83

Merged
uqio merged 10 commits intomainfrom
feat/be-tier9
May 8, 2026
Merged

feat(be-tier9): BE support for Rgbf32 / Rgbf16 row kernels#83
uqio merged 10 commits intomainfrom
feat/be-tier9

Conversation

@uqio
Copy link
Copy Markdown
Collaborator

@uqio uqio commented May 7, 2026

Summary

Phase 2 — Tier 9 BE rollout. Stacked on #81 (BE infra). Adds <const BE: bool> to all Rgbf32 / Rgbf16 packed-RGB float row kernels across all 6 backends + dispatcher.

Implementation:

  • Scalar: to_bits().swap_bytes() for both f32 and f16 elements; LE f32 fast path uses copy_from_slice
  • NEON: load_f32x4::<BE> via load_endian_u32x4; widen_f16x4::<BE> via load_endian_u16x8; BE-only deinterleave path for vld3q_f32-inaccessible layouts
  • SSE4.1 / AVX2 / AVX-512: _mm*_castsi*_ps(load_endian_u32xN::<BE>(...)); F16C widening via _mm{,256,512}_cvtph_ps with endian-aware u16 loads
  • wasm-simd128: load_f32x4::<BE> via load_endian_u32x4; scalar f16 byte-swap before widening (no native f16 SIMD on wasm)

Test results: 2862 tests pass total (33 new BE parity tests). cargo clippy and cargo fmt clean.

Stacking

Base: feat/be-infra (#81). Will rebase onto main once #81 merges.

Test plan

  • cargo test --target aarch64-apple-darwin
  • cargo build --target x86_64-apple-darwin --tests
  • cargo build --target wasm32-unknown-unknown --tests
  • s390x QEMU (Phase 3)

🤖 Generated with Claude Code

@uqio uqio force-pushed the feat/be-tier9 branch from c99a0a6 to d604016 Compare May 7, 2026 12:26
uqio added a commit that referenced this pull request May 7, 2026
`widen_f16x4_sse<BE=true>` was calling `load_endian_u16x8` (which reads 16
bytes via `_mm_loadu_si128`) but the kernel only guarantees 8 readable
bytes per call (4 × f16). The third widen call per loop iteration reads
[lane*2+16, lane*2+32) while the buffer ends at lane*2+24 when
`lane+12 == total_lanes` — an 8-byte tail-overread that ASan caught on
PR #83's CI sanitizer job.

Add `load_endian_u16x4` (8-byte load via `_mm_loadl_epi64` + low-half
byte-swap; upper half zeroed). The fix is correct because `_mm_cvtph_ps`
only reads the low 64 bits (4 × f16) of its `__m128i` operand, so the
zeroed upper half is harmless. AVX2 (`widen_f16x8_avx`) and AVX-512
(`widen_f16x16_avx512`) need a full 16/32-byte u16 region per call so
they keep using `load_endian_u16x8` / `load_endian_u16x16`.

Verified locally with `RUSTFLAGS=-Zsanitizer=address cargo +nightly test
--target x86_64-apple-darwin` (2862 tests pass).
Base automatically changed from feat/be-infra to main May 7, 2026 12:37
uqio added a commit that referenced this pull request May 7, 2026
`widen_f16x4_sse<BE=true>` was calling `load_endian_u16x8` (which reads 16
bytes via `_mm_loadu_si128`) but the kernel only guarantees 8 readable
bytes per call (4 × f16). The third widen call per loop iteration reads
[lane*2+16, lane*2+32) while the buffer ends at lane*2+24 when
`lane+12 == total_lanes` — an 8-byte tail-overread that ASan caught on
PR #83's CI sanitizer job.

Add `load_endian_u16x4` (8-byte load via `_mm_loadl_epi64` + low-half
byte-swap; upper half zeroed). The fix is correct because `_mm_cvtph_ps`
only reads the low 64 bits (4 × f16) of its `__m128i` operand, so the
zeroed upper half is harmless. AVX2 (`widen_f16x8_avx`) and AVX-512
(`widen_f16x16_avx512`) need a full 16/32-byte u16 region per call so
they keep using `load_endian_u16x8` / `load_endian_u16x16`.

Verified locally with `RUSTFLAGS=-Zsanitizer=address cargo +nightly test
--target x86_64-apple-darwin` (2862 tests pass).
@uqio uqio force-pushed the feat/be-tier9 branch 2 times, most recently from e83608b to bcc366b Compare May 7, 2026 13:10
@al8n al8n requested a review from Copilot May 7, 2026 23:38
uqio and others added 3 commits May 8, 2026 11:38
Add `<const BE: bool>` to all Rgbf32 and Rgbf16 row kernels across
scalar, NEON, SSE4.1, AVX2, AVX-512, and wasm-simd128 backends, plus
both dispatchers. BE parity tests added to every backend; existing
callers (sinkers, scalar tests, arch tests) updated to `<false>`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
`widen_f16x4_sse<BE=true>` was calling `load_endian_u16x8` (which reads 16
bytes via `_mm_loadu_si128`) but the kernel only guarantees 8 readable
bytes per call (4 × f16). The third widen call per loop iteration reads
[lane*2+16, lane*2+32) while the buffer ends at lane*2+24 when
`lane+12 == total_lanes` — an 8-byte tail-overread that ASan caught on
PR #83's CI sanitizer job.

Add `load_endian_u16x4` (8-byte load via `_mm_loadl_epi64` + low-half
byte-swap; upper half zeroed). The fix is correct because `_mm_cvtph_ps`
only reads the low 64 bits (4 × f16) of its `__m128i` operand, so the
zeroed upper half is harmless. AVX2 (`widen_f16x8_avx`) and AVX-512
(`widen_f16x16_avx512`) need a full 16/32-byte u16 region per call so
they keep using `load_endian_u16x8` / `load_endian_u16x16`.

Verified locally with `RUSTFLAGS=-Zsanitizer=address cargo +nightly test
--target x86_64-apple-darwin` (2862 tests pass).
Tier 9 packed-float-RGB scalar BE conversion used unconditional
`x.swap_bytes()`, which always swaps regardless of host endianness. On
big-endian hosts (powerpc64, s390x) the source bytes are already in
host-native order, so an extra swap corrupts every BE row. The SIMD
`load_endian_*::<BE>` helpers shipped with feat/be-infra are already
target-endian aware (no-op on a matching host), so the scalar and
per-arch tail paths produced wrong output relative to the SIMD body
on a hypothetical s390x runner.

Replaced every `bits.swap_bytes()` / `to_bits().swap_bytes()` site in
the source files with `u32::from_be` / `u32::from_le` (for f32) or
`u16::from_be` / `u16::from_le` (for f16):

- `if BE { x.swap_bytes() } else { x }`
  → `if BE { u32::from_be(x) } else { u32::from_le(x) }`
- `f32::from_bits(raw.to_bits().swap_bytes())` (BE-only)
  → `f32::from_bits(if BE { u32::from_be(raw.to_bits()) }
       else { u32::from_le(raw.to_bits()) })`

`from_be` / `from_le` is a no-op when the encoded byte order matches
the host, a byte-swap when they differ — exactly mirroring the SIMD
helper semantics so LE and BE hosts now produce bit-identical output.

Special note for the f32 / f16 pass-through kernels: previously the
`else` branch fell back to `copy_from_slice`, which is a byte-level
copy. On a BE host that copies LE-encoded bytes into f32 / f16 lanes
verbatim, leaving the destination in non-host-native order — the
docstring claims "output is always host-native". The fix routes both
branches through `from_bits(from_be/from_le(to_bits()))`, which is a
no-op on LE host (correct, byte order matches) and a swap on BE host
(correct, since the data is LE-encoded).

Source-file call sites fixed:

- u32 (f32 → bits, target-endian decoded): 7 — scalar `load_f32`,
  scalar `rgbf32_to_rgb_f32_row`, neon / x86_sse41 / x86_avx2 /
  x86_avx512 / wasm_simd128 `rgbf32_to_rgb_f32_row` BE tails.
- u16 (f16 → bits, target-endian decoded): 12 — scalar `load_f16`,
  scalar `rgbf16_to_rgb_f32_row`, scalar `rgbf16_to_rgb_f16_row`,
  neon `widen_f16_tail`, x86_sse41 `load_f16_scalar`, x86_avx2 /
  x86_avx512 `rgbf16_to_rgb_f32_row` f16 widen tails, wasm_simd128
  five f16 widen lanes (`rgbf16_to_rgb_row`, `rgbf16_to_rgba_row`,
  `rgbf16_to_rgb_u16_row`, `rgbf16_to_rgba_u16_row`,
  `rgbf16_to_rgb_f32_row`).
- f32 special case: covered by the u32 sites (scalar
  `rgbf32_to_rgb_f32_row` and the per-arch BE tails go through
  `f32::from_bits(u32::from_be/le(to_bits()))`).
- f16 special case: covered by the u16 sites (scalar
  `rgbf16_to_rgb_f16_row` and `rgbf16_to_rgb_f32_row` go through
  `half::f16::from_bits(u16::from_be/le(to_bits()))`).

Test helpers (`be_rgbf32` / `be_rgbf16` in `tests/packed_rgb_float.rs`
across all arch backends) intentionally still use `swap_bytes()`
because they synthesize a BE-encoded buffer from an LE host input —
the unconditional swap is correct there and per-instructions remains
unchanged. The neon `widen_f16_tail` helper additionally became
`<const BE: bool>` (was previously calling `to_f32()` directly on
host-native bits, producing garbage when fed BE-encoded f16 — the
test `neon_rgbf16_to_rgb_f32_be_matches_le` failed at widths where
the 4-lane SIMD body left a non-zero tail).

Verified: 2170 lib tests pass on `aarch64-apple-darwin`;
`cargo build --target x86_64-apple-darwin --tests` clean;
`RUSTFLAGS="-C target-feature=+simd128" cargo build --target
wasm32-unknown-unknown --tests` clean (warnings pre-existing);
`cargo build --no-default-features` clean; `cargo fmt --check` clean;
`cargo clippy --all-targets --all-features -- -D warnings` clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR is Phase 2 of the Tier 9 Big-Endian rollout, adding a <const BE: bool> endianness parameter to packed-float RGB row kernels (Rgbf32 / Rgbf16) and updating callers/tests accordingly so kernels can decode BE-encoded input streams.

Changes:

  • Updated Tier 9 packed float scalar kernels to decode f32/f16 element bit-patterns based on <const BE: bool>.
  • Updated MixedSinker Tier 9 sinker implementations to call the new const-generic row dispatchers.
  • Updated scalar tests to call the new const-generic kernel APIs.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 11 comments.

File Description
src/sinker/mixed/packed_rgb_float.rs Updates Rgbf32 sinker to call const-generic row dispatchers with BE specified.
src/sinker/mixed/packed_rgb_f16.rs Updates Rgbf16 sinker to call const-generic row dispatchers with BE specified.
src/row/scalar/tests.rs Updates scalar Tier 9 tests to call const-generic kernels.
src/row/scalar/packed_rgb_float.rs Adds endian-aware scalar loads and propagates <const BE: bool> across Tier 9 scalar kernels.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 209 to 213
if let Some(buf) = rgb_f32.as_deref_mut() {
let f32_start = one_plane_start * 3;
let f32_end = one_plane_end * 3;
rgbf32_to_rgb_f32_row(rgb_in, &mut buf[f32_start..f32_end], w, use_simd);
rgbf32_to_rgb_f32_row::<false>(rgb_in, &mut buf[f32_start..f32_end], w, use_simd);
}
Comment on lines 234 to 245
if let Some(buf) = rgb_f16.as_deref_mut() {
let f16_start = one_plane_start * 3;
let f16_end = one_plane_end * 3;
rgbf16_to_rgb_f16_row(rgb_in, &mut buf[f16_start..f16_end], w, use_simd);
rgbf16_to_rgb_f16_row::<false>(rgb_in, &mut buf[f16_start..f16_end], w, use_simd);
}

// Lossless f32 widen — also independent of integer conversion paths.
if let Some(buf) = rgb_f32.as_deref_mut() {
let f32_start = one_plane_start * 3;
let f32_end = one_plane_end * 3;
rgbf16_to_rgb_f32_row(rgb_in, &mut buf[f32_start..f32_end], w, use_simd);
rgbf16_to_rgb_f32_row::<false>(rgb_in, &mut buf[f32_start..f32_end], w, use_simd);
}
Comment thread src/row/scalar/packed_rgb_float.rs Outdated
Comment on lines +103 to +104
/// When `BE = true` the input `f32` values are encoded big-endian
/// (bytes swapped relative to the host's native little-endian layout).
Comment on lines +213 to 235
pub(crate) fn rgbf32_to_rgb_f32_row<const BE: bool>(
rgb_in: &[f32],
rgb_out: &mut [f32],
width: usize,
) {
debug_assert!(rgb_in.len() >= width * 3, "rgbf32 row too short");
debug_assert!(rgb_out.len() >= width * 3, "rgb_f32_out row too short");
rgb_out[..width * 3].copy_from_slice(&rgb_in[..width * 3]);
// Decode each source f32 from `BE` byte order to host-native.
// `u32::from_be` / `u32::from_le` is target-endian aware: a no-op
// when encoded byte order matches the host, a byte-swap when they
// differ. Output is always host-native f32 on every target.
for (dst, src) in rgb_out[..width * 3]
.iter_mut()
.zip(rgb_in[..width * 3].iter())
{
let bits = src.to_bits();
*dst = f32::from_bits(if BE {
u32::from_be(bits)
} else {
u32::from_le(bits)
});
}
}
Comment thread src/row/scalar/packed_rgb_float.rs Outdated
Comment on lines +403 to +417
// Decode each source f16 from `BE` byte order to host-native, mirror
// of `rgbf32_to_rgb_f32_row`. `u16::from_be` / `u16::from_le` is
// target-endian aware: no-op when encoded byte order matches the
// host, swap when they differ. Output is always host-native f16.
for (dst, src) in rgb_out[..width * 3]
.iter_mut()
.zip(rgb_in[..width * 3].iter())
{
let bits = src.to_bits();
*dst = half::f16::from_bits(if BE {
u16::from_be(bits)
} else {
u16::from_le(bits)
});
}
Comment on lines 209 to 213
if let Some(buf) = rgb_f32.as_deref_mut() {
let f32_start = one_plane_start * 3;
let f32_end = one_plane_end * 3;
rgbf32_to_rgb_f32_row(rgb_in, &mut buf[f32_start..f32_end], w, use_simd);
rgbf32_to_rgb_f32_row::<false>(rgb_in, &mut buf[f32_start..f32_end], w, use_simd);
}
Comment on lines 233 to 238
// Lossless f16 pass-through — emit first (independent of all other paths).
if let Some(buf) = rgb_f16.as_deref_mut() {
let f16_start = one_plane_start * 3;
let f16_end = one_plane_end * 3;
rgbf16_to_rgb_f16_row(rgb_in, &mut buf[f16_start..f16_end], w, use_simd);
rgbf16_to_rgb_f16_row::<false>(rgb_in, &mut buf[f16_start..f16_end], w, use_simd);
}
Comment thread src/row/scalar/packed_rgb_float.rs Outdated
Comment on lines +103 to +104
/// When `BE = true` the input `f32` values are encoded big-endian
/// (bytes swapped relative to the host's native little-endian layout).
Comment on lines +213 to 235
pub(crate) fn rgbf32_to_rgb_f32_row<const BE: bool>(
rgb_in: &[f32],
rgb_out: &mut [f32],
width: usize,
) {
debug_assert!(rgb_in.len() >= width * 3, "rgbf32 row too short");
debug_assert!(rgb_out.len() >= width * 3, "rgb_f32_out row too short");
rgb_out[..width * 3].copy_from_slice(&rgb_in[..width * 3]);
// Decode each source f32 from `BE` byte order to host-native.
// `u32::from_be` / `u32::from_le` is target-endian aware: a no-op
// when encoded byte order matches the host, a byte-swap when they
// differ. Output is always host-native f32 on every target.
for (dst, src) in rgb_out[..width * 3]
.iter_mut()
.zip(rgb_in[..width * 3].iter())
{
let bits = src.to_bits();
*dst = f32::from_bits(if BE {
u32::from_be(bits)
} else {
u32::from_le(bits)
});
}
}
Comment on lines +396 to 418
pub(crate) fn rgbf16_to_rgb_f16_row<const BE: bool>(
rgb_in: &[half::f16],
rgb_out: &mut [half::f16],
width: usize,
) {
debug_assert!(rgb_in.len() >= width * 3, "rgbf16 row too short");
debug_assert!(rgb_out.len() >= width * 3, "rgb_f16_out row too short");
rgb_out[..width * 3].copy_from_slice(&rgb_in[..width * 3]);
// Decode each source f16 from `BE` byte order to host-native, mirror
// of `rgbf32_to_rgb_f32_row`. `u16::from_be` / `u16::from_le` is
// target-endian aware: no-op when encoded byte order matches the
// host, swap when they differ. Output is always host-native f16.
for (dst, src) in rgb_out[..width * 3]
.iter_mut()
.zip(rgb_in[..width * 3].iter())
{
let bits = src.to_bits();
*dst = half::f16::from_bits(if BE {
u16::from_be(bits)
} else {
u16::from_le(bits)
});
}
}
uqio and others added 7 commits May 8, 2026 12:01
Codex review of PR #83 found two NEON-specific kernel correctness bugs in
the Tier 9 Rgbf16 BE path. Both are fixed here together.

Finding 1 — NEON f16 widening over-reads past row.

`widen_f16x4::<BE=true>` was calling `load_endian_u16x8` (16-byte load via
`vld1q_u16`) but the kernel only guarantees 8 readable bytes per call (4 ×
f16). The third widen call per 12-lane chunk reads bytes
[(lane+8)*2 .. (lane+16)*2) while the row ends at total_lanes*2 when
lane+12 == total_lanes — an 8-byte tail-overread that ASan/Miri catch on
guarded pages. Mirrors the SSE4.1 fix in 5967967.

The fix adds `load_endian_u16x4` to `src/row/arch/neon/endian.rs` (8-byte
load via `vld1_u16` + `vrev16_u8` byte-swap when needed) and uses it in
both the `BE=true` and `BE=false` arms of `widen_f16x4`. The downstream
`vcvt_f32_f16` already takes `uint16x4_t` so no further plumbing is
needed.

Finding 2 — f16→f32 widen-then-convert paths treat host-native f32 as
LE-encoded.

After widening f16 → f32 via `vcvt_f32_f16` (NEON), `_mm_cvtph_ps` /
`_mm256_cvtph_ps` / `_mm512_cvtph_ps` (x86), or scalar `to_f32()` (wasm),
the stack buffer carries host-native f32 values. The kernels then called
`rgbf32_to_*::<false>` to convert the buffer. With BE-aware kernel
semantics (BE=false means LE-encoded input, NOT host-native), the f32
loaders inside `rgbf32_to_*` would byte-swap the already-decoded host-
native buffer on a BE host — corrupting it.

The fix introduces a per-backend
  const HOST_NATIVE_BE: bool = cfg!(target_endian = "big");
and routes the post-widen conversion as `rgbf32_to_*::<HOST_NATIVE_BE>`.
This is a no-op byte-swap on both LE and BE hosts:
  • LE host: HOST_NATIVE_BE = false → `from_le` (no-op on LE) → correct.
  • BE host: HOST_NATIVE_BE = true  → `from_be` (no-op on BE) → correct.

Applied to all 5 backends (NEON / SSE4.1 / AVX2 / AVX-512 / wasm-simd128)
across 20 call sites total. Each `rgbf32_to_rgb_row::<false>` /
`rgbf32_to_rgba_row::<false>` / `rgbf32_to_rgb_u16_row::<false>` /
`rgbf32_to_rgba_u16_row::<false>` after a SIMD widen now uses
`HOST_NATIVE_BE`. The lossless f16→f32 paths (`rgbf16_to_rgb_f32_row`)
write directly to the f32 output without a downstream convert and are
already correct.

Audit notes (other backends):

  • SSE4.1: only Finding 2 applied here; Finding 1 was fixed in 5967967.
  • AVX2: `widen_f16x8_avx` correctly loads 16 bytes (8 × f16 = 16) so no
    Finding 1; Finding 2 fix applied.
  • AVX-512: `widen_f16x16_avx512` correctly loads 32 bytes (16 × f16 =
    32) so no Finding 1; Finding 2 fix applied.
  • wasm-simd128: scalar widen, no SIMD u16 load to over-read; Finding 2
    fix applied (path was explicitly documented "call LE downstream",
    making the wasm32-LE-only assumption visible — fix is endian-agnostic
    so it survives any future BE wasm target).

Tests:

Width set for all NEON Rgbf16 BE-parity tests extended to include `5`
(was `[1, 4, 7, 16, 33, 1920, 1921]`) to cover the
"lane+12 == total_lanes via 1-pixel scalar tail" boundary. Added a
dedicated `neon_rgbf16_be_tail_overread_widths_4_5_16_33` regression test
that calls each kernel at exactly the over-read-prone widths through
exact-sized allocations. Verified locally:

  • cargo test --target aarch64-apple-darwin --lib → 2200 pass
  • cargo test --target x86_64-apple-darwin --lib → 2915 pass
  • cargo test --no-default-features --lib → 35 pass
  • RUSTFLAGS=-Zsanitizer=address cargo +nightly test
    --target aarch64-apple-darwin --lib row::arch::neon::tests::packed_rgb_float
    -Zbuild-std → 23 pass
  • RUSTFLAGS=-Zsanitizer=address cargo +nightly test
    --target x86_64-apple-darwin --lib rgbf16 -Zbuild-std → 60 pass
  • cargo build --target x86_64-apple-darwin --tests → 0 warnings (new)
  • RUSTFLAGS="-C target-feature=+simd128" cargo build
    --target wasm32-unknown-unknown --tests → 0 warnings (new)
  • cargo build --no-default-features → ok
  • cargo fmt --check → clean
  • cargo clippy --all-targets --all-features (both targets) -D warnings
    → clean

Out of scope:

  • Finding 3 (sinker hardcodes ::<false>) — Phase 4 deferred per PR
    scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex 2nd-pass review of PR #83 found a third high-severity bug at the
**sinker** layer (one level above the SIMD-internal HOST_NATIVE_BE fix
that landed in c3a6478): the Rgbf32 and Rgbf16 `MixedSinker`
`PixelSink::process` impls hardcoded `::<false>` on every row dispatcher
call. With the new BE-aware kernel semantics, `BE = false` means
"decode LE-encoded input" (`u32::from_le` / `u16::from_le` / SIMD
`load_endian_*` LE arms). But `Rgbf32Frame` exposes a host-native
`&[f32]` row and `Rgbf16Frame` exposes a host-native `&[half::f16]` row
— the public API contract is "caller passes host-native floats". On a
BE host, `::<false>` would byte-swap the already-decoded host-native
values inside the loaders, corrupting the lossless `with_rgb_f32` /
`with_rgb_f16` pass-throughs **and** every downstream u8/u16/luma/HSV
output that flows through the same row.

Direct backend `::<true>` BE-parity tests (the body of c3a6478 and
prior commits) don't catch this because they bypass the sinker entirely
— they hand the kernel BE-encoded bytes and assert against the LE-
encoded counterpart, exercising only the kernel decode boundary, not
the sinker-to-kernel routing.

The fix is the **sinker-layer** complement of the SIMD-backend-internal
HOST_NATIVE_BE introduced in c3a6478:

  const HOST_NATIVE_BE: bool = cfg!(target_endian = "big");

Defined at module scope in both
`src/sinker/mixed/packed_rgb_float.rs` and
`src/sinker/mixed/packed_rgb_f16.rs`. Every hardcoded `::<false>` in
each sinker's `process` is replaced with `::<HOST_NATIVE_BE>`. Same
truth table as the SIMD fix, different layer:

  • LE host: HOST_NATIVE_BE = false → `from_le` (no-op on LE) → correct.
  • BE host: HOST_NATIVE_BE = true  → `from_be` (no-op on BE) → correct.

Distinction from Phase 4 (out of scope here):

  This is **host-native correctness** — the contract that `Rgbf32Frame`
  / `Rgbf16Frame` exposes already-decoded floats. It is **NOT** the
  Phase 4 BE-source-frame work, which would let the Frame type itself
  carry an encoding tag (LE-encoded bytes vs BE-encoded bytes vs host-
  native) and thread that through the walker / row / sinker stack. The
  Yuva / Gbrap / packed-RGB-u16 sinkers that are still on `::<false>`
  fall under Phase 4 because their Frames hold `&[u16]` plane buffers
  whose interpretation depends on whether the caller passed an LE-
  encoded or BE-encoded byte stream — that's a Frame-API design
  question, not a host-native routing bug. This commit touches only
  the float sinkers whose Frame types unambiguously specify host-
  native element semantics.

Call sites changed:

  • src/sinker/mixed/packed_rgb_float.rs — 6 sites (rgbf32_to_rgb_row,
    rgbf32_to_rgba_row ×2, rgbf32_to_rgb_u16_row, rgbf32_to_rgba_u16_row,
    rgbf32_to_rgb_f32_row).
  • src/sinker/mixed/packed_rgb_f16.rs — 7 sites (rgbf16_to_rgb_row,
    rgbf16_to_rgba_row ×2, rgbf16_to_rgb_u16_row, rgbf16_to_rgba_u16_row,
    rgbf16_to_rgb_f16_row, rgbf16_to_rgb_f32_row).

Out of scope (Phase 4 territory, not touched here):

  • Frame types
  • Walker types
  • Other sinkers (Yuva, Gbrap, mono1bit u16, etc.) — their Frames
    carry plane bytes whose encoding semantics need explicit Phase 4
    plumbing.
  • Row dispatchers themselves (already BE-aware via const generic).

Tests:

Added 4 sinker-level regression tests (one kernel-equivalence test +
one public-API contract test for each of Rgbf32 / Rgbf16):

  • `rgbf32_kernel_host_native_be_matches_false_on_le_host` and
    `rgbf16_kernel_host_native_be_matches_false_on_le_host` — call
    each `rgbf32_to_*` / `rgbf16_to_*` dispatcher with both
    `BE = false` and `BE = HOST_NATIVE_BE` (= `cfg!(target_endian = "big")`),
    asserting outputs are byte-equal on the active host. On a LE host
    both are no-op so this documents the routing equivalence; on a BE
    host the same equivalence holds for the **fixed** sinker but would
    fail for the broken one. Width 33 covers SIMD main loop + scalar
    tail across every backend.
  • `rgbf32_sinker_host_native_contract_lossless_passthrough` and
    `rgbf16_sinker_host_native_contract_lossless_passthrough` — feed
    `Rgbf32Frame` / `Rgbf16Frame` through the public sinker API and
    assert `with_rgb_f32` / `with_rgb_f16` round-trips host-native
    input bit-exact. Pairs with the kernel-level test to cover both
    the dispatch boundary and the public sinker boundary.

Comment in each docstring notes that full BE-host coverage requires
QEMU s390x (Phase 3) — these tests document the contract on LE and
would catch the bug on BE.

Verified locally:

  • cargo test --target aarch64-apple-darwin --lib → 2204 pass
    (was 2200; +4 new tests).
  • cargo test --target x86_64-apple-darwin --lib → 2919 pass
    (was 2915; +4 new tests).
  • cargo test --no-default-features --lib → 35 pass.
  • cargo build --target x86_64-apple-darwin --tests → 0 warnings.
  • RUSTFLAGS="-C target-feature=+simd128" cargo build
    --target wasm32-unknown-unknown --tests → only 3 pre-existing
    `unused imports` warnings in unrelated `wasm_simd128/tests/*`
    files, not introduced by this change (confirmed by stashing the
    diff and rebuilding on c3a6478).
  • cargo build --no-default-features → ok.
  • cargo fmt --check → clean.
  • cargo clippy --all-targets --all-features
    --target {aarch64,x86_64}-apple-darwin -- -D warnings → clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tcher-equivalence tests on LE host

Codex 3rd-pass review of PR #83 found that `rgbf32_to_rgb_f32_row::<BE>` had
two endian-correctness defects:

Finding 1 (high) — NEON `BE = false` branch used a raw `vld1q_f32` /
`vst1q_f32` copy. That preserves on-disk byte order: it's correct when the
input encoding matches host-native (the historical assumption: LE-encoded
input on a LE host) but corrupts the lossless f32 output on a BE host where
the request `BE = false` should mean "decode LE-encoded input to host-native"
— the kernel must byte-swap, not pass through. Audit revealed identical
defects in the SSE4.1, AVX2, AVX-512, and wasm-simd128 backends.

Findings 2 + 3 (medium) — the Rgbf32 / Rgbf16 sinker dispatcher-equivalence
tests asserted `::<false>` ≡ `::<HOST_NATIVE_BE>` while feeding host-native
fixtures. On LE hosts both calls are byte-for-byte identical (the test's
intent), but on BE hosts `::<false>` decodes the host-native fixture as if it
were LE-encoded (byte-swap) while `::<HOST_NATIVE_BE> == ::<true>` decodes as
BE (no swap), so the outputs diverge by design — the equivalence claim is
specifically about the LE host-routing pattern.

Fix:

* Replace the raw-copy fast path in every `rgbf32_to_rgb_f32_row::<BE>`
  backend (NEON, SSE4.1, AVX2, AVX-512, wasm-simd128) with a host-endian
  gate: `if BE == HOST_NATIVE_BE { raw copy } else { endian-aware load }`.
  When the requested encoding matches host-native the bytes can be copied
  verbatim (perf-equivalent to the old fast path on the only shipping
  target — LE); otherwise the kernel falls through to the existing
  `load_f32x{4,8,16}::<BE>` slow path which byte-swaps via the `_endian_*`
  loaders. Tail loop now uses the endian-aware
  `if BE { from_be } else { from_le }` decode (matches scalar reference at
  `src/row/scalar/packed_rgb_float.rs:213`).
* Add five new BE-target regression tests
  (`{neon,sse41,avx2,avx512,wasm}_rgbf32_to_rgb_f32_row_le_input_decodes_correctly_on_any_host`).
  Each constructs an LE-encoded f32 byte fixture (host-native bits passed
  through `f32::from_bits(u32::from_le(_))`) and feeds it through `::<false>`,
  asserting the output matches the original host-native expected values. On
  LE hosts this is a vacuous identity check; on BE hosts (full QEMU s390x
  coverage is Phase 3) it would have caught the original bug.
* Gate the two sinker dispatcher-equivalence tests
  (`rgbf32_kernel_host_native_be_matches_false_on_le_host` and
  `rgbf16_kernel_host_native_be_matches_false_on_le_host`) on
  `#[cfg(target_endian = "little")]`. BE-host correctness of the routing
  change is verified instead by `*_sinker_host_native_contract_lossless_passthrough`
  (sinker public-API contract) and the row-kernel BE parity tests.

Audit results — every backend with a `rgbf32_to_rgb_f32_row::<BE>` kernel
that used a raw passthrough on `BE = false` had the same defect:

  src/row/arch/neon/packed_rgb_float.rs:414      — fixed
  src/row/arch/x86_sse41/packed_rgb_float.rs:322 — fixed
  src/row/arch/x86_avx2/packed_rgb_float.rs:332  — fixed
  src/row/arch/x86_avx512/packed_rgb_float.rs:303 — fixed
  src/row/arch/wasm_simd128/packed_rgb_float.rs:287 — fixed

Five call sites changed (all five vector backends), five new regression
tests, two existing sinker dispatcher-equivalence tests cfg-gated on
`target_endian = "little"`. All `cargo test` / `cargo build` /
`cargo fmt --check` / `cargo clippy --all-targets --all-features
-- -D warnings` checks pass on aarch64-apple-darwin and x86_64-apple-darwin
(LE hosts).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…llow-up)

Codex 4th-pass review of PR #83 found that the previous `f1161d7` audit, which
fixed the `BE = false` raw-load fast path in `rgbf32_to_rgb_f32_row` per
backend, missed three other kernels with the same broken pattern: NEON's
`rgbf32_to_rgba_row` / `rgbf32_to_rgba_u16_row` (`vld3q_f32` deinterleave
hardcoded for the LE-host quadrant) and the AVX2 / AVX-512 `widen_f16x*`
helpers used by every Rgbf16 SIMD entry point (raw `_mm_loadu_si128` /
`_mm256_loadu_si256` for `BE = false`). On a big-endian AArch64 or x86 host
with LE-encoded input, each of these reads host-native (BE) bytes from an
LE buffer and mis-decodes the f32 / f16 lanes downstream — same defect class
as the original f32 pass-through bug, just in different kernels.

Fix:

* NEON `rgbf32_to_rgba_row` / `rgbf32_to_rgba_u16_row`: replace the hardcoded
  `if BE { endian-aware } else { vld3q_f32 }` deinterleave gate with the
  `BE == HOST_NATIVE_BE` host-endian gate (same pattern f1161d7 established
  for `rgbf32_to_rgb_f32_row`). Fast path uses `vld3q_f32` when on-disk
  encoding matches host-native; otherwise falls through to the existing
  endian-aware `load_f32x4::<BE>` slow path with manual deinterleave. Two
  call sites changed.
* AVX2 `widen_f16x8_avx`: drop the `if BE { load_endian_u16x8::<BE> } else
  { _mm_loadu_si128 }` conditional in favor of unconditionally routing
  through `load_endian_u16x8::<BE>`. The endian-aware loader monomorphizes
  to a no-op `_mm_loadu_si128` when `BE` matches host-native and to a
  byte-swap shuffle otherwise — correct on both LE and BE hosts. One call
  site changed; transitively fixes 5 entry points (`rgbf16_to_rgb_row`,
  `rgbf16_to_rgba_row`, `rgbf16_to_rgb_u16_row`, `rgbf16_to_rgba_u16_row`,
  `rgbf16_to_rgb_f32_row`).
* AVX-512 `widen_f16x16_avx512`: same fix using `load_endian_u16x16::<BE>`.
  One call site changed; transitively fixes the same 5 entry points.

Audit results — only kernels with a `<const BE: bool>` parameter that gated a
raw-load fast path on the bare `BE` flag had this defect:

  src/row/arch/neon/packed_rgb_float.rs:162   rgbf32_to_rgba_row     — fixed
  src/row/arch/neon/packed_rgb_float.rs:341   rgbf32_to_rgba_u16_row — fixed
  src/row/arch/x86_avx2/packed_rgb_float.rs:393  widen_f16x8_avx     — fixed
  src/row/arch/x86_avx512/packed_rgb_float.rs:367 widen_f16x16_avx512 — fixed

Backends checked clean: SSE4.1 (RGBA paths already use `load_f32x4::<BE>`,
f16 widen already uses `load_endian_u16x4::<BE>`); WASM SIMD128 (RGBA paths
use `load_f32x4::<BE>`, f16 widen is scalar with explicit endian decode);
NEON `widen_f16x4` (already uses `load_endian_u16x4::<BE>`).

Regression tests — twelve new LE-decode tests using the established
`*_le_input_decodes_correctly_on_any_host` pattern:

  neon_rgbf32_to_rgba_row_le_input_decodes_correctly_on_any_host
  neon_rgbf32_to_rgba_u16_row_le_input_decodes_correctly_on_any_host
  avx2_rgbf16_to_rgb_row_le_input_decodes_correctly_on_any_host
  avx2_rgbf16_to_rgba_row_le_input_decodes_correctly_on_any_host
  avx2_rgbf16_to_rgb_u16_row_le_input_decodes_correctly_on_any_host
  avx2_rgbf16_to_rgba_u16_row_le_input_decodes_correctly_on_any_host
  avx2_rgbf16_to_rgb_f32_row_le_input_decodes_correctly_on_any_host
  avx512_rgbf16_to_rgb_row_le_input_decodes_correctly_on_any_host
  avx512_rgbf16_to_rgba_row_le_input_decodes_correctly_on_any_host
  avx512_rgbf16_to_rgb_u16_row_le_input_decodes_correctly_on_any_host
  avx512_rgbf16_to_rgba_u16_row_le_input_decodes_correctly_on_any_host
  avx512_rgbf16_to_rgb_f32_row_le_input_decodes_correctly_on_any_host

Each constructs an LE-encoded fixture (host-native bits passed through
`from_bits(u_n::from_le(_))`) and feeds it through `::<false>`, asserting
the SIMD output matches the scalar reference (which is endian-correct via
`load_f32::<BE>` / `load_f16::<BE>`). On LE hosts these are vacuous identity
checks; on BE hosts (full QEMU s390x coverage is Phase 3) each would have
caught the original bug.

Aarch64 LE-host test count: 2205 → 2207 (+2 NEON RGBA tests). x86_64 LE-host
test count: 2922 → 2932 (+10 = 5 AVX2 + 5 AVX-512). All `cargo test` /
`cargo build` (LE x86 + aarch64 + WASM) / `cargo build --no-default-features`
/ `cargo fmt --check` / `cargo clippy --all-targets --all-features
-- -D warnings` checks pass on aarch64-apple-darwin and x86_64-apple-darwin.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…little")

Codex 5th-pass review of PR #83 found a medium finding in
src/row/scalar/tests.rs: the six Rgbf16 scalar row kernel parity /
widen / copy tests at lines 658-743 use host-native fixtures
(half::f16::from_f32 / Vec<f32>) and call the kernels with `::<false>`
(LE-encoded path). On a big-endian host (e.g. miri-sb-s390x), the
kernel's `u16::from_le` / `u32::from_le` byte-swap correctly
reinterprets the host-native fixture as if it were LE-encoded —
producing different bits than the test asserts. Same class as the
PR #82 alpha_extract / planar_gbr_high_bit fix in 8f2e329.

Fix: add `#[cfg(target_endian = "little")]` alongside `#[test]` on the
six tests, plus on the shared `rgbf16_test_inputs` helper to avoid an
unused-fn warning on BE hosts. Add a section-header comment block
explaining why these tests fail on BE, why the kernel itself remains
correct (locked down by the dedicated BE-parity tests in the
per-backend `tests/packed_rgb_float.rs` modules that build LE-encoded
fixtures via `f32::from_bits(u32::from_le(_))` /
`half::f16::from_bits(u16::from_le(_))`), and why byte-symmetric value
tests are intentionally NOT gated.

Audit:
  - The only `tests.rs` file under `src/row/scalar/` is the file in
    this commit; no other dedicated test files in that directory.
  - Inline `mod tests` blocks in other scalar source files were
    audited in PR #82 (`8f2e329`); no new occurrences of the
    host-native + `<false>` pattern landed in PR #83.
  - YUV planar high-bit-depth tests (e.g. yuv420p10_*) use host-native
    u16 too, but their kernels read u16 directly without `from_le`
    — the byte-format-agnostic contract is documented in
    `src/row/scalar/yuv_planar_high_bit.rs`. Those tests are correctly
    NOT gated.

Tests gated: 6 (rgbf16_scalar_{rgb,rgba,rgb_u16,rgba_u16}_matches_widen_then_rgbf32,
rgbf16_scalar_rgb_f32_matches_element_wise_widen,
rgbf16_scalar_rgb_f16_is_copy) + the rgbf16_test_inputs helper.

LE-host test count unchanged (gates are no-ops):
  Before: 2207 passed
  After:  2207 passed (cargo test --target aarch64-apple-darwin --lib)

cargo fmt --check, cargo clippy --all-targets --all-features
-- -D warnings, cargo build --target x86_64-apple-darwin --tests,
RUSTFLAGS="-C target-feature=+simd128" cargo build --target
wasm32-unknown-unknown --tests, cargo build --no-default-features
all pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n = "little") (audit follow-up)

Codex 6th-pass review of PR #83 found a medium finding in
src/row/arch/neon/tests/packed_rgb_float.rs (lines 32-242): the eleven
Rgbf32 / Rgbf16 SIMD-vs-scalar parity tests use host-native fixtures
(`pseudo_random_rgbf32` / `pseudo_random_rgbf16`) and call the kernels
with `::<false>` (LE-encoded path). On a big-endian host (e.g.
aarch64-be-linux-gnu, miri-sb-s390x), the kernel's `u32::from_le` /
`u16::from_le` byte-swap correctly reinterprets the host-native fixture
as if it were LE-encoded — producing different bits than the test
asserts. Same class as the PR #82 alpha_extract / planar_gbr_high_bit
fix in 8f2e329 and the PR #83 5th-pass scalar gate in 56342c0.

For the SIMD-vs-scalar parity assertions (`assert_eq!(out_scalar,
out_simd)`), parity holds vacuously on BE because both paths apply
the same `from_le` byte-swap to the host-native fixture and produce
the same (corrupted) decoded f32/f16. For the two `lossless`
host-native equality assertions (`assert_eq!(out_neon,
input[..w * 3])` for `rgbf32_to_rgb_f32_row` and
`rgbf16_to_rgb_f16_row`), the assertion fails outright on BE since
the kernel decodes through `load_f32x4::<false>` / scalar `from_le`
to produce a byte-swapped (relative to host-native) result.

The kernel itself is correct on BE; this is purely a fixture-vs-
kernel byte-order mismatch. NEON BE-host correctness is locked down
separately by the dedicated BE-parity tests in this same module
(which build LE-encoded fixtures via `byte_swap` helpers and assert
`<true>`/`<false>` parity on every host) and by the LE-decode
regression tests added in commits c3a6478, dcf40a3, f1161d7,
63fdf8f. Those tests are intentionally NOT gated.

Fix: add `#[cfg(target_endian = "little")]` alongside `#[test]` on
the eleven NEON parity tests, plus a section-header comment block
explaining why these tests fail on BE, why the kernel itself remains
correct, and why byte-swap-helper / LE-decode regression tests are
intentionally NOT gated. The shared `pseudo_random_rgbf32` /
`pseudo_random_rgbf16` helpers are NOT gated because they're also
used by the BE-parity / LE-decode tests that compile on every host.

Audit of other backend test files (`packed_rgb_float.rs` under
x86_sse41, x86_avx2, x86_avx512, wasm_simd128):

  - SSE4.1: 12 tests with same pattern (1 MXCSR regression + 5
    Rgbf32 + 6 Rgbf16). Gated for structural consistency. Already
    only compiled on `target_arch = "x86_64"` which always implies
    `target_endian = "little"`, so the gate is functionally a no-op
    on every supported configuration — but it documents the
    assumption and matches the audit pattern.
  - AVX2: 11 tests (5 Rgbf32 + 6 Rgbf16). Same rationale.
  - AVX-512: 11 tests (5 Rgbf32 + 6 Rgbf16). Same rationale.
  - wasm_simd128: 11 tests (5 Rgbf32 + 6 Rgbf16).
    `target_arch = "wasm32"` is LE by spec; gate added for
    consistency / future-proofing against hypothetical BE wasm.

Total tests gated: 56 (11 NEON + 12 SSE4.1 + 11 AVX2 + 11 AVX-512 +
11 wasm_simd128).

LE-host test count unchanged (gates are no-ops):
  Before: 2207 passed
  After:  2207 passed (cargo test --target aarch64-apple-darwin --lib)

cargo fmt --check, cargo clippy --all-targets --all-features
-- -D warnings, cargo build --target x86_64-apple-darwin --tests
(0 warnings), RUSTFLAGS="-C target-feature=+simd128" cargo build
--target wasm32-unknown-unknown --tests (3 pre-existing warnings,
not from this commit), cargo build --no-default-features, and
cargo check --target s390x-unknown-linux-gnu --lib (BE-host smoke
check) all pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three Copilot findings addressed (others were already fixed in dcf40a3
/ 56342c0 / 4340b15 — Copilot reviewed pre-fix state):

1. **Doc rewording** at `rgbf32_to_rgb_row` — was "bytes swapped
   relative to the host's native little-endian layout" which is
   misleading on BE hosts. Reframed in terms of the input buffer's
   encoded byte order vs the host CPU's native order.

2. **Pass-through perf fast path** for `rgbf32_to_rgb_f32_row` —
   added `BE == HOST_NATIVE_BE` branch that becomes a single
   `copy_from_slice` (memcpy) when the encoded byte order matches the
   host. Restores the pre-BE-aware "lossless pass-through" perf
   characteristic. The const-generic dead branch is eliminated per
   monomorphization, so the slow byte-swap path is only emitted for
   `BE != HOST_NATIVE_BE` callers.

3. **Pass-through perf fast path** for `rgbf16_to_rgb_f16_row` —
   mirror of (2) for half-precision input.

Verified:
  - cargo test --target aarch64-apple-darwin --lib: 2207 pass
  - cargo build --target x86_64-apple-darwin --tests: 0 warnings
  - cargo fmt --check: clean
  - cargo clippy --all-targets --all-features -D warnings: clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@uqio uqio merged commit ea21141 into main May 8, 2026
43 checks passed
@uqio uqio deleted the feat/be-tier9 branch May 8, 2026 04:04
uqio added a commit that referenced this pull request May 8, 2026
Same as the NEON dedup in the rebase amend (`6d3c8b5`): PR #83's `c3a6478`
upstreamed `load_le_u16x4` / `load_be_u16x4` / `load_endian_u16x4` to the
SSE4.1 endian module. Tier 10 float's earlier addition of these helpers is
now redundant — caused E0428 duplicate-definition errors on x86 cross-builds
(test ubuntu/windows/macos, build ubuntu/macos/windows, sanitizer, sde-avx512
all failed in CI run 25536150449).

Verified:
  - cargo test --target aarch64-apple-darwin --lib: 2231 pass
  - cargo build --target x86_64-apple-darwin --tests: 0 errors, 0 warnings
  - cargo build --target wasm32-unknown-unknown --tests: clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
uqio added a commit that referenced this pull request May 8, 2026
…IVE_BE routing

Codex review of PR #84 surfaced 3 high-severity correctness bugs in the BE
support added for the Tier 10 float planar GBR family. All 3 are direct
analogs of the PR #83 fixes (`dcf40a3` sinker routing + `c3a6478` dispatch
f16-widen routing) — the Tier 10 work missed both patterns.

Finding 1 — dispatch f16-widen double-byte-swap. The dispatcher's f16 → f32
widen-then-convert fallback paths called `g[offset + i].to_f32()` BEFORE any
endian-aware load, then routed the host-native f32 scratch into the f32
kernel with the source `BE`. On a BE-source-on-LE-host (or LE-source-on-BE-
host) this double-swaps: the f16 read interprets BE-encoded bytes as host-
native f16 (wrong), then the f32 kernel's `from_be` swaps the wrong f32
again. Fix: introduce a `widen_f16_be_to_host_f32::<BE>` helper that
normalizes f16 bits via `from_be` / `from_le` BEFORE `to_f32`, producing
host-native f32 for any source `BE` × host endian; route the downstream
`gbrpf32_to_*` chain via `HOST_NATIVE_BE = cfg!(target_endian = "big")`.
Applies to all 12 widen-fallback paths in `src/row/dispatch/planar_gbr_float.rs`
(4 main scalar fallbacks + 8 SIMD-arm scalar widens for NEON-no-fp16,
AVX-512-no-F16C, AVX2-no-F16C, SSE4.1-no-F16C across `gbrpf16_to_rgb_row`
and `gbrpf16_to_rgba_row`).

Finding 2 — Gbrpf32 / Gbrapf32 sinker hardcodes `::<false>`. `Gbrpf32Frame`
exposes `&[f32]` rows in host-native layout; the sinker hardcoded `::<false>`
on every `gbrpf32_to_*` / `gbrapf32_to_*` call, so on BE hosts the kernel's
`from_le` loaders byte-swap an already-decoded host-native f32, corrupting
every output path. Fix: add module-scope `const HOST_NATIVE_BE: bool =
cfg!(target_endian = "big")` mirroring PR #83 `dcf40a3`'s `Rgbf32` sinker
pattern, replace all 22 `::<false>` call sites in
`src/sinker/mixed/planar_gbr_float.rs` with `::<HOST_NATIVE_BE>`. Strategy A+
α-plane consistency: `copy_alpha_plane_f32_to_u8` is endian-agnostic
(no `from_be` / `from_le` byte-load), so routing the RGB chain via
`HOST_NATIVE_BE` eliminates the prior LE-RGB + host-α mix-mode.

Finding 3 — Gbrpf16 / Gbrapf16 sinker hardcodes `::<false>`. Same pattern
as Finding 2 but for the f16 sinker, covering both the lossless f16 paths
(`gbrpf16_to_rgb_f16_row`, `gbrapf16_to_rgba_f16_row`, etc.) and the
widened f32 chain (`gbrpf32_to_*` after `widen_f16_to_f32`). After
`widen_f16_to_f32` consumes host-native `&[half::f16]` and emits host-native
f32, the downstream `gbrpf32_to_*` must use `HOST_NATIVE_BE`. Fix: add
`HOST_NATIVE_BE` const in `src/sinker/mixed/planar_gbr_f16.rs` mirroring
PR #83's `Rgbf16` sinker, replace all 22 `::<false>` call sites with
`::<HOST_NATIVE_BE>`. The Gbrapf16 standalone-RGBA + Strategy A+ paths
widen the f16 α plane to host-native f32 then call
`copy_alpha_plane_f32_to_u8` (endian-agnostic), so α consistency is
preserved by the RGB-chain fix.

Tests added (6 new, all on the Tier 10 sinker side):
- `gbrpf32_kernel_host_native_be_matches_false_on_le_host` — kernel-level
  routing equivalence at tail widths 5 / 7 / 33 across u8 RGB / u8 RGBA /
  u16 RGB / u16 RGBA / f32 lossless (LE-host gated; BE-host correctness
  is verified by row-kernel BE parity tests + the contract tests below).
- `gbrpf16_kernel_host_native_be_matches_false_on_le_host` — same coverage
  for the f16 dispatcher with both `use_simd = false` (scalar widen-
  fallback) and `use_simd = true` (SIMD widen path).
- `gbrpf32_sinker_host_native_contract_lossless_passthrough` — host-native
  f32 → `with_rgb_f32` bit-exact pass-through on every host (HDR, NaN, Inf).
- `gbrpf16_sinker_host_native_contract_lossless_passthrough` — host-native
  f16 → `with_rgb_f16` bit-exact pass-through on every host.
- `gbrapf16_sinker_host_native_contract_lossless_passthrough_with_alpha` —
  α plane round-trips bit-exact alongside RGB (Strategy A+ consistency).
- `gbrapf32_sinker_host_native_contract_lossless_passthrough_with_alpha` —
  same for f32 sources with α.

Audit summary across all 5 row backends for the f16 widening pattern:

| Backend         | Dispatch fallback | SIMD scalar tail (in row kernels) |
|-----------------|-------------------|------------------------------------|
| scalar          | n/a               | n/a                                |
| NEON            | FIXED             | unchanged (per "DO NOT" directive) |
| SSE4.1 + F16C   | FIXED             | unchanged (per "DO NOT" directive) |
| AVX2 + F16C     | FIXED             | unchanged (per "DO NOT" directive) |
| AVX-512 + F16C  | FIXED             | unchanged (per "DO NOT" directive) |
| wasm-simd128    | FIXED             | unchanged (per "DO NOT" directive) |

The SIMD scalar tails (`g[x + i].to_f32()` followed by
`scalar::gbrpf32_to_*::<BE>`) inside `arch::*::gbrpf16_*` row kernels share
the same bug pattern as the dispatch fallback. Per the explicit "DO NOT
touch row kernels themselves" directive in the codex review, those tails
are NOT modified in this commit; they are tracked as a separate follow-up.
The dispatch fallback fix here is sufficient for `use_simd = false` callers
and for backends where the SIMD widen kernel is not selected at runtime.
LE-host correctness for `use_simd = true` paths is preserved (since
`HOST_NATIVE_BE = false ≡ false`); BE-host correctness for the SIMD tails
will be addressed in a follow-up cleanup PR.

Verification:
- cargo test --target aarch64-apple-darwin --lib → 2237 passed
- cargo test --target x86_64-apple-darwin --lib → 2968 passed
- cargo build --target x86_64-apple-darwin --tests → clean
- RUSTFLAGS=-Ctarget-feature=+simd128 cargo build --target wasm32-unknown-unknown --tests → clean
- cargo build --no-default-features → clean
- cargo fmt --check → clean
- cargo clippy --all-targets --all-features -- -D warnings → clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
uqio added a commit that referenced this pull request May 8, 2026
…dian.rs

Copilot review caught two stale comments in `src/row/arch/{neon,x86_sse41}/endian.rs`
that referenced "PR #83's be-tier9 branch" as the origin of the u16x4
loaders. The references made sense as transient dedup markers during
the rebase, but post-merge they're stale — the helpers are now just
normal members of each module. The previous comments above the u16x4
loaders already explain why both u16x8 and u16x4 variants exist
(tail-safe 8-byte loads for f16 widen kernels), so removing the dedup
note doesn't lose information.

Copilot's third finding (about `widen_f16_be_to_host_f32` doc claiming
shared use with dispatch when dispatch had its own copy) was already
resolved by 6e946a6 which dedup'd them — dispatch now imports the
scalar `pub(crate)` helper.

No code changes; comment-only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
uqio added a commit that referenced this pull request May 8, 2026
…s-through fast path

Codex 2nd-pass review of PR #85 found two issues — one high-severity
correctness bug at the **sinker** layer (same class as PR #83 dcf40a3
and PR #84 8627280) plus a low-severity perf regression in the scalar
lossless pass-through.

**Finding 1 [high] — Grayf32 sinker hardcoded LE for host-native f32.**

The `MixedSinker<Grayf32>` `PixelSink::process` impl hardcoded `::<false>`
on every `grayf32_to_*_row` dispatcher call. With the new BE-aware
kernel semantics, `BE = false` means "decode LE-encoded input" via
`u32::from_le`. But `Grayf32Frame` exposes a host-native `&[f32]`
plane — the API contract is "caller passes host-native floats". On a
BE host, `::<false>` would byte-swap the already-decoded host-native
f32 inside the loaders, corrupting every Grayf32 output path
(including the lossless `with_luma_f32` and `with_rgb_f32`
pass-throughs **and** every downstream u8/u16/luma/HSV output).

Direct backend `::<true>` BE-parity tests (in `src/row/arch/*/tests/`)
don't catch this because they bypass the sinker entirely — they hand
the kernel BE-encoded bytes and assert against the LE-encoded
counterpart, exercising only the kernel decode boundary, not the
sinker-to-kernel routing.

The fix is the **sinker-layer** complement of the SIMD-backend-internal
HOST_NATIVE_BE introduced in PR #83's `c3a6478`, mirroring exactly the
fixes applied in PR #83 `dcf40a3` (`packed_rgb_float.rs`) and PR #84
`8627280` (`packed_rgb_f16.rs` / `planar_gbr_float.rs`):

  const HOST_NATIVE_BE: bool = cfg!(target_endian = "big");

Defined at module scope in `src/sinker/mixed/gray.rs`. Every hardcoded
`::<false>` in the Grayf32 `process` impl is replaced with
`::<HOST_NATIVE_BE>`. Same truth table as the previous fixes:

  • LE host: HOST_NATIVE_BE = false → `from_le` (no-op on LE) → correct.
  • BE host: HOST_NATIVE_BE = true  → `from_be` (no-op on BE) → correct.

Out of scope (different layer):

  The Gray8 / GrayN / Gray16 / Ya8 / Ya16 sinkers in this same module
  take their `BE` from the source `Frame`'s flag (their Frames carry
  plane bytes whose encoding semantics are caller-specified — Phase 4
  territory) and must keep their existing `<BE>` routing. This commit
  only touches the Grayf32 sinker, whose Frame type unambiguously
  specifies host-native f32 element semantics.

Call sites changed in `src/sinker/mixed/gray.rs` (Grayf32 sinker only):

  • `grayf32_to_luma_f32_row`  (lossless luma pass-through)
  • `grayf32_to_rgb_f32_row`   (lossless RGB replicate)
  • `grayf32_to_luma_row`      (luma u8)
  • `grayf32_to_luma_u16_row`  (luma u16)
  • `grayf32_to_rgba_u16_row`  (standalone u16 RGBA fast path)
  • `grayf32_to_rgb_u16_row`   (u16 RGB)
  • `grayf32_to_rgba_row` ×2   (standalone u8 RGBA + HSV-with-RGBA)
  • `grayf32_to_hsv_row`       (standalone HSV fast path)
  • `grayf32_to_rgb_row`       (combined u8 RGB / HSV path)

10 dispatch-line replacements across 9 distinct dispatcher names.

**Finding 2 [low] — `grayf32_to_luma_f32_row` lossless pass-through perf.**

`grayf32_to_luma_f32_row` always looped and rebuilt each sample via
`to_bits` + `from_le`/`from_be`, even when the encoded byte order
matches the host-native (i.e. the no-swap case). Mirroring PR #83's
`b915754` `rgbf32_to_rgb_f32_row` fast path:

  const HOST_NATIVE_BE: bool = cfg!(target_endian = "big");
  if BE == HOST_NATIVE_BE {
    luma_f32_out[..width].copy_from_slice(&plane[..width]);
    return;
  }

The const-generic dead branch is eliminated per monomorphization, so
the slow byte-swap path is only emitted for `BE != HOST_NATIVE_BE`
callers. Restores the pre-BE-aware "lossless pass-through" perf
characteristic for the dominant LE-host + LE-encoded routing.

Not applied to `grayf32_to_rgb_f32_row` in this commit because that
kernel does broadcast (Y → R=G=B; output is 3× input, so no
`copy_from_slice` is possible). Doc rewording on
`grayf32_to_luma_f32_row` consistent with PR #83's `b915754` rewording
(input encoded byte order vs host CPU's native order, instead of "bytes
swapped relative to the host's native little-endian layout").

**Tests:**

Added 2 new sinker-level regression tests in
`src/sinker/mixed/gray.rs` (mirroring the PR #83 `dcf40a3` test pair):

  • `grayf32_kernel_host_native_be_matches_false_on_le_host` — calls
    each `grayf32_to_*` dispatcher (luma_f32, rgb_f32, luma u8/u16, RGB
    u8/u16, RGBA u8/u16, HSV) with both `BE = false` and
    `BE = HOST_NATIVE_BE` (= `cfg!(target_endian = "big")`), asserting
    outputs are byte-equal on the active host. LE-host-only (gated on
    `target_endian = "little"`); on a BE host the equality `::<false>`
    ≡ `::<HOST_NATIVE_BE>` is _false_ by design — the `::<true>` arm
    decodes the host-native fixture as if it were BE-encoded (no swap),
    while `::<false>` would byte-swap. Width 33 covers SIMD main loop
    + scalar tail across every backend. Doubles as a smoke test for
    the new `luma_f32` `copy_from_slice` fast path.

  • `grayf32_sinker_host_native_contract_lossless_passthrough` — feeds
    `Grayf32Frame` through the public sinker API and asserts both
    `with_luma_f32` and `with_rgb_f32` round-trip host-native input
    bit-exact. Pairs with the kernel-level test to cover both the
    dispatch boundary and the public sinker boundary. Runs on every
    host (LE and BE).

Both tests miri-ignored because the SIMD-dispatched row kernels use
intrinsics unsupported by Miri (consistent with existing tests).

**Verified locally:**

  • cargo test --target aarch64-apple-darwin --lib → 2271 pass
    (was 2269; +2 new tests).
  • cargo test --target x86_64-apple-darwin --lib → 3020 pass.
  • cargo test --no-default-features --lib → 35 pass.
  • cargo build --target x86_64-apple-darwin --tests → 0 warnings.
  • RUSTFLAGS="-C target-feature=+simd128" cargo build
    --target wasm32-unknown-unknown --tests → ok.
  • cargo build --no-default-features → ok.
  • cargo fmt --check → clean.
  • cargo clippy --all-targets --all-features -- -D warnings → clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
uqio added a commit that referenced this pull request May 8, 2026
…ttle") + add f32 BE parity test

CI miri-sb (run 25544222638) failed on powerpc64 / s390x / i686 /
aarch64-apple-darwin / aarch64-unknown-linux-gnu sanitizer targets:
the LE-fixture scalar tests in src/row/scalar/gray.rs (e.g.
gray10_full_range_pass_through, gray10_limited_range_midpoint,
gray16_limited_range_white, gray16_to_luma_u16_identity, gray9_*,
gray12/14_limited_range_white, ...) construct fixtures as host-native
`Vec<u16>` literals and call the kernels with `::<false>` (LE-encoded
contract). On a big-endian host, host-native u16 bits do NOT lay out
little-endian, so the kernel's `u16::from_le` swap correctly
reinterprets the fixture as if it were an LE-encoded payload —
producing a different (corrupted) value than the test asserts.

Same class as PR #82's `8f2e329` (alpha_extract / planar_gbr_high_bit)
and PR #83's `56342c0` (Rgbf16 scalar widen / parity / copy tests).

Fix: gate the 24 affected tests on `#[cfg(target_endian = "little")]`
and add a section-header comment explaining the rationale (mirroring
PR #82 / #83). Tests intentionally NOT gated:
  - gray8_* tests (u8 source, no endian dependency)
  - byte-symmetric value tests (`0xFFFF`, `u16::MAX`):
    `gray16_to_rgb_u16_limited_range_over_white_clamps`,
    `gray_n_to_luma_u16_10bit_masks`
  - dedicated `*_be_parity_*` tests built via `swap_bytes()` —
    those lock down BE-host kernel correctness directly and pass on
    every host.

Audit: full sweep of `src/row/scalar/gray.rs` `mod tests` block; no
shared helper functions to gate (every fixture is constructed inline).

Part 2 (closes Copilot review PR #85 finding 3): adds
`grayf32_to_luma_f32_row_be_le_parity_lossless` in
`src/row/scalar/grayf32.rs`. The existing `grayf32_be_parity_*` suite
covers integer-output paths but never exercises the lossless
`grayf32_to_luma_f32_row::<true>` (BE-encoded f32 → host-native f32)
fast/slow paths. The new test builds a BE-encoded mirror via the same
`f32::from_bits(v.to_bits().swap_bytes())` construction the suite
already uses for `f32_to_be_bytes`, runs both BE and LE kernels, and
asserts bitwise equality of their outputs (NaN-safe via `f32::to_bits`).
Path coverage by host:
  - LE host: LE = memcpy fast path; BE = slow swap path.
  - BE host: LE = slow swap path;   BE = memcpy fast path.
This exercises the `BE == HOST_NATIVE_BE` gate from both directions.

Copilot findings 1 + 2 REJECTED: those suggested replacing
`if BE == HOST_NATIVE_BE { copy_from_slice }` with `if !BE { ... }`.
That simplification is incorrect on BE hosts:
  - BE host, BE=false: `!BE = true` → would copy_from_slice the
    LE-encoded buffer as-is, propagating swapped bytes into output.
  - The current `BE == HOST_NATIVE_BE` gate handles all four
    host×data quadrants correctly.
Kernel code unchanged (codex 5th-pass approved the existing gate).

Verified:
  cargo test --target aarch64-apple-darwin --lib    → 2271 passed
    (previous 2270 + 1 new f32 BE parity test)
  cargo build --target x86_64-apple-darwin --tests  → 0 warnings
  RUSTFLAGS=-Ctarget-feature=+simd128 cargo build
    --target wasm32-unknown-unknown --tests          → ok
  cargo build --no-default-features                  → ok
  cargo fmt --check                                   → clean
  cargo clippy --all-targets --all-features
    -- -D warnings                                    → clean
  cargo check --target s390x-unknown-linux-gnu --lib → ok
    (BE-host smoke check; cross-toolchain link errors at --tests
    stage are expected and acceptable per project convention)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
uqio added a commit that referenced this pull request May 8, 2026
Affected types: Rgbf32Frame, Rgbf16Frame, Gbrpf32Frame, Gbrapf32Frame,
Gbrpf16Frame, Gbrapf16Frame.

Each doc-comment now explicitly states the LE-encoded byte contract,
mirroring the established Grayf32Frame doc pattern: the &[f32] / &[f16]
plane is the LE-encoded byte layout reinterpreted as f32 / f16, matching
FFmpeg's canonical *LE pixel-format suffixes (AV_PIX_FMT_GBRPF32LE etc).
Adds the bytemuck::cast_slice + linesize-division-by-element-size
instruction so callers wiring up FFmpeg buffers don't have to guess.

Codex 3rd-pass review of PR #85 caught that PR #83 + PR #84 introduced
sinker routings that assumed the planes are already host-native f32 / f16
— a wrong reading of the *LE contract that would corrupt every output
path on a BE host. Pinning the doc-level contract here so future readers
don't repeat the mistake; sinker revert lands in the next commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
uqio added a commit that referenced this pull request May 8, 2026
Mirrors PR #85's Grayf32 sinker revert (52f8191) for the other six float
frame types affected by PR #83 (dcf40a3) and PR #84 (8627280):
Rgbf32, Rgbf16, Gbrpf32, Gbrapf32, Gbrpf16, Gbrapf16.

The earlier PRs introduced sinker-layer `HOST_NATIVE_BE` consts that
made the row-kernel `BE` parameter a no-op byte-swap on whichever host
the build targeted. That assumed the Frame's plane bytes are already
host-native f32/f16 — which contradicts the LE-encoded byte contract
clarified in the previous commit. On a BE host the routing skipped the
required `u32::from_le` / `u16::from_le` swap and would have corrupted
every output path.

Reverted call sites (all `::<HOST_NATIVE_BE>` -> `::<false>`):
  - src/sinker/mixed/packed_rgb_float.rs:    6 sites (Rgbf32)
  - src/sinker/mixed/packed_rgb_f16.rs:      7 sites (Rgbf16)
  - src/sinker/mixed/planar_gbr_float.rs:   22 sites (Gbrpf32 + Gbrapf32)
  - src/sinker/mixed/planar_gbr_f16.rs:     22 sites (Gbrpf16 + Gbrapf16)

Removes the four `const HOST_NATIVE_BE: bool = cfg!(target_endian = "big")`
definitions and their explanatory doc-comments. Rewrites the
`widen_f16_to_f32` doc in planar_gbr_f16.rs to accurately describe the
LE-encoded contract (helper is correct on LE host; cross-endian widen
should use the bit-normalising `widen_f16_be_to_host_f32` pattern).

Untouched: `BE == HOST_NATIVE_BE` `copy_from_slice` fast paths in scalar
row kernels (`rgbf32_to_rgb_f32_row`, `gbrpf32_to_rgb_f32_row`, etc.) —
generic in `BE`, correct under either contract. Per-arch SIMD kernels
and `widen_f16_be_to_host_f32` left alone — different layer.

Verification (aarch64-apple-darwin):

  cargo test --target aarch64-apple-darwin --lib   # 2246 ok

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
uqio added a commit that referenced this pull request May 8, 2026
…oded regressions

Removes the host-native-contract tests that PR #83 + PR #84 added — they
were typed against an incorrect host-native-f32/f16 reading of the Frame
contract and would mask the BE-host corruption introduced by the now-
reverted `::<HOST_NATIVE_BE>` sinker routings:

  - rgbf32_kernel_host_native_be_matches_false_on_le_host       (deleted)
  - rgbf32_sinker_host_native_contract_lossless_passthrough     (deleted)
  - rgbf16_kernel_host_native_be_matches_false_on_le_host       (deleted)
  - rgbf16_sinker_host_native_contract_lossless_passthrough     (deleted)
  - gbrpf32_kernel_host_native_be_matches_false_on_le_host      (deleted)
  - gbrpf32_sinker_host_native_contract_lossless_passthrough    (deleted)
  - gbrpf16_kernel_host_native_be_matches_false_on_le_host      (deleted)
  - gbrpf16_sinker_host_native_contract_lossless_passthrough    (deleted)
  - gbrapf32_sinker_host_native_contract_lossless_passthrough_with_alpha  (deleted)
  - gbrapf16_sinker_host_native_contract_lossless_passthrough_with_alpha  (deleted)

Adds LE-encoded byte contract regressions following PR #85's `52f8191`
pattern (`grayf32_sinker_le_encoded_frame_decodes_correctly`):

  - rgbf32_sinker_le_encoded_frame_decodes_correctly    (new)
  - rgbf16_sinker_le_encoded_frame_decodes_correctly    (new)
  - gbrpf32_sinker_le_encoded_frame_decodes_correctly   (new)
  - gbrapf32_sinker_le_encoded_frame_decodes_correctly  (new)
  - gbrpf16_sinker_le_encoded_frame_decodes_correctly   (new)
  - gbrapf16_sinker_le_encoded_frame_decodes_correctly  (new)

Each test constructs an explicitly LE-encoded plane via
`f32::from_bits(intended.to_bits().to_le())` (or the f16 analogue),
builds the Frame, runs the sinker's lossless pass-through, and asserts
the output equals the host-native intended values. Vacuous on LE hosts
(where `to_le` is identity) but on a BE host any regression that drops
the `::<false>` kernel routing would fail fast.

Also picks up two trivial pre-existing trailing-blank-line cleanups in
`src/row/arch/{neon,x86_sse41}/endian.rs` produced by `cargo fmt` while
running it across the modified files.

Verification (aarch64-apple-darwin, x86_64-apple-darwin, wasm32):

  cargo test --target aarch64-apple-darwin --lib                  # 2242 ok
  cargo build --target x86_64-apple-darwin --tests                # ok
  RUSTFLAGS="-C target-feature=+simd128" \
    cargo build --target wasm32-unknown-unknown --tests           # ok
  cargo build --no-default-features                               # ok
  cargo fmt --check                                               # ok
  cargo clippy --all-targets --all-features -- -D warnings        # ok

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
al8n pushed a commit that referenced this pull request May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants