Skip to content

feat(be-tier10b): BE support for Gbrp/Gbrap 9-16-bit row kernels#82

Merged
uqio merged 7 commits intomainfrom
feat/be-tier10b
May 7, 2026
Merged

feat(be-tier10b): BE support for Gbrp/Gbrap 9-16-bit row kernels#82
uqio merged 7 commits intomainfrom
feat/be-tier10b

Conversation

@uqio
Copy link
Copy Markdown
Collaborator

@uqio uqio commented May 7, 2026

Summary

Phase 2 — Tier 10b BE rollout. Stacked on top of #81 (BE infra). Adds <const BE: bool> generic to all Gbrp/Gbrap high-bit (9/10/12/14/16-bit) row kernels across all 6 backends + dispatcher layer.

Implementation:

  • Scalar reads: if BE { x.swap_bytes() } else { x }
  • SIMD loads: load_endian_u16x8::<BE>(ptr) (and 16/32 variants) from crate::row::arch::{backend}::endian::*
  • Unused branch is const-folded at monomorphization — zero runtime cost

Scope:

  • ✅ Row kernels + dispatchers
  • ✅ 72 new BE parity tests across all 6 test modules
  • ⏸ Sinker: minimal compile fix only (false hardcoded at dispatch call sites — no BE plumbing yet)
  • ⏸ Frame/Walker BE wiring deferred to Phase 4 follow-up

Test results: 2177 tests pass total.

Stacking

Base: feat/be-infra (#81). Once #81 merges, this PR will rebase cleanly onto main.

Test plan

  • cargo test --target aarch64-apple-darwin (NEON + scalar)
  • cargo build --target x86_64-apple-darwin --tests (SSE4.1 / AVX2 / AVX-512 cross-compile)
  • cargo build --target wasm32-unknown-unknown --tests (wasm-simd128 cross-compile)
  • s390x QEMU job (Phase 3, deferred)

🤖 Generated with Claude Code

@uqio uqio force-pushed the feat/be-tier10b branch from a80fbf1 to 3297d01 Compare May 7, 2026 12:31
Base automatically changed from feat/be-infra to main May 7, 2026 12:37
Add <const BE: bool> const-generic to all Gbrp/Gbrap high-bit row
kernels (scalar, NEON, SSE4.1, AVX2, AVX512, wasm-simd128) and their
dispatchers.  Sinker gets a minimal compile fix (hardcoded false).
Adds BE parity tests in all six test modules (72 new test cases).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@uqio uqio force-pushed the feat/be-tier10b branch from 3297d01 to 8c6b6dc Compare May 7, 2026 12:39
uqio and others added 4 commits May 8, 2026 00:48
Codex adversarial review of #82 caught a high-severity bug: the
scalar `if BE { x.swap_bytes() } else { x }` pattern is a no-op
when the data byte order matches the host CPU's byte order, but
unconditionally swaps bytes regardless of target endianness.

That diverges from the SIMD `load_endian_u16x*::<BE>` helpers from
#81, which are target-endian-aware (a swap is needed only when the
data byte order differs from the host).  Mismatched semantics
between scalar and SIMD paths means scalar tails (and luma kernels,
which are scalar-only) would corrupt rows on a big-endian host
(s390x), in BOTH BE=true and BE=false cases.

The fix replaces every scalar load with the standard `u16::from_be`
/ `u16::from_le` pair, which expand exactly to the SIMD helper
semantics: each is a no-op when the data byte order matches the
host, and a `swap_bytes()` when they differ.

  if BE { u16::from_be(r[x]) } else { u16::from_le(r[x]) }

26 call sites across the planar_gbr_high_bit kernels (g, b, r, a)
updated. Test helper `byte_swap_vec` left as-is; it intentionally
synthesizes BE-encoded buffers from LE inputs for parity tests on
LE-host CI (a future follow-up should make the test helper
target-endian aware too, when Phase 3 s390x QEMU coverage lands).

Verified: 2177 tests pass; cargo fmt clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex adversarial review of #82 caught a high-severity bug in the
Strategy A+ alpha-only fast-path helpers. `copy_alpha_plane_u16_to_u8`
and `copy_alpha_plane_u16` mask + shift raw u16 values without endian
awareness. On a big-endian host (s390x) processing LE-encoded Gbrap
input, the alpha plane is byte-reversed when both `with_rgb` and
`with_rgba` outputs are requested — the same class of bug we already
fixed for the direct `gbra_to_rgba_*` kernels in 26e5077.

The direct kernels are correct because they were threaded through
`<const BITS: u32, const BE: bool>` in PR #82 + 26e5077. The α-extract
helpers were left at `<const BITS: u32>` with raw `alpha[n]` reads and
silently drift to incorrect output on a BE host.

Fix: thread `<const BE: bool>` through both scalar α-extract helpers
and apply the same `u16::from_be` / `u16::from_le` pattern as the
direct-kernel scalar fix:

  let raw = if BE { u16::from_be(alpha[n]) } else { u16::from_le(alpha[n]) };
  rgba_out[n * 4 + 3] = ((raw & mask) >> shift) as u8;  // u16-to-u8 variant

Each conversion compiles to a no-op when the data byte order matches
the host CPU and a byte-swap otherwise, mirroring the SIMD
`load_endian_u16x*::<BE>` semantics from #81. Scalar tails and SIMD
hot paths now stay byte-for-byte equivalent on every host for
BE = false (the case currently exercised).

The dispatcher (`row::dispatch::alpha_extract`) gained the matching
`<const BE: bool>` parameter. When `BE = true` it routes directly to
scalar — the SIMD α-extract backends use raw native-u16 loads
(`vld1q_u16` / `_mm_loadu_si128` / `v128_load64_zero`) and have no
byte-swap path, so feeding them BE-encoded input would re-introduce
the same corruption. Per the spec ("Don't touch SIMD α-extract paths
... codex didn't flag those"), the SIMD kernels keep their existing
LE-oriented loads. Phase 4 will plumb `<const BE: bool>` through SIMD
if/when a real BE-input sinker hot-path lands.

All sinker call sites pass `<BITS, false>` for now (LE-only sinkers
today; matches the `false` already passed to the sibling
`gbr_to_rgb_u16_high_bit_row::<BITS, false>` calls). Eight call sites
updated:

  - sinker/mixed/planar_gbr_high_bit.rs (Tier 10b, 2 sites)
  - sinker/mixed/yuva_4_4_4.rs           (Tier 9, 2 sites)
  - sinker/mixed/yuva_4_2_2.rs           (Tier 9, 2 sites)
  - sinker/mixed/yuva_4_2_0.rs           (Tier 9, 2 sites)

Each call site has an inline `// BE = false: ...` comment naming Phase 4
as the follow-up that will plumb a real `<const BE: bool>` from the
row type.

The 8/16-bit variants `copy_alpha_plane_u8`,
`copy_alpha_packed_u8x4_at_3`, `copy_alpha_packed_u16x4_to_u8_at_0`,
`copy_alpha_packed_u16x4_at_0`, and `copy_alpha_ya_*` are unchanged:
the 8-bit ones have no endianness; the AYUV64 / Rgba64 / Bgra64 / Ya16
variants take packed sources whose endianness is already a property of
the source's row-type wrapper rather than this helper.

The f32 helpers (`copy_alpha_plane_f32*`) are left untouched — they
belong to Tier 10 float (Gbrapf32 / Gbrpf16), out of scope for this
PR. They will be addressed in a separate PR when Phase 4 rolls up
through the float sinkers.

Tests added:
  - `copy_alpha_plane_u16_to_u8_be_parity_with_swapped_buffer` —
    builds a host-side `swap_bytes` of the LE fixture, calls the
    helper with `<10, false>` on the LE buffer and `<10, true>` on
    the BE-encoded buffer, asserts identical output. Locks down the
    BE-flag round-trip on every host.
  - `copy_alpha_plane_u16_be_parity_with_swapped_buffer` — same
    pattern for the u16-output variant.

Existing scalar tests retargeted to `<BITS, false>` (LE) to preserve
current behavior. SIMD parity tests in
`row/arch/{neon,x86_*,wasm_simd128}/alpha_extract.rs` retargeted the
scalar reference call to `<BITS, false>` — the SIMD helpers do
host-native loads, which matches scalar BE = false on LE hosts.

Verification:
  - cargo test --target aarch64-apple-darwin --lib → 2179 passed
  - cargo test --target x86_64-apple-darwin   --lib → 2873 passed
  - cargo build --target x86_64-apple-darwin --tests → 0 warnings
  - RUSTFLAGS=+simd128 cargo build --target wasm32-unknown-unknown
      --tests → only pre-existing unused-import warnings
  - cargo build --no-default-features → ok
  - cargo fmt --check → clean
  - cargo clippy --all-targets --all-features -- -D warnings → clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…from data endian

Codex 3rd-pass review of PR #82 caught two issues that survived the prior
two rounds. This commit fixes both.

Finding 1 [high]: LE-Strategy-A+ alpha SIMD path corrupts on BE hosts.
The previous routing in `row::dispatch::alpha_extract::{copy_alpha_plane_u16_to_u8,
copy_alpha_plane_u16}` used `if !use_simd || BE { scalar } else { SIMD }`.
That correctly routed BE-encoded input to scalar but silently broke the
mirror case: an LE-encoded Gbrap source on a BE host (`BE = false`,
`target_endian = "big"`) would still take the SIMD path, which uses raw
host-native u16 loads (`vld1q_u16` / `_mm_loadu_si128` /
`v128_load64_zero`). Those reads byte-swap the LE bytes on a BE host —
silently corrupting the α plane.

The fix replaces the narrow `BE` check with a real "do data and host
disagree?" check:

  let need_swap = BE != cfg!(target_endian = "big");
  if need_swap || !use_simd {
    // scalar — `u16::from_le` / `u16::from_be` handles the swap.
  } else {
    // SIMD — host-native loads are correct because data byte order
    // already matches the host CPU.
  }

Truth table:
  - LE data, LE host: need_swap = false != false = false  → SIMD ok
    (host-native LE u16 reads match LE encoding).
  - LE data, BE host: need_swap = false != true  = true   → scalar
    (scalar uses `u16::from_le`, swaps on BE host as needed).
  - BE data, LE host: need_swap = true  != false = true   → scalar
    (scalar uses `u16::from_be`, swaps on LE host as needed).
  - BE data, BE host: need_swap = true  != true  = false  → SIMD ok
    (host-native BE u16 reads match BE encoding).

Both u16 alpha-plane dispatchers (`copy_alpha_plane_u16_to_u8` and
`copy_alpha_plane_u16`) get the same fix. Doc comments updated with the
truth table for future readers. SIMD α-extract internals are untouched
— per the spec, they remain native-host-only by design; Phase 4 will
plumb `<const BE: bool>` through SIMD if a real BE-input sinker
hot-path lands. Sinker call sites are unchanged (they continue to pass
`BE = false`).

Finding 2 [medium]: Native-depth Strategy A+ alpha scatter had no test
coverage. The existing Strategy A+ integration tests for Gbrap10/12/14/16
in `src/sinker/mixed/tests/planar_gbr_high_bit.rs` only covered the u8
alpha-scatter path (`with_rgb` + `with_rgba`, which routes through
`copy_alpha_plane_u16_to_u8`). The native-depth combo path
`with_rgb_u16` + `with_rgba_u16` calls `copy_alpha_plane_u16` and was
unexercised — a regression there would not have been caught.

Fix: added `test_gbrap_strategy_a_plus_u16!` macro mirroring the
existing u8 macro, with one instance per bit depth (10, 12, 14, 16):
  - Build a Gbrap source with full-range u16 G/B/R/α plane values
    (using `pseudo_random_u16_low_n_bits` with `bits=16` so the upper
    bits beyond BITS are dirty — exercises the `(1 << BITS) - 1` mask
    in both the direct kernel and α-extract paths).
  - Run standalone: attach only `with_rgba_u16`, drives the direct
    4-channel `gbra_to_rgba_u16_high_bit_row` kernel.
  - Run combo: attach both `with_rgb_u16` AND `with_rgba_u16`, drives
    the Strategy A+ path (`gbr_to_rgb_u16_high_bit_row` →
    `expand_rgb_u16_to_rgba_u16_row` → `copy_alpha_plane_u16`).
  - Assert byte-exact equality between the two RGBA u16 buffers.

This mirrors the existing `test_gbrap_strategy_a_plus!` macro pattern
exactly.

Verification:
  - cargo test --target aarch64-apple-darwin --lib → 2183 passed (+4)
  - cargo test --target x86_64-apple-darwin   --lib → 2877 passed (+4)
  - cargo build --target x86_64-apple-darwin --tests → no new warnings
  - RUSTFLAGS=+simd128 cargo build --target wasm32-unknown-unknown
      --tests → only pre-existing unused-import warnings
  - cargo build --no-default-features → ok
  - cargo fmt --check → clean
  - cargo clippy --all-targets --all-features -- -D warnings → clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…iple-width tests

Codex 4th-pass review of PR #82 found a remaining ship-blocker: the prior
dispatcher routing in `src/row/dispatch/alpha_extract.rs`
(`need_swap = BE != cfg!(target_endian = "big")`) admitted SIMD on
BE-host/BE-data. The vector body's host-native u16 loads are correct in
that quadrant, but every alpha SIMD backend hardcodes its scalar tail to
`scalar::<BITS, false>` (NEON `src/row/arch/neon/alpha_extract.rs:249,295`,
SSE/AVX2/AVX-512/wasm mirror this). On a BE host with BE data and a
non-multiple width, the LE-only scalar tail then runs `u16::from_le` over
already-native samples, byte-swapping them before mask/shift — at BITS=10
sample `0x0123` becomes `0x2301 & 0x03ff = 0x0301`. Silent α corruption.

The existing Strategy A+ tests in
`src/sinker/mixed/tests/planar_gbr_high_bit.rs` use width 32, which is a
multiple of every backend's SIMD block, so the tail path was never
exercised.

Fix (option B from codex's recommendation — simpler than threading BE
through the SIMD helpers):

  let safe_for_simd = !BE && cfg!(target_endian = "little");
  if !safe_for_simd || !use_simd {
    // scalar — handles all (host_endian, BE) combinations correctly
  } else {
    // SIMD — only LE host + LE data
  }

Applied to both `copy_alpha_plane_u16_to_u8` and `copy_alpha_plane_u16`.

Truth table (`safe_for_simd = !BE && target_endian == "little"`):
- LE data, LE host: `!false && true  = true`  → SIMD (correct; tail
  `from_le` is a no-op)
- LE data, BE host: `!false && false = false` → scalar (correct;
  uses `from_le`)
- BE data, LE host: `!true  && true  = false` → scalar (correct;
  uses `from_be`)
- BE data, BE host: `!true  && false = false` → scalar (correct;
  uses `from_be`. SIMD vector body would be correct but the tail
  hardcodes BE=false and would corrupt non-multiple widths via
  `from_le` on already-native samples — until SIMD helpers are made
  const-generic over BE in a future Phase, scalar covers this rare
  quadrant correctly.)

Trade-off: BE-host/BE-data callers pay the scalar cost. Acceptable —
this is a rare quadrant; eventual Phase 4 work can thread BE into the
SIMD helpers if a real BE-input hot path lands.

Doc comments in both dispatchers updated to reflect the new contract.

New tests at non-multiple width 31 in
`src/sinker/mixed/tests/planar_gbr_high_bit.rs` exercise the SIMD tail
path on supported (LE) hosts:
- `gbrap10/12/14/16_strategy_a_plus_u16_matches_standalone_w31` — covers
  `copy_alpha_plane_u16` (no depth conv, u16→u16 RGBA path).
- `gbrap10_strategy_a_plus_matches_standalone_w31` — covers
  `copy_alpha_plane_u16_to_u8` (depth-conv `>> (BITS - 8)` path).

The two existing Strategy A+ macros (`test_gbrap_strategy_a_plus`,
`test_gbrap_strategy_a_plus_u16`) gained a 5-arg form taking width;
existing 4-arg callers default to width 32 unchanged. Test count moves
from 2183 to 2188 (+5).

Out of scope for this commit: touching the SIMD α-extract helpers
themselves (option A would require threading BE through 5 backends —
deferred until a real BE-input hot path needs it).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Phase 2 of the BE rollout for Tier 10b: adds a <const BE: bool> const-generic to all high-bit (9–16b) planar GBR/GBRA row kernels across scalar + all SIMD backends, wires the dispatcher layer to pass BE, and adds extensive BE parity coverage. Sinker plumbing is intentionally deferred (call sites hardcode BE=false for now).

Changes:

  • Add BE const-generic to planar GBR/GBRA high-bit row kernels in dispatch + scalar + SIMD backends, using endian-aware SIMD loaders where available.
  • Update alpha-extract scalar + dispatch logic to be endian-aware, and restrict SIMD alpha-extract to the LE-host/LE-data quadrant (scalar elsewhere).
  • Add/extend tests for BE parity and Strategy A+ path parity (including non-multiple widths to exercise SIMD tails).

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/sinker/mixed/yuva_4_4_4.rs Hardcodes BE=false when calling updated alpha-extract helpers (sinker BE plumbing deferred).
src/sinker/mixed/yuva_4_2_2.rs Same as above for 4:2:2 sinker path.
src/sinker/mixed/yuva_4_2_0.rs Same as above for 4:2:0 sinker path.
src/sinker/mixed/tests/planar_gbr_high_bit.rs Expands Strategy A+ tests (adds u16 path parity + width-31 tail coverage).
src/sinker/mixed/planar_gbr_high_bit.rs Updates sinker calls into planar GBR/GBRA kernels to pass BE=false.
src/row/scalar/alpha_extract.rs Adds endian-aware scalar alpha extraction via u16::from_{le,be} and updates tests.
src/row/mod.rs Updates overflow tests to new <BITS, BE> signatures.
src/row/dispatch/planar_gbr_high_bit.rs Adds BE const-generic to public dispatchers and forwards into backend/scalar implementations.
src/row/dispatch/alpha_extract.rs Adds BE const-generic and routes SIMD only when safe (LE host + LE data).
src/row/arch/x86_sse41/planar_gbr_high_bit.rs Adds BE const-generic and uses endian-aware SIMD u16 loads.
src/row/arch/x86_sse41/alpha_extract.rs Documents/locks scalar tail to BE=false for SIMD helper; updates scalar reference in tests.
src/row/arch/x86_avx512/tests/planar_gbr_high_bit.rs Updates existing tests for new signature + adds extensive BE parity tests.
src/row/arch/x86_avx512/planar_gbr_high_bit.rs Adds BE const-generic and uses 512/128-bit endian-aware SIMD loads.
src/row/arch/x86_avx512/alpha_extract.rs Documents/locks scalar tail to BE=false for SIMD helper; updates scalar reference in tests.
src/row/arch/x86_avx2/tests/planar_gbr_high_bit.rs Updates existing tests for new signature + adds extensive BE parity tests.
src/row/arch/x86_avx2/planar_gbr_high_bit.rs Adds BE const-generic and uses 256/128-bit endian-aware SIMD loads.
src/row/arch/x86_avx2/alpha_extract.rs Documents/locks scalar tail to BE=false for SIMD helper; updates scalar reference in tests.
src/row/arch/wasm_simd128/tests/planar_gbr_high_bit.rs Updates existing tests for new signature + adds BE parity tests.
src/row/arch/wasm_simd128/planar_gbr_high_bit.rs Adds BE const-generic and uses endian-aware SIMD loads.
src/row/arch/wasm_simd128/alpha_extract.rs Documents/locks scalar tail to BE=false for SIMD helper; updates scalar reference in tests.
src/row/arch/neon/tests/planar_gbr_high_bit.rs Updates existing tests for new signature + adds BE parity tests.
src/row/arch/neon/planar_gbr_high_bit.rs Adds BE const-generic and uses endian-aware NEON loads.
src/row/arch/neon/alpha_extract.rs Documents/locks scalar tail to BE=false for SIMD helper; updates scalar reference in tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/row/scalar/alpha_extract.rs Outdated
uqio and others added 2 commits May 8, 2026 11:29
… "little")

The miri-sb-s390x CI job (BE host) on PR #82 was failing 36 scalar
tests in src/row/scalar/{alpha_extract,planar_gbr_high_bit}.rs. The
tests use host-native u16 literals (e.g. `vec![0x3FFu16, 0x1FF]`,
`[100u16; 1]`) as if they were on-disk LE encodings, then call kernels
with `<BITS, BE = false>` (LE path).

On a BE host (s390x), host-native u16 storage does NOT lay bytes out
little-endian, so the kernel's `u16::from_le` byte-swap correctly
reinterprets the host-native value and produces a different logical
value than the literal — making the assertion fail. The kernel itself
is correct; this is purely a test fixture-vs-kernel byte-order
mismatch on BE hosts.

The kernel's BE-host scalar correctness is locked down by the
dedicated `*_be_parity_*` / `*_be_parity_with_swapped_buffer` tests in
the same files. Those tests build BE-encoded fixtures via
`byte_swap_vec` / `swap_bytes` from LE inputs and assert that
`<BITS, true>` on the swapped buffer matches `<BITS, false>` on the
original buffer — byte-for-byte identical output on every host. They
are intentionally NOT gated.

Tests with byte-symmetric literals only (`0u16`, `u16::MAX`) are also
NOT gated — `from_le` is a no-op on those bit patterns regardless of
host endianness, so the assertions pass on BE without modification.

Gated tests:
  - alpha_extract.rs: 4 tests
    (copy_alpha_plane_u16{,_to_u8}_*)
  - planar_gbr_high_bit.rs: 32 tests
    (rgb_high_bit_*, rgb_u16_high_bit_*, rgba_opaque_*,
     gbra_rgba_*, gbr_to_rgb*_masks_*, gbra_to_rgba_*_masks_*,
     luma_u16_high_bit_*)

Test counts (aarch64-apple-darwin lib):
  Before: 2188 passed
  After:  2188 passed (no change on LE host — gates are no-ops)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot review on PR #82 caught: doc comment said `BITS` is "9, 10, 12,
or 14" but the runtime `assert!(BITS >= 8 && BITS <= 16)` allows the
full [8, 16] range, and real call sites pass `BITS = 16` (Yuva420p16le,
Gbrap16, etc.).  Updated to reflect actual behavior + enumerate the
formats that consume this helper (Yuva*p9/10/12/14/16 + Gbrap10/12/14/16).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@uqio uqio merged commit a21f775 into main May 7, 2026
43 checks passed
@uqio uqio deleted the feat/be-tier10b branch May 7, 2026 23:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants