feat(be-tier10b): BE support for Gbrp/Gbrap 9-16-bit row kernels by uqio · Pull Request #82 · Findit-AI/colconv

uqio · 2026-05-07T11:30:03Z

Summary

Phase 2 — Tier 10b BE rollout. Stacked on top of #81 (BE infra). Adds <const BE: bool> generic to all Gbrp/Gbrap high-bit (9/10/12/14/16-bit) row kernels across all 6 backends + dispatcher layer.

Implementation:

Scalar reads: if BE { x.swap_bytes() } else { x }
SIMD loads: load_endian_u16x8::<BE>(ptr) (and 16/32 variants) from crate::row::arch::{backend}::endian::*
Unused branch is const-folded at monomorphization — zero runtime cost

Scope:

✅ Row kernels + dispatchers
✅ 72 new BE parity tests across all 6 test modules
⏸ Sinker: minimal compile fix only (false hardcoded at dispatch call sites — no BE plumbing yet)
⏸ Frame/Walker BE wiring deferred to Phase 4 follow-up

Test results: 2177 tests pass total.

Stacking

Base: feat/be-infra (#81). Once #81 merges, this PR will rebase cleanly onto main.

Test plan

cargo test --target aarch64-apple-darwin (NEON + scalar)
cargo build --target x86_64-apple-darwin --tests (SSE4.1 / AVX2 / AVX-512 cross-compile)
cargo build --target wasm32-unknown-unknown --tests (wasm-simd128 cross-compile)
s390x QEMU job (Phase 3, deferred)

🤖 Generated with Claude Code

Add <const BE: bool> const-generic to all Gbrp/Gbrap high-bit row kernels (scalar, NEON, SSE4.1, AVX2, AVX512, wasm-simd128) and their dispatchers. Sinker gets a minimal compile fix (hardcoded false). Adds BE parity tests in all six test modules (72 new test cases). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Codex adversarial review of #82 caught a high-severity bug: the scalar `if BE { x.swap_bytes() } else { x }` pattern is a no-op when the data byte order matches the host CPU's byte order, but unconditionally swaps bytes regardless of target endianness. That diverges from the SIMD `load_endian_u16x*::<BE>` helpers from #81, which are target-endian-aware (a swap is needed only when the data byte order differs from the host). Mismatched semantics between scalar and SIMD paths means scalar tails (and luma kernels, which are scalar-only) would corrupt rows on a big-endian host (s390x), in BOTH BE=true and BE=false cases. The fix replaces every scalar load with the standard `u16::from_be` / `u16::from_le` pair, which expand exactly to the SIMD helper semantics: each is a no-op when the data byte order matches the host, and a `swap_bytes()` when they differ. if BE { u16::from_be(r[x]) } else { u16::from_le(r[x]) } 26 call sites across the planar_gbr_high_bit kernels (g, b, r, a) updated. Test helper `byte_swap_vec` left as-is; it intentionally synthesizes BE-encoded buffers from LE inputs for parity tests on LE-host CI (a future follow-up should make the test helper target-endian aware too, when Phase 3 s390x QEMU coverage lands). Verified: 2177 tests pass; cargo fmt clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Codex adversarial review of #82 caught a high-severity bug in the Strategy A+ alpha-only fast-path helpers. `copy_alpha_plane_u16_to_u8` and `copy_alpha_plane_u16` mask + shift raw u16 values without endian awareness. On a big-endian host (s390x) processing LE-encoded Gbrap input, the alpha plane is byte-reversed when both `with_rgb` and `with_rgba` outputs are requested — the same class of bug we already fixed for the direct `gbra_to_rgba_*` kernels in 26e5077. The direct kernels are correct because they were threaded through `<const BITS: u32, const BE: bool>` in PR #82 + 26e5077. The α-extract helpers were left at `<const BITS: u32>` with raw `alpha[n]` reads and silently drift to incorrect output on a BE host. Fix: thread `<const BE: bool>` through both scalar α-extract helpers and apply the same `u16::from_be` / `u16::from_le` pattern as the direct-kernel scalar fix: let raw = if BE { u16::from_be(alpha[n]) } else { u16::from_le(alpha[n]) }; rgba_out[n * 4 + 3] = ((raw & mask) >> shift) as u8; // u16-to-u8 variant Each conversion compiles to a no-op when the data byte order matches the host CPU and a byte-swap otherwise, mirroring the SIMD `load_endian_u16x*::<BE>` semantics from #81. Scalar tails and SIMD hot paths now stay byte-for-byte equivalent on every host for BE = false (the case currently exercised). The dispatcher (`row::dispatch::alpha_extract`) gained the matching `<const BE: bool>` parameter. When `BE = true` it routes directly to scalar — the SIMD α-extract backends use raw native-u16 loads (`vld1q_u16` / `_mm_loadu_si128` / `v128_load64_zero`) and have no byte-swap path, so feeding them BE-encoded input would re-introduce the same corruption. Per the spec ("Don't touch SIMD α-extract paths ... codex didn't flag those"), the SIMD kernels keep their existing LE-oriented loads. Phase 4 will plumb `<const BE: bool>` through SIMD if/when a real BE-input sinker hot-path lands. All sinker call sites pass `<BITS, false>` for now (LE-only sinkers today; matches the `false` already passed to the sibling `gbr_to_rgb_u16_high_bit_row::<BITS, false>` calls). Eight call sites updated: - sinker/mixed/planar_gbr_high_bit.rs (Tier 10b, 2 sites) - sinker/mixed/yuva_4_4_4.rs (Tier 9, 2 sites) - sinker/mixed/yuva_4_2_2.rs (Tier 9, 2 sites) - sinker/mixed/yuva_4_2_0.rs (Tier 9, 2 sites) Each call site has an inline `// BE = false: ...` comment naming Phase 4 as the follow-up that will plumb a real `<const BE: bool>` from the row type. The 8/16-bit variants `copy_alpha_plane_u8`, `copy_alpha_packed_u8x4_at_3`, `copy_alpha_packed_u16x4_to_u8_at_0`, `copy_alpha_packed_u16x4_at_0`, and `copy_alpha_ya_*` are unchanged: the 8-bit ones have no endianness; the AYUV64 / Rgba64 / Bgra64 / Ya16 variants take packed sources whose endianness is already a property of the source's row-type wrapper rather than this helper. The f32 helpers (`copy_alpha_plane_f32*`) are left untouched — they belong to Tier 10 float (Gbrapf32 / Gbrpf16), out of scope for this PR. They will be addressed in a separate PR when Phase 4 rolls up through the float sinkers. Tests added: - `copy_alpha_plane_u16_to_u8_be_parity_with_swapped_buffer` — builds a host-side `swap_bytes` of the LE fixture, calls the helper with `<10, false>` on the LE buffer and `<10, true>` on the BE-encoded buffer, asserts identical output. Locks down the BE-flag round-trip on every host. - `copy_alpha_plane_u16_be_parity_with_swapped_buffer` — same pattern for the u16-output variant. Existing scalar tests retargeted to `<BITS, false>` (LE) to preserve current behavior. SIMD parity tests in `row/arch/{neon,x86_*,wasm_simd128}/alpha_extract.rs` retargeted the scalar reference call to `<BITS, false>` — the SIMD helpers do host-native loads, which matches scalar BE = false on LE hosts. Verification: - cargo test --target aarch64-apple-darwin --lib → 2179 passed - cargo test --target x86_64-apple-darwin --lib → 2873 passed - cargo build --target x86_64-apple-darwin --tests → 0 warnings - RUSTFLAGS=+simd128 cargo build --target wasm32-unknown-unknown --tests → only pre-existing unused-import warnings - cargo build --no-default-features → ok - cargo fmt --check → clean - cargo clippy --all-targets --all-features -- -D warnings → clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…from data endian Codex 3rd-pass review of PR #82 caught two issues that survived the prior two rounds. This commit fixes both. Finding 1 [high]: LE-Strategy-A+ alpha SIMD path corrupts on BE hosts. The previous routing in `row::dispatch::alpha_extract::{copy_alpha_plane_u16_to_u8, copy_alpha_plane_u16}` used `if !use_simd || BE { scalar } else { SIMD }`. That correctly routed BE-encoded input to scalar but silently broke the mirror case: an LE-encoded Gbrap source on a BE host (`BE = false`, `target_endian = "big"`) would still take the SIMD path, which uses raw host-native u16 loads (`vld1q_u16` / `_mm_loadu_si128` / `v128_load64_zero`). Those reads byte-swap the LE bytes on a BE host — silently corrupting the α plane. The fix replaces the narrow `BE` check with a real "do data and host disagree?" check: let need_swap = BE != cfg!(target_endian = "big"); if need_swap || !use_simd { // scalar — `u16::from_le` / `u16::from_be` handles the swap. } else { // SIMD — host-native loads are correct because data byte order // already matches the host CPU. } Truth table: - LE data, LE host: need_swap = false != false = false → SIMD ok (host-native LE u16 reads match LE encoding). - LE data, BE host: need_swap = false != true = true → scalar (scalar uses `u16::from_le`, swaps on BE host as needed). - BE data, LE host: need_swap = true != false = true → scalar (scalar uses `u16::from_be`, swaps on LE host as needed). - BE data, BE host: need_swap = true != true = false → SIMD ok (host-native BE u16 reads match BE encoding). Both u16 alpha-plane dispatchers (`copy_alpha_plane_u16_to_u8` and `copy_alpha_plane_u16`) get the same fix. Doc comments updated with the truth table for future readers. SIMD α-extract internals are untouched — per the spec, they remain native-host-only by design; Phase 4 will plumb `<const BE: bool>` through SIMD if a real BE-input sinker hot-path lands. Sinker call sites are unchanged (they continue to pass `BE = false`). Finding 2 [medium]: Native-depth Strategy A+ alpha scatter had no test coverage. The existing Strategy A+ integration tests for Gbrap10/12/14/16 in `src/sinker/mixed/tests/planar_gbr_high_bit.rs` only covered the u8 alpha-scatter path (`with_rgb` + `with_rgba`, which routes through `copy_alpha_plane_u16_to_u8`). The native-depth combo path `with_rgb_u16` + `with_rgba_u16` calls `copy_alpha_plane_u16` and was unexercised — a regression there would not have been caught. Fix: added `test_gbrap_strategy_a_plus_u16!` macro mirroring the existing u8 macro, with one instance per bit depth (10, 12, 14, 16): - Build a Gbrap source with full-range u16 G/B/R/α plane values (using `pseudo_random_u16_low_n_bits` with `bits=16` so the upper bits beyond BITS are dirty — exercises the `(1 << BITS) - 1` mask in both the direct kernel and α-extract paths). - Run standalone: attach only `with_rgba_u16`, drives the direct 4-channel `gbra_to_rgba_u16_high_bit_row` kernel. - Run combo: attach both `with_rgb_u16` AND `with_rgba_u16`, drives the Strategy A+ path (`gbr_to_rgb_u16_high_bit_row` → `expand_rgb_u16_to_rgba_u16_row` → `copy_alpha_plane_u16`). - Assert byte-exact equality between the two RGBA u16 buffers. This mirrors the existing `test_gbrap_strategy_a_plus!` macro pattern exactly. Verification: - cargo test --target aarch64-apple-darwin --lib → 2183 passed (+4) - cargo test --target x86_64-apple-darwin --lib → 2877 passed (+4) - cargo build --target x86_64-apple-darwin --tests → no new warnings - RUSTFLAGS=+simd128 cargo build --target wasm32-unknown-unknown --tests → only pre-existing unused-import warnings - cargo build --no-default-features → ok - cargo fmt --check → clean - cargo clippy --all-targets --all-features -- -D warnings → clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…iple-width tests Codex 4th-pass review of PR #82 found a remaining ship-blocker: the prior dispatcher routing in `src/row/dispatch/alpha_extract.rs` (`need_swap = BE != cfg!(target_endian = "big")`) admitted SIMD on BE-host/BE-data. The vector body's host-native u16 loads are correct in that quadrant, but every alpha SIMD backend hardcodes its scalar tail to `scalar::<BITS, false>` (NEON `src/row/arch/neon/alpha_extract.rs:249,295`, SSE/AVX2/AVX-512/wasm mirror this). On a BE host with BE data and a non-multiple width, the LE-only scalar tail then runs `u16::from_le` over already-native samples, byte-swapping them before mask/shift — at BITS=10 sample `0x0123` becomes `0x2301 & 0x03ff = 0x0301`. Silent α corruption. The existing Strategy A+ tests in `src/sinker/mixed/tests/planar_gbr_high_bit.rs` use width 32, which is a multiple of every backend's SIMD block, so the tail path was never exercised. Fix (option B from codex's recommendation — simpler than threading BE through the SIMD helpers): let safe_for_simd = !BE && cfg!(target_endian = "little"); if !safe_for_simd || !use_simd { // scalar — handles all (host_endian, BE) combinations correctly } else { // SIMD — only LE host + LE data } Applied to both `copy_alpha_plane_u16_to_u8` and `copy_alpha_plane_u16`. Truth table (`safe_for_simd = !BE && target_endian == "little"`): - LE data, LE host: `!false && true = true` → SIMD (correct; tail `from_le` is a no-op) - LE data, BE host: `!false && false = false` → scalar (correct; uses `from_le`) - BE data, LE host: `!true && true = false` → scalar (correct; uses `from_be`) - BE data, BE host: `!true && false = false` → scalar (correct; uses `from_be`. SIMD vector body would be correct but the tail hardcodes BE=false and would corrupt non-multiple widths via `from_le` on already-native samples — until SIMD helpers are made const-generic over BE in a future Phase, scalar covers this rare quadrant correctly.) Trade-off: BE-host/BE-data callers pay the scalar cost. Acceptable — this is a rare quadrant; eventual Phase 4 work can thread BE into the SIMD helpers if a real BE-input hot path lands. Doc comments in both dispatchers updated to reflect the new contract. New tests at non-multiple width 31 in `src/sinker/mixed/tests/planar_gbr_high_bit.rs` exercise the SIMD tail path on supported (LE) hosts: - `gbrap10/12/14/16_strategy_a_plus_u16_matches_standalone_w31` — covers `copy_alpha_plane_u16` (no depth conv, u16→u16 RGBA path). - `gbrap10_strategy_a_plus_matches_standalone_w31` — covers `copy_alpha_plane_u16_to_u8` (depth-conv `>> (BITS - 8)` path). The two existing Strategy A+ macros (`test_gbrap_strategy_a_plus`, `test_gbrap_strategy_a_plus_u16`) gained a 5-arg form taking width; existing 4-arg callers default to width 32 unchanged. Test count moves from 2183 to 2188 (+5). Out of scope for this commit: touching the SIMD α-extract helpers themselves (option A would require threading BE through 5 backends — deferred until a real BE-input hot path needs it). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Phase 2 of the BE rollout for Tier 10b: adds a <const BE: bool> const-generic to all high-bit (9–16b) planar GBR/GBRA row kernels across scalar + all SIMD backends, wires the dispatcher layer to pass BE, and adds extensive BE parity coverage. Sinker plumbing is intentionally deferred (call sites hardcode BE=false for now).

Changes:

Add BE const-generic to planar GBR/GBRA high-bit row kernels in dispatch + scalar + SIMD backends, using endian-aware SIMD loaders where available.
Update alpha-extract scalar + dispatch logic to be endian-aware, and restrict SIMD alpha-extract to the LE-host/LE-data quadrant (scalar elsewhere).
Add/extend tests for BE parity and Strategy A+ path parity (including non-multiple widths to exercise SIMD tails).

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
src/sinker/mixed/yuva_4_4_4.rs	Hardcodes `BE=false` when calling updated alpha-extract helpers (sinker BE plumbing deferred).
src/sinker/mixed/yuva_4_2_2.rs	Same as above for 4:2:2 sinker path.
src/sinker/mixed/yuva_4_2_0.rs	Same as above for 4:2:0 sinker path.
src/sinker/mixed/tests/planar_gbr_high_bit.rs	Expands Strategy A+ tests (adds u16 path parity + width-31 tail coverage).
src/sinker/mixed/planar_gbr_high_bit.rs	Updates sinker calls into planar GBR/GBRA kernels to pass `BE=false`.
src/row/scalar/alpha_extract.rs	Adds endian-aware scalar alpha extraction via `u16::from_{le,be}` and updates tests.
src/row/mod.rs	Updates overflow tests to new `<BITS, BE>` signatures.
src/row/dispatch/planar_gbr_high_bit.rs	Adds `BE` const-generic to public dispatchers and forwards into backend/scalar implementations.
src/row/dispatch/alpha_extract.rs	Adds `BE` const-generic and routes SIMD only when safe (LE host + LE data).
src/row/arch/x86_sse41/planar_gbr_high_bit.rs	Adds `BE` const-generic and uses endian-aware SIMD u16 loads.
src/row/arch/x86_sse41/alpha_extract.rs	Documents/locks scalar tail to `BE=false` for SIMD helper; updates scalar reference in tests.
src/row/arch/x86_avx512/tests/planar_gbr_high_bit.rs	Updates existing tests for new signature + adds extensive BE parity tests.
src/row/arch/x86_avx512/planar_gbr_high_bit.rs	Adds `BE` const-generic and uses 512/128-bit endian-aware SIMD loads.
src/row/arch/x86_avx512/alpha_extract.rs	Documents/locks scalar tail to `BE=false` for SIMD helper; updates scalar reference in tests.
src/row/arch/x86_avx2/tests/planar_gbr_high_bit.rs	Updates existing tests for new signature + adds extensive BE parity tests.
src/row/arch/x86_avx2/planar_gbr_high_bit.rs	Adds `BE` const-generic and uses 256/128-bit endian-aware SIMD loads.
src/row/arch/x86_avx2/alpha_extract.rs	Documents/locks scalar tail to `BE=false` for SIMD helper; updates scalar reference in tests.
src/row/arch/wasm_simd128/tests/planar_gbr_high_bit.rs	Updates existing tests for new signature + adds BE parity tests.
src/row/arch/wasm_simd128/planar_gbr_high_bit.rs	Adds `BE` const-generic and uses endian-aware SIMD loads.
src/row/arch/wasm_simd128/alpha_extract.rs	Documents/locks scalar tail to `BE=false` for SIMD helper; updates scalar reference in tests.
src/row/arch/neon/tests/planar_gbr_high_bit.rs	Updates existing tests for new signature + adds BE parity tests.
src/row/arch/neon/planar_gbr_high_bit.rs	Adds `BE` const-generic and uses endian-aware NEON loads.
src/row/arch/neon/alpha_extract.rs	Documents/locks scalar tail to `BE=false` for SIMD helper; updates scalar reference in tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

… "little") The miri-sb-s390x CI job (BE host) on PR #82 was failing 36 scalar tests in src/row/scalar/{alpha_extract,planar_gbr_high_bit}.rs. The tests use host-native u16 literals (e.g. `vec![0x3FFu16, 0x1FF]`, `[100u16; 1]`) as if they were on-disk LE encodings, then call kernels with `<BITS, BE = false>` (LE path). On a BE host (s390x), host-native u16 storage does NOT lay bytes out little-endian, so the kernel's `u16::from_le` byte-swap correctly reinterprets the host-native value and produces a different logical value than the literal — making the assertion fail. The kernel itself is correct; this is purely a test fixture-vs-kernel byte-order mismatch on BE hosts. The kernel's BE-host scalar correctness is locked down by the dedicated `*_be_parity_*` / `*_be_parity_with_swapped_buffer` tests in the same files. Those tests build BE-encoded fixtures via `byte_swap_vec` / `swap_bytes` from LE inputs and assert that `<BITS, true>` on the swapped buffer matches `<BITS, false>` on the original buffer — byte-for-byte identical output on every host. They are intentionally NOT gated. Tests with byte-symmetric literals only (`0u16`, `u16::MAX`) are also NOT gated — `from_le` is a no-op on those bit patterns regardless of host endianness, so the assertions pass on BE without modification. Gated tests: - alpha_extract.rs: 4 tests (copy_alpha_plane_u16{,_to_u8}_*) - planar_gbr_high_bit.rs: 32 tests (rgb_high_bit_*, rgb_u16_high_bit_*, rgba_opaque_*, gbra_rgba_*, gbr_to_rgb*_masks_*, gbra_to_rgba_*_masks_*, luma_u16_high_bit_*) Test counts (aarch64-apple-darwin lib): Before: 2188 passed After: 2188 passed (no change on LE host — gates are no-ops) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot review on PR #82 caught: doc comment said `BITS` is "9, 10, 12, or 14" but the runtime `assert!(BITS >= 8 && BITS <= 16)` allows the full [8, 16] range, and real call sites pass `BITS = 16` (Yuva420p16le, Gbrap16, etc.). Updated to reflect actual behavior + enumerate the formats that consume this helper (Yuva*p9/10/12/14/16 + Gbrap10/12/14/16). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

uqio force-pushed the feat/be-tier10b branch from a80fbf1 to 3297d01 Compare May 7, 2026 12:31

Base automatically changed from feat/be-infra to main May 7, 2026 12:37

uqio force-pushed the feat/be-tier10b branch from 3297d01 to 8c6b6dc Compare May 7, 2026 12:39

uqio and others added 4 commits May 8, 2026 00:48

al8n requested a review from Copilot May 7, 2026 23:15

Copilot started reviewing on behalf of al8n May 7, 2026 23:16 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

Comment thread src/row/scalar/alpha_extract.rs Outdated

uqio and others added 2 commits May 8, 2026 11:29

uqio merged commit a21f775 into main May 7, 2026
43 checks passed

uqio deleted the feat/be-tier10b branch May 7, 2026 23:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(be-tier10b): BE support for Gbrp/Gbrap 9-16-bit row kernels#82

feat(be-tier10b): BE support for Gbrp/Gbrap 9-16-bit row kernels#82
uqio merged 7 commits intomainfrom
feat/be-tier10b

uqio commented May 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

uqio commented May 7, 2026

Summary

Stacking

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants