Skip to content

feat(be-tier11): BE support for Gray9-16 / Ya16 / Grayf32 row kernels#85

Open
uqio wants to merge 2 commits intomainfrom
feat/be-tier11
Open

feat(be-tier11): BE support for Gray9-16 / Ya16 / Grayf32 row kernels#85
uqio wants to merge 2 commits intomainfrom
feat/be-tier11

Conversation

@uqio
Copy link
Copy Markdown
Collaborator

@uqio uqio commented May 7, 2026

Summary

Phase 2 — Tier 11 BE rollout. Stacked on #81. Adds <const BE: bool> to all Gray9-16, Ya16, and Grayf32 row kernels across all 6 backends + dispatcher.

Implementation:

  • Scalar: swap_bytes() per element (Gray9/10/12/14/16, Ya16); f32::from_bits(raw.to_bits().swap_bytes()) for Grayf32
  • NEON: u16 loads via load_endian_u16x8::<BE>; f32 loads via vreinterpretq_f32_u32(load_endian_u32x4::<BE>(...)). Ya16 NEON: if BE { return scalar; } guardvld2q_u16 has no endian-aware variant available without invasive deinterleave restructuring; BE Ya16 falls through to scalar (rare path)
  • SSE4.1 / AVX2 / AVX-512: _mm*_castsi*_ps(load_endian_u32xN::<BE>(...)) for f32; load_endian_u16xN::<BE> for u16
  • wasm-simd128: load_endian_u16x8::<BE> / load_endian_u32x4::<BE>

Test results: 2176 tests pass. cargo build (all-features, no-default-features, x86_64-apple-darwin), clippy, fmt all clean.

Stacking

Base: feat/be-infra (#81). Will rebase onto main once #81 merges.

Test plan

  • cargo test --target aarch64-apple-darwin
  • cargo build --target x86_64-apple-darwin --tests
  • cargo build --no-default-features (alloc-only)
  • s390x QEMU (Phase 3)

🤖 Generated with Claude Code

@uqio uqio force-pushed the feat/be-tier11 branch from 0f18fe2 to 544656a Compare May 7, 2026 12:27
Base automatically changed from feat/be-infra to main May 7, 2026 12:37
uqio and others added 2 commits May 8, 2026 00:51
Add `<const BE: bool>` to all scalar, NEON, SSE4.1, AVX2, AVX512, and
wasm-simd128 kernels for Gray9/10/12/14/16, Ya16, and Grayf32 formats.
Dispatchers and sinker callers thread BE through; sinker hardcodes
`false` pending future Frame-level BE plumbing. BE parity tests added
to every SIMD backend (luma path) plus scalar.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The scalar BE branches in Gray9-16, Ya16, and Grayf32 row kernels used
unconditional `swap_bytes()` regardless of host endianness. On a BE host
(s390x), the LE branch passed input through as-is and the BE branch
swapped — both inverted from what the SIMD `load_endian_*::<BE>` helpers
do, so every Tier 11 BE row was corrupted on BE hosts.

Fix: replace `if BE { x.swap_bytes() } else { x }` with target-endian
aware conversions:
- u16 (gray9-16, ya16): `if BE { u16::from_be(x) } else { u16::from_le(x) }`
- f32 (grayf32): `if BE { f32::from_bits(u32::from_be(x.to_bits())) }
  else { f32::from_bits(u32::from_le(x.to_bits())) }`

`u16::from_le` and `u32::from_le` compile to no-ops on LE hosts, so the
LE-host fast path keeps the same machine code. On BE hosts both halves
of the branch now correctly produce a host-native value, matching the
SIMD helpers.

Test helpers that synthesize BE buffers from LE input
(`v.swap_bytes()` / `v.to_bits().swap_bytes()` fixtures) are intentionally
left untouched — they encode the wire format and remain a one-direction
LE-host-side simulation.

Call sites fixed:
- u16 (gray.rs + ya16.rs): 23 production sites
- f32 (grayf32.rs): 9 production sites

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@uqio uqio force-pushed the feat/be-tier11 branch from 544656a to ebefb6f Compare May 7, 2026 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant