Skip to content

feat(be-tier5): BE support for V410 / XV36 / AYUV64 row kernels#86

Open
uqio wants to merge 2 commits intomainfrom
feat/be-tier5
Open

feat(be-tier5): BE support for V410 / XV36 / AYUV64 row kernels#86
uqio wants to merge 2 commits intomainfrom
feat/be-tier5

Conversation

@uqio
Copy link
Copy Markdown
Collaborator

@uqio uqio commented May 7, 2026

Summary

Phase 2 — Tier 5 BE rollout. Stacked on #81. Adds <const BE: bool> to all V410, XV36, AYUV64 row kernels across all 6 backends.

Note on dispatcher pattern (variant from other tiers):
Dispatcher functions take a runtime be_input: bool parameter and branch internally to <true> vs <false> const-generic kernels. This avoids forcing the (downstream) Sinker / Frame layer to also be <const BE: bool>-generic, while preserving full monomorphization at the kernel level. Kernels themselves still use <const BE: bool> for dead-code elimination.

The runtime branch is one predictable conditional per row, picked once before the per-pixel inner loop.

Implementation:

  • Scalar: per-element swap_bytes() for u16/u32 reads (V410 / XV36 use 32-bit packed words; AYUV64 uses 16-bit channels)
  • All 5 SIMD backends parameterized with <const BE: bool> and use load_endian_u32xN::<BE> / load_endian_u16xN::<BE> from feat(be-infra): endian-aware SIMD loaders across 5 backends #81 infra
  • BE parity tests added at the dispatcher level: v410_be_and_le_dispatchers_agree, xv36_be_and_le_dispatchers_agree, ayuv64_be_and_le_dispatchers_agree

Test results: 2165 tests pass.

Stacking

Base: feat/be-infra (#81). Will rebase onto main once #81 merges.

Test plan

  • cargo test --features std
  • cargo build --target x86_64-apple-darwin --tests
  • s390x QEMU (Phase 3)

🤖 Generated with Claude Code

Base automatically changed from feat/be-infra to main May 7, 2026 12:37
@uqio uqio force-pushed the feat/be-tier5 branch from 98619a0 to 5a3b632 Compare May 7, 2026 12:46
uqio and others added 2 commits May 8, 2026 00:50
Add `<const BE: bool>` to all row kernels (scalar + 5 SIMD backends)
for the three Tier-5 packed YUV 4:4:4 formats. Dispatch functions gain
a `be_input: bool` parameter that branches to the correct monomorphization
at runtime; sinker callers forward `false` (LE default) until the sinker
layer grows its own BE flag.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Codex flagged a high-severity correctness bug in the tier 5 scalar BE
paths. The pattern `if BE { x.swap_bytes() } else { x }` is unconditional
— it always byte-swaps on BE input regardless of the host endianness.
On a BE host (e.g. s390x), the input is already in host order, so the
swap inverts it, producing garbled samples. The matching SIMD
`load_endian_*` helpers are target-endian aware (cfg-gated reverses),
so the scalar/SIMD paths disagree on BE hosts.

Fix: replace `x.swap_bytes()` with `T::from_be(x)` / `T::from_le(x)`,
which are target-endian aware (no-op when source byte order already
matches the host). All six tier 5 scalar call sites are updated:

- src/row/scalar/v410.rs (4 sites): u32 packed-pixel and luma kernels
- src/row/scalar/xv36.rs  (1 site): `load_xv36_u16` helper
- src/row/scalar/ayuv64.rs (1 site): `load_ayuv64_u16` helper

Test helpers in `mod tests` blocks (`p_le.map(|v| v.swap_bytes())`,
`le_word.swap_bytes()`) are intentionally left unchanged — they
synthesize BE-encoded fixtures from LE inputs on the LE host where
CI runs, so an unconditional swap is the correct behaviour there.

Verified:
- cargo test --target aarch64-apple-darwin --lib (2159 passed, 0 failed)
- cargo build --target x86_64-apple-darwin --tests (0 warnings)
- RUSTFLAGS='-C target-feature=+simd128' cargo build --target wasm32-unknown-unknown --tests
- cargo build --no-default-features
- cargo fmt --check
- cargo clippy --all-targets --all-features -- -D warnings

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@uqio uqio force-pushed the feat/be-tier5 branch from 5a3b632 to 05c3608 Compare May 7, 2026 13:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant