Ship 8b-2c: Yuva420p family u16 RGBA SIMD across all 5 backends#37
Merged
Ship 8b-2c: Yuva420p family u16 RGBA SIMD across all 5 backends#37
Conversation
Adds native-depth u16 RGBA SIMD across NEON / SSE4.1 / AVX2 / AVX-512 / wasm simd128 for the high-bit YUVA 4:2:0 family — Yuva420p9 / Yuva420p10 (BITS-generic) and Yuva420p16 (16-bit). Wires the 3 u16 RGBA dispatchers in src/row/mod.rs that landed as scalar-only stubs in PR #35 (Ship 8b-2a), completing the Yuva420p source-side family across u8 RGBA (8b-2b, PR #36) and u16 RGBA (this PR). Note: 8-bit Yuva420p has no u16 RGBA path — its u8 alpha source doesn't widen meaningfully into a u16 alpha output, and the public API doesn't expose it. ## Changes - **5 SIMD backends** — each gain a third const-generic `ALPHA_SRC: bool` added to the existing `<BITS, ALPHA>` (or `<ALPHA>` for 16-bit) u16 RGBA templates across 2 kernel families: - high-bit BITS-generic: `yuv_420p_n_to_rgb_or_rgba_u16_row<BITS, ALPHA, ALPHA_SRC>` - 16-bit: `yuv_420p16_to_rgb_or_rgba_u16_row<ALPHA, ALPHA_SRC>` When `ALPHA_SRC = true`: - **High-bit (Yuva420p9/10)**: alpha is loaded + AND-masked with `bits_mask::<BITS>()` (same hardening as Y/U/V) and stored at native bit depth — no shift since both source and output are at BITS. - **16-bit (Yuva420p16)**: alpha is loaded directly as full-range u16 — no mask, no shift. Existing no-alpha / opaque-alpha wrappers stay backward-compat by passing `ALPHA_SRC = false, None`. AVX-512 16-bit's `write_rgba_u16_32` helper broadcasts a single 128-bit alpha lane, so the ALPHA_SRC = true branch inlines four `write_rgba_u16_8` calls with per-quarter alpha extraction instead. - **3 u16 RGBA dispatchers wired** in `src/row/mod.rs` (`yuva420p9_to_rgba_u16_row`, `yuva420p10_to_rgba_u16_row`, `yuva420p16_to_rgba_u16_row`) — replace the prior `let _ = use_simd` stubs with the standard `cfg_select!` per-arch route block, mirroring the Yuva444p10 u16 dispatchers' patterns from PR #34. - **Per-backend u16 RGBA equivalence tests** — 25 new `#[test]` functions across the 5 backend test modules (5 NEON, 5 each on SSE4.1 / AVX2 / AVX-512 / wasm simd128). Each new x86 test early-returns on `is_x86_feature_detected!` to satisfy CI sanitizer / Miri / non-feature-flagged runners. Pseudo-random alpha flushes lane-order corruption that solid alpha would mask. - Compile-time `const { assert!(!ALPHA_SRC || ALPHA) }` retained on every shared template — source alpha requires RGBA output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`src/row/mod.rs` had grown to 7276 lines, dominating the entire row crate-private surface. Split the public dispatchers into 7 sibling files under `src/row/dispatch/` grouped by source-format family for readability: - `dispatch/yuv420.rs` (~2700 lines): yuv_420 (8-bit) + yuv420p9/10/12/14/16 + p010/p012/p016 — RGB + RGBA - `dispatch/yuv444.rs` (~1330 lines): yuv_444 (8-bit) + yuv444p9/10/12/14/16 (BITS-generic helpers + per-bit-depth wrappers) — RGB + RGBA - `dispatch/nv.rs` (~630 lines): NV12 / NV21 / NV24 / NV42 — RGB + RGBA - `dispatch/pn.rs` (~800 lines): P410 / P412 / P416 (semi-planar 4:4:4) — RGB + RGBA - `dispatch/yuva.rs` (~845 lines): Yuva444p10 + the Yuva420p family (8-bit + 9 / 10 / 16-bit) — RGBA + u16 RGBA - `dispatch/rgb_ops.rs` (~170 lines): rgb_to_hsv_row, bgr_to_rgb_row, rgb_to_bgr_row - `dispatch/bayer.rs` (~160 lines): Bayer dispatchers `mod.rs` keeps: - Module-level doc + `pub(crate) mod arch / scalar` - `mod dispatch;` + `pub use dispatch::*::*` re-exports (the public API at `crate::row::*` is unchanged) - Shared dispatcher helpers (`rgb_row_bytes`, `rgba_row_bytes`, `rgb_row_elems`, `rgba_row_elems`, `uv_full_row_elems`, `assert_color_transform_well_formed`, `MAX_FUSED_TRANSFORM_ABS`) — bumped from `fn` (private) to `pub(crate)` so dispatch submodules can call them. - Runtime CPU feature detection (`neon_available`, `avx2_available`, `sse41_available`, `avx512_available`, `simd128_available`) — also bumped to `pub(crate)`. - Inline tests (`mod overflow_tests`, `mod bayer_dispatcher_tests`). mod.rs reduces from 7276 lines to 770 lines. The dispatcher function bodies were extracted byte-for-byte via `sed -n` — no semantic changes. The only edits were swapping `fn` → `pub(crate) fn` on shared helpers, adding per-file `use crate::row::*` imports for `scalar`, `arch`, helpers, and the CPU-detection helpers, plus the `pub use dispatch::*::*` re-exports in `mod.rs`. Verified across aarch64-apple-darwin, x86_64-unknown-freebsd, and wasm32-unknown-unknown: - `cargo check --lib --tests`: clean - `RUSTFLAGS=-Dwarnings cargo clippy --lib --tests`: clean - `cargo test --lib` (host): 629 passed (same as before) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… directories
`src/row/dispatch/yuv420.rs` (2698 lines) and `yuv444.rs` (1333 lines)
were the two largest files left after the previous split. Split each
into a subdirectory with one file per source format:
```
src/row/dispatch/yuv420/
mod.rs (re-exports + module decls, 31 lines)
yuv_420.rs (8-bit YUV 4:2:0 RGB / RGBA, 222 lines)
yuv420p9.rs (4 variants, 360 lines)
yuv420p10.rs (4 variants, 367 lines)
yuv420p12.rs (4 variants, 343 lines)
yuv420p14.rs (4 variants, 332 lines)
yuv420p16.rs (4 variants, 291 lines)
p010.rs (P010 4:2:0 semi-planar, 312 lines)
p012.rs (P012, 296 lines)
p016.rs (P016, 279 lines)
src/row/dispatch/yuv444/
mod.rs (re-exports + pub(crate) BITS-generic helpers
`yuv_444p_n_to_rgb_row` /
`yuv_444p_n_to_rgb_u16_row` shared by 9/10/12/14
wrappers, 197 lines)
yuv_444.rs (8-bit YUV 4:4:4 RGB / RGBA, 159 lines)
yuv444p9.rs (thin RGB wrappers + full RGBA dispatchers,
209 lines)
yuv444p10.rs (193 lines)
yuv444p12.rs (192 lines)
yuv444p14.rs (192 lines)
yuv444p16.rs (full dispatchers — BITS-generic template
pinned to {9,10,12,14}, so 16-bit gets its own,
304 lines)
```
No semantic changes — function bodies were extracted byte-for-byte
via `sed -n` from the prior single-file modules. The only edits
were:
- Per-file `use` lines trimmed to what each file actually needs
(e.g. 8-bit dispatchers don't import `rgb_row_elems` /
`rgba_row_elems`; the BITS-generic helper file in yuv444 doesn't
need `rgba_row_*`).
- `yuv444/p9.rs`-`p14.rs` add `use super::{yuv_444p_n_to_rgb_row,
yuv_444p_n_to_rgb_u16_row};` so the thin wrappers reach the
helpers in the sibling `yuv444/mod.rs`.
- Parent `dispatch/mod.rs` is unchanged — the existing
`pub(super) mod yuv420; pub(super) mod yuv444;` declarations
resolve to the new `yuv420/mod.rs` / `yuv444/mod.rs` files.
The maximum file size in `src/row/dispatch/` is now 845 lines
(`yuva.rs`); after dropping yuv420.rs/yuv444.rs the largest YUV
files are 367 / 304 lines.
Verified across aarch64-apple-darwin, x86_64-unknown-freebsd, and
wasm32-unknown-unknown:
- `cargo check --lib --tests`: clean
- `RUSTFLAGS=-Dwarnings cargo clippy --lib --tests`: clean
- `cargo test --lib` (host): 629 passed (same as before)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The per-format split landed `use crate::row::arch;` (folded into the
`row::{arch, ...}` import group) in every dispatch sub-file. On
targets without a per-arch SIMD backend — i686, powerpc64, riscv64,
s390x, etc. — the `cfg_select!` body falls through to the scalar
path, every `arch::*` reference is gated out, and clippy's
`-D warnings` flag promotes the resulting `unused_imports` to a hard
error. CI fails: `miri-tb-i686`, `miri-sb-powerpc64`,
`cross (i686-linux-android)`.
Fix: lift `arch` out of the bundled `row::{...}` import block in
each dispatch file and re-import it under
`#[cfg(any(target_arch = "aarch64", target_arch = "x86_64",
target_arch = "wasm32"))]`. The three targets gate matches the set
that has a SIMD backend in `crate::row::arch::*`. Tested via
`RUSTFLAGS=-Dwarnings cargo check --target i686-unknown-linux-gnu
--lib` (now clean) plus the host aarch64 / x86_64-freebsd / wasm32
suites still passing 629 tests.
Touches every dispatch file that imports `arch`: bayer.rs is
intentionally untouched (the Bayer dispatchers are still
scalar-only and never reference `arch::*`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Changes
5 SIMD backends — each gain a third const-generic `ALPHA_SRC: bool` added to the existing `<BITS, ALPHA>` (or `` for 16-bit) u16 RGBA templates across 2 kernel families:
When `ALPHA_SRC = true`:
Existing no-alpha / opaque-alpha wrappers stay backward-compat by passing `ALPHA_SRC = false, None`. AVX-512 16-bit's `write_rgba_u16_32` helper broadcasts a single 128-bit alpha lane, so the ALPHA_SRC = true branch inlines four `write_rgba_u16_8` calls with per-quarter alpha extraction instead.
3 u16 RGBA dispatchers wired in `src/row/mod.rs` (`yuva420p9_to_rgba_u16_row`, `yuva420p10_to_rgba_u16_row`, `yuva420p16_to_rgba_u16_row`) — replace the prior `let _ = use_simd` stubs with the standard `cfg_select!` per-arch route block, mirroring the Yuva444p10 u16 dispatchers' patterns from PR Ship 8b‑1c: Yuva444p10 u16 RGBA SIMD across all 5 backends #34.
Per-backend u16 RGBA equivalence tests — 25 new `#[test]` functions across the 5 backend test modules (5 each on NEON / SSE4.1 / AVX2 / AVX-512 / wasm simd128). Each new x86 test early-returns on `is_x86_feature_detected!` to satisfy CI sanitizer / Miri / non-feature-flagged runners. Pseudo-random alpha flushes lane-order corruption that solid alpha would mask.
Compile-time `const { assert!(!ALPHA_SRC || ALPHA) }` retained on every shared template — source alpha requires RGBA output.
Test plan
Closes Yuva420p source-side family
With this PR merged, the Yuva420p source-side family (Yuva420p / Yuva420p9 / Yuva420p10 / Yuva420p16) has full SIMD coverage for both u8 RGBA (8b-2b) and native-depth u16 RGBA (8b-2c). Next Ship 8b shipping increment is TBD per the Ship 8b plan.
🤖 Generated with Claude Code