Ship 8b-2c: Yuva420p family u16 RGBA SIMD across all 5 backends by al8n · Pull Request #37 · Findit-AI/colconv

al8n · 2026-04-27T23:38:56Z

Summary

Adds native-depth u16 RGBA SIMD across NEON / SSE4.1 / AVX2 / AVX-512 / wasm simd128 for the high-bit YUVA 4:2:0 family — Yuva420p9 / Yuva420p10 (BITS-generic) and Yuva420p16 (16-bit)
Wires the 3 u16 RGBA dispatchers in `src/row/mod.rs` that landed as scalar-only stubs in PR Ship 8b-2a: Yuva420p family scalar prep (Yuva420p / Yuva420p9 / Yuva420p10 / Yuva420p16) #35 — completes the Yuva420p source-side family across u8 RGBA (PR Ship 8b-2b: Yuva420p family u8 RGBA SIMD across all 5 backends #36) and u16 RGBA (this PR)
8-bit Yuva420p has no u16 RGBA path — `u8` alpha doesn't widen meaningfully into a u16 alpha output, and the public API doesn't expose it

Changes

5 SIMD backends — each gain a third const-generic `ALPHA_SRC: bool` added to the existing `<BITS, ALPHA>` (or `` for 16-bit) u16 RGBA templates across 2 kernel families:
- high-bit BITS-generic: `yuv_420p_n_to_rgb_or_rgba_u16_row<BITS, ALPHA, ALPHA_SRC>`
- 16-bit: `yuv_420p16_to_rgb_or_rgba_u16_row<ALPHA, ALPHA_SRC>`
When `ALPHA_SRC = true`:
- High-bit (Yuva420p9/10): alpha is loaded + AND-masked with `bits_mask::()` (same hardening as Y/U/V) and stored at native bit depth — no shift since both source and output are at BITS.
- 16-bit (Yuva420p16): alpha is loaded directly as full-range u16 — no mask, no shift.
Existing no-alpha / opaque-alpha wrappers stay backward-compat by passing `ALPHA_SRC = false, None`. AVX-512 16-bit's `write_rgba_u16_32` helper broadcasts a single 128-bit alpha lane, so the ALPHA_SRC = true branch inlines four `write_rgba_u16_8` calls with per-quarter alpha extraction instead.
3 u16 RGBA dispatchers wired in `src/row/mod.rs` (`yuva420p9_to_rgba_u16_row`, `yuva420p10_to_rgba_u16_row`, `yuva420p16_to_rgba_u16_row`) — replace the prior `let _ = use_simd` stubs with the standard `cfg_select!` per-arch route block, mirroring the Yuva444p10 u16 dispatchers' patterns from PR Ship 8b‑1c: Yuva444p10 u16 RGBA SIMD across all 5 backends #34.
Per-backend u16 RGBA equivalence tests — 25 new `#[test]` functions across the 5 backend test modules (5 each on NEON / SSE4.1 / AVX2 / AVX-512 / wasm simd128). Each new x86 test early-returns on `is_x86_feature_detected!` to satisfy CI sanitizer / Miri / non-feature-flagged runners. Pseudo-random alpha flushes lane-order corruption that solid alpha would mask.
Compile-time `const { assert!(!ALPHA_SRC || ALPHA) }` retained on every shared template — source alpha requires RGBA output.

Test plan

`cargo check --lib --tests` (aarch64) — clean
`cargo test --lib` (aarch64) — 629 passed (+5 new NEON u16 tests)
`RUSTFLAGS=-Dwarnings cargo clippy --lib --tests` (aarch64) — clean
`cargo check --target x86_64-unknown-freebsd --lib --tests` — clean
`RUSTFLAGS=-Dwarnings cargo clippy --target x86_64-unknown-freebsd --lib --tests` — clean
`cargo check --target wasm32-unknown-unknown --lib --tests` — clean
`RUSTFLAGS=-Dwarnings cargo clippy --target wasm32-unknown-unknown --lib --tests` — clean

Closes Yuva420p source-side family

With this PR merged, the Yuva420p source-side family (Yuva420p / Yuva420p9 / Yuva420p10 / Yuva420p16) has full SIMD coverage for both u8 RGBA (8b-2b) and native-depth u16 RGBA (8b-2c). Next Ship 8b shipping increment is TBD per the Ship 8b plan.

🤖 Generated with Claude Code

Adds native-depth u16 RGBA SIMD across NEON / SSE4.1 / AVX2 / AVX-512 / wasm simd128 for the high-bit YUVA 4:2:0 family — Yuva420p9 / Yuva420p10 (BITS-generic) and Yuva420p16 (16-bit). Wires the 3 u16 RGBA dispatchers in src/row/mod.rs that landed as scalar-only stubs in PR #35 (Ship 8b-2a), completing the Yuva420p source-side family across u8 RGBA (8b-2b, PR #36) and u16 RGBA (this PR). Note: 8-bit Yuva420p has no u16 RGBA path — its u8 alpha source doesn't widen meaningfully into a u16 alpha output, and the public API doesn't expose it. ## Changes - **5 SIMD backends** — each gain a third const-generic `ALPHA_SRC: bool` added to the existing `<BITS, ALPHA>` (or `<ALPHA>` for 16-bit) u16 RGBA templates across 2 kernel families: - high-bit BITS-generic: `yuv_420p_n_to_rgb_or_rgba_u16_row<BITS, ALPHA, ALPHA_SRC>` - 16-bit: `yuv_420p16_to_rgb_or_rgba_u16_row<ALPHA, ALPHA_SRC>` When `ALPHA_SRC = true`: - **High-bit (Yuva420p9/10)**: alpha is loaded + AND-masked with `bits_mask::<BITS>()` (same hardening as Y/U/V) and stored at native bit depth — no shift since both source and output are at BITS. - **16-bit (Yuva420p16)**: alpha is loaded directly as full-range u16 — no mask, no shift. Existing no-alpha / opaque-alpha wrappers stay backward-compat by passing `ALPHA_SRC = false, None`. AVX-512 16-bit's `write_rgba_u16_32` helper broadcasts a single 128-bit alpha lane, so the ALPHA_SRC = true branch inlines four `write_rgba_u16_8` calls with per-quarter alpha extraction instead. - **3 u16 RGBA dispatchers wired** in `src/row/mod.rs` (`yuva420p9_to_rgba_u16_row`, `yuva420p10_to_rgba_u16_row`, `yuva420p16_to_rgba_u16_row`) — replace the prior `let _ = use_simd` stubs with the standard `cfg_select!` per-arch route block, mirroring the Yuva444p10 u16 dispatchers' patterns from PR #34. - **Per-backend u16 RGBA equivalence tests** — 25 new `#[test]` functions across the 5 backend test modules (5 NEON, 5 each on SSE4.1 / AVX2 / AVX-512 / wasm simd128). Each new x86 test early-returns on `is_x86_feature_detected!` to satisfy CI sanitizer / Miri / non-feature-flagged runners. Pseudo-random alpha flushes lane-order corruption that solid alpha would mask. - Compile-time `const { assert!(!ALPHA_SRC || ALPHA) }` retained on every shared template — source alpha requires RGBA output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`src/row/mod.rs` had grown to 7276 lines, dominating the entire row crate-private surface. Split the public dispatchers into 7 sibling files under `src/row/dispatch/` grouped by source-format family for readability: - `dispatch/yuv420.rs` (~2700 lines): yuv_420 (8-bit) + yuv420p9/10/12/14/16 + p010/p012/p016 — RGB + RGBA - `dispatch/yuv444.rs` (~1330 lines): yuv_444 (8-bit) + yuv444p9/10/12/14/16 (BITS-generic helpers + per-bit-depth wrappers) — RGB + RGBA - `dispatch/nv.rs` (~630 lines): NV12 / NV21 / NV24 / NV42 — RGB + RGBA - `dispatch/pn.rs` (~800 lines): P410 / P412 / P416 (semi-planar 4:4:4) — RGB + RGBA - `dispatch/yuva.rs` (~845 lines): Yuva444p10 + the Yuva420p family (8-bit + 9 / 10 / 16-bit) — RGBA + u16 RGBA - `dispatch/rgb_ops.rs` (~170 lines): rgb_to_hsv_row, bgr_to_rgb_row, rgb_to_bgr_row - `dispatch/bayer.rs` (~160 lines): Bayer dispatchers `mod.rs` keeps: - Module-level doc + `pub(crate) mod arch / scalar` - `mod dispatch;` + `pub use dispatch::*::*` re-exports (the public API at `crate::row::*` is unchanged) - Shared dispatcher helpers (`rgb_row_bytes`, `rgba_row_bytes`, `rgb_row_elems`, `rgba_row_elems`, `uv_full_row_elems`, `assert_color_transform_well_formed`, `MAX_FUSED_TRANSFORM_ABS`) — bumped from `fn` (private) to `pub(crate)` so dispatch submodules can call them. - Runtime CPU feature detection (`neon_available`, `avx2_available`, `sse41_available`, `avx512_available`, `simd128_available`) — also bumped to `pub(crate)`. - Inline tests (`mod overflow_tests`, `mod bayer_dispatcher_tests`). mod.rs reduces from 7276 lines to 770 lines. The dispatcher function bodies were extracted byte-for-byte via `sed -n` — no semantic changes. The only edits were swapping `fn` → `pub(crate) fn` on shared helpers, adding per-file `use crate::row::*` imports for `scalar`, `arch`, helpers, and the CPU-detection helpers, plus the `pub use dispatch::*::*` re-exports in `mod.rs`. Verified across aarch64-apple-darwin, x86_64-unknown-freebsd, and wasm32-unknown-unknown: - `cargo check --lib --tests`: clean - `RUSTFLAGS=-Dwarnings cargo clippy --lib --tests`: clean - `cargo test --lib` (host): 629 passed (same as before) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… directories `src/row/dispatch/yuv420.rs` (2698 lines) and `yuv444.rs` (1333 lines) were the two largest files left after the previous split. Split each into a subdirectory with one file per source format: ``` src/row/dispatch/yuv420/ mod.rs (re-exports + module decls, 31 lines) yuv_420.rs (8-bit YUV 4:2:0 RGB / RGBA, 222 lines) yuv420p9.rs (4 variants, 360 lines) yuv420p10.rs (4 variants, 367 lines) yuv420p12.rs (4 variants, 343 lines) yuv420p14.rs (4 variants, 332 lines) yuv420p16.rs (4 variants, 291 lines) p010.rs (P010 4:2:0 semi-planar, 312 lines) p012.rs (P012, 296 lines) p016.rs (P016, 279 lines) src/row/dispatch/yuv444/ mod.rs (re-exports + pub(crate) BITS-generic helpers `yuv_444p_n_to_rgb_row` / `yuv_444p_n_to_rgb_u16_row` shared by 9/10/12/14 wrappers, 197 lines) yuv_444.rs (8-bit YUV 4:4:4 RGB / RGBA, 159 lines) yuv444p9.rs (thin RGB wrappers + full RGBA dispatchers, 209 lines) yuv444p10.rs (193 lines) yuv444p12.rs (192 lines) yuv444p14.rs (192 lines) yuv444p16.rs (full dispatchers — BITS-generic template pinned to {9,10,12,14}, so 16-bit gets its own, 304 lines) ``` No semantic changes — function bodies were extracted byte-for-byte via `sed -n` from the prior single-file modules. The only edits were: - Per-file `use` lines trimmed to what each file actually needs (e.g. 8-bit dispatchers don't import `rgb_row_elems` / `rgba_row_elems`; the BITS-generic helper file in yuv444 doesn't need `rgba_row_*`). - `yuv444/p9.rs`-`p14.rs` add `use super::{yuv_444p_n_to_rgb_row, yuv_444p_n_to_rgb_u16_row};` so the thin wrappers reach the helpers in the sibling `yuv444/mod.rs`. - Parent `dispatch/mod.rs` is unchanged — the existing `pub(super) mod yuv420; pub(super) mod yuv444;` declarations resolve to the new `yuv420/mod.rs` / `yuv444/mod.rs` files. The maximum file size in `src/row/dispatch/` is now 845 lines (`yuva.rs`); after dropping yuv420.rs/yuv444.rs the largest YUV files are 367 / 304 lines. Verified across aarch64-apple-darwin, x86_64-unknown-freebsd, and wasm32-unknown-unknown: - `cargo check --lib --tests`: clean - `RUSTFLAGS=-Dwarnings cargo clippy --lib --tests`: clean - `cargo test --lib` (host): 629 passed (same as before) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The per-format split landed `use crate::row::arch;` (folded into the `row::{arch, ...}` import group) in every dispatch sub-file. On targets without a per-arch SIMD backend — i686, powerpc64, riscv64, s390x, etc. — the `cfg_select!` body falls through to the scalar path, every `arch::*` reference is gated out, and clippy's `-D warnings` flag promotes the resulting `unused_imports` to a hard error. CI fails: `miri-tb-i686`, `miri-sb-powerpc64`, `cross (i686-linux-android)`. Fix: lift `arch` out of the bundled `row::{...}` import block in each dispatch file and re-import it under `#[cfg(any(target_arch = "aarch64", target_arch = "x86_64", target_arch = "wasm32"))]`. The three targets gate matches the set that has a SIMD backend in `crate::row::arch::*`. Tested via `RUSTFLAGS=-Dwarnings cargo check --target i686-unknown-linux-gnu --lib` (now clean) plus the host aarch64 / x86_64-freebsd / wasm32 suites still passing 629 tests. Touches every dispatch file that imports `arch`: bayer.rs is intentionally untouched (the Bayer dispatchers are still scalar-only and never reference `arch::*`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings April 27, 2026 23:38

uqio and others added 5 commits April 28, 2026 12:05

finish scalar impl for yuv420p

c1a0731

finish scalar impl for yuv420p

56be621

uqio merged commit 6724a83 into main Apr 28, 2026
43 checks passed

uqio deleted the feat/ship8b-2c-yuva420p-family-u16-simd branch April 29, 2026 09:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ship 8b-2c: Yuva420p family u16 RGBA SIMD across all 5 backends#37

Ship 8b-2c: Yuva420p family u16 RGBA SIMD across all 5 backends#37
uqio merged 6 commits intomainfrom
feat/ship8b-2c-yuva420p-family-u16-simd

al8n commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

al8n commented Apr 27, 2026

Summary

Changes

Test plan

Closes Yuva420p source-side family

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants