Skip to content

Ship 8b-2c: Yuva420p family u16 RGBA SIMD across all 5 backends#37

Merged
uqio merged 6 commits intomainfrom
feat/ship8b-2c-yuva420p-family-u16-simd
Apr 28, 2026
Merged

Ship 8b-2c: Yuva420p family u16 RGBA SIMD across all 5 backends#37
uqio merged 6 commits intomainfrom
feat/ship8b-2c-yuva420p-family-u16-simd

Conversation

@al8n
Copy link
Copy Markdown
Collaborator

@al8n al8n commented Apr 27, 2026

Summary

Changes

  • 5 SIMD backends — each gain a third const-generic `ALPHA_SRC: bool` added to the existing `<BITS, ALPHA>` (or `` for 16-bit) u16 RGBA templates across 2 kernel families:

    • high-bit BITS-generic: `yuv_420p_n_to_rgb_or_rgba_u16_row<BITS, ALPHA, ALPHA_SRC>`
    • 16-bit: `yuv_420p16_to_rgb_or_rgba_u16_row<ALPHA, ALPHA_SRC>`

    When `ALPHA_SRC = true`:

    • High-bit (Yuva420p9/10): alpha is loaded + AND-masked with `bits_mask::()` (same hardening as Y/U/V) and stored at native bit depth — no shift since both source and output are at BITS.
    • 16-bit (Yuva420p16): alpha is loaded directly as full-range u16 — no mask, no shift.

    Existing no-alpha / opaque-alpha wrappers stay backward-compat by passing `ALPHA_SRC = false, None`. AVX-512 16-bit's `write_rgba_u16_32` helper broadcasts a single 128-bit alpha lane, so the ALPHA_SRC = true branch inlines four `write_rgba_u16_8` calls with per-quarter alpha extraction instead.

  • 3 u16 RGBA dispatchers wired in `src/row/mod.rs` (`yuva420p9_to_rgba_u16_row`, `yuva420p10_to_rgba_u16_row`, `yuva420p16_to_rgba_u16_row`) — replace the prior `let _ = use_simd` stubs with the standard `cfg_select!` per-arch route block, mirroring the Yuva444p10 u16 dispatchers' patterns from PR Ship 8b‑1c: Yuva444p10 u16 RGBA SIMD across all 5 backends #34.

  • Per-backend u16 RGBA equivalence tests — 25 new `#[test]` functions across the 5 backend test modules (5 each on NEON / SSE4.1 / AVX2 / AVX-512 / wasm simd128). Each new x86 test early-returns on `is_x86_feature_detected!` to satisfy CI sanitizer / Miri / non-feature-flagged runners. Pseudo-random alpha flushes lane-order corruption that solid alpha would mask.

  • Compile-time `const { assert!(!ALPHA_SRC || ALPHA) }` retained on every shared template — source alpha requires RGBA output.

Test plan

  • `cargo check --lib --tests` (aarch64) — clean
  • `cargo test --lib` (aarch64) — 629 passed (+5 new NEON u16 tests)
  • `RUSTFLAGS=-Dwarnings cargo clippy --lib --tests` (aarch64) — clean
  • `cargo check --target x86_64-unknown-freebsd --lib --tests` — clean
  • `RUSTFLAGS=-Dwarnings cargo clippy --target x86_64-unknown-freebsd --lib --tests` — clean
  • `cargo check --target wasm32-unknown-unknown --lib --tests` — clean
  • `RUSTFLAGS=-Dwarnings cargo clippy --target wasm32-unknown-unknown --lib --tests` — clean

Closes Yuva420p source-side family

With this PR merged, the Yuva420p source-side family (Yuva420p / Yuva420p9 / Yuva420p10 / Yuva420p16) has full SIMD coverage for both u8 RGBA (8b-2b) and native-depth u16 RGBA (8b-2c). Next Ship 8b shipping increment is TBD per the Ship 8b plan.

🤖 Generated with Claude Code

Adds native-depth u16 RGBA SIMD across NEON / SSE4.1 / AVX2 / AVX-512
/ wasm simd128 for the high-bit YUVA 4:2:0 family — Yuva420p9 /
Yuva420p10 (BITS-generic) and Yuva420p16 (16-bit). Wires the 3 u16
RGBA dispatchers in src/row/mod.rs that landed as scalar-only stubs
in PR #35 (Ship 8b-2a), completing the Yuva420p source-side family
across u8 RGBA (8b-2b, PR #36) and u16 RGBA (this PR).

Note: 8-bit Yuva420p has no u16 RGBA path — its u8 alpha source
doesn't widen meaningfully into a u16 alpha output, and the public
API doesn't expose it.

## Changes

- **5 SIMD backends** — each gain a third const-generic
  `ALPHA_SRC: bool` added to the existing
  `<BITS, ALPHA>` (or `<ALPHA>` for 16-bit) u16 RGBA templates
  across 2 kernel families:
  - high-bit BITS-generic: `yuv_420p_n_to_rgb_or_rgba_u16_row<BITS, ALPHA, ALPHA_SRC>`
  - 16-bit: `yuv_420p16_to_rgb_or_rgba_u16_row<ALPHA, ALPHA_SRC>`

  When `ALPHA_SRC = true`:
  - **High-bit (Yuva420p9/10)**: alpha is loaded + AND-masked with
    `bits_mask::<BITS>()` (same hardening as Y/U/V) and stored at
    native bit depth — no shift since both source and output are at
    BITS.
  - **16-bit (Yuva420p16)**: alpha is loaded directly as full-range
    u16 — no mask, no shift.

  Existing no-alpha / opaque-alpha wrappers stay backward-compat by
  passing `ALPHA_SRC = false, None`. AVX-512 16-bit's
  `write_rgba_u16_32` helper broadcasts a single 128-bit alpha lane,
  so the ALPHA_SRC = true branch inlines four `write_rgba_u16_8`
  calls with per-quarter alpha extraction instead.

- **3 u16 RGBA dispatchers wired** in `src/row/mod.rs`
  (`yuva420p9_to_rgba_u16_row`, `yuva420p10_to_rgba_u16_row`,
  `yuva420p16_to_rgba_u16_row`) — replace the prior `let _ = use_simd`
  stubs with the standard `cfg_select!` per-arch route block,
  mirroring the Yuva444p10 u16 dispatchers' patterns from PR #34.

- **Per-backend u16 RGBA equivalence tests** — 25 new `#[test]`
  functions across the 5 backend test modules (5 NEON, 5 each on
  SSE4.1 / AVX2 / AVX-512 / wasm simd128). Each new x86 test
  early-returns on `is_x86_feature_detected!` to satisfy CI
  sanitizer / Miri / non-feature-flagged runners. Pseudo-random
  alpha flushes lane-order corruption that solid alpha would mask.

- Compile-time `const { assert!(!ALPHA_SRC || ALPHA) }` retained on
  every shared template — source alpha requires RGBA output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 27, 2026 23:38
uqio and others added 5 commits April 28, 2026 12:05
`src/row/mod.rs` had grown to 7276 lines, dominating the entire row
crate-private surface. Split the public dispatchers into 7 sibling
files under `src/row/dispatch/` grouped by source-format family for
readability:

- `dispatch/yuv420.rs` (~2700 lines): yuv_420 (8-bit) +
  yuv420p9/10/12/14/16 + p010/p012/p016 — RGB + RGBA
- `dispatch/yuv444.rs` (~1330 lines): yuv_444 (8-bit) +
  yuv444p9/10/12/14/16 (BITS-generic helpers + per-bit-depth
  wrappers) — RGB + RGBA
- `dispatch/nv.rs` (~630 lines): NV12 / NV21 / NV24 / NV42 —
  RGB + RGBA
- `dispatch/pn.rs` (~800 lines): P410 / P412 / P416 (semi-planar
  4:4:4) — RGB + RGBA
- `dispatch/yuva.rs` (~845 lines): Yuva444p10 + the Yuva420p
  family (8-bit + 9 / 10 / 16-bit) — RGBA + u16 RGBA
- `dispatch/rgb_ops.rs` (~170 lines): rgb_to_hsv_row,
  bgr_to_rgb_row, rgb_to_bgr_row
- `dispatch/bayer.rs` (~160 lines): Bayer dispatchers

`mod.rs` keeps:
- Module-level doc + `pub(crate) mod arch / scalar`
- `mod dispatch;` + `pub use dispatch::*::*` re-exports (the public
  API at `crate::row::*` is unchanged)
- Shared dispatcher helpers (`rgb_row_bytes`, `rgba_row_bytes`,
  `rgb_row_elems`, `rgba_row_elems`, `uv_full_row_elems`,
  `assert_color_transform_well_formed`, `MAX_FUSED_TRANSFORM_ABS`)
  — bumped from `fn` (private) to `pub(crate)` so dispatch
  submodules can call them.
- Runtime CPU feature detection (`neon_available`, `avx2_available`,
  `sse41_available`, `avx512_available`, `simd128_available`) — also
  bumped to `pub(crate)`.
- Inline tests (`mod overflow_tests`, `mod bayer_dispatcher_tests`).

mod.rs reduces from 7276 lines to 770 lines.

The dispatcher function bodies were extracted byte-for-byte via
`sed -n` — no semantic changes. The only edits were swapping
`fn` → `pub(crate) fn` on shared helpers, adding per-file
`use crate::row::*` imports for `scalar`, `arch`, helpers, and the
CPU-detection helpers, plus the `pub use dispatch::*::*` re-exports
in `mod.rs`.

Verified across aarch64-apple-darwin, x86_64-unknown-freebsd, and
wasm32-unknown-unknown:
- `cargo check --lib --tests`: clean
- `RUSTFLAGS=-Dwarnings cargo clippy --lib --tests`: clean
- `cargo test --lib` (host): 629 passed (same as before)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… directories

`src/row/dispatch/yuv420.rs` (2698 lines) and `yuv444.rs` (1333 lines)
were the two largest files left after the previous split. Split each
into a subdirectory with one file per source format:

```
src/row/dispatch/yuv420/
  mod.rs           (re-exports + module decls, 31 lines)
  yuv_420.rs       (8-bit YUV 4:2:0 RGB / RGBA, 222 lines)
  yuv420p9.rs      (4 variants, 360 lines)
  yuv420p10.rs     (4 variants, 367 lines)
  yuv420p12.rs     (4 variants, 343 lines)
  yuv420p14.rs     (4 variants, 332 lines)
  yuv420p16.rs     (4 variants, 291 lines)
  p010.rs          (P010 4:2:0 semi-planar, 312 lines)
  p012.rs          (P012, 296 lines)
  p016.rs          (P016, 279 lines)

src/row/dispatch/yuv444/
  mod.rs           (re-exports + pub(crate) BITS-generic helpers
                    `yuv_444p_n_to_rgb_row` /
                    `yuv_444p_n_to_rgb_u16_row` shared by 9/10/12/14
                    wrappers, 197 lines)
  yuv_444.rs       (8-bit YUV 4:4:4 RGB / RGBA, 159 lines)
  yuv444p9.rs      (thin RGB wrappers + full RGBA dispatchers,
                    209 lines)
  yuv444p10.rs     (193 lines)
  yuv444p12.rs     (192 lines)
  yuv444p14.rs     (192 lines)
  yuv444p16.rs     (full dispatchers — BITS-generic template
                    pinned to {9,10,12,14}, so 16-bit gets its own,
                    304 lines)
```

No semantic changes — function bodies were extracted byte-for-byte
via `sed -n` from the prior single-file modules. The only edits
were:

- Per-file `use` lines trimmed to what each file actually needs
  (e.g. 8-bit dispatchers don't import `rgb_row_elems` /
  `rgba_row_elems`; the BITS-generic helper file in yuv444 doesn't
  need `rgba_row_*`).
- `yuv444/p9.rs`-`p14.rs` add `use super::{yuv_444p_n_to_rgb_row,
  yuv_444p_n_to_rgb_u16_row};` so the thin wrappers reach the
  helpers in the sibling `yuv444/mod.rs`.
- Parent `dispatch/mod.rs` is unchanged — the existing
  `pub(super) mod yuv420; pub(super) mod yuv444;` declarations
  resolve to the new `yuv420/mod.rs` / `yuv444/mod.rs` files.

The maximum file size in `src/row/dispatch/` is now 845 lines
(`yuva.rs`); after dropping yuv420.rs/yuv444.rs the largest YUV
files are 367 / 304 lines.

Verified across aarch64-apple-darwin, x86_64-unknown-freebsd, and
wasm32-unknown-unknown:
- `cargo check --lib --tests`: clean
- `RUSTFLAGS=-Dwarnings cargo clippy --lib --tests`: clean
- `cargo test --lib` (host): 629 passed (same as before)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The per-format split landed `use crate::row::arch;` (folded into the
`row::{arch, ...}` import group) in every dispatch sub-file. On
targets without a per-arch SIMD backend — i686, powerpc64, riscv64,
s390x, etc. — the `cfg_select!` body falls through to the scalar
path, every `arch::*` reference is gated out, and clippy's
`-D warnings` flag promotes the resulting `unused_imports` to a hard
error. CI fails: `miri-tb-i686`, `miri-sb-powerpc64`,
`cross (i686-linux-android)`.

Fix: lift `arch` out of the bundled `row::{...}` import block in
each dispatch file and re-import it under
`#[cfg(any(target_arch = "aarch64", target_arch = "x86_64",
target_arch = "wasm32"))]`. The three targets gate matches the set
that has a SIMD backend in `crate::row::arch::*`. Tested via
`RUSTFLAGS=-Dwarnings cargo check --target i686-unknown-linux-gnu
--lib` (now clean) plus the host aarch64 / x86_64-freebsd / wasm32
suites still passing 629 tests.

Touches every dispatch file that imports `arch`: bayer.rs is
intentionally untouched (the Bayer dispatchers are still
scalar-only and never reference `arch::*`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@uqio uqio merged commit 6724a83 into main Apr 28, 2026
43 checks passed
@uqio uqio deleted the feat/ship8b-2c-yuva420p-family-u16-simd branch April 29, 2026 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants