Skip to content

vello_cpu: Optimize alpha fills for solid colors#1617

Draft
LaurenzV wants to merge 1 commit into
mainfrom
laurenz/improve_solid
Draft

vello_cpu: Optimize alpha fills for solid colors#1617
LaurenzV wants to merge 1 commit into
mainfrom
laurenz/improve_solid

Conversation

@LaurenzV
Copy link
Copy Markdown
Collaborator

@LaurenzV LaurenzV commented May 3, 2026

Based on #1615.

Up until now, when rendering alpha fills we only had a single code path that takes into account the alpha of the source color to calculate the final alpha and then composite. However, in case the color is fully opaque, this can be simplified a lot since the only source of alpha is from the alpha mask (from anti-aliasing). Therefore, we can save ourselves the operation that multiplies it with the alpha of the source color.

As can be seen in the benchmarks, this makes a pretty huge difference for solid colors (compare each opaque version with its corresponding transparent version):

fine/strip/solid_opaque_single_u8_neon
                        time:   [10.403 ns 10.469 ns 10.529 ns]
                        change: [-2.1594% -0.8205% +0.2495%] (p = 0.17 > 0.05)
                        No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) low severe
  7 (7.00%) low mild

fine/strip/solid_transparent_single_u8_neon
                        time:   [12.117 ns 12.162 ns 12.206 ns]
                        change: [+0.7448% +1.6490% +2.6538%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  3 (3.00%) high mild

fine/strip/solid_opaque_short_u8_neon
                        time:   [12.419 ns 12.503 ns 12.593 ns]
                        change: [-6.0021% -3.9354% -2.0884%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

fine/strip/solid_transparent_short_u8_neon
                        time:   [13.106 ns 13.379 ns 13.655 ns]
                        change: [-1.9953% +0.1353% +2.3733%] (p = 0.90 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

fine/strip/solid_opaque_medium_u8_neon
                        time:   [14.023 ns 14.196 ns 14.416 ns]
                        change: [-1.5650% -0.5159% +0.5592%] (p = 0.36 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) high mild
  7 (7.00%) high severe

fine/strip/solid_transparent_medium_u8_neon
                        time:   [21.052 ns 21.088 ns 21.129 ns]
                        change: [+0.2684% +0.6005% +0.9341%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

fine/strip/solid_opaque_long_u8_neon
                        time:   [45.880 ns 45.948 ns 46.022 ns]
                        change: [-3.2477% -2.4263% -1.8232%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

fine/strip/solid_transparent_long_u8_neon
                        time:   [73.942 ns 74.149 ns 74.375 ns]
                        change: [+1.1164% +1.5625% +1.9901%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

@LaurenzV LaurenzV requested a review from grebmeg May 3, 2026 21:17
Base automatically changed from laurenz/micro to main May 7, 2026 08:39
@LaurenzV LaurenzV force-pushed the laurenz/improve_solid branch from a3ab756 to bf81c0b Compare May 10, 2026 10:15
Copy link
Copy Markdown
Collaborator

@grebmeg grebmeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m on an M1 Max and seeing quite different bench numbers (I’m comparing the function before and after on the same git branch). Overall, there are clear improvements for solid fills, except for a major regression in solid_opaque_single and smaller regressions for transparent fills. Given the large improvements I’m seeing for fills overall, I think this PR still makes sense, but I’d be interested to know whether there’s anything we can do to compensate for the solid_opaque_single regression and address that issue?

I’m also wondering why the numbers between you and me differ so significantly.

fine/strip/solid_opaque_single_u8_neon
                        time:   [11.448 ns 11.477 ns 11.505 ns]
                        change: [+11.153% +13.394% +15.441%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe

fine/strip/solid_transparent_single_u8_neon
                        time:   [9.5715 ns 9.6390 ns 9.7107 ns]
                        change: [-9.4751% -7.6928% -5.9355%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe

fine/strip/solid_opaque_short_u8_neon
                        time:   [12.004 ns 12.204 ns 12.504 ns]
                        change: [-16.935% -14.583% -12.037%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high severe

fine/strip/solid_transparent_short_u8_neon
                        time:   [14.323 ns 14.414 ns 14.518 ns]
                        change: [+1.1711% +2.5106% +3.7685%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe

fine/strip/solid_opaque_medium_u8_neon
                        time:   [16.474 ns 16.509 ns 16.548 ns]
                        change: [-30.802% -30.539% -30.276%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

fine/strip/solid_transparent_medium_u8_neon
                        time:   [23.544 ns 23.632 ns 23.737 ns]
                        change: [-0.4460% -0.1135% +0.1970%] (p = 0.50 > 0.05)
                        No change in performance detected.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low mild
  9 (9.00%) high mild
  4 (4.00%) high severe

fine/strip/solid_opaque_long_u8_neon
                        time:   [50.509 ns 50.677 ns 50.874 ns]
                        change: [-37.554% -37.344% -37.146%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  10 (10.00%) high mild
  4 (4.00%) high severe

fine/strip/solid_transparent_long_u8_neon
                        time:   [80.458 ns 80.643 ns 80.849 ns]
                        change: [-1.6947% -1.2280% -0.7446%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  7 (7.00%) high mild
  3 (3.00%) high severe

Comment on lines +504 to +513
if src[3] == 255 {
for (next_bg, next_mask) in dest.chunks_exact_mut(32).zip(alphas) {
alpha_composite_opaque_inner(s, next_bg, &next_mask, src_c, one);
}
} else {
let src_a = u8x32::splat(s, src[3]);

for (next_bg, next_mask) in dest.chunks_exact_mut(32).zip(alphas) {
alpha_composite_inner(s, next_bg, &next_mask, src_c, src_a, one);
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solid_opaque_single bench consistently regresses ~15-18% on neon. With Tile::WIDTH = 4, the loop runs exactly once, so the branch overhead + codegen difference from the new function outweighs the saved multiply? Would it be worth investigating the generated assembly? It might be worth lifting the opaque dispatch up to Fine::fill (where we already check color[3] == T::Numeric::ONE) rather than branching inside the SIMD closure.

}

#[inline(always)]
fn alpha_composite_opaque_inner<S: Simd>(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a small doc comment here?

@LaurenzV
Copy link
Copy Markdown
Collaborator Author

LaurenzV commented May 11, 2026

That's unfortunate, will try to look into it. Kind of strange though because the new path is objectively less work.

I’m also wondering why the numbers between you and me differ so significantly.

Well, wouldn't be the first time that we are seeing very different numbers on M1 and M4. 😄

@laurenz-canva laurenz-canva force-pushed the laurenz/improve_solid branch from bf81c0b to f1192c7 Compare May 13, 2026 06:12
@LaurenzV LaurenzV marked this pull request as draft May 13, 2026 07:03
@laurenz-canva laurenz-canva force-pushed the laurenz/improve_solid branch from f1192c7 to 8a38862 Compare May 13, 2026 07:56
@laurenz-canva laurenz-canva force-pushed the laurenz/improve_solid branch from 8a38862 to c855dba Compare May 13, 2026 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants