vello_cpu: Optimize alpha fills for solid colors#1617
Conversation
a3ab756 to
bf81c0b
Compare
grebmeg
left a comment
There was a problem hiding this comment.
I’m on an M1 Max and seeing quite different bench numbers (I’m comparing the function before and after on the same git branch). Overall, there are clear improvements for solid fills, except for a major regression in solid_opaque_single and smaller regressions for transparent fills. Given the large improvements I’m seeing for fills overall, I think this PR still makes sense, but I’d be interested to know whether there’s anything we can do to compensate for the solid_opaque_single regression and address that issue?
I’m also wondering why the numbers between you and me differ so significantly.
fine/strip/solid_opaque_single_u8_neon
time: [11.448 ns 11.477 ns 11.505 ns]
change: [+11.153% +13.394% +15.441%] (p = 0.00 < 0.05)
Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
1 (1.00%) low mild
3 (3.00%) high mild
2 (2.00%) high severe
fine/strip/solid_transparent_single_u8_neon
time: [9.5715 ns 9.6390 ns 9.7107 ns]
change: [-9.4751% -7.6928% -5.9355%] (p = 0.00 < 0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
6 (6.00%) high mild
1 (1.00%) high severe
fine/strip/solid_opaque_short_u8_neon
time: [12.004 ns 12.204 ns 12.504 ns]
change: [-16.935% -14.583% -12.037%] (p = 0.00 < 0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
4 (4.00%) high severe
fine/strip/solid_transparent_short_u8_neon
time: [14.323 ns 14.414 ns 14.518 ns]
change: [+1.1711% +2.5106% +3.7685%] (p = 0.00 < 0.05)
Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
6 (6.00%) high mild
4 (4.00%) high severe
fine/strip/solid_opaque_medium_u8_neon
time: [16.474 ns 16.509 ns 16.548 ns]
change: [-30.802% -30.539% -30.276%] (p = 0.00 < 0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) high mild
1 (1.00%) high severe
fine/strip/solid_transparent_medium_u8_neon
time: [23.544 ns 23.632 ns 23.737 ns]
change: [-0.4460% -0.1135% +0.1970%] (p = 0.50 > 0.05)
No change in performance detected.
Found 14 outliers among 100 measurements (14.00%)
1 (1.00%) low mild
9 (9.00%) high mild
4 (4.00%) high severe
fine/strip/solid_opaque_long_u8_neon
time: [50.509 ns 50.677 ns 50.874 ns]
change: [-37.554% -37.344% -37.146%] (p = 0.00 < 0.05)
Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
10 (10.00%) high mild
4 (4.00%) high severe
fine/strip/solid_transparent_long_u8_neon
time: [80.458 ns 80.643 ns 80.849 ns]
change: [-1.6947% -1.2280% -0.7446%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
7 (7.00%) high mild
3 (3.00%) high severe| if src[3] == 255 { | ||
| for (next_bg, next_mask) in dest.chunks_exact_mut(32).zip(alphas) { | ||
| alpha_composite_opaque_inner(s, next_bg, &next_mask, src_c, one); | ||
| } | ||
| } else { | ||
| let src_a = u8x32::splat(s, src[3]); | ||
|
|
||
| for (next_bg, next_mask) in dest.chunks_exact_mut(32).zip(alphas) { | ||
| alpha_composite_inner(s, next_bg, &next_mask, src_c, src_a, one); | ||
| } |
There was a problem hiding this comment.
The solid_opaque_single bench consistently regresses ~15-18% on neon. With Tile::WIDTH = 4, the loop runs exactly once, so the branch overhead + codegen difference from the new function outweighs the saved multiply? Would it be worth investigating the generated assembly? It might be worth lifting the opaque dispatch up to Fine::fill (where we already check color[3] == T::Numeric::ONE) rather than branching inside the SIMD closure.
| } | ||
|
|
||
| #[inline(always)] | ||
| fn alpha_composite_opaque_inner<S: Simd>( |
There was a problem hiding this comment.
Could you please add a small doc comment here?
|
That's unfortunate, will try to look into it. Kind of strange though because the new path is objectively less work.
Well, wouldn't be the first time that we are seeing very different numbers on M1 and M4. 😄 |
bf81c0b to
f1192c7
Compare
f1192c7 to
8a38862
Compare
8a38862 to
c855dba
Compare
Based on #1615.
Up until now, when rendering alpha fills we only had a single code path that takes into account the alpha of the source color to calculate the final alpha and then composite. However, in case the color is fully opaque, this can be simplified a lot since the only source of alpha is from the alpha mask (from anti-aliasing). Therefore, we can save ourselves the operation that multiplies it with the alpha of the source color.
As can be seen in the benchmarks, this makes a pretty huge difference for solid colors (compare each opaque version with its corresponding transparent version):