sparse_strips: Add ColorMatrix filter#1623
Conversation
a20a91a to
e72325b
Compare
|
This was done with the assistance of Codex (GPT 5.5, xhigh). It did some of the boring stuff but also wrote benchmarks, ran them under a profiler, disassembled code to verify certain things were happening, and carried out a number of experiments to see if some stuff would result in appreciable performance improvements or not. |
e72325b to
b01cf5f
Compare
|
After talking to @raphlinus, Codex and I did another optimization that was amended into the 3rd commit in this series to avoid a lot of the un-premultiply / re-premultiply work with compatible matrices. The fast path applies matrices directly in premultiplied space when they:
That covers Benchmark using the same 20s 1024x1024 SEPIA release harness:
|
b01cf5f to
fc26e75
Compare
|
Some of this code didn't really belong in Vello, so I've gone and created a new crate in the I've got a version of this set of changes that builds on that instead, and I'll update this PR with it later today. |
Hey @waywardmonkeys, sorry for the slow review. Now that I’ve got review capacity back, I’m ready to take a look. Before I start, just wanted to confirm whether this PR is ready for review or if you’re still planning to make more updates to it? |
|
@grebmeg I think most of this apart from the color math is useful to take a look at and make sure there's no objections to it overall. The color math here is fine, but really belongs elsewhere, hence linebender/color#221. Unfortunately, Tom brought up some good points over there and maybe some people who care about this topic should have a meeting to discuss things. If you'd like to set up such a meeting with me, please feel free to do so. I won't need much notice about having it, as long as I'm available. |
|
FYI: @tomcur (See above) |
There was a problem hiding this comment.
I think this PR is in pretty good shape overall. It might be worth adding the optimizations I mentioned here so we can avoid the additional memory overhead for each filter.
There are also a few other things we haven’t wired up yet, but hopefully we’ll be able to tackle those over time as well.
Thanks for introducing the color filter support! 🎉
| let t3 = textureLoad(filter_data, vec2((texel_offset + 3u) % w, (texel_offset + 3u) / w), 0); | ||
| let t4 = textureLoad(filter_data, vec2((texel_offset + 4u) % w, (texel_offset + 4u) / w), 0); | ||
| let t5 = textureLoad(filter_data, vec2((texel_offset + 5u) % w, (texel_offset + 5u) / w), 0); |
There was a problem hiding this comment.
With adding three more texture loads, I think it’s time to address this comment and introduce a variable-stride index for filter data. This would be similar to how sparse strips are handled in encoded_paints, where paint_idxs stores the texel offset for where the current filter starts in the texture. Would it be possible to do this, if not in this PR, then at least as a follow-up?
| if is_premul_compatible_matrix(matrix) { | ||
| // For RGB-only, alpha-preserving matrices, applying the RGB matrix to | ||
| // premultiplied channels is equivalent to unpremultiply -> matrix -> | ||
| // premultiply. This avoids the per-pixel alpha division. | ||
| for pixel in pixmap.data_mut() { | ||
| let transformed = apply_premul_color_matrix_to_pixel(*pixel, matrix); | ||
| may_have_transparency |= transformed.a != 255; | ||
| *pixel = transformed; | ||
| } | ||
| } else { | ||
| for pixel in pixmap.data_mut() { | ||
| let transformed = apply_color_matrix_to_pixel(*pixel, matrix); | ||
| may_have_transparency |= transformed.a != 255; | ||
| *pixel = transformed; | ||
| } | ||
| } |
There was a problem hiding this comment.
Color matrix is a great candidate for SIMD here 😅 We haven’t wired filters up with SIMD yet, but it’d be nice to do at some point since it could bring significant performance improvements. Just sharing my thoughts, it might be worth adding a TODO somewhere so we can revisit this later.
There was a problem hiding this comment.
Actually, I just realized we also haven’t wired up the faster non-spatial filter path at all, things like color matrix, opacity, flood, basically filters that only require per-pixel work.
I don’t think that blocks this PR since we already have the same situation with Flood, but I wanted to highlight it in case you’re aiming for significantly better performance. Right now those filters still go through the spatial filter path, which allocates a pixmap and does some additional work that could be avoided entirely. This could likely be handled much more efficiently here in vello_cpu fine.
| ) | ||
| } | ||
|
|
||
| fn is_premul_compatible_matrix(matrix: &[f32; 20]) -> bool { |
There was a problem hiding this comment.
Would it be worth inlining this and the functions above as well?
| #[vello_test(skip_multithreaded, width = 120, height = 80, hybrid_tolerance = 1)] | ||
| fn filter_color_matrix_full_sepia(ctx: &mut impl Renderer) { | ||
| let filter = Filter::from_primitive(FilterPrimitive::ColorMatrix { | ||
| matrix: matrices::SEPIA, |
There was a problem hiding this comment.
Could you add at least one snapshot test for an alpha-affecting matrix (e.g. matrices::ALPHA_TO_BLACK) or a matrix with non-zero offsets (the 5th column in a row)?
| fn is_premul_compatible_matrix(matrix: &[f32; 20]) -> bool { | ||
| matrix[3] == 0.0 | ||
| && matrix[4] == 0.0 | ||
| && matrix[8] == 0.0 | ||
| && matrix[9] == 0.0 | ||
| && matrix[13] == 0.0 | ||
| && matrix[14] == 0.0 | ||
| && matrix[15] == 0.0 | ||
| && matrix[16] == 0.0 | ||
| && matrix[17] == 0.0 | ||
| && matrix[18] == 1.0 | ||
| && matrix[19] == 0.0 | ||
| } |
There was a problem hiding this comment.
| fn is_premul_compatible_matrix(matrix: &[f32; 20]) -> bool { | |
| matrix[3] == 0.0 | |
| && matrix[4] == 0.0 | |
| && matrix[8] == 0.0 | |
| && matrix[9] == 0.0 | |
| && matrix[13] == 0.0 | |
| && matrix[14] == 0.0 | |
| && matrix[15] == 0.0 | |
| && matrix[16] == 0.0 | |
| && matrix[17] == 0.0 | |
| && matrix[18] == 1.0 | |
| && matrix[19] == 0.0 | |
| } | |
| fn is_premul_compatible_matrix(matrix: &[f32; 20]) -> bool { | |
| let row_r = &matrix[0..5]; | |
| let row_g = &matrix[5..10]; | |
| let row_b = &matrix[10..15]; | |
| let row_a = &matrix[15..20]; | |
| // RGB rows must not depend on alpha (col 3) or carry a constant offset (col 4): | |
| // applying the matrix directly to premultiplied channels then yields the same | |
| // result as unpremultiply -> matrix -> premultiply. | |
| let rgb_alpha_independent = row_r[3] == 0.0 | |
| && row_r[4] == 0.0 | |
| && row_g[3] == 0.0 | |
| && row_g[4] == 0.0 | |
| && row_b[3] == 0.0 | |
| && row_b[4] == 0.0; | |
| // And alpha must be preserved exactly. | |
| let alpha_preserved = row_a[0] == 0.0 | |
| && row_a[1] == 0.0 | |
| && row_a[2] == 0.0 | |
| && row_a[3] == 1.0 | |
| && row_a[4] == 0.0; | |
| rgb_alpha_independent && alpha_preserved | |
| } |
| return color; | ||
| } | ||
|
|
||
| fn color_matrix_row(data: GpuFilterData, base: u32, color: vec4<f32>) -> f32 { |
I have a feeling it might be useful to start the discussion on Zulip first, so we can all better explore the problem space together? I think this is really important, especially since in the future we’ll likely need to consider not just sRGB. I’m not even sure how much of the discussion so far has touched on that. I also honestly don’t know how actively the |
This adds
ColorMatrixfilter support to the sparse-strip renderers.may_have_transparencyduring the color-matrix pass/ 255.0normalization via reciprocal multiplyPerformance Notes
I profiled a temporary release-mode harness applying
matrices::SEPIArepeatedly over a 1024x1024 pixmap for 20 seconds.After the reciprocal multiply change:
3610full-pixmap passes3233full-pixmap passes