vello_hybrid: Faster bilinear image rendering#1493
Conversation
| /// This method will respect the current affine transformation in place (set via [`Scene::set_transform`], | ||
| /// but is not affected by the current paint transform (set via [`Scene::set_paint_transform`] in place. | ||
| pub fn draw_image(&mut self, image: ImageSource, rect: &Rect) { | ||
| let state = self.save_current_state(); |
There was a problem hiding this comment.
Would it be possible to avoid saving/storing state here and instead allow the user to control that? It would be useful in cases where many draw_image calls happen in sequence.
There was a problem hiding this comment.
I will change it so that we only store the paint transform and paint. I think this is not something that should be upon the user to do though, since I think it's unexpected that draw_image would actually change the render state. Let me know if that works for you.
| /// | ||
| /// This method will respect the current affine transformation in place (set via [`Scene::set_transform`], | ||
| /// but is not affected by the current paint transform (set via [`Scene::set_paint_transform`] in place. | ||
| pub fn draw_image(&mut self, image: ImageSource, rect: &Rect) { |
There was a problem hiding this comment.
What about keeping the existing image rendering API as it is today, where you use set_paint_transform, set_paint, and fill_rect?
And then maybe requiring users to opt in to the new, faster path, for example by setting a native sampling flag? Then, in encode_current_paint, custom could be set as needed before passing it into encode_into. That flag would have no effect for vello_cpu, so we could remove hybrid_ref and use the same tests for all our renderers.
There was a problem hiding this comment.
And then maybe requiring users to opt in to the new, faster path, for example by setting a native sampling flag?
I think the problem with this is that it should be possible to still use the old image rendering mode (if you want to draw normal image patterns), even if you add a flag for that. I guess you could make the flag configurable with a set/unset method, but I'm not sure how much better this is either. 😄 I think it makes sense to add this mode to vello_cpu as well, though, so that we have feature parity. But not sure if it should happen in this PR.
| WebGl2RenderingContext::TEXTURE_MAX_LEVEL, | ||
| 0, | ||
| ); | ||
| texture |
There was a problem hiding this comment.
Have you had a chance to compare performance in webgl?
There was a problem hiding this comment.
It doesn't seem to be as straight-forward to get reliable measurements in WebGL, but I'm seeing similar improvements (33ms vs 22ms).
webgl.mp4
|
@grebmeg If I try this fast path for EDIT: See below, I've implemented this now. |
a3cfd9d to
f013939
Compare
|
@grebmeg I've split this into two commits now, the second commit also adds support for a fast-path for nearest-neighbor sampling. If we don't want that, I can revert the last commit. |
33c5240 to
5e4f4dc
Compare
| if let Some(EncodedPaint::Image(img)) = self.encoded_paints.get_mut(paint_idx) { | ||
| img.set_use_gpu_fast_path(); | ||
| } |
There was a problem hiding this comment.
Sorry - just skimming through this PR to understand how it works from an API perspective.
Is this an invariant? Shouldn't this always be Some(EncodedPaint::Image(img))? If so, we should .unwrap rather than suffer a silent bug
There was a problem hiding this comment.
Yes, it should always be. Can change.
| /// | ||
| /// This method will respect the current affine transformation in place (set via [`Scene::set_transform`], | ||
| /// but is not affected by the current paint transform (set via [`Scene::set_paint_transform`] in place. | ||
| pub fn draw_image(&mut self, image: ImageSource, rect: &Rect, bilinear: bool) { |
There was a problem hiding this comment.
For my information, why didn't we decide to apply this fast path in fill_rect where bilinear is passed in as a RenderSettings or set_* configuration?
There was a problem hiding this comment.
The API can still change, but the problem with passing it to RenderSettings is that it means you cannot draw normal pattern images with extend repeat/reflect/pad anymore. We could add a setter for this so that it can change at runtime, but personally it just seems worse to me from an API perspective. I think a dedicated draw_image method expresses the intent clearer.
|
Using texture filtering hardware I think is the right choice, but I have a few overall concerns with this. I question the value of creating a fast branch, when we really always want to be fast. Leveraging the hardware should be the norm and not the exception that requires a flag or special configuration.
Are there any other reasons for not doing extend/repeat/reflect that I've overlooked here? Because my impression is that it's viable and that native texture filtering shoudn't be a branch, it should be the replacement for the manual bilinear approach that relies on 4 For the bicubic case, this can also be done more efficiently with bilinear texture filtering. In this case, this GPU Gems article covers the approach. This is something where we may have more reservations about changing the quality, but I'd argue in favor of being high quality and fast across a range of hardware is better place to be. There's a lot of room to reduce the number of registers needed for the uber shader in the texture filtering code just by leveraging the existing sampling hardware. |
I agree, but I think in our case, we would internally always use this new API, because I from what I understood we don't have a need for Pad/Reflect/Repeat? From my understanding, we just need to have the ability to draw simple rectangular images, instead of needing to be able to fill arbitrary shapes with image patterns (which is the case where the Pad/Repeat/Reflect API starts to really become relevant). So basically, the existing functionality stays in-place simply because the current imaging model of Vello needs to have support for it (for example, SVG requires this), but in our case, we would always only use this new API (which does mean we need to add this to vello_cpu as well, to have consistency across the two).
I'm not sure I agree that it's an acceptable trade-off. Yes, if you are rendering things like larger images it probably wouldn't be that noticeable, but if you are using the API to render "real" patterns (for example grid lines or other cell patterns, which are often small in size), then I think this could cause some real artifacts, especially when scaling up a lot. For bicubic, this would be even worse because we can sample even further outside of the border. I guess we would need to test but I don't think other renderers like Skia will have artifacts there, so I would be a bit hesitant to go down this route. It would also make cross-testing against vello_cpu much harder, since our current tests trigger many such edge cases.
If we want to have the behaviour of clamping instead of transparent padding we could probably add that, but obviously we'd need to do the clamping as well in the fragment shader, while right now, we can always just use the raw coordinates. Might not be a big deal, but as we've seen, on low-tier devices, every operation counts. 😦
Yes, the approach in this PR could also be used to accelerate bicubic filtering. But as mentioned, if we try to use this for pad/reflect/repeat as well, this would make artifacts even worse since we will sample even more incorrect pixels at the edge. :( But I agree it might be worth trying what the performance gains would be (on low-tier devices) if we always use the native sampling path. |
I may be overlooking something, but I can only really see an issue for this when it comes to "repeat" which has a discontinuity. If I understand correctly, then "pad" is just clamping at the texel center for the edge texel, and "reflect" samples to the center of the edge texel and then samples inward towards the texture. If we want "repeat" to work correctly with texture samplers, it would be possible to solve it ideally in the padding of texture in the atlas, or in worst case, a branch at the discontinuity in the shader. I believe it's entirely possible to do this without artifacts, as long as we are precise with texel coordinates.
I think that's part of my concern with creating a branch. How can this work for vello_cpu? isn't the correct api for vello_cpu the one that already exists?
I'm not particularly concerned about the cost of clamp/fract/abs/floor ops required to get texture filtering behaving consistently with the current approach, since they are very efficiently implemented in hardware. Switching from textureLoad to textureSample is a huge win in terms of prefetching and reduces a lot of ALU and registers required to do image painting. I would be quite surprised if we could measure a noticeable difference in performance with keeping the extend_mode function in place. I decided to investigate implementing my suggestions, and I haven't shifted my view from being concerned with having a fast branch as part of the API. I think this can still be simplified to using We can handle |
48477ab to
9d6886e
Compare
9d6886e to
ea8d6a7
Compare
ea8d6a7 to
1b6129a
Compare
taj-p
left a comment
There was a problem hiding this comment.
I've started investigating this issue and comparing it to #1517. My preference is to not add a new API but I'm still learning the nuances of both approaches.
I suspect the fill_rect approach can be made to be correct with some additional work. But, the question is whether it can be as performant as this implementation - I'm not sure.
| if let Some(EncodedPaint::Image(img)) = self.encoded_paints.borrow_mut().get_mut(paint_idx) | ||
| { | ||
| img.set_use_gpu_fast_path(); | ||
| } |
There was a problem hiding this comment.
This is an invariant right? This must always be Some. In lieu of amending the implementation such that this if isn't required, I think we should .unwrap instead of silently failing (if for whatever reason get_mut returns None).
There was a problem hiding this comment.
Ah yes, sorry you already remarked that, but I haven't fixed it yet. 😅
To maybe add another argument in favor of a new API, it seems like at least Skia also has a dedicated |
1b6129a to
83194cb
Compare
|
Hey @LaurenzV and @harley-canva ! cc @grebmeg I performed an investigation on this PR since it seems that we're a bit unsure about direction. I think the findings are valuable and will help direct efforts about where we're spending the time. See #1547. I'm wondering whether someone can validate that investigation because the data seems to suggest that the bottleneck isn't the extend_mode calculation, but:
When you put those conclusions together, if we're introducing this new API for performance only, it may be unnecessary. The above requires some review and discussion. |
Now that we have the ability to add image padding, we could add a new
draw_imagemethod that uses the GPU-native bilinear sampler instead of our custom bilinear interpolation method. From my very rough experiment with a single scene of 1k images, this seems to give a pretty good speed boost.A part of the speed boost also comes from the fact that we need to do less work in the fragment shader itself.
Zoomed out
zoomed_out.mp4
We roughly get the following frame times for each mode:
Low: 16.5ms
Medium (manual bilinear): 21.5ms
Medium (GPU-native bilinear): 12.2ms
High: 39ms
Zoomed in
zoomed_in.mp4
Low: 27ms
Medium (manual bilinear): 35ms
Medium (GPU-native bilinear): 19.2ms
High: 60ms