Skip to content

vello_hybrid: Faster bilinear image rendering#1493

Open
laurenz-canva wants to merge 1 commit into
mainfrom
laurenz/poc_native_bilinear
Open

vello_hybrid: Faster bilinear image rendering#1493
laurenz-canva wants to merge 1 commit into
mainfrom
laurenz/poc_native_bilinear

Conversation

@laurenz-canva
Copy link
Copy Markdown
Contributor

@laurenz-canva laurenz-canva commented Mar 5, 2026

Now that we have the ability to add image padding, we could add a new draw_image method that uses the GPU-native bilinear sampler instead of our custom bilinear interpolation method. From my very rough experiment with a single scene of 1k images, this seems to give a pretty good speed boost.

A part of the speed boost also comes from the fact that we need to do less work in the fragment shader itself.

Zoomed out

zoomed_out.mp4

We roughly get the following frame times for each mode:
Low: 16.5ms
Medium (manual bilinear): 21.5ms
Medium (GPU-native bilinear): 12.2ms
High: 39ms

Zoomed in

zoomed_in.mp4

Low: 27ms
Medium (manual bilinear): 35ms
Medium (GPU-native bilinear): 19.2ms
High: 60ms

@laurenz-canva laurenz-canva changed the title vello_hybrid: Faster bilinear image rendering (POC) vello_hybrid: Faster bilinear image rendering Mar 6, 2026
@laurenz-canva laurenz-canva requested a review from grebmeg March 6, 2026 08:40
@laurenz-canva laurenz-canva marked this pull request as ready for review March 6, 2026 08:40
Comment thread sparse_strips/vello_sparse_shaders/shaders/render_strips.wgsl Outdated
Comment thread sparse_strips/vello_hybrid/src/scene.rs Outdated
/// This method will respect the current affine transformation in place (set via [`Scene::set_transform`],
/// but is not affected by the current paint transform (set via [`Scene::set_paint_transform`] in place.
pub fn draw_image(&mut self, image: ImageSource, rect: &Rect) {
let state = self.save_current_state();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to avoid saving/storing state here and instead allow the user to control that? It would be useful in cases where many draw_image calls happen in sequence.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change it so that we only store the paint transform and paint. I think this is not something that should be upon the user to do though, since I think it's unexpected that draw_image would actually change the render state. Let me know if that works for you.

Comment thread sparse_strips/vello_hybrid/src/render/wgpu.rs Outdated
Comment thread sparse_strips/vello_hybrid/src/render/wgpu.rs Outdated
Comment thread sparse_strips/vello_hybrid/src/scene.rs Outdated
///
/// This method will respect the current affine transformation in place (set via [`Scene::set_transform`],
/// but is not affected by the current paint transform (set via [`Scene::set_paint_transform`] in place.
pub fn draw_image(&mut self, image: ImageSource, rect: &Rect) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about keeping the existing image rendering API as it is today, where you use set_paint_transform, set_paint, and fill_rect?

And then maybe requiring users to opt in to the new, faster path, for example by setting a native sampling flag? Then, in encode_current_paint, custom could be set as needed before passing it into encode_into. That flag would have no effect for vello_cpu, so we could remove hybrid_ref and use the same tests for all our renderers.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then maybe requiring users to opt in to the new, faster path, for example by setting a native sampling flag?

I think the problem with this is that it should be possible to still use the old image rendering mode (if you want to draw normal image patterns), even if you add a flag for that. I guess you could make the flag configurable with a set/unset method, but I'm not sure how much better this is either. 😄 I think it makes sense to add this mode to vello_cpu as well, though, so that we have feature parity. But not sure if it should happen in this PR.

WebGl2RenderingContext::TEXTURE_MAX_LEVEL,
0,
);
texture
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you had a chance to compare performance in webgl?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will investigate that.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem to be as straight-forward to get reliable measurements in WebGL, but I'm seeing similar improvements (33ms vs 22ms).

webgl.mp4

@LaurenzV
Copy link
Copy Markdown
Collaborator

LaurenzV commented Mar 9, 2026

@grebmeg If I try this fast path for ImageQuality::Low I'm also seeing a drop (from about 19ms to 15ms), so it might be worth doing if we need it for glyph caching. But as mentioned, we probably need to figure out a better API story first 😬

EDIT: See below, I've implemented this now.

@laurenz-canva laurenz-canva force-pushed the laurenz/poc_native_bilinear branch from a3cfd9d to f013939 Compare March 9, 2026 10:18
@LaurenzV
Copy link
Copy Markdown
Collaborator

LaurenzV commented Mar 9, 2026

@grebmeg I've split this into two commits now, the second commit also adds support for a fast-path for nearest-neighbor sampling. If we don't want that, I can revert the last commit.

@laurenz-canva laurenz-canva force-pushed the laurenz/poc_native_bilinear branch 2 times, most recently from 33c5240 to 5e4f4dc Compare March 9, 2026 11:58
Comment thread sparse_strips/vello_hybrid/src/scene.rs Outdated
Comment on lines +501 to +503
if let Some(EncodedPaint::Image(img)) = self.encoded_paints.get_mut(paint_idx) {
img.set_use_gpu_fast_path();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - just skimming through this PR to understand how it works from an API perspective.

Is this an invariant? Shouldn't this always be Some(EncodedPaint::Image(img))? If so, we should .unwrap rather than suffer a silent bug

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should always be. Can change.

///
/// This method will respect the current affine transformation in place (set via [`Scene::set_transform`],
/// but is not affected by the current paint transform (set via [`Scene::set_paint_transform`] in place.
pub fn draw_image(&mut self, image: ImageSource, rect: &Rect, bilinear: bool) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my information, why didn't we decide to apply this fast path in fill_rect where bilinear is passed in as a RenderSettings or set_* configuration?

Copy link
Copy Markdown
Collaborator

@LaurenzV LaurenzV Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API can still change, but the problem with passing it to RenderSettings is that it means you cannot draw normal pattern images with extend repeat/reflect/pad anymore. We could add a setter for this so that it can change at runtime, but personally it just seems worse to me from an API perspective. I think a dedicated draw_image method expresses the intent clearer.

@nicoburns nicoburns added the C-hybrid Applies to the vello_hybrid crate label Mar 11, 2026
@grebmeg grebmeg self-requested a review March 16, 2026 05:52
@harley-canva
Copy link
Copy Markdown

Using texture filtering hardware I think is the right choice, but I have a few overall concerns with this. I question the value of creating a fast branch, when we really always want to be fast. Leveraging the hardware should be the norm and not the exception that requires a flag or special configuration.

clamp/fract/abs/floor should be all we need to get extend/repeat/reflect working with texture filtering. There will be discontinuities at the edge texel, but I think that's an acceptable trade-off. The key is to make sure we sample from the center of the texels at the edge of the texture, not at the edge of the texel, to avoid bilinear filtering with transparent padding.

Are there any other reasons for not doing extend/repeat/reflect that I've overlooked here? Because my impression is that it's viable and that native texture filtering shoudn't be a branch, it should be the replacement for the manual bilinear approach that relies on 4 textureLoad calls.

For the bicubic case, this can also be done more efficiently with bilinear texture filtering. In this case, this GPU Gems article covers the approach. This is something where we may have more reservations about changing the quality, but I'd argue in favor of being high quality and fast across a range of hardware is better place to be. There's a lot of room to reduce the number of registers needed for the uber shader in the texture filtering code just by leveraging the existing sampling hardware.

@LaurenzV
Copy link
Copy Markdown
Collaborator

I question the value of creating a fast branch, when we really always want to be fast. Leveraging the hardware should be the norm and not the exception that requires a flag or special configuration.

I agree, but I think in our case, we would internally always use this new API, because I from what I understood we don't have a need for Pad/Reflect/Repeat? From my understanding, we just need to have the ability to draw simple rectangular images, instead of needing to be able to fill arbitrary shapes with image patterns (which is the case where the Pad/Repeat/Reflect API starts to really become relevant).

So basically, the existing functionality stays in-place simply because the current imaging model of Vello needs to have support for it (for example, SVG requires this), but in our case, we would always only use this new API (which does mean we need to add this to vello_cpu as well, to have consistency across the two).

clamp/fract/abs/floor should be all we need to get extend/repeat/reflect working with texture filtering. There will be discontinuities at the edge texel, but I think that's an acceptable trade-off.

I'm not sure I agree that it's an acceptable trade-off. Yes, if you are rendering things like larger images it probably wouldn't be that noticeable, but if you are using the API to render "real" patterns (for example grid lines or other cell patterns, which are often small in size), then I think this could cause some real artifacts, especially when scaling up a lot. For bicubic, this would be even worse because we can sample even further outside of the border. I guess we would need to test but I don't think other renderers like Skia will have artifacts there, so I would be a bit hesitant to go down this route. It would also make cross-testing against vello_cpu much harder, since our current tests trigger many such edge cases.

The key is to make sure we sample from the center of the texels at the edge of the texture, not at the edge of the texel, to avoid bilinear filtering with transparent padding.

If we want to have the behaviour of clamping instead of transparent padding we could probably add that, but obviously we'd need to do the clamping as well in the fragment shader, while right now, we can always just use the raw coordinates. Might not be a big deal, but as we've seen, on low-tier devices, every operation counts. 😦

For the bicubic case, this can also be done more efficiently with bilinear texture filtering.

Yes, the approach in this PR could also be used to accelerate bicubic filtering. But as mentioned, if we try to use this for pad/reflect/repeat as well, this would make artifacts even worse since we will sample even more incorrect pixels at the edge. :(

But I agree it might be worth trying what the performance gains would be (on low-tier devices) if we always use the native sampling path.

@harley-canva
Copy link
Copy Markdown

harley-canva commented Mar 19, 2026

if you are using the API to render "real" patterns (for example grid lines or other cell patterns, which are often small in size), then I think this could cause some real artifacts, especially when scaling up a lot

I may be overlooking something, but I can only really see an issue for this when it comes to "repeat" which has a discontinuity. If I understand correctly, then "pad" is just clamping at the texel center for the edge texel, and "reflect" samples to the center of the edge texel and then samples inward towards the texture. If we want "repeat" to work correctly with texture samplers, it would be possible to solve it ideally in the padding of texture in the atlas, or in worst case, a branch at the discontinuity in the shader. I believe it's entirely possible to do this without artifacts, as long as we are precise with texel coordinates.

in our case, we would always only use this new API (which does mean we need to add this to vello_cpu as well, to have consistency across the two).

I think that's part of my concern with creating a branch. How can this work for vello_cpu? isn't the correct api for vello_cpu the one that already exists?

Might not be a big deal, but as we've seen, on low-tier devices, every operation counts.

I'm not particularly concerned about the cost of clamp/fract/abs/floor ops required to get texture filtering behaving consistently with the current approach, since they are very efficiently implemented in hardware. Switching from textureLoad to textureSample is a huge win in terms of prefetching and reduces a lot of ALU and registers required to do image painting. I would be quite surprised if we could measure a noticeable difference in performance with keeping the extend_mode function in place.


I decided to investigate implementing my suggestions, and I realize my mistake with regards to "reflect", it does have sampling issues at the boundaries that I overlooked. EDIT: I got it working #1517

I haven't shifted my view from being concerned with having a fast branch as part of the API. I think this can still be simplified to using textureSample for Pad + Reflect, which can produce correct results, with a smaller set of changes to the shader and webgl/wgpu code and no change to the API.

We can handle Repeat by handling it in the atlas, to be both performant and correct, although that's obviously a larger change. If we did that, we can then remove the branch, and minimize the number of registers needed for bilinear sampling.

@laurenz-canva laurenz-canva force-pushed the laurenz/poc_native_bilinear branch 3 times, most recently from 48477ab to 9d6886e Compare March 24, 2026 14:43
@LaurenzV LaurenzV force-pushed the laurenz/poc_native_bilinear branch from 9d6886e to ea8d6a7 Compare March 25, 2026 07:35
@laurenz-canva laurenz-canva force-pushed the laurenz/poc_native_bilinear branch from ea8d6a7 to 1b6129a Compare March 26, 2026 10:52
Copy link
Copy Markdown
Contributor

@taj-p taj-p left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've started investigating this issue and comparing it to #1517. My preference is to not add a new API but I'm still learning the nuances of both approaches.

I suspect the fill_rect approach can be made to be correct with some additional work. But, the question is whether it can be as performant as this implementation - I'm not sure.

Comment on lines +530 to +533
if let Some(EncodedPaint::Image(img)) = self.encoded_paints.borrow_mut().get_mut(paint_idx)
{
img.set_use_gpu_fast_path();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an invariant right? This must always be Some. In lieu of amending the implementation such that this if isn't required, I think we should .unwrap instead of silently failing (if for whatever reason get_mut returns None).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, sorry you already remarked that, but I haven't fixed it yet. 😅

@LaurenzV
Copy link
Copy Markdown
Collaborator

LaurenzV commented Mar 26, 2026

My preference is to not add a new API but I'm still learning the nuances of both approaches.

To maybe add another argument in favor of a new API, it seems like at least Skia also has a dedicated drawImage method instead of requiring the user to call drawRect with an image paint. So we wouldn't be alone in having a dedicated API for that. 😄 Though I admittedly haven't looked into how those are handled in the shader.

@taj-p
Copy link
Copy Markdown
Contributor

taj-p commented Mar 28, 2026

Hey @LaurenzV and @harley-canva ! cc @grebmeg

I performed an investigation on this PR since it seems that we're a bit unsure about direction. I think the findings are valuable and will help direct efforts about where we're spending the time. See #1547.

I'm wondering whether someone can validate that investigation because the data seems to suggest that the bottleneck isn't the extend_mode calculation, but:

  • textureDimension querying;
  • image quality branching; and
  • per-pixel frag calculations (that should be moved to the vertex and passed as memory).

When you put those conclusions together, if we're introducing this new API for performance only, it may be unnecessary. The above requires some review and discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C-hybrid Applies to the vello_hybrid crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants