vello_hybrid: Faster bilinear image rendering by laurenz-canva · Pull Request #1493 · linebender/vello

laurenz-canva · 2026-03-05T14:09:45Z

Now that we have the ability to add image padding, we could add a new draw_image method that uses the GPU-native bilinear sampler instead of our custom bilinear interpolation method. From my very rough experiment with a single scene of 1k images, this seems to give a pretty good speed boost.

A part of the speed boost also comes from the fact that we need to do less work in the fragment shader itself.

Zoomed out

zoomed_out.mp4

We roughly get the following frame times for each mode:
Low: 16.5ms
Medium (manual bilinear): 21.5ms
Medium (GPU-native bilinear): 12.2ms
High: 39ms

Zoomed in

zoomed_in.mp4

Low: 27ms
Medium (manual bilinear): 35ms
Medium (GPU-native bilinear): 19.2ms
High: 60ms

grebmeg · 2026-03-09T00:33:59Z

+    /// This method will respect the current affine transformation in place (set via [`Scene::set_transform`],
+    /// but is not affected by the current paint transform (set via [`Scene::set_paint_transform`] in place.
+    pub fn draw_image(&mut self, image: ImageSource, rect: &Rect) {
+        let state = self.save_current_state();


Would it be possible to avoid saving/storing state here and instead allow the user to control that? It would be useful in cases where many draw_image calls happen in sequence.

I will change it so that we only store the paint transform and paint. I think this is not something that should be upon the user to do though, since I think it's unexpected that draw_image would actually change the render state. Let me know if that works for you.

grebmeg · 2026-03-09T02:00:49Z

+    ///
+    /// This method will respect the current affine transformation in place (set via [`Scene::set_transform`],
+    /// but is not affected by the current paint transform (set via [`Scene::set_paint_transform`] in place.
+    pub fn draw_image(&mut self, image: ImageSource, rect: &Rect) {


What about keeping the existing image rendering API as it is today, where you use set_paint_transform, set_paint, and fill_rect?

And then maybe requiring users to opt in to the new, faster path, for example by setting a native sampling flag? Then, in encode_current_paint, custom could be set as needed before passing it into encode_into. That flag would have no effect for vello_cpu, so we could remove hybrid_ref and use the same tests for all our renderers.

And then maybe requiring users to opt in to the new, faster path, for example by setting a native sampling flag?

I think the problem with this is that it should be possible to still use the old image rendering mode (if you want to draw normal image patterns), even if you add a flag for that. I guess you could make the flag configurable with a set/unset method, but I'm not sure how much better this is either. 😄 I think it makes sense to add this mode to vello_cpu as well, though, so that we have feature parity. But not sure if it should happen in this PR.

grebmeg · 2026-03-09T02:17:08Z

+        WebGl2RenderingContext::TEXTURE_MAX_LEVEL,
+        0,
+    );
+    texture


Have you had a chance to compare performance in webgl?

Will investigate that.

It doesn't seem to be as straight-forward to get reliable measurements in WebGL, but I'm seeing similar improvements (33ms vs 22ms).

webgl.mp4

LaurenzV · 2026-03-09T08:46:00Z

@grebmeg If I try this fast path for ImageQuality::Low I'm also seeing a drop (from about 19ms to 15ms), so it might be worth doing if we need it for glyph caching. But as mentioned, we probably need to figure out a better API story first 😬

EDIT: See below, I've implemented this now.

LaurenzV · 2026-03-09T10:20:54Z

@grebmeg I've split this into two commits now, the second commit also adds support for a fast-path for nearest-neighbor sampling. If we don't want that, I can revert the last commit.

taj-p · 2026-03-11T18:11:06Z

+        if let Some(EncodedPaint::Image(img)) = self.encoded_paints.get_mut(paint_idx) {
+            img.set_use_gpu_fast_path();
+        }


Sorry - just skimming through this PR to understand how it works from an API perspective.

Is this an invariant? Shouldn't this always be Some(EncodedPaint::Image(img))? If so, we should .unwrap rather than suffer a silent bug

Yes, it should always be. Can change.

taj-p · 2026-03-11T18:27:14Z

+    ///
+    /// This method will respect the current affine transformation in place (set via [`Scene::set_transform`],
+    /// but is not affected by the current paint transform (set via [`Scene::set_paint_transform`] in place.
+    pub fn draw_image(&mut self, image: ImageSource, rect: &Rect, bilinear: bool) {


For my information, why didn't we decide to apply this fast path in fill_rect where bilinear is passed in as a RenderSettings or set_* configuration?

The API can still change, but the problem with passing it to RenderSettings is that it means you cannot draw normal pattern images with extend repeat/reflect/pad anymore. We could add a setter for this so that it can change at runtime, but personally it just seems worse to me from an API perspective. I think a dedicated draw_image method expresses the intent clearer.

harley-canva · 2026-03-19T09:29:27Z

Using texture filtering hardware I think is the right choice, but I have a few overall concerns with this. I question the value of creating a fast branch, when we really always want to be fast. Leveraging the hardware should be the norm and not the exception that requires a flag or special configuration.

clamp/fract/abs/floor should be all we need to get extend/repeat/reflect working with texture filtering. There will be discontinuities at the edge texel, but I think that's an acceptable trade-off. The key is to make sure we sample from the center of the texels at the edge of the texture, not at the edge of the texel, to avoid bilinear filtering with transparent padding.

Are there any other reasons for not doing extend/repeat/reflect that I've overlooked here? Because my impression is that it's viable and that native texture filtering shoudn't be a branch, it should be the replacement for the manual bilinear approach that relies on 4 textureLoad calls.

For the bicubic case, this can also be done more efficiently with bilinear texture filtering. In this case, this GPU Gems article covers the approach. This is something where we may have more reservations about changing the quality, but I'd argue in favor of being high quality and fast across a range of hardware is better place to be. There's a lot of room to reduce the number of registers needed for the uber shader in the texture filtering code just by leveraging the existing sampling hardware.

LaurenzV · 2026-03-19T10:08:48Z

I question the value of creating a fast branch, when we really always want to be fast. Leveraging the hardware should be the norm and not the exception that requires a flag or special configuration.

I agree, but I think in our case, we would internally always use this new API, because I from what I understood we don't have a need for Pad/Reflect/Repeat? From my understanding, we just need to have the ability to draw simple rectangular images, instead of needing to be able to fill arbitrary shapes with image patterns (which is the case where the Pad/Repeat/Reflect API starts to really become relevant).

So basically, the existing functionality stays in-place simply because the current imaging model of Vello needs to have support for it (for example, SVG requires this), but in our case, we would always only use this new API (which does mean we need to add this to vello_cpu as well, to have consistency across the two).

clamp/fract/abs/floor should be all we need to get extend/repeat/reflect working with texture filtering. There will be discontinuities at the edge texel, but I think that's an acceptable trade-off.

I'm not sure I agree that it's an acceptable trade-off. Yes, if you are rendering things like larger images it probably wouldn't be that noticeable, but if you are using the API to render "real" patterns (for example grid lines or other cell patterns, which are often small in size), then I think this could cause some real artifacts, especially when scaling up a lot. For bicubic, this would be even worse because we can sample even further outside of the border. I guess we would need to test but I don't think other renderers like Skia will have artifacts there, so I would be a bit hesitant to go down this route. It would also make cross-testing against vello_cpu much harder, since our current tests trigger many such edge cases.

The key is to make sure we sample from the center of the texels at the edge of the texture, not at the edge of the texel, to avoid bilinear filtering with transparent padding.

If we want to have the behaviour of clamping instead of transparent padding we could probably add that, but obviously we'd need to do the clamping as well in the fragment shader, while right now, we can always just use the raw coordinates. Might not be a big deal, but as we've seen, on low-tier devices, every operation counts. 😦

For the bicubic case, this can also be done more efficiently with bilinear texture filtering.

Yes, the approach in this PR could also be used to accelerate bicubic filtering. But as mentioned, if we try to use this for pad/reflect/repeat as well, this would make artifacts even worse since we will sample even more incorrect pixels at the edge. :(

But I agree it might be worth trying what the performance gains would be (on low-tier devices) if we always use the native sampling path.

harley-canva · 2026-03-19T22:30:02Z

if you are using the API to render "real" patterns (for example grid lines or other cell patterns, which are often small in size), then I think this could cause some real artifacts, especially when scaling up a lot

I may be overlooking something, but I can only really see an issue for this when it comes to "repeat" which has a discontinuity. If I understand correctly, then "pad" is just clamping at the texel center for the edge texel, and "reflect" samples to the center of the edge texel and then samples inward towards the texture. If we want "repeat" to work correctly with texture samplers, it would be possible to solve it ideally in the padding of texture in the atlas, or in worst case, a branch at the discontinuity in the shader. I believe it's entirely possible to do this without artifacts, as long as we are precise with texel coordinates.

in our case, we would always only use this new API (which does mean we need to add this to vello_cpu as well, to have consistency across the two).

I think that's part of my concern with creating a branch. How can this work for vello_cpu? isn't the correct api for vello_cpu the one that already exists?

Might not be a big deal, but as we've seen, on low-tier devices, every operation counts.

I'm not particularly concerned about the cost of clamp/fract/abs/floor ops required to get texture filtering behaving consistently with the current approach, since they are very efficiently implemented in hardware. Switching from textureLoad to textureSample is a huge win in terms of prefetching and reduces a lot of ALU and registers required to do image painting. I would be quite surprised if we could measure a noticeable difference in performance with keeping the extend_mode function in place.

I decided to investigate implementing my suggestions, and ~~I realize my mistake with regards to "reflect", it does have sampling issues at the boundaries that I overlooked~~. EDIT: I got it working #1517

I haven't shifted my view from being concerned with having a fast branch as part of the API. I think this can still be simplified to using textureSample for Pad + Reflect, which can produce correct results, with a smaller set of changes to the shader and webgl/wgpu code and no change to the API.

We can handle Repeat by handling it in the atlas, to be both performant and correct, although that's obviously a larger change. If we did that, we can then remove the branch, and minimize the number of registers needed for bilinear sampling.

taj-p

I've started investigating this issue and comparing it to #1517. My preference is to not add a new API but I'm still learning the nuances of both approaches.

I suspect the fill_rect approach can be made to be correct with some additional work. But, the question is whether it can be as performant as this implementation - I'm not sure.

taj-p · 2026-03-26T19:46:27Z

+        if let Some(EncodedPaint::Image(img)) = self.encoded_paints.borrow_mut().get_mut(paint_idx)
+        {
+            img.set_use_gpu_fast_path();
+        }


This is an invariant right? This must always be Some. In lieu of amending the implementation such that this if isn't required, I think we should .unwrap instead of silently failing (if for whatever reason get_mut returns None).

Ah yes, sorry you already remarked that, but I haven't fixed it yet. 😅

LaurenzV · 2026-03-26T21:06:31Z

My preference is to not add a new API but I'm still learning the nuances of both approaches.

To maybe add another argument in favor of a new API, it seems like at least Skia also has a dedicated drawImage method instead of requiring the user to call drawRect with an image paint. So we wouldn't be alone in having a dedicated API for that. 😄 Though I admittedly haven't looked into how those are handled in the shader.

taj-p · 2026-03-28T20:30:27Z

Hey @LaurenzV and @harley-canva ! cc @grebmeg

I performed an investigation on this PR since it seems that we're a bit unsure about direction. I think the findings are valuable and will help direct efforts about where we're spending the time. See #1547.

I'm wondering whether someone can validate that investigation because the data seems to suggest that the bottleneck isn't the extend_mode calculation, but:

textureDimension querying;
image quality branching; and
per-pixel frag calculations (that should be moved to the vertex and passed as memory).

When you put those conclusions together, if we're introducing this new API for performance only, it may be unnecessary. The above requires some review and discussion.

laurenz-canva changed the title ~~vello_hybrid: Faster bilinear image rendering (POC)~~ vello_hybrid: Faster bilinear image rendering Mar 6, 2026

laurenz-canva requested a review from grebmeg March 6, 2026 08:40

laurenz-canva marked this pull request as ready for review March 6, 2026 08:40

grebmeg reviewed Mar 9, 2026

View reviewed changes

laurenz-canva force-pushed the laurenz/poc_native_bilinear branch from a3cfd9d to f013939 Compare March 9, 2026 10:18

laurenz-canva force-pushed the laurenz/poc_native_bilinear branch 2 times, most recently from 33c5240 to 5e4f4dc Compare March 9, 2026 11:58

taj-p mentioned this pull request Mar 10, 2026

vello_hybrid: Avoid unnecessarily allocating Uint32Arrays #1498

Merged

taj-p reviewed Mar 11, 2026

View reviewed changes

nicoburns added the C-hybrid Applies to the vello_hybrid crate label Mar 11, 2026

grebmeg self-requested a review March 16, 2026 05:52

grebmeg mentioned this pull request Mar 20, 2026

bilinear texture sampling with textureSample instead of textureLoad #1517

Draft

laurenz-canva force-pushed the laurenz/poc_native_bilinear branch 3 times, most recently from 48477ab to 9d6886e Compare March 24, 2026 14:43

LaurenzV force-pushed the laurenz/poc_native_bilinear branch from 9d6886e to ea8d6a7 Compare March 25, 2026 07:35

laurenz-canva force-pushed the laurenz/poc_native_bilinear branch from ea8d6a7 to 1b6129a Compare March 26, 2026 10:52

taj-p reviewed Mar 26, 2026

View reviewed changes

Add support for faster NN and bilinear sampling

83194cb

LaurenzV force-pushed the laurenz/poc_native_bilinear branch from 1b6129a to 83194cb Compare March 27, 2026 13:13

taj-p mentioned this pull request Mar 28, 2026

[Investigation PR]: Improving performance of image sampling in Vello Hybrid #1547

Draft

Conversation

laurenz-canva commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Zoomed out

Zoomed in

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LaurenzV commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LaurenzV commented Mar 9, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LaurenzV Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harley-canva commented Mar 19, 2026

Uh oh!

LaurenzV commented Mar 19, 2026

Uh oh!

harley-canva commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taj-p left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LaurenzV commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taj-p commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

laurenz-canva commented Mar 5, 2026 •

edited

Loading

LaurenzV commented Mar 9, 2026 •

edited

Loading

LaurenzV Mar 11, 2026 •

edited

Loading

harley-canva commented Mar 19, 2026 •

edited

Loading

LaurenzV commented Mar 26, 2026 •

edited

Loading

taj-p commented Mar 28, 2026 •

edited

Loading