Performance improvements in `blurs` and `overlay` by masterpiga · Pull Request #20826 · darktable-org/darktable

masterpiga · 2026-04-17T14:50:56Z

This PR brings significant performance improvements for blurs and overlay modules, significantly alleviating the heavy lag spikes observed when manipulating complex compositions at high sensor resolutions. It revamps the bounding constraints of the internal overlay structures, accelerates HQ parameter modifications, refines the OpenCL implementation of blurs and ports overlay to native OpenCL.

Co-authored with Claude, who did most of the heavy lifting of the first pass, and Gemini who fixed a couple of bugs that Claude was clueless about.

Structural Changes

Dynamic Bounding via Cairo Cache

In master, initializing the internal pipeline for an overlay image unconditionally evaluates the full uncropped sensor (e.g. 24 MP) via dt_dev_image(), resulting in a massive bottleneck.
overlay.c now bounds the generated pipeline to the visible limits of the viewport parameters (req_w and req_h). It anchors the upper ceiling of the requested dimensions and leverages cairo_scale to map to smaller interacting viewports avoiding dt_dev_image() reruns entirely.

OpenCL Porting for Overlay

The master pipeline relied exclusively on CPU rendering for overlay. This branch implements an OpenCL path, directly offloading evaluation sequences to the GPU.

HQ and Standard Blurs Optimizations

The blurs algorithms and GPU routines have been refined, delivering very noticeable acceleration scaling particularly during High-Quality processing operations.

Observed improvements in interactive usage

These have been computed on an M4 Pro with OpenCL enabled. Methodology:

Create a composite including two heavily processed images, both including blurs
Enable/disable both overlay and blurs repeatedly
Move sliders around (overlay scale and rotation, motion blur direction, curvature and offset)
Repeat with HQ processing on/off

1. Processing setup

Fast Processing Setup

Master (blurs GPU + overlay CPU): ~70.2 seconds total freeze (~34.7s + ~35.5s)
Branch (blurs GPU + overlay GPU): ~3.35 seconds total freeze (~1.35s + ~2.00s)
Gain: ~21x faster

2. Interaction: Updating Blur Parameters

Fast Mode Edits

[full] pipeline (GPU): Master ~0.06s vs Branch ~0.04s (~1.5x faster)
[preview] pipeline (CPU): Master ~0.06s vs Branch ~0.04s (~1.5x faster)

High-Quality (HQ) Mode Edits

[full] pipeline (GPU): Master ~7.00s vs Branch ~0.65s (~10.7x faster!)
[preview] pipeline (CPU): Master ~0.06s vs Branch ~0.03s (~2x faster)

3. Interaction: Updating Overlay Parameters

Translating, rotating, scaling, or adjusting thresholds relies entirely on cairo_scale caching transformations.

Fast Mode Edits

[full] pipeline: Master (CPU) ~0.07s vs Branch (GPU) ~0.07s (Similar throughput)
[preview] pipeline: Master (CPU) ~0.06s vs Branch (CPU) ~0.13s (Fractionally heavier due to Cairo sub-sampling map)

High-Quality (HQ) Mode Edits

[full] pipeline: Master (CPU) ~0.60s vs Branch (GPU) ~0.38s (~1.5x faster via OpenCL pathing)
[preview] pipeline: Master (CPU) ~0.03s vs Branch (CPU) ~0.07s

Detail list of changes

`blurs.cl`

convolve — kernel buffer changed from read_only image2d_t kern to __global const float *kern (GPU can L2-cache it; avoids image format overhead). Added kernel_width param for direct flat indexing.
convolve_sparse (new) — takes offsets_x[], offsets_y[], values[] arrays. Only iterates over non-zero kernel entries. For motion blur this is ~1–5% of total entries at large radii → 20–100× fewer texture reads.
restore_alpha (new) — composes RGB from blurred buffer + alpha from original, used after dt_gaussian_blur_cl.

`blurs.c`

#include "common/gaussian.h" + "develop/tiling.h": enables IIR blur + tiling API
IOP_FLAGS_ALLOW_TILING in flags(): pipeline can tile the image
Gaussian CPU process(): replaced O(r²) spatial kernel with dt_gaussian_blur_4c — O(n) IIR, 30×+ faster at r=64
Lens/motion CPU process(): sparse list of non-zero entries + correct roi_in/roi_out offset handling for tiled operation
Gaussian OpenCL process_cl(): dt_gaussian_blur_cl + restore_alpha kernel to preserve pipeline mask
Lens/motion OpenCL process_cl(): sparse convolve_sparse kernel (fallback to dense convolve on OOM)
init_global / cleanup_global in OpenCL: register convolve_sparse and restore_alpha kernels

`overlay.c:264-335`

_setup_overlay now takes req_w/req_h. The cache check at overlay.c:386 invalidates if the stored size doesn't match the current pipe output dimensions. For a 24 MP image at 25% preview scale, dt_dev_image now renders ~1.5 MP instead of 24 MP — ~16× less work per interactive repaint.

`overlay.c:686-730`, `overlay.cl`

New process_cl(): Cairo still runs on CPU, but the per-pixel alpha blend runs on GPU via the overlay_blend kernel. Also registered as program 41 in programs.conf.

`overlay.c:354-645`

_get_overlay_argb() wraps *pbuf directly in a Cairo surface with no malloc+memcpy. overlay_threadsafe is held through cairo_paint() (cache path) to keep the buffer stable, then released before the expensive blend step so other pipes aren't blocked.

TurboGit

I have tested the regression tests against this.

For blurs all ok. I have just 4 more pixels diff between CPU & GPU and I'll update the baseline count.

For overlay, there is some high diff on CPU. Note that the diff between CPU & GPU is ok.

Test 0160-overlay
      Image mire1.cr2
      CPU & GPU version differ by 302 pixels
      CPU & GPU large difference > 300
      CPU vs. GPU report :
      ----------------------------------
      Max dE                   : 1.47144
      Avg dE                   : 0.00005
      Std dE                   : 0.00547
      ----------------------------------
      Pixels below avg + 0 std : 99.99 %
      Pixels below avg + 1 std : 99.99 %
      Pixels below avg + 3 std : 99.99 %
      Pixels below avg + 6 std : 99.99 %
      Pixels below avg + 9 std : 99.99 %
      ----------------------------------
      Pixels above tolerance   : 0.00 %
 
      Expected CPU vs. current CPU report :
      ----------------------------------
      Max dE                   : 7.46740
      Avg dE                   : 0.13420
      Std dE                   : 0.35203
      ----------------------------------
      Pixels below avg + 0 std : 81.84 %
      Pixels below avg + 1 std : 87.91 %
      Pixels below avg + 3 std : 97.53 %
      Pixels below avg + 6 std : 99.66 %
      Pixels below avg + 9 std : 99.94 %
      ----------------------------------
      Pixels above tolerance   : 0.31 %
 
  FAILS: image visually changed
         see diff.png for visual difference
         (509247 pixels changed)

Test 0161-overlay-modules-before-after
      Image mire1.cr2
      CPU & GPU version differ by 8595 pixels
      CPU vs. GPU report :
      ----------------------------------
      Max dE                   : 2.11817
      Avg dE                   : 0.00158
      Std dE                   : 0.02934
      ----------------------------------
      Pixels below avg + 0 std : 99.66 %
      Pixels below avg + 1 std : 99.66 %
      Pixels below avg + 3 std : 99.67 %
      Pixels below avg + 6 std : 99.69 %
      Pixels below avg + 9 std : 99.71 %
      ----------------------------------
      Pixels above tolerance   : 0.00 %
 
      Expected CPU vs. current CPU report :
      ----------------------------------
      Max dE                   : 12.42073
      Avg dE                   : 0.20106
      Std dE                   : 0.51328
      ----------------------------------
      Pixels below avg + 0 std : 78.74 %
      Pixels below avg + 1 std : 89.59 %
      Pixels below avg + 3 std : 97.67 %
      Pixels below avg + 6 std : 99.62 %
      Pixels below avg + 9 std : 99.93 %
      ----------------------------------
      Pixels above tolerance   : 1.18 %
 
  FAILS: image visually changed
         see diff.png for visual difference
         (551249 pixels changed)

The diff for 0160 (seems more saturated tones are affected most):

The diff for 0161 (seems more saturated tones are affected most):

Visually looking at expected and new output I don't see difference myself. But I'd like to check with you if you have an idea about the diff on saturated tones? I don't see this looking at the code myself.

TurboGit · 2026-04-18T17:15:58Z

 }

+void tiling_callback(dt_iop_module_t *self, dt_dev_pixelpipe_iop_t *piece,
+                     const dt_iop_roi_t *roi_in, const dt_iop_roi_t *roi_out,


style: 1 parameter per line

TurboGit · 2026-04-18T17:18:02Z

+    for(size_t k = 0; k < npix; k++)
+      out[k * 4 + 3] = in[k * 4 + 3];
+
+    return;


I don't for readability/maintainability like pattern like:

if(C) { A; return; } B;

To me it should be:

if(C) { A; } else { B; }

So when possible here, avoid an early return.

TurboGit · 2026-04-18T17:18:35Z

  const int radius = MAX(roundf(p->radius / scale), 2);
+
+  // ── Gaussian fast path: IIR separable filter ──
+  if(p->type == DT_BLUR_GAUSSIAN)


Likewise, pattern if/else please.

TurboGit · 2026-04-18T17:21:30Z

+kernel void convolve(read_only image2d_t in,
+                     __global const float *kern,
+                     write_only image2d_t out,
+                     const int width, const int height,


style: 1 parameter per line

TurboGit · 2026-04-18T17:21:43Z

+                             __global const int *offsets_y,
+                             __global const float *values,
+                             write_only image2d_t out,
+                             const int width, const int height,


style: likewise

TurboGit · 2026-04-18T17:21:56Z

+kernel void restore_alpha(read_only image2d_t original,
+                          read_only image2d_t blurred,
+                          write_only image2d_t out,
+                          const int width, const int height)


style: likewise

About the diffs, if i understand the code correctly we only process a part of the overlay. I dont think that is ok. Many modules might take data from other parts. No?

ralfbrown · 2026-04-19T12:46:26Z

I've only taken a brief glance at the blurs CPU code. Looks like you beat me to replacing the Gaussian with the existing dt_gaussian_* functions (runtime independent of blur radius) and implementing a sparse matrix for the others (faster but still quadratic in radius). I was also thinking about whether the motion blur could be rewritten to directly follow the motion line, which would make its runtime linear in the blur radius regardless of orientation.

BTW, at least the Gaussian blur can now have its maximum radius substantially increased....

jenshannoschwalm · 2026-05-05T06:07:00Z

Dynamic Bounding via Cairo Cache

* In `master`, initializing the internal pipeline for an overlay image unconditionally evaluates the full uncropped sensor (e.g. 24 MP) via `dt_dev_image()`, resulting in a massive bottleneck.

* `overlay.c` now bounds the generated pipeline to the visible limits of the viewport parameters (`req_w` and `req_h`). It anchors the upper ceiling of the requested dimensions and leverages `cairo_scale` to map to smaller interacting viewports avoiding `dt_dev_image()` reruns entirely.

@masterpiga i checked this again and am sure, this is not a good idea at all. Of course it's a bottleneck but i would say for very good reasons. Many modules (examples would be opposed highlights, hazeremoval ...) require full input data (probably the AI or you were not aware of that :-) for correct output so the cpu vs expected diffs are just a first hint there is something wrong with this approach. No?

I would suggest to split this PR (or remove the overlay part) for safe testing - the blur improvements are impressive.

masterpiga added 2 commits April 17, 2026 15:13

Efficiency improvements in blur.

913764c

Efficiency improvements in overlay.

4eca26f

TurboGit requested changes Apr 18, 2026

View reviewed changes

TurboGit added this to the 5.6 milestone Apr 18, 2026

TurboGit added the scope: performance doing everything the same but faster label Apr 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvements in `blurs` and `overlay`#20826

Performance improvements in `blurs` and `overlay`#20826
masterpiga wants to merge 2 commits intodarktable-org:masterfrom
masterpiga:blurs

masterpiga commented Apr 17, 2026 •

edited

Loading

Uh oh!

TurboGit left a comment

Uh oh!

TurboGit Apr 18, 2026

Uh oh!

TurboGit Apr 18, 2026

Uh oh!

TurboGit Apr 18, 2026

Uh oh!

TurboGit Apr 18, 2026

Uh oh!

TurboGit Apr 18, 2026

Uh oh!

TurboGit Apr 18, 2026

Uh oh!

jenshannoschwalm Apr 18, 2026

Uh oh!

ralfbrown commented Apr 19, 2026

Uh oh!

jenshannoschwalm commented May 5, 2026

Dynamic Bounding via Cairo Cache

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

masterpiga commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Structural Changes

Dynamic Bounding via Cairo Cache

OpenCL Porting for Overlay

HQ and Standard Blurs Optimizations

Observed improvements in interactive usage

1. Processing setup

2. Interaction: Updating Blur Parameters

3. Interaction: Updating Overlay Parameters

Detail list of changes

blurs.cl

blurs.c

overlay.c:264-335

overlay.c:686-730, overlay.cl

overlay.c:354-645

Uh oh!

TurboGit left a comment

Choose a reason for hiding this comment

Uh oh!

TurboGit Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

TurboGit Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

TurboGit Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

TurboGit Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

TurboGit Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

TurboGit Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

jenshannoschwalm Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

ralfbrown commented Apr 19, 2026

Uh oh!

jenshannoschwalm commented May 5, 2026

Dynamic Bounding via Cairo Cache

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

masterpiga commented Apr 17, 2026 •

edited

Loading

`blurs.cl`

`blurs.c`

`overlay.c:264-335`

`overlay.c:686-730`, `overlay.cl`

`overlay.c:354-645`