Skip to content

Performance improvements in blurs and overlay#20826

Open
masterpiga wants to merge 2 commits intodarktable-org:masterfrom
masterpiga:blurs
Open

Performance improvements in blurs and overlay#20826
masterpiga wants to merge 2 commits intodarktable-org:masterfrom
masterpiga:blurs

Conversation

@masterpiga
Copy link
Copy Markdown
Contributor

@masterpiga masterpiga commented Apr 17, 2026

This PR brings significant performance improvements for blurs and overlay modules, significantly alleviating the heavy lag spikes observed when manipulating complex compositions at high sensor resolutions. It revamps the bounding constraints of the internal overlay structures, accelerates HQ parameter modifications, refines the OpenCL implementation of blurs and ports overlay to native OpenCL.

Co-authored with Claude, who did most of the heavy lifting of the first pass, and Gemini who fixed a couple of bugs that Claude was clueless about.

Structural Changes

Dynamic Bounding via Cairo Cache

  • In master, initializing the internal pipeline for an overlay image unconditionally evaluates the full uncropped sensor (e.g. 24 MP) via dt_dev_image(), resulting in a massive bottleneck.
  • overlay.c now bounds the generated pipeline to the visible limits of the viewport parameters (req_w and req_h). It anchors the upper ceiling of the requested dimensions and leverages cairo_scale to map to smaller interacting viewports avoiding dt_dev_image() reruns entirely.

OpenCL Porting for Overlay

  • The master pipeline relied exclusively on CPU rendering for overlay. This branch implements an OpenCL path, directly offloading evaluation sequences to the GPU.

HQ and Standard Blurs Optimizations

  • The blurs algorithms and GPU routines have been refined, delivering very noticeable acceleration scaling particularly during High-Quality processing operations.

Observed improvements in interactive usage

These have been computed on an M4 Pro with OpenCL enabled. Methodology:

  • Create a composite including two heavily processed images, both including blurs
  • Enable/disable both overlay and blurs repeatedly
  • Move sliders around (overlay scale and rotation, motion blur direction, curvature and offset)
  • Repeat with HQ processing on/off

1. Processing setup

Fast Processing Setup

  • Master (blurs GPU + overlay CPU): ~70.2 seconds total freeze (~34.7s + ~35.5s)
  • Branch (blurs GPU + overlay GPU): ~3.35 seconds total freeze (~1.35s + ~2.00s)
  • Gain: ~21x faster

2. Interaction: Updating Blur Parameters

Fast Mode Edits

  • [full] pipeline (GPU): Master ~0.06s vs Branch ~0.04s (~1.5x faster)
  • [preview] pipeline (CPU): Master ~0.06s vs Branch ~0.04s (~1.5x faster)

High-Quality (HQ) Mode Edits

  • [full] pipeline (GPU): Master ~7.00s vs Branch ~0.65s (~10.7x faster!)
  • [preview] pipeline (CPU): Master ~0.06s vs Branch ~0.03s (~2x faster)

3. Interaction: Updating Overlay Parameters

Translating, rotating, scaling, or adjusting thresholds relies entirely on cairo_scale caching transformations.

Fast Mode Edits

  • [full] pipeline: Master (CPU) ~0.07s vs Branch (GPU) ~0.07s (Similar throughput)
  • [preview] pipeline: Master (CPU) ~0.06s vs Branch (CPU) ~0.13s (Fractionally heavier due to Cairo sub-sampling map)

High-Quality (HQ) Mode Edits

  • [full] pipeline: Master (CPU) ~0.60s vs Branch (GPU) ~0.38s (~1.5x faster via OpenCL pathing)
  • [preview] pipeline: Master (CPU) ~0.03s vs Branch (CPU) ~0.07s

Detail list of changes

blurs.cl

  • convolve — kernel buffer changed from read_only image2d_t kern to __global const float *kern (GPU can L2-cache it; avoids image format overhead). Added kernel_width param for direct flat indexing.
  • convolve_sparse (new) — takes offsets_x[], offsets_y[], values[] arrays. Only iterates over non-zero kernel entries. For motion blur this is ~1–5% of total entries at large radii → 20–100× fewer texture reads.
  • restore_alpha (new) — composes RGB from blurred buffer + alpha from original, used after dt_gaussian_blur_cl.

blurs.c

  • #include "common/gaussian.h" + "develop/tiling.h": enables IIR blur + tiling API
  • IOP_FLAGS_ALLOW_TILING in flags(): pipeline can tile the image
  • Gaussian CPU process(): replaced O(r²) spatial kernel with dt_gaussian_blur_4c — O(n) IIR, 30×+ faster at r=64
  • Lens/motion CPU process(): sparse list of non-zero entries + correct roi_in/roi_out offset handling for tiled operation
  • Gaussian OpenCL process_cl(): dt_gaussian_blur_cl + restore_alpha kernel to preserve pipeline mask
  • Lens/motion OpenCL process_cl(): sparse convolve_sparse kernel (fallback to dense convolve on OOM)
  • init_global / cleanup_global in OpenCL: register convolve_sparse and restore_alpha kernels

overlay.c:264-335

_setup_overlay now takes req_w/req_h. The cache check at overlay.c:386 invalidates if the stored size doesn't match the current pipe output dimensions. For a 24 MP image at 25% preview scale, dt_dev_image now renders ~1.5 MP instead of 24 MP — ~16× less work per interactive repaint.

overlay.c:686-730, overlay.cl

New process_cl(): Cairo still runs on CPU, but the per-pixel alpha blend runs on GPU via the overlay_blend kernel. Also registered as program 41 in programs.conf.

overlay.c:354-645

_get_overlay_argb() wraps *pbuf directly in a Cairo surface with no malloc+memcpy. overlay_threadsafe is held through cairo_paint() (cache path) to keep the buffer stable, then released before the expensive blend step so other pipes aren't blocked.

Copy link
Copy Markdown
Member

@TurboGit TurboGit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested the regression tests against this.

For blurs all ok. I have just 4 more pixels diff between CPU & GPU and I'll update the baseline count.

For overlay, there is some high diff on CPU. Note that the diff between CPU & GPU is ok.

Test 0160-overlay
      Image mire1.cr2
      CPU & GPU version differ by 302 pixels
      CPU & GPU large difference > 300
      CPU vs. GPU report :
      ----------------------------------
      Max dE                   : 1.47144
      Avg dE                   : 0.00005
      Std dE                   : 0.00547
      ----------------------------------
      Pixels below avg + 0 std : 99.99 %
      Pixels below avg + 1 std : 99.99 %
      Pixels below avg + 3 std : 99.99 %
      Pixels below avg + 6 std : 99.99 %
      Pixels below avg + 9 std : 99.99 %
      ----------------------------------
      Pixels above tolerance   : 0.00 %
 
      Expected CPU vs. current CPU report :
      ----------------------------------
      Max dE                   : 7.46740
      Avg dE                   : 0.13420
      Std dE                   : 0.35203
      ----------------------------------
      Pixels below avg + 0 std : 81.84 %
      Pixels below avg + 1 std : 87.91 %
      Pixels below avg + 3 std : 97.53 %
      Pixels below avg + 6 std : 99.66 %
      Pixels below avg + 9 std : 99.94 %
      ----------------------------------
      Pixels above tolerance   : 0.31 %
 
  FAILS: image visually changed
         see diff.png for visual difference
         (509247 pixels changed)

Test 0161-overlay-modules-before-after
      Image mire1.cr2
      CPU & GPU version differ by 8595 pixels
      CPU vs. GPU report :
      ----------------------------------
      Max dE                   : 2.11817
      Avg dE                   : 0.00158
      Std dE                   : 0.02934
      ----------------------------------
      Pixels below avg + 0 std : 99.66 %
      Pixels below avg + 1 std : 99.66 %
      Pixels below avg + 3 std : 99.67 %
      Pixels below avg + 6 std : 99.69 %
      Pixels below avg + 9 std : 99.71 %
      ----------------------------------
      Pixels above tolerance   : 0.00 %
 
      Expected CPU vs. current CPU report :
      ----------------------------------
      Max dE                   : 12.42073
      Avg dE                   : 0.20106
      Std dE                   : 0.51328
      ----------------------------------
      Pixels below avg + 0 std : 78.74 %
      Pixels below avg + 1 std : 89.59 %
      Pixels below avg + 3 std : 97.67 %
      Pixels below avg + 6 std : 99.62 %
      Pixels below avg + 9 std : 99.93 %
      ----------------------------------
      Pixels above tolerance   : 1.18 %
 
  FAILS: image visually changed
         see diff.png for visual difference
         (551249 pixels changed)

The diff for 0160 (seems more saturated tones are affected most):

Image

The diff for 0161 (seems more saturated tones are affected most):

Image

Visually looking at expected and new output I don't see difference myself. But I'd like to check with you if you have an idea about the diff on saturated tones? I don't see this looking at the code myself.

Comment thread src/iop/blurs.c
}

void tiling_callback(dt_iop_module_t *self, dt_dev_pixelpipe_iop_t *piece,
const dt_iop_roi_t *roi_in, const dt_iop_roi_t *roi_out,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: 1 parameter per line

Comment thread src/iop/blurs.c
for(size_t k = 0; k < npix; k++)
out[k * 4 + 3] = in[k * 4 + 3];

return;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't for readability/maintainability like pattern like:

if(C)
{
   A;
   return;
}

B;

To me it should be:

if(C)
{
   A;
}
else
{
    B;
}

So when possible here, avoid an early return.

Comment thread src/iop/blurs.c
const int radius = MAX(roundf(p->radius / scale), 2);

// ── Gaussian fast path: IIR separable filter ──
if(p->type == DT_BLUR_GAUSSIAN)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, pattern if/else please.

Comment thread data/kernels/blurs.cl
kernel void convolve(read_only image2d_t in,
__global const float *kern,
write_only image2d_t out,
const int width, const int height,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: 1 parameter per line

Comment thread data/kernels/blurs.cl
__global const int *offsets_y,
__global const float *values,
write_only image2d_t out,
const int width, const int height,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: likewise

Comment thread data/kernels/blurs.cl
kernel void restore_alpha(read_only image2d_t original,
read_only image2d_t blurred,
write_only image2d_t out,
const int width, const int height)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: likewise

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the diffs, if i understand the code correctly we only process a part of the overlay. I dont think that is ok. Many modules might take data from other parts. No?

@TurboGit TurboGit added this to the 5.6 milestone Apr 18, 2026
@TurboGit TurboGit added the scope: performance doing everything the same but faster label Apr 18, 2026
@ralfbrown
Copy link
Copy Markdown
Collaborator

I've only taken a brief glance at the blurs CPU code. Looks like you beat me to replacing the Gaussian with the existing dt_gaussian_* functions (runtime independent of blur radius) and implementing a sparse matrix for the others (faster but still quadratic in radius). I was also thinking about whether the motion blur could be rewritten to directly follow the motion line, which would make its runtime linear in the blur radius regardless of orientation.

BTW, at least the Gaussian blur can now have its maximum radius substantially increased....

@jenshannoschwalm
Copy link
Copy Markdown
Collaborator

Dynamic Bounding via Cairo Cache

* In `master`, initializing the internal pipeline for an overlay image unconditionally evaluates the full uncropped sensor (e.g. 24 MP) via `dt_dev_image()`, resulting in a massive bottleneck.

* `overlay.c` now bounds the generated pipeline to the visible limits of the viewport parameters (`req_w` and `req_h`). It anchors the upper ceiling of the requested dimensions and leverages `cairo_scale` to map to smaller interacting viewports avoiding `dt_dev_image()` reruns entirely.

@masterpiga i checked this again and am sure, this is not a good idea at all. Of course it's a bottleneck but i would say for very good reasons. Many modules (examples would be opposed highlights, hazeremoval ...) require full input data (probably the AI or you were not aware of that :-) for correct output so the cpu vs expected diffs are just a first hint there is something wrong with this approach. No?

I would suggest to split this PR (or remove the overlay part) for safe testing - the blur improvements are impressive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

scope: performance doing everything the same but faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants