Performance improvements in blurs and overlay#20826
Performance improvements in blurs and overlay#20826masterpiga wants to merge 2 commits intodarktable-org:masterfrom
blurs and overlay#20826Conversation
TurboGit
left a comment
There was a problem hiding this comment.
I have tested the regression tests against this.
For blurs all ok. I have just 4 more pixels diff between CPU & GPU and I'll update the baseline count.
For overlay, there is some high diff on CPU. Note that the diff between CPU & GPU is ok.
Test 0160-overlay
Image mire1.cr2
CPU & GPU version differ by 302 pixels
CPU & GPU large difference > 300
CPU vs. GPU report :
----------------------------------
Max dE : 1.47144
Avg dE : 0.00005
Std dE : 0.00547
----------------------------------
Pixels below avg + 0 std : 99.99 %
Pixels below avg + 1 std : 99.99 %
Pixels below avg + 3 std : 99.99 %
Pixels below avg + 6 std : 99.99 %
Pixels below avg + 9 std : 99.99 %
----------------------------------
Pixels above tolerance : 0.00 %
Expected CPU vs. current CPU report :
----------------------------------
Max dE : 7.46740
Avg dE : 0.13420
Std dE : 0.35203
----------------------------------
Pixels below avg + 0 std : 81.84 %
Pixels below avg + 1 std : 87.91 %
Pixels below avg + 3 std : 97.53 %
Pixels below avg + 6 std : 99.66 %
Pixels below avg + 9 std : 99.94 %
----------------------------------
Pixels above tolerance : 0.31 %
FAILS: image visually changed
see diff.png for visual difference
(509247 pixels changed)
Test 0161-overlay-modules-before-after
Image mire1.cr2
CPU & GPU version differ by 8595 pixels
CPU vs. GPU report :
----------------------------------
Max dE : 2.11817
Avg dE : 0.00158
Std dE : 0.02934
----------------------------------
Pixels below avg + 0 std : 99.66 %
Pixels below avg + 1 std : 99.66 %
Pixels below avg + 3 std : 99.67 %
Pixels below avg + 6 std : 99.69 %
Pixels below avg + 9 std : 99.71 %
----------------------------------
Pixels above tolerance : 0.00 %
Expected CPU vs. current CPU report :
----------------------------------
Max dE : 12.42073
Avg dE : 0.20106
Std dE : 0.51328
----------------------------------
Pixels below avg + 0 std : 78.74 %
Pixels below avg + 1 std : 89.59 %
Pixels below avg + 3 std : 97.67 %
Pixels below avg + 6 std : 99.62 %
Pixels below avg + 9 std : 99.93 %
----------------------------------
Pixels above tolerance : 1.18 %
FAILS: image visually changed
see diff.png for visual difference
(551249 pixels changed)
The diff for 0160 (seems more saturated tones are affected most):
The diff for 0161 (seems more saturated tones are affected most):
Visually looking at expected and new output I don't see difference myself. But I'd like to check with you if you have an idea about the diff on saturated tones? I don't see this looking at the code myself.
| } | ||
|
|
||
| void tiling_callback(dt_iop_module_t *self, dt_dev_pixelpipe_iop_t *piece, | ||
| const dt_iop_roi_t *roi_in, const dt_iop_roi_t *roi_out, |
| for(size_t k = 0; k < npix; k++) | ||
| out[k * 4 + 3] = in[k * 4 + 3]; | ||
|
|
||
| return; |
There was a problem hiding this comment.
I don't for readability/maintainability like pattern like:
if(C)
{
A;
return;
}
B;
To me it should be:
if(C)
{
A;
}
else
{
B;
}
So when possible here, avoid an early return.
| const int radius = MAX(roundf(p->radius / scale), 2); | ||
|
|
||
| // ── Gaussian fast path: IIR separable filter ── | ||
| if(p->type == DT_BLUR_GAUSSIAN) |
There was a problem hiding this comment.
Likewise, pattern if/else please.
| kernel void convolve(read_only image2d_t in, | ||
| __global const float *kern, | ||
| write_only image2d_t out, | ||
| const int width, const int height, |
| __global const int *offsets_y, | ||
| __global const float *values, | ||
| write_only image2d_t out, | ||
| const int width, const int height, |
| kernel void restore_alpha(read_only image2d_t original, | ||
| read_only image2d_t blurred, | ||
| write_only image2d_t out, | ||
| const int width, const int height) |
There was a problem hiding this comment.
About the diffs, if i understand the code correctly we only process a part of the overlay. I dont think that is ok. Many modules might take data from other parts. No?
|
I've only taken a brief glance at the blurs CPU code. Looks like you beat me to replacing the Gaussian with the existing dt_gaussian_* functions (runtime independent of blur radius) and implementing a sparse matrix for the others (faster but still quadratic in radius). I was also thinking about whether the motion blur could be rewritten to directly follow the motion line, which would make its runtime linear in the blur radius regardless of orientation. BTW, at least the Gaussian blur can now have its maximum radius substantially increased.... |
@masterpiga i checked this again and am sure, this is not a good idea at all. Of course it's a bottleneck but i would say for very good reasons. Many modules (examples would be opposed highlights, hazeremoval ...) require full input data (probably the AI or you were not aware of that :-) for correct output so the cpu vs expected diffs are just a first hint there is something wrong with this approach. No? I would suggest to split this PR (or remove the overlay part) for safe testing - the blur improvements are impressive. |
This PR brings significant performance improvements for
blursandoverlaymodules, significantly alleviating the heavy lag spikes observed when manipulating complex compositions at high sensor resolutions. It revamps the bounding constraints of the internal overlay structures, accelerates HQ parameter modifications, refines the OpenCL implementation ofblursand portsoverlayto native OpenCL.Co-authored with Claude, who did most of the heavy lifting of the first pass, and Gemini who fixed a couple of bugs that Claude was clueless about.
Structural Changes
Dynamic Bounding via Cairo Cache
master, initializing the internal pipeline for an overlay image unconditionally evaluates the full uncropped sensor (e.g. 24 MP) viadt_dev_image(), resulting in a massive bottleneck.overlay.cnow bounds the generated pipeline to the visible limits of the viewport parameters (req_wandreq_h). It anchors the upper ceiling of the requested dimensions and leveragescairo_scaleto map to smaller interacting viewports avoidingdt_dev_image()reruns entirely.OpenCL Porting for Overlay
masterpipeline relied exclusively on CPU rendering foroverlay. This branch implements an OpenCL path, directly offloading evaluation sequences to the GPU.HQ and Standard Blurs Optimizations
blursalgorithms and GPU routines have been refined, delivering very noticeable acceleration scaling particularly during High-Quality processing operations.Observed improvements in interactive usage
These have been computed on an M4 Pro with OpenCL enabled. Methodology:
overlayandblursrepeatedly1. Processing setup
Fast Processing Setup
2. Interaction: Updating Blur Parameters
Fast Mode Edits
[full]pipeline (GPU): Master ~0.06s vs Branch ~0.04s (~1.5x faster)[preview]pipeline (CPU): Master ~0.06s vs Branch ~0.04s (~1.5x faster)High-Quality (HQ) Mode Edits
[full]pipeline (GPU): Master ~7.00s vs Branch ~0.65s (~10.7x faster!)[preview]pipeline (CPU): Master ~0.06s vs Branch ~0.03s (~2x faster)3. Interaction: Updating Overlay Parameters
Translating, rotating, scaling, or adjusting thresholds relies entirely on
cairo_scalecaching transformations.Fast Mode Edits
[full]pipeline: Master (CPU) ~0.07s vs Branch (GPU) ~0.07s (Similar throughput)[preview]pipeline: Master (CPU) ~0.06s vs Branch (CPU) ~0.13s (Fractionally heavier due to Cairo sub-sampling map)High-Quality (HQ) Mode Edits
[full]pipeline: Master (CPU) ~0.60s vs Branch (GPU) ~0.38s (~1.5x faster via OpenCL pathing)[preview]pipeline: Master (CPU) ~0.03s vs Branch (CPU) ~0.07sDetail list of changes
blurs.clconvolve— kernel buffer changed fromread_only image2d_t kernto__global const float *kern(GPU can L2-cache it; avoids image format overhead). Addedkernel_widthparam for direct flat indexing.convolve_sparse(new) — takesoffsets_x[],offsets_y[],values[]arrays. Only iterates over non-zero kernel entries. For motion blur this is ~1–5% of total entries at large radii → 20–100× fewer texture reads.restore_alpha(new) — composes RGB from blurred buffer + alpha from original, used afterdt_gaussian_blur_cl.blurs.c#include "common/gaussian.h"+"develop/tiling.h": enables IIR blur + tiling APIIOP_FLAGS_ALLOW_TILINGinflags(): pipeline can tile the imageprocess(): replaced O(r²) spatial kernel with dt_gaussian_blur_4c — O(n) IIR, 30×+ faster at r=64process(): sparse list of non-zero entries + correct roi_in/roi_out offset handling for tiled operationprocess_cl(): dt_gaussian_blur_cl + restore_alpha kernel to preserve pipeline maskprocess_cl(): sparse convolve_sparse kernel (fallback to dense convolve on OOM)init_global/cleanup_globalin OpenCL: register convolve_sparse and restore_alpha kernelsoverlay.c:264-335_setup_overlaynow takesreq_w/req_h. The cache check at overlay.c:386 invalidates if the stored size doesn't match the current pipe output dimensions. For a 24 MP image at 25% preview scale,dt_dev_imagenow renders ~1.5 MP instead of 24 MP — ~16× less work per interactive repaint.overlay.c:686-730,overlay.clNew
process_cl(): Cairo still runs on CPU, but the per-pixel alpha blend runs on GPU via theoverlay_blendkernel. Also registered as program 41 inprograms.conf.overlay.c:354-645_get_overlay_argb()wraps*pbufdirectly in a Cairo surface with nomalloc+memcpy.overlay_threadsafeis held throughcairo_paint()(cache path) to keep the buffer stable, then released before the expensive blend step so other pipes aren't blocked.