Metal support for macOS#20817
Conversation
|
Forgot to mention that this is for Apple Silicon Macs only. Intel Macs doesn't have an integrated GPU so we would have to check for different external GPUs which would make all this stuff much more complicated. The last Github runner image for |
unfortunately not, since that's needed for old intel macs (even they will just be supported with macports based builds if github retires intel runners ...) |
|
btw. it compiles fine (after xcodebuild -downloadComponent MetalToolchain) and runs fine |
Yes, with Xcode 26 Apple decided to no longer include the metal toolchain, needs to be downloaded separately. |
|
Had a first quick look at pixelpipe_hb changes, looks pretty safe.
|
Right, and the code in
I hope so. All the memory handling is checked in
Again, I hope so. I have nearly no knowledge about the pixelpipe handling, that's why I needed the help of Claude for that part. On Linux and Windows these changes should ideally have no effect. So on macOS we would start just with the exposure module and give it a field test. |
|
Forgive me for the ignorant question. Does this mean that eventually for each module there will have to be three distinct implementations? Cpu, OpenCL and Metal? |
|
That's the conclusion. Yes. |
Ouch! I am no expert in GPU coding, but I know a thing or two about software engineering and code maintenance, and this does not seem like a very sustainable approach, especially considering that openCL - IIUC - is not exactly the platform of the future. I am sure you have already considered the alternatives, but what about deprecating OpenCL instead, and writing GPU code in GLSL/HLSL targeting Vulkan? On Windows and Linux, it would run natively on Vulkan. On macOS, it would be piped through MoltenVK and map Vulkan API calls to Metal API calls in real-time. This would condense the GPU path into a single, modern, heavily supported graphics API. It seems a more "future-proof" approach than the one suggested here. It would require rewriting all OpenCL code into Vulkan compute shaders, but this would be a one-off effort that can probably be by and large automated with LLMs. |
Me too :) For macOS we are just at the beginning of this route, so it would be ok to stop here. |
|
An even more radical (but superior) alternative would be switching to Halide. It is an open-source DSL specifically designed for high-performance image processing. It is used heavily by Adobe, Google, and Instagram, among others. In Halide, you write the algorithm once. You then write a schedule (how to compute it - loop unrolling, threading, GPU utilization) separately. The Halide compiler takes a single algorithm definition and compiles it natively to x86 (CPU), ARM, OpenCL, CUDA, and Apple Metal. You would literally write only one algorithm, and Halide would generate the optimized C++ and GPU kernel code for all targets. Only one code path, instead of 3 (or 2). |
|
That all sounds good, but that would require a huge redesign of the whole pipeline. The approach here is an addition to the already existing pipeline and can now be migrated step by step to the other modules. Everything keeps working as it is. And converting an OpenCL kernel source to metal is also an easy task for an LLM. |
Yes, absolutely. It's the maintenance and code bloat that scare me. And, since my understanding is that darktable should eventually move on from OpenCL, it would make sense to do the effort only once. |
|
Honestly - i don't think we can replace all OpenCL code with metal variants in some managable time. So i somewhat doubt this pr is a good plan. How shall About replacing all current code using halide .. not with me working on that :-) Leaving out the question of supporting legacy code. Simply a nightmare. Currently i don't think that OpenCL is in bad shape ... the 1.2 version is not a big thing for now. There are currently just a few workarounds. |
|
If transitioning away from OpenCL is not something that will happen in the short/medium term, then another alternative would be using an open-source toolchain like clspv (OpenCL C to SPIR-V) combined with SPIRV-Cross (SPIR-V to Metal Shading Language). Devs would still write and maintain the GPU and OpenCL kernel, so no change there. The build system would compile it to SPIR-V and then transpile SPIR-V to Metal. Mac users would still get native Metal performance without darktable developers having to learn or maintain MSL. |
No need to, and that is my intention. Then the OpenCL kernel of the next module can be converted to metal and the corresponding process_metal() function as well. Step by step, module by module... That's my imagination for the progress. |
Yes and nice. BUT no standard, nothing we can be sure of. |
Those are very very valid points, maintaining 2 different GPU code paths is insane for such a small group of core developers, given that we do not even have people in that group that are experienced with the new GPU code path (metal). Instead of writing another GPU code path in a proprietary GPU framework for only one target platform (macOS), I'd rather propose to already start the migration to a well established GPU programming framework (GLSL/HLSL shaders targeting Vulkan) which is going to work for all our target platforms in the long run. We could keep this code path only for macOS for the time being and slowly retire our OpenCL implementations, step by step, also for the other platforms. |
|
I also agree with the maintenance nightmare that this would introduce. At some point we may want to find a common framework that can handle a single code and multiple target CPU, OpenCL, Metal, Vulcan... I have also heard about Khronos (just know nothing about it). |
it would help to use metal variants just for the most performance hungry modules first - thats where the effort gives most return. |
|
Ok, it was worth a try, but what happens here? The attempt to implement metal support in small work units is being talked down by counter-proposals that require a complete redesign of the pixelpipe processing. Who will take on that task? A human developer? Or will the future of darktable lie in the hands of AI coding agents? Before this discussion here drifts off course like the ones on pixls.us, we should probably better close this PR. |
|
I think the refactoring of pixelpipe code is worth the effort anyway because it makes that complicated code more clear and readable. |
|
Maybe do the refactoring first in a seperate pr and decide on metal integration based on that after full testing? Honestly i dont know how performant the mac opencl code is. So checking that on some critical code might be worthwhile. There is one point we have to remember, if we mix metal and opencl code that would break passing cl image/buffer from one module to the next (output image is used as input by the next module). We would have to "convert". That would certainly cost some performance. So calling a metal module would only be beneficial if the algorithm perf gain is larger than loss by conversions. |
That sounds good, but that cannot be done by me. I have very few knowledge on the pixelpipe logic.
The main reason for this PR is the fear that Apple will one day completely abandon OpenCL. That would kick out darktable on macOS.
Fully agreed, that's why I asked (especially you) for help and support here. |
|
Ok there is no time pressure :-) I will prepare the refactoring pr (possibly including the mask-cache thing) as a first step. |
If Metal is available and working and enabled for a module, use it. Otherwise process the usual way (OpenCL, CPU)
The macOS opencl1.2 implementation isn’t known for being benchmark. But simply porting geekbench numbers won’t be helpful. We need a quite performance hungry module for a darktable benchmark. |
|
To have a more performance hungry module to test, I have now converted the diffuse or sharpen module and made some comparisons with OpenCL. First, I created two presets:
Both give unusable results of course, just for performance measuring.
local contrast 50, OpenCLlocal contrast 50, Metalbloom 20, OpenCLbloom 20, Metal@MStraeten : Can you please test on your system? |
|
local contast|normal preset and set iterations to 50: opencl speed is quite better on my system |
|
Can you please check again? Latest commit gives ~50% improvement on my system. |
|
|
So no change for your system? Did you rebuild with the latest commit? Exporting the image with opencl I get: with metal: |
|
after git reset --hard and pulling again - same results. |
|
I had an in-depth look at pixelpipe_hb.c and that looks ok and safe with existing code. So no breakage expected. Any refactoring seems pretty difficult atm as the There seem to be some relevant design restrictions.
|
I am sure there is a lot of work still to be done :-)
At the final end that would mean to fallback to CPU if we assume that opencl will not be available on Macs some day.
Took an image, reset the history and only applied the diffuse local contrast normal with 50 iterations.
I am quite unsure if we really should continue this PR. All the things already said above about maintaining 3 processing routes will become a nightmare. What if someone changes an algorithm in one of the modules. We not only have to remember maintaining CPU and OpenCL. There will be the Metal code as well to keep in mind. |
|
Plus colororspace handling. |
|
@MStraeten : Can you please try again? Now with tiling. Just to get things working and see if all that stuff is worth the effort. After that we can decide on how to proceed. |
|
similar numbers, unfortunately: log.txt btw: the suggestion 'change to threadW=32, threadH=8' makes it even worse |
|
I dont know metal but with opencl the workgroup dimensions have a big impact. Even on kernels without locals. The aligning to more data on horizontal is a common trick for locality. |
|
@zisoft @jenshannoschwalm @MStraeten I understand that adding metal support to the pipeline would introduce a significant amount of maintenance work. Still, I wonder whether the effort could be reduced by focusing primarily on the modules that appear in the 'quick access panel', and on the 'color equalizer' in particular, which has a major performance impact. When it is enabled, the exposure slider lags by up to three seconds on an M2, whereas with the equalizer disabled, the slider responds almost instantly. Other modules, such as color balance, tone equalizer or local contrast, don’t cause nearly the same slowdown. |
@zisoft @MStraeten can you confirm the bad performance? @haecksenwerk i am not sure about what exactly you observe. We need at least a log with |
|
@haecksenwerk : please open a new issue for that |
Never mind. I probably jumped the gun with my comment, but I was excited to see this MR and the possibility of some improvements coming to the macOS version. Given that you've mentioned potentially prioritizing modules that could be tackled first I just cranked out my view as a darktable user. |
since exposure is quite early in the pipe on change the whole pipeline up to color equalizer needs to be reprocessed - so not really surprising. But without and xmp it's hard to check the root cause. |
here logs - you might try to find out which changes were made with color equalizer on/off in my opinion background processes on the system have more impact ;) i don't see issues with exposure - overall processing time just differs based on the subsequent modules in the pipe... |
|
After removing both ./cache/darktable and ./config/darktable, the heavy latency I was experiencing is gone. Sorry for stirring things up unnecessarily. |
As you might know, Apple has deprecated OpenCL on macOS several years ago (in macOS 10.14).
It is recommended to transition to Metal ( further reading ).
This is work I have begun almost 2 years ago and after a long pause I finally got it to this working stage (with the help of Claude for the complicated part in
pixelpipe_hb.c).This PR implements the following on macOS:
./data/metal<install_dir>/share/darktable/metal-d metalfor loggingAt runtime we have the following processing logic:
For an iop module, check if the module has a
process_metal()function and the corresponding metal kernel. If yes, use it. If not, try OpenCL, with fallback to CPU.I have started with the probably simplest kernel, the exposure module.
So with this PR merged we would get the basic things working, other modules could then be added step by step.
Since darktable is Linux first, we don't have to take care for compatibility with old OpenCL versions (OpenCL on macOS is stuck with Version 1.2).
In fact, once all kernels are transferred to metal we can stop using OpenCL on macOS at all.
This is still draft and I need some review here.
To avoid code duplication, two helper functions are created in
src/develop/pixelpipe_hb.c:_pixelpipe_pre_process()and_pixelpipe_post_process().@jenshannoschwalm: May I ask you to check if everything is correct here?
The macOS part is guarded by
#if defined __APPLE__so there should hopefully be no impact on Linux and Windows.If this PR gets merged I would continue with the other iop modules.