Skip to content

Improve memory access pattern and histogram clipping for CLAHE#81

Merged
axtimwalde merged 11 commits into
axtimwalde:masterfrom
minnerbe:refactor/clahe
May 13, 2026
Merged

Improve memory access pattern and histogram clipping for CLAHE#81
axtimwalde merged 11 commits into
axtimwalde:masterfrom
minnerbe:refactor/clahe

Conversation

@minnerbe

@minnerbe minnerbe commented Jun 15, 2025

Copy link
Copy Markdown
Contributor

While working on #80, I saw some potential for performance improvements in the "original" CLAHE method and made the following changes:

  • The memory access pattern for updating the sliding window histogram was column-wise on a row-major image and binning computations were done for all updates. I introduced a column-major array of precomputed bin indices, which improved the performance of the histogram updates. This trades off slightly higher memory consumption with faster run time.
  • The histogram clipping method made quite a lot of passes over the histogram data to redistribute excess values. I implemented a variant of the "look-ahead method" described in a paper titled Resource efficient real-time processing of Contrast Limited Adaptive Histogram Equalization. This seemed to reduce the number of passes over the data to about 3 for every histogram.
  • The number of bins that was used internally was more than the number of bins specified. E.g., for 256 user-specified bins (the default) , there were 257 bins used internally, which aligns badly with the value range of uint8 that is used internally.

The changes introduce slight differences in the actual output values, but I verified that they stay within 1 intensity unit in an uint8 image. The following table shows the run times before and after the changes (best of 3 runs). The numbers are width x height, blockRadius.

Before After Speedup
100x100, 7 8ms 5ms 1.6
1000x1000, 7 661ms 235ms 2.8
1000x1000, 127 1329ms 661ms 2.0
5000x5000, 127 36113 14292ms 2.5

In principle, the improved histogram clipping also affects the "fast" option. Since the histogram is not computed that often in this case, I saw only improvements of about 20% for small values of blockRadius and large images (i.e., many histograms to compute), but no significant speedup for the other cases.

That being said, the ‘fast’ option is still orders of magnitude faster and should definitely be the go-to method unless it’s verifiably insufficient for the use case.

Let me know what you think @axtimwalde!

@minnerbe

minnerbe commented May 9, 2026

Copy link
Copy Markdown
Contributor Author

To be on the safe side, I added a test (as a beanshell script). @axtimwalde does this PR look good to you? I'm also happy to port all existing tests to JUnit (which adds an additional dependency).

@axtimwalde

Copy link
Copy Markdown
Owner

Looks great. Junit tests would be wonderful. Thanks!

@minnerbe

Copy link
Copy Markdown
Contributor Author

Done. I also ported the ringbuffer test since it was the only one that didn't heavily rely on manual exploration and a gui to test things.

@axtimwalde axtimwalde merged commit 0ed8a9d into axtimwalde:master May 13, 2026
1 check passed
@axtimwalde

Copy link
Copy Markdown
Owner

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants