Optimizations for wide histogram building#12158
Open
siqi-he wants to merge 12 commits intodmlc:masterfrom
Open
Optimizations for wide histogram building#12158siqi-he wants to merge 12 commits intodmlc:masterfrom
siqi-he wants to merge 12 commits intodmlc:masterfrom
Conversation
siqi-he
commented
Apr 13, 2026
Comment on lines
+502
to
+503
| constexpr double kMinDensityForTiling = 0.5; | ||
| bool bin_sorted = !BuildingManager::kAnyMissing || gmat.RowsSortedByBin(); |
Contributor
Author
There was a problem hiding this comment.
The local buffer is flushed entirely every column block. When the data is very sparse, only a few bins are actually hit. Therefore doing a full sweep would actually slow things down. The 0.5 threshold is a rough heuristic. The idea is that denser data tend to benefit more from tiling.
For the tiled kernel to work, entries within a row need to be in ascending bin order. It seems that this is the case for standard SparsePage but not guaranteed for CSRArrayAdapter as it accepts user-provided CSR data where column indices may not be sorted. This guard is thus added to avoid silent failures.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Optimizations for wide histogram building using column block tiling
Motivation
For wide datasets (e.g. >500 features), the per-thread histogram buffer in CPU hist tree building exceeds L2 cache. Each row scatters gradient updates across the full buffer, causing heavy cache misses. The existing
ColsWiseBuildHistKernelmitigates this for dense data by iterating column-by-column, but suffers from poor gradient-pair reuse (reloads gpair for every column). There is no mitigation for the sparse (any-missing) row-wise path.Changes
Add column-block tiling with a thread-local local buffer to both histogram kernels. Instead of scattering into the full histogram, each thread accumulates into a small buffer covering ~32 columns worth of bins (~128 KB, fits in L2), then flushes to the full histogram. This localizes writes and amortizes gradient-pair loads across multiple columns per row.
Benchmark methodology
All benchmarks use
tree_method='hist',max_depth=8,max_bin=256, 100 rounds, 3 repeats (average of runs 2-3). CPU pinning viataskset. The benchmarks were run using aws ec2 c6i.32xl instance, using all physical cores (i.e. nthread=64).Datasets include real (Epsilon, Bosch, Santander) and synthetic (HIGGS with PolynomialFeatures expansion, various sparsity levels via injected NaN at fixed seed). Sparse datasets force the row-wise kernel path (
IsDense()=false). Predictions were verified to be identical (measured bynp.allclose) between master (b2f15e6) and tiling branch across all datasets at 100 rounds.Results
full benchmark script