follow up on leaf similarity implementation by ZhuYizhou2333 · Pull Request #12157 · dmlc/xgboost

ZhuYizhou2333 · 2026-04-12T10:56:42Z

Summary

This PR is a follow-up on top of pr-11926 and makes Booster.compute_leaf_similarity() more complete and easier to review.

What changed:

changed the default behavior to weight_type="uniform"
replaced the Python-side pairwise loop with a sparse-matrix-based implementation
added a minimal C API for leaf-similarity tree weights so gain/cover no longer depend on trees_to_dataframe()
added compatibility coverage for supported tree modes
added stable user-facing errors for unsupported modes such as gblinear and multi_output_tree

Why this changed:

the original implementation relied on trees_to_dataframe(), which is not scalable and does not work for all tree configurations
the current implementation keeps uniform on a more general path and moves gain/cover weight extraction into a minimal core API
this makes the feature faster, easier to reason about, and more explicit about supported vs unsupported modes

Validation used for the code in this PR:

source .venv/bin/activate && PYTHONPATH=python-package python -m pytest -q tests/python/test_leaf_similarity.py -rA
Result: 25 passed

Experiment Report

The experiment generator used for this evaluation is intentionally kept out of the PR diff so the upstream change stays focused on implementation and compatibility coverage.

PR-11926 Follow-up: Leaf Similarity Experiment Report

Summary

Random seed: 2026
Compared weight types: uniform, gain, cover
The experiments support keeping weight_type configurable. No single option dominates across all datasets.
If a single default must be chosen for practical use, the choice should be justified by the target workload rather than by a claim of universal superiority.
The current implementation is strong enough to merge as a reusable similarity primitive, but the report does not support claiming that it is categorically better than Random Forest proximity.

Key Findings

There is no single winner among uniform/gain/cover; the current win counts are: uniform: 3, gain: 3, cover: 1.
Against Random Forest proximity, leaf similarity performs better on: wine, classification_binary; RF performs better on: moons, circles, anisotropic_blobs, classification_multiclass, friedman1_one_output_per_tree.
Unsupported modes are detected and reported explicitly instead of being mixed into normal experiment results: friedman1_multi_output_tree.

Why This Happens

uniform behaves like a direct proximity count. It is often more stable on binary tabular tasks because it does not amplify a small number of high-gain trees.
gain emphasizes trees that contributed more during training, so it tends to highlight stronger discriminative or regression signal on multiclass and multi-target regression tasks.
cover emphasizes high-coverage regions. It can be useful on datasets such as moons, where smoother local connectivity matters, but it is not uniformly best.
Random Forest proximity remains a strong baseline, especially on 2D manifold-style datasets. The main value of leaf similarity is that it reuses the existing XGBoost model instead of requiring a separate RF model.

Recommendations

If the goal is to reuse an existing XGBoost model for sample retrieval without training a separate Random Forest, this API is already practically useful.
On binary tabular tasks, uniform is the best first option. It is usually more stable and closest to the classical proximity intuition.
On multiclass and one_output_per_tree multi-target regression tasks, gain is the best first option because it better highlights trees with stronger signal.
cover should not be the default, but it is worth trying on datasets where local connectivity and smooth regions matter.

Current Boundaries

multi_output_tree is still unsupported and should continue to fail with a stable user-facing error.
The current visual report covers 2D structure, tabular classification, and multi-target regression, but it does not yet include standalone visualization for dart or num_parallel_tree > 1.
This report is best treated as evidence for feature usefulness and weight-type behavior, not as a final large-scale benchmark paper.

Local 2D Explanations

This section is intentionally focused on 2D datasets. Its purpose is to answer the concrete question: “for a specific pair of samples A and B, do these methods consider them similar?” rather than only showing global block structure.

How to read the figures:

Scatter plot: shows the original 2D sample distribution. Colors indicate the ground-truth class and provide spatial context for the local explanations below.
Anchor neighborhood plot:
Three anchors are chosen from the evaluation split, then the top-k neighbors of each anchor are shown for rf/uniform/gain/cover.
Rows correspond to anchor types and columns correspond to similarity methods; each panel still uses the original 2D feature space as its x/y coordinate system.
The star marker is the anchor, colored points with black edges are the top-k neighbors selected by the method, and gray edges connect the anchor to those neighbors.
prototype means the sample closest to the centroid of the dominant class, boundary means the sample whose nearest opposite-class neighbor is closest, and fringe means the sample farthest from the dominant-class centroid.
This figure is meant to show how each method behaves in the class core, near the class boundary, and near the class fringe.
A-B pair plot:
Four representative point pairs are constructed from the evaluation split: nearest same-class, farthest same-class, nearest different-class, and farthest different-class.
A is marked in red, B is marked as a cyan square, and a black segment connects them.
This figure turns similarity from an abstract matrix value into a specific pair of samples that can be discussed directly.
Representative pair score table:
Lists the rf/uniform/gain/cover scores for the same A-B pairs shown above.
This is the most direct artifact for answering how similar a particular pair is under each method.
Auxiliary matrix plot:
The full similarity matrix is still included, but only as a secondary view for global block structure. It is no longer the main interpretability figure.

moons

Representative pair scores:

pair	description	A(label)	B(label)	euclidean	rf	uniform	gain	cover
same_near	same class, nearest pair	14 (1)	43 (1)	0.010	0.925	1.000	1.000	1.000
same_far	same class, farthest pair	95 (1)	102 (1)	2.306	0.025	0.050	0.000	0.008
diff_near	different class, nearest pair	24 (0)	58 (1)	0.136	0.875	0.887	0.999	0.981
diff_far	different class, farthest pair	95 (1)	121 (0)	3.229	0.000	0.050	0.000	0.008

Auxiliary matrix view:

circles

Representative pair scores:

pair	description	A(label)	B(label)	euclidean	rf	uniform	gain	cover
same_near	same class, nearest pair	13 (1)	126 (1)	0.002	1.000	0.937	0.999	0.986
same_far	same class, farthest pair	45 (0)	135 (0)	2.206	0.000	0.000	0.000	0.000
diff_near	different class, nearest pair	89 (1)	119 (0)	0.234	0.688	0.925	0.999	0.985
diff_far	different class, farthest pair	96 (0)	154 (1)	1.738	0.000	0.100	0.002	0.019

Auxiliary matrix view:

anisotropic_blobs

Representative pair scores:

pair	description	A(label)	B(label)	euclidean	rf	uniform	gain	cover
same_near	same class, nearest pair	17 (2)	73 (2)	0.002	1.000	1.000	1.000	1.000
same_far	same class, farthest pair	34 (2)	66 (2)	6.648	0.000	0.000	0.000	0.000
diff_near	different class, nearest pair	91 (1)	155 (2)	0.038	0.900	0.958	0.815	0.913
diff_far	different class, farthest pair	66 (2)	69 (0)	8.134	0.000	0.000	0.000	0.000

Auxiliary matrix view:

Global Overviews

How to read the overview figures:

Weight-type overview: places the primary metric for each dataset side by side to compare rf/uniform/gain/cover at the dataset level.
Performance overview: shows runtime curves across different sample sizes and tree counts so the relative cost of each weight type is visible.