follow up on leaf similarity implementation#12157
Open
ZhuYizhou2333 wants to merge 4 commits intodmlc:masterfrom
Open
follow up on leaf similarity implementation#12157ZhuYizhou2333 wants to merge 4 commits intodmlc:masterfrom
ZhuYizhou2333 wants to merge 4 commits intodmlc:masterfrom
Conversation
Compute similarity between observations based on leaf node co-occurrence across trees. Similar to Random Forest proximity matrices. - Two weight types: 'gain' (default) and 'cover' - Returns similarity matrix with values in [0, 1] - Self-similarity is 1.0 Closes dmlc#11919
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR is a follow-up on top of
pr-11926and makesBooster.compute_leaf_similarity()more complete and easier to review.What changed:
weight_type="uniform"gain/coverno longer depend ontrees_to_dataframe()gblinearandmulti_output_treeWhy this changed:
trees_to_dataframe(), which is not scalable and does not work for all tree configurationsuniformon a more general path and movesgain/coverweight extraction into a minimal core APIValidation used for the code in this PR:
source .venv/bin/activate && PYTHONPATH=python-package python -m pytest -q tests/python/test_leaf_similarity.py -rA25 passedExperiment Report
The experiment generator used for this evaluation is intentionally kept out of the PR diff so the upstream change stays focused on implementation and compatibility coverage.
PR-11926 Follow-up: Leaf Similarity Experiment Report
Summary
Random seed:
2026Compared weight types:
uniform, gain, coverThe experiments support keeping
weight_typeconfigurable. No single option dominates across all datasets.If a single default must be chosen for practical use, the choice should be justified by the target workload rather than by a claim of universal superiority.
The current implementation is strong enough to merge as a reusable similarity primitive, but the report does not support claiming that it is categorically better than Random Forest proximity.
Key Findings
uniform/gain/cover; the current win counts are: uniform: 3, gain: 3, cover: 1.Why This Happens
uniformbehaves like a direct proximity count. It is often more stable on binary tabular tasks because it does not amplify a small number of high-gain trees.gainemphasizes trees that contributed more during training, so it tends to highlight stronger discriminative or regression signal on multiclass and multi-target regression tasks.coveremphasizes high-coverage regions. It can be useful on datasets such asmoons, where smoother local connectivity matters, but it is not uniformly best.Recommendations
uniformis the best first option. It is usually more stable and closest to the classical proximity intuition.one_output_per_treemulti-target regression tasks,gainis the best first option because it better highlights trees with stronger signal.covershould not be the default, but it is worth trying on datasets where local connectivity and smooth regions matter.Current Boundaries
multi_output_treeis still unsupported and should continue to fail with a stable user-facing error.dartornum_parallel_tree > 1.Local 2D Explanations
This section is intentionally focused on 2D datasets. Its purpose is to answer the concrete question: “for a specific pair of samples A and B, do these methods consider them similar?” rather than only showing global block structure.
How to read the figures:
Scatter plot: shows the original 2D sample distribution. Colors indicate the ground-truth class and provide spatial context for the local explanations below.Anchor neighborhood plot:Three anchors are chosen from the evaluation split, then the top-k neighbors of each anchor are shown for
rf/uniform/gain/cover.Rows correspond to anchor types and columns correspond to similarity methods; each panel still uses the original 2D feature space as its x/y coordinate system.
The star marker is the anchor, colored points with black edges are the top-k neighbors selected by the method, and gray edges connect the anchor to those neighbors.
prototypemeans the sample closest to the centroid of the dominant class,boundarymeans the sample whose nearest opposite-class neighbor is closest, andfringemeans the sample farthest from the dominant-class centroid.This figure is meant to show how each method behaves in the class core, near the class boundary, and near the class fringe.
A-B pair plot:Four representative point pairs are constructed from the evaluation split: nearest same-class, farthest same-class, nearest different-class, and farthest different-class.
A is marked in red, B is marked as a cyan square, and a black segment connects them.
This figure turns similarity from an abstract matrix value into a specific pair of samples that can be discussed directly.
Representative pair score table:Lists the
rf/uniform/gain/coverscores for the same A-B pairs shown above.This is the most direct artifact for answering how similar a particular pair is under each method.
Auxiliary matrix plot:The full similarity matrix is still included, but only as a secondary view for global block structure. It is no longer the main interpretability figure.
moons
Representative pair scores:
Auxiliary matrix view:
circles
Representative pair scores:
Auxiliary matrix view:
anisotropic_blobs
Representative pair scores:
Auxiliary matrix view:
Global Overviews
How to read the overview figures:
Weight-type overview: places the primary metric for each dataset side by side to comparerf/uniform/gain/coverat the dataset level.Performance overview: shows runtime curves across different sample sizes and tree counts so the relative cost of each weight type is visible.Dataset Results
moons
classification402160top_k_same_class_precisioncovercircles
classification402160top_k_same_class_precisiongainanisotropic_blobs
classification402160top_k_same_class_precisionuniformwine
classification11959top_k_same_class_precisionuniformclassification_binary
classification1340160top_k_same_class_precisionuniformclassification_multiclass
classification1005160top_k_same_class_precisiongainfriedman1_one_output_per_tree
multi_target_regression804160top_k_target_distancegainfriedman1_multi_output_tree
unsupported_multi_output_tree804160Performance Results
classification_binary
classification_multiclass
friedman1_one_output_per_tree