Skip to content

follow up on leaf similarity implementation#12157

Open
ZhuYizhou2333 wants to merge 4 commits intodmlc:masterfrom
ZhuYizhou2333:zyz/pr-11926-followup
Open

follow up on leaf similarity implementation#12157
ZhuYizhou2333 wants to merge 4 commits intodmlc:masterfrom
ZhuYizhou2333:zyz/pr-11926-followup

Conversation

@ZhuYizhou2333
Copy link
Copy Markdown

Summary

This PR is a follow-up on top of pr-11926 and makes Booster.compute_leaf_similarity() more complete and easier to review.

What changed:

  • changed the default behavior to weight_type="uniform"
  • replaced the Python-side pairwise loop with a sparse-matrix-based implementation
  • added a minimal C API for leaf-similarity tree weights so gain/cover no longer depend on trees_to_dataframe()
  • added compatibility coverage for supported tree modes
  • added stable user-facing errors for unsupported modes such as gblinear and multi_output_tree

Why this changed:

  • the original implementation relied on trees_to_dataframe(), which is not scalable and does not work for all tree configurations
  • the current implementation keeps uniform on a more general path and moves gain/cover weight extraction into a minimal core API
  • this makes the feature faster, easier to reason about, and more explicit about supported vs unsupported modes

Validation used for the code in this PR:

  • source .venv/bin/activate && PYTHONPATH=python-package python -m pytest -q tests/python/test_leaf_similarity.py -rA
  • Result: 25 passed

Experiment Report

The experiment generator used for this evaluation is intentionally kept out of the PR diff so the upstream change stays focused on implementation and compatibility coverage.

PR-11926 Follow-up: Leaf Similarity Experiment Report

Summary

  • Random seed: 2026

  • Compared weight types: uniform, gain, cover

  • The experiments support keeping weight_type configurable. No single option dominates across all datasets.

  • If a single default must be chosen for practical use, the choice should be justified by the target workload rather than by a claim of universal superiority.

  • The current implementation is strong enough to merge as a reusable similarity primitive, but the report does not support claiming that it is categorically better than Random Forest proximity.

Key Findings

  • There is no single winner among uniform/gain/cover; the current win counts are: uniform: 3, gain: 3, cover: 1.
  • Against Random Forest proximity, leaf similarity performs better on: wine, classification_binary; RF performs better on: moons, circles, anisotropic_blobs, classification_multiclass, friedman1_one_output_per_tree.
  • Unsupported modes are detected and reported explicitly instead of being mixed into normal experiment results: friedman1_multi_output_tree.

Why This Happens

  • uniform behaves like a direct proximity count. It is often more stable on binary tabular tasks because it does not amplify a small number of high-gain trees.
  • gain emphasizes trees that contributed more during training, so it tends to highlight stronger discriminative or regression signal on multiclass and multi-target regression tasks.
  • cover emphasizes high-coverage regions. It can be useful on datasets such as moons, where smoother local connectivity matters, but it is not uniformly best.
  • Random Forest proximity remains a strong baseline, especially on 2D manifold-style datasets. The main value of leaf similarity is that it reuses the existing XGBoost model instead of requiring a separate RF model.

Recommendations

  • If the goal is to reuse an existing XGBoost model for sample retrieval without training a separate Random Forest, this API is already practically useful.
  • On binary tabular tasks, uniform is the best first option. It is usually more stable and closest to the classical proximity intuition.
  • On multiclass and one_output_per_tree multi-target regression tasks, gain is the best first option because it better highlights trees with stronger signal.
  • cover should not be the default, but it is worth trying on datasets where local connectivity and smooth regions matter.

Current Boundaries

  • multi_output_tree is still unsupported and should continue to fail with a stable user-facing error.
  • The current visual report covers 2D structure, tabular classification, and multi-target regression, but it does not yet include standalone visualization for dart or num_parallel_tree > 1.
  • This report is best treated as evidence for feature usefulness and weight-type behavior, not as a final large-scale benchmark paper.

Local 2D Explanations

This section is intentionally focused on 2D datasets. Its purpose is to answer the concrete question: “for a specific pair of samples A and B, do these methods consider them similar?” rather than only showing global block structure.

How to read the figures:

  • Scatter plot: shows the original 2D sample distribution. Colors indicate the ground-truth class and provide spatial context for the local explanations below.
  • Anchor neighborhood plot:
    Three anchors are chosen from the evaluation split, then the top-k neighbors of each anchor are shown for rf/uniform/gain/cover.
    Rows correspond to anchor types and columns correspond to similarity methods; each panel still uses the original 2D feature space as its x/y coordinate system.
    The star marker is the anchor, colored points with black edges are the top-k neighbors selected by the method, and gray edges connect the anchor to those neighbors.
    prototype means the sample closest to the centroid of the dominant class, boundary means the sample whose nearest opposite-class neighbor is closest, and fringe means the sample farthest from the dominant-class centroid.
    This figure is meant to show how each method behaves in the class core, near the class boundary, and near the class fringe.
  • A-B pair plot:
    Four representative point pairs are constructed from the evaluation split: nearest same-class, farthest same-class, nearest different-class, and farthest different-class.
    A is marked in red, B is marked as a cyan square, and a black segment connects them.
    This figure turns similarity from an abstract matrix value into a specific pair of samples that can be discussed directly.
  • Representative pair score table:
    Lists the rf/uniform/gain/cover scores for the same A-B pairs shown above.
    This is the most direct artifact for answering how similar a particular pair is under each method.
  • Auxiliary matrix plot:
    The full similarity matrix is still included, but only as a secondary view for global block structure. It is no longer the main interpretability figure.

moons

moons_scatter moons_anchor_neighbors moons_pair_examples

Representative pair scores:

pair description A(label) B(label) euclidean rf uniform gain cover
same_near same class, nearest pair 14 (1) 43 (1) 0.010 0.925 1.000 1.000 1.000
same_far same class, farthest pair 95 (1) 102 (1) 2.306 0.025 0.050 0.000 0.008
diff_near different class, nearest pair 24 (0) 58 (1) 0.136 0.875 0.887 0.999 0.981
diff_far different class, farthest pair 95 (1) 121 (0) 3.229 0.000 0.050 0.000 0.008

Auxiliary matrix view:

moons_similarity_matrices

circles

circles_scatter circles_anchor_neighbors circles_pair_examples

Representative pair scores:

pair description A(label) B(label) euclidean rf uniform gain cover
same_near same class, nearest pair 13 (1) 126 (1) 0.002 1.000 0.937 0.999 0.986
same_far same class, farthest pair 45 (0) 135 (0) 2.206 0.000 0.000 0.000 0.000
diff_near different class, nearest pair 89 (1) 119 (0) 0.234 0.688 0.925 0.999 0.985
diff_far different class, farthest pair 96 (0) 154 (1) 1.738 0.000 0.100 0.002 0.019

Auxiliary matrix view:

circles_similarity_matrices

anisotropic_blobs

anisotropic_blobs_scatter anisotropic_blobs_anchor_neighbors anisotropic_blobs_pair_examples

Representative pair scores:

pair description A(label) B(label) euclidean rf uniform gain cover
same_near same class, nearest pair 17 (2) 73 (2) 0.002 1.000 1.000 1.000 1.000
same_far same class, farthest pair 34 (2) 66 (2) 6.648 0.000 0.000 0.000 0.000
diff_near different class, nearest pair 91 (1) 155 (2) 0.038 0.900 0.958 0.815 0.913
diff_far different class, farthest pair 66 (2) 69 (0) 8.134 0.000 0.000 0.000 0.000

Auxiliary matrix view:

anisotropic_blobs_similarity_matrices

Global Overviews

How to read the overview figures:

  • Weight-type overview: places the primary metric for each dataset side by side to compare rf/uniform/gain/cover at the dataset level.
  • Performance overview: shows runtime curves across different sample sizes and tree counts so the relative cost of each weight type is visible.
weight_type_overview performance_overview

Dataset Results

moons

  • Task type: classification
  • Training samples: 402
  • Evaluation samples: 160
  • Primary metric: top_k_same_class_precision
  • Best weight in this run: cover
weight status primary metric correlation with RF matrix
uniform ok 0.947500 0.886166
gain ok 0.941250 0.893552
cover ok 0.955000 0.903496

circles

  • Task type: classification
  • Training samples: 402
  • Evaluation samples: 160
  • Primary metric: top_k_same_class_precision
  • Best weight in this run: gain
weight status primary metric correlation with RF matrix
uniform ok 0.976875 0.815651
gain ok 0.978750 0.936447
cover ok 0.978750 0.957870

anisotropic_blobs

  • Task type: classification
  • Training samples: 402
  • Evaluation samples: 160
  • Primary metric: top_k_same_class_precision
  • Best weight in this run: uniform
weight status primary metric correlation with RF matrix
uniform ok 0.842500 0.820018
gain ok 0.837500 0.814215
cover ok 0.837500 0.850356

wine

  • Task type: classification
  • Training samples: 119
  • Evaluation samples: 59
  • Primary metric: top_k_same_class_precision
  • Best weight in this run: uniform
weight status primary metric correlation with RF matrix
uniform ok 0.932203 0.852871
gain ok 0.808475 0.590623
cover ok 0.844068 0.681282

classification_binary

  • Task type: classification
  • Training samples: 1340
  • Evaluation samples: 160
  • Primary metric: top_k_same_class_precision
  • Best weight in this run: uniform
weight status primary metric correlation with RF matrix
uniform ok 0.816875 0.751876
gain ok 0.760625 0.695932
cover ok 0.795000 0.759709

classification_multiclass

  • Task type: classification
  • Training samples: 1005
  • Evaluation samples: 160
  • Primary metric: top_k_same_class_precision
  • Best weight in this run: gain
weight status primary metric correlation with RF matrix
uniform ok 0.723750 0.715012
gain ok 0.747500 0.750536
cover ok 0.744375 0.734683

friedman1_one_output_per_tree

  • Task type: multi_target_regression
  • Training samples: 804
  • Evaluation samples: 160
  • Primary metric: top_k_target_distance
  • Best weight in this run: gain
weight status primary metric correlation with RF matrix
uniform ok 3.628646 0.479473
gain ok 3.299045 0.714496
cover ok 3.622466 0.479542

friedman1_multi_output_tree

  • Task type: unsupported_multi_output_tree
  • Training samples: 804
  • Evaluation samples: 160
  • No valid leaf similarity result is available for this dataset.
weight status primary metric correlation with RF matrix
uniform error n/a n/a
gain error n/a n/a
cover error n/a n/a

Performance Results

classification_binary

samples trees uniform (ms) gain (ms) cover (ms)
600 50 1.314 1.172 1.130
600 100 1.995 1.935 1.658
1000 50 1.312 1.213 1.153
1000 100 1.770 1.666 1.660

classification_multiclass

samples trees uniform (ms) gain (ms) cover (ms)
600 50 2.241 2.328 2.197
600 100 3.949 3.816 3.557
1000 50 2.363 2.173 2.267
1000 100 3.922 3.634 3.877

friedman1_one_output_per_tree

samples trees uniform (ms) gain (ms) cover (ms)
600 50 1.943 1.677 1.699
600 100 3.010 2.991 2.948
1000 50 1.863 1.644 1.576
1000 100 3.253 3.024 3.114

mfdel and others added 4 commits January 14, 2026 21:21
Compute similarity between observations based on leaf node co-occurrence
across trees. Similar to Random Forest proximity matrices.

- Two weight types: 'gain' (default) and 'cover'
- Returns similarity matrix with values in [0, 1]
- Self-similarity is 1.0

Closes dmlc#11919
@ZhuYizhou2333 ZhuYizhou2333 marked this pull request as ready for review April 12, 2026 10:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants