Skip to content

feat: Add HDBSCAN clustering method#90

Closed
ChenChihYuan wants to merge 3 commits intowassimj:mainfrom
ChenChihYuan:HDBSCAN_feature
Closed

feat: Add HDBSCAN clustering method#90
ChenChihYuan wants to merge 3 commits intowassimj:mainfrom
ChenChihYuan:HDBSCAN_feature

Conversation

@ChenChihYuan
Copy link
Copy Markdown

Add HDBSCAN Clustering Method

Summary

This PR adds Cluster.HDBSCAN() — a pure numpy/scipy implementation of the Hierarchical Density-Based Spatial Clustering of Applications with Noise algorithm to topologicpy's Cluster class. It also adds demonstration and comparison cells to the Unsupervised_Learning.ipynb notebook.

Motivation

DBSCAN requires users to specify an epsilon (neighborhood radius) parameter, which can be difficult to tune — especially for datasets with clusters of varying density. HDBSCAN eliminates this requirement by building a hierarchical density-based clustering and automatically extracting the most stable clusters. This makes it a more robust, general-purpose density-based clustering method.

Implementation Details

Algorithm (Cluster.HDBSCAN()):

  1. Core distances — For each point, compute the distance to its minSamples-th nearest neighbor
  2. Mutual reachability matrix — Build a distance matrix where d_mreach(a,b) = max(core(a), core(b), d(a,b))
  3. Minimum spanning tree — Construct via Prim's algorithm on the mutual reachability graph
  4. Condensed cluster tree — Walk the single-linkage dendrogram top-down, splitting only when both children have ≥ minClusterSize points
  5. Cluster extraction — Select the most stable clusters using either Excess of Mass (eom) or leaf method

API (follows existing Cluster.DBSCAN() conventions):

Cluster.HDBSCAN(
    topologies,                    # List of topologicpy Topology objects
    selectors=None,                # Optional selector points
    keys=["x", "y", "z"],          # Dictionary keys for feature extraction
    minClusterSize=5,              # Minimum cluster size
    minSamples=None,               # Core distance neighbor count (defaults to minClusterSize)
    clusterSelectionMethod="eom"   # "eom" or "leaf"
)
# Returns: (list_of_clusters, noise_cluster_or_None)

Dependencies: Uses only numpy and scipy (both already imported in Cluster.py). No new dependencies.

Notebook additions (12 cells in Unsupervised_Learning.ipynb):

  • HDBSCAN on spiral data with visualization
  • HDBSCAN on gallery floor plan with visualization
  • DBSCAN vs HDBSCAN comparison table

Testing

  • All 14 existing pytest tests pass
  • Verified on synthetic Gaussian clusters (correctly identifies 3 clusters)
  • Verified on elongated and varying-density clusters
  • Verified on spiral and gallery floor plan datasets in the notebook

Jim Chen (Woven by Toyota, Inc.)/Jim Chen added 2 commits March 25, 2026 23:13
Implement Cluster.HDBSCAN() - a pure numpy/scipy implementation of the
Hierarchical Density-Based Spatial Clustering of Applications with Noise
algorithm. Unlike DBSCAN, HDBSCAN does not require an epsilon parameter,
making it more robust for datasets with clusters of varying density.

Algorithm steps:
- Compute core distances for each point
- Build mutual reachability distance matrix
- Construct minimum spanning tree (Prim's algorithm)
- Build condensed cluster tree via top-down dendrogram walk
- Extract stable clusters using Excess of Mass (eom) or leaf method

Also adds HDBSCAN demonstration cells to the Unsupervised_Learning
notebook with spiral data and gallery floor plan examples, plus a
DBSCAN vs HDBSCAN comparison table.
…d add allowSingleCluster parameter

- Fixed EOM bottom-up processing to use reverse sort (leaves before parents)
- Added allowSingleCluster parameter (default False) matching the standard
  HDBSCAN library behavior: prevents trivial single-cluster results by not
  allowing the root cluster to dominate its children in EOM selection
- Updated spiral demo note text to reflect corrected behavior
- HDBSCAN now correctly finds multiple clusters on spiral and gallery data
@ChenChihYuan ChenChihYuan marked this pull request as draft March 25, 2026 16:02
@wassimj wassimj marked this pull request as ready for review April 8, 2026 06:08
@wassimj
Copy link
Copy Markdown
Owner

wassimj commented Apr 8, 2026

Hi @ChenChihYuan. Can you please re-send this with just the changed to Cluster.py without the jupyter notebook? Thanks!

@wassimj wassimj closed this Apr 8, 2026
@ChenChihYuan
Copy link
Copy Markdown
Author

Yes I will do that. Sorry for bothering! DBSCAN for AEC usage looks pretty great so I thought about using HDBSCAN.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants