feat: DBSCAN cluster coloring for embedding landscape#341
Conversation
Run DBSCAN on t-SNE/UMAP projected coordinates server-side to identify spatial clusters, then color them in the web UI like a political map. Server: - Add _compute_clusters() with auto-tuned eps (40th percentile k-NN) - Add _name_clusters() using TF-IDF scoring of concept labels - Emit cluster_id per point and cluster stats/names in projection response Client: - Add "By Cluster" color mode with 3 palettes (Bold, Warm→Cool, Earth) - Sortable legend (by name, count, or palette order) with cluster toggles - Highlight/dim clusters by clicking legend entries - Poll job status after regeneration so UI refreshes reliably - Move info panel from left-click to right-click context menu - Left click reserved for pan/rotate only Screenshots and docs updated.
Code Review -- PR #341: DBSCAN Cluster ColoringScope: 447 additions, 61 deletions across 7 source files (standard tier). Backend clustering + naming logic in Python, frontend cluster visualization in React/TypeScript, documentation and screenshots. The overall feature is well-structured: server-side clustering keeps the frontend thin, TF-IDF naming is a smart approach, and the auto-tuned eps via k-NN percentile is a solid heuristic. Good work separating Important1. Job polling reimplements Location: The apiClient already has
Suggestion: Replace the manual loop with: if (result.status === 'queued' && result.job_id) {
const finalJob = await apiClient.pollJobUntilComplete(result.job_id, {
intervalMs: 1000,
});
if (finalJob.status === 'failed') {
setError('Projection job failed');
return;
}
if (finalJob.status === 'cancelled') {
setError('Projection job was cancelled');
return;
}
}Note that 2. Location: When there is exactly one cluster, every term has idf = math.log(num_clusters / doc_freq[w]) # math.log(1/1) = 0.0All terms score 3. Location: The onSelectPoint: (point: EmbeddingPoint | null, screenPos?: { x: number; y: number }) => void;But after this PR, Minor4. Cluster legend JSX is ~120 lines of inline rendering -- consider extraction Location:
Not blocking, but this file will keep growing as features are added. 5. Python type annotations: Location:
6. Location: data_range = float(np.max(np.ptp(projection, axis=0)))
data_range = float(np.max(np.max(projection, axis=0) - np.min(projection, axis=0)))Nit7. These are stdlib imports. Convention in this codebase (and PEP 8) is top-of-file imports. Lazy imports are appropriate for heavy optional dependencies (like the sklearn guard pattern used above), but 8. Stop words list could be a module-level constant
What looks good
Testing gapThere are no tests for
These are pure-function methods on numpy arrays -- easy to unit test without database fixtures. |
- Replace manual job polling with apiClient.pollJobUntilComplete() - Fix single-cluster TF-IDF scoring (frequency-only when num_clusters <= 1) - Remove dead screenPos parameter from onSelectPoint - Extract ClusterLegend into its own component (from ~140 inline lines) - Replace deprecated np.ptp with explicit max-min - Move inline imports (math, Counter) to module top level - Use consistent str keys for cluster_sizes and cluster_names dicts - Fix eps=0 edge case when all points are identical (floor at 1e-6) - Add 11 unit tests for _compute_clusters and _name_clusters
Summary
Test plan
tsc -bpasses with no errors