Changes to the clustering algorithm to support out-of-sample predictions#15
Changes to the clustering algorithm to support out-of-sample predictions#15fholstege wants to merge 7 commits into
Conversation
|
|
||
| def fit(self, X, y): | ||
|
|
||
| def fit(self, X, y, random_state=None): |
There was a problem hiding this comment.
I don't see random_state being used anywhere inside fit. Why is it added as an argument?
There was a problem hiding this comment.
As discussed 24-02:
Randomness is already implemented in abstract method both in _kmeans.py and _kmodes.py. So, no need to add it in the fit function.
|
|
||
|
|
||
| # Fit the centroids | ||
| self.centroids_ = self.calc_centroids(X, self.labels_) |
There was a problem hiding this comment.
I wouldn't add centroids_ field to BiasAwareHierarchicalClustering as it can be used to implement clustering algorithms that do not use centroids (e.g. DBSCAN).
There was a problem hiding this comment.
As discussed 24-02:
We don't want to return centroids as part of the _bach.py because it's an abstract class. Best way forward is to implement as part of BiasAwareHierarchicalKMeans and BiasAwareHierarchicalKModes classes. This allows us/others to implement other clustering methods that don't use centroids in the future.
There was a problem hiding this comment.
To be discussed:
Is predict function using calculated centroids the same as HBAC results?
|
|
||
| @abstractmethod | ||
| def _split(self, X): | ||
| def _split(self, X, random_state=None): |
There was a problem hiding this comment.
random_state also seems redundant over here.
| """ | ||
| pass | ||
|
|
||
| def binary_chi_square_test(self, m, labels, k, bonf_correct): |
There was a problem hiding this comment.
self is not used anywhere inside the function, so this should be moved out, probably into utils.
There was a problem hiding this comment.
Implement in new post_processing.py in subfolder utils
| return p_clust, diff_clust, p1, p0 | ||
|
|
||
|
|
||
| def t_test(self, m, labels, k, bonf_correct, alternative='two-sided'): |
There was a problem hiding this comment.
Implement in new post_processing.py in subfolder utils
| return p_clust, diff_clust, m1.mean(), m0.mean() | ||
|
|
||
|
|
||
| def calc_ratio_within_between(self, m, labels): |
| return self.kmeans.fit_predict(X) | ||
|
|
||
|
|
||
| def calc_centroids(self, X, labels): |
There was a problem hiding this comment.
This should be private (i.e. prefixed with an underscore _), if present in the class at all.
| centroids = np.zeros((X.shape[1], len(np.unique(labels)))) | ||
|
|
||
| # iterate over the labels | ||
| for i, label in enumerate(np.unique(labels)): |
There was a problem hiding this comment.
np.unique(labels) should be equivalent to self.n_clusters_
Also, no need to use enumerate you can use just for i in range(self.n_clusters_) or for label in range(self.n_clusters_).
| return self.kmodes.fit_predict(X) | ||
|
|
||
|
|
||
| def calc_centroids(self, X, labels): |
There was a problem hiding this comment.
Same here, I would make it private.
Also includes some code related to the paper. This should be ignored/rejected.