-
Notifications
You must be signed in to change notification settings - Fork 44
Description
Hello,
I would like to implement the k-means clustering algorithm for okapi and I would really appreciate your input regarding some implementation choices.
The algorithm partitions N data points (observations) into k clusters.
The standard algorithm is iterative and works as follows.
Input: data points of the form { pointID, coordinatesVector }
Output: { pointID, centerID } & coordinates of each of the k cluster centers. Cluster centers do not have to belong to the input points.
Initialization: randomly choose k points from input
In each iteration:
- each data point is assigned to the cluster center which is closest to it, by means of euclidean distance
- new cluster centers are recomputed, by calculating the arithmetic mean of the assigned points
Convergence is reached when the positions of the cluster centers do not change.
In Giraph, each data point will correspond to a vertex and execute step 1. The positions of the k centers can be stored in an aggregator and updated by the Master.
The GPS paper describes a different version of the algorithm, where
- cluster centers are randomly chosen in each iteration
- distance from the centers is calculated based on shortest paths (edge weights)
- convergence is reached when the edge cut is less than some threshold
Facebook seems to have followed a similar implementation.
In my view, it's better going for the standard implementation and once we have a stable implementation, maybe extend it to support edge-cut as converge criterion and/or other variations.
Let me know your thoughts!