k-means clustering implementation

Hello,

I would like to implement the k-means clustering algorithm for okapi and I would really appreciate your input regarding some implementation choices.

The [algorithm](https://en.wikipedia.org/wiki/K-means_clustering) partitions _N_ data points (observations) into _k_ clusters.
The standard algorithm is iterative and works as follows.

**Input:** data points of the form { pointID, coordinatesVector }
**Output:** { pointID, centerID } & coordinates of each of the k cluster centers. Cluster centers do not have to belong to the input points.
**Initialization:** randomly choose k points from input

In each iteration: 
1. each data point is assigned to the cluster center which is closest to it, by means of euclidean distance
2. new cluster centers are recomputed, by calculating the arithmetic mean of the assigned points
Convergence is reached when the positions of the cluster centers do not change.

In Giraph, each data point will correspond to a vertex and execute step 1. The positions of the k centers can be stored in an aggregator and updated by the Master.

The [GPS paper](http://scholar.google.be/scholar_url?hl=en&q=http://ilpubs.stanford.edu:8090/1039/7/gps_ssdbm.pdf&sa=X&scisig=AAGBfm2LXRm4rfpl0_vjxtIXruOoyW0NDg&oi=scholarr&ei=nKgxU8m_LMPxhQejjoGoAw&ved=0CC0QgAMoADAA) describes a different version of the algorithm, where 
1. cluster centers are randomly chosen in each iteration
2. distance from the centers is calculated based on shortest paths (edge weights)
3. convergence is reached when the edge cut is less than some threshold

Facebook seems to have followed a [similar implementation](https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920).

In my view, it's better going for the standard implementation and once we have a stable implementation, maybe extend it to support edge-cut as converge criterion and/or other variations. 
Let me know your thoughts!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k-means clustering implementation #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

k-means clustering implementation #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions