Skip to content

k-means clustering implementation #5

@vasia

Description

@vasia

Hello,

I would like to implement the k-means clustering algorithm for okapi and I would really appreciate your input regarding some implementation choices.

The algorithm partitions N data points (observations) into k clusters.
The standard algorithm is iterative and works as follows.

Input: data points of the form { pointID, coordinatesVector }
Output: { pointID, centerID } & coordinates of each of the k cluster centers. Cluster centers do not have to belong to the input points.
Initialization: randomly choose k points from input

In each iteration:

  1. each data point is assigned to the cluster center which is closest to it, by means of euclidean distance
  2. new cluster centers are recomputed, by calculating the arithmetic mean of the assigned points
    Convergence is reached when the positions of the cluster centers do not change.

In Giraph, each data point will correspond to a vertex and execute step 1. The positions of the k centers can be stored in an aggregator and updated by the Master.

The GPS paper describes a different version of the algorithm, where

  1. cluster centers are randomly chosen in each iteration
  2. distance from the centers is calculated based on shortest paths (edge weights)
  3. convergence is reached when the edge cut is less than some threshold

Facebook seems to have followed a similar implementation.

In my view, it's better going for the standard implementation and once we have a stable implementation, maybe extend it to support edge-cut as converge criterion and/or other variations.
Let me know your thoughts!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions