by Erica Brisigotti, Ekaterina Chueva, Sofia Pacheco Garcia, Nadillia Sahputra
This work is the final project of the Management and Analysis of Physics Dataset (mod. B) class from the Physics of Data Master's Degree, held at the University of Padova during Academic Year 2022-2023 by Professor J. Pazzini.
The goals of the project are to:
- build a Spark cluster of 3 Virtual Machines provided on the OpenStack-based cloud provided by the University of Padova (Cloudveneto)
- implement 3 variations of the K-means algorithm (standard – “naive” one, K-means++ and K-means||)
- compare the 3 variations based on the performance over the chosen dataset, to ultimately find the best algorithm.
K-means is an unsupervised machine learning algorithm, which aims to partition
The dataset chosen for the project is kdcup99 dataset, which is an artificial dataset representing “bad” and “good” internet connections (intrusions/attacks and normal connections). Each observation consists of a series of attributes, which are numeric and non-numeric.
We first build a Spark cluster of 3 virtual machines: we specify that one machine is both a master and worker, and the remaining two are workers. We then install all the components necessary to run Python, Jupyter Notebooks, and PySpark on said VMs.
We import a subset of the data from the Python scikit-learn library: our dataset consists of 42 columns/attributes and around 500000 rows/observations, and the corresponding targets/classes. We represent take a look at the distribution of the targets/classes:
K-means algorithms perform best when provided with comparable size classes. Therefore, we decide to keep only the classes with the most data (>10000 occurrences).
We are left with 3 classes (
We further modify the dataset by keeping only the numeric columns/attributes. If we had kept all columns/attributes, we would have used a mixed metric (defined as the sum of the Euclidean and discrete distances): in that case, the non-numeric columns/attributes would give larger contributions than the numeric ones, leading to an imbalanced estimate of the centroid and cluster.
The following section focuses on the implementation of the 3 variations of the k-means algorithm: k-means||, k-means++, and naive k-means. Each algorithm consists of an initialization (specific to that variation) and Lloyd’s algorithm, which assigns each datum to the nearest centroid.
The idea behind the K-means|| initialization is to find a good compromise between the random initialization of the naive approach and the k-means++ initialization, which can be thought as occurring at two ends of a spectrum. From the naive approach, we would like to select multiple points at a time and keep the number of iterations small. From the k++ algorithm, we want to take the non-uniform distribution from which the points are randomly extracted.
The desired sweet spot is found by tuning the two main parameters of the distribution, which are:
-
$l$ is a parameter related to the non-uniform distribution from which new centroids are extracted: the distribution is proportional to the square distance, so that we are more likely to be picking new centroids far away from the ones that we already have - the number of iterations of the initialization
Because of the fixed number of iterations, this algorithm doesn’t accept the first k centroids that it finds, as the previous algorithms do.
It accepts the
After this initialization we ended up with k=3 centroids which already have a “good start” and in theory require fewer Lloyd’s iterations (we prove it in practice as well). For k-means|| we also found the optimal parameters by analyzing the trend of the cost per number of iterations.
The idea of the algorithm is that we choose the first centroid uniformly at random, then we choose the rest of the centroids from the remaining data points with probability proportional to its squared distance.
This algorithm extracts
The results are shown in the picture below:
We see that k-means++ is the worst algorithm in terms of minimization of the cost. We then can compare naive k-means and k-means||. We observe that both algorithms reach minimum of the cost, but the k-means|| takes more time in the initialization. Although it may look like naive k-means is the best, in reality we cannot rely on this algorithm, since the result strongly depends on the fully random initialization. On the contrary, although we pay the cost in time taken for initialization in k-means||, it is more reliable.
We experimented with the number of partitions and obtained the following results:
We concluded that for implementing the k-means parallel the best number of partitions is 12, which equals number of workers multiplied by number of cores available for each worker.


