Generalizes to clusters of different shapes and sizes.
Bad
Sensitive to the outliers.
Choosing the k values manually is tough.
Dependent on initial values.
Scalability decreases when dimension increases.
What is Clustering? (Supervised vs Unsupervised Learning)
Clustering can be defined as grouping an unlabelled data
The Elbow Method
Within Cluster Sum of Squares (WCSS):
WCSS = ∑(Pi in Cluster 1) distance(Pi, C1)2 + ∑(Pi in Cluster 2) distance(Pi, C2)2 + ...
K-Means++
K-Means++ Initialization Algorithm:
Step 1: Choose first centroid at random among data points
Step 2: For each of the remaining data points compute the distance (D) to the nearest out of already selected centroids
Step 3: Choose next centroid among remaining data points using weighted random selection - weighted by D2
Step 4: Repeat Steps 2 and 3 until all k centroids have been selected
Step 5: Proceed with standard k-means clustering