Daily Dose of Data Science
Posts
A (Highly) Important Point to Consider Before You Use KMeans Next Time

A (Highly) Important Point to Consider Before You Use KMeans Next Time

Estimating the success rate of KMeans.

March 31, 2023 • Reading Time: 3 minutes

The most important yet often overlooked step of KMeans is its centroid initialization. Here's something to consider before you use it next time.

KMeans selects the initial centroids randomly. As a result, it fails to converge at times. This requires us to repeat clustering several times with different initialization.

Yet, repeated clustering may not guarantee that you will soon end up with the correct clusters. This is especially true when you have many centroids to begin with.

Instead, KMeans++ takes a smarter approach to initialize centroids.

The first centroid is selected randomly. But the next centroid is chosen based on the distance from the first centroid.

In other words, a point that is away from the first centroid is more likely to be selected as an initial centroid. This way, all the initial centroids are likely to lie in different clusters already, and the algorithm may converge faster and more accurately.

The impact is evident from the bar plots shown below. They depict the frequency of the number of misplaced centroids obtained (analyzed manually) after training 50 different models with KMeans and KMeans++.

On the given dataset, out of the 50 models, KMeans only produced zero misplaced centroids once, which is a success rate of just 2%.

In contrast, KMeans++ never produced any misplaced centroids.

Luckily, if you are using sklearn, you don’t need to worry about the initialization step. This is because sklearn, by default, resorts to the KMeans++ approach.

However, if you have a custom implementation, do give it a thought.

👉 Tell me you liked this post by leaving a heart react 🤍.

👉 If you love reading this newsletter, feel free to share it with friends!

Hey there!

I hope you may have noticed that I have made these daily newsletter issues a bit more detailed. What is your feedback about this change?

Find the code for my tips here: GitHub.

I like to explore, experiment and write about data science concepts and tools. You can read my articles on Medium. Also, you can connect with me on LinkedIn and Twitter.

Reply

or to participate.