The Limitation of KMeans Which Is Often Overlooked by Many

...and here's how to address it.

The KMeans algorithm has a major limitation.

To recall, KMeans is trained as follows:

  • Initialize centroids

  • Find the nearest centroid for each point

  • Reassign centroids

  • Repeat until convergence

But the second step:

  • involves a brute-force and exhaustive search

  • finds the distance of every point from every centroid

As a result, this implementation:

  • isn't optimized

  • takes plenty of time to train and predict

This is especially challenging with large datasets.

To speed up KMeans, use Faiss by Facebook AI Research.

Faiss:

  • provides a much faster nearest-neighbor search.

  • finds an approximate best solution

  • uses "Inverted Index", which is an optimized data structure to store and index the data point

This makes performing clustering extremely efficient.

As shown above, Faiss is roughly 20x as fast as running the ideal KMeans algorithm from Sklearn.

Get started with Faiss: GitHub.

πŸ‘‰ Over to you: What are some other limitations of the KMeans algorithm?

πŸ‘‰ Tell the world what makes this newsletter special for you by leaving a review here :)

πŸ‘‰ If you liked this post, don’t forget to leave a like ❀️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights. The button is located towards the bottom of this email.

πŸ‘‰ If you love reading this newsletter, feel free to share it with friends!

πŸ‘‰ Sponsor the Daily Dose of Data Science Newsletter. More info here: Sponsorship details.

Find the code for my tips here: GitHub.

I like to explore, experiment and write about data science concepts and tools. You can read my articles on Medium. Also, you can connect with me on LinkedIn and Twitter.

Reply

or to participate.