Daily Dose of Data Science
Posts
The Limitation of KMeans Which Is Often Overlooked by Many

The Limitation of KMeans Which Is Often Overlooked by Many

...and here's how to address it.

June 10, 2023 • Reading Time: 3 minutes

The KMeans algorithm has a major limitation.

To recall, KMeans is trained as follows:

Initialize centroids
Find the nearest centroid for each point
Reassign centroids
Repeat until convergence

But the second step:

involves a brute-force and exhaustive search
finds the distance of every point from every centroid

As a result, this implementation:

isn't optimized
takes plenty of time to train and predict

This is especially challenging with large datasets.

To speed up KMeans, use Faiss by Facebook AI Research.

Faiss:

provides a much faster nearest-neighbor search.
finds an approximate best solution
uses "Inverted Index", which is an optimized data structure to store and index the data point

This makes performing clustering extremely efficient.

As shown above, Faiss is roughly 20x as fast as running the ideal KMeans algorithm from Sklearn.

Get started with Faiss: GitHub.

👉 Over to you: What are some other limitations of the KMeans algorithm?

👉 Tell the world what makes this newsletter special for you by leaving a review here :)

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights. The button is located towards the bottom of this email.

👉 If you love reading this newsletter, feel free to share it with friends!

👉 Sponsor the Daily Dose of Data Science Newsletter. More info here: Sponsorship details.

Find the code for my tips here: GitHub.

I like to explore, experiment and write about data science concepts and tools. You can read my articles on Medium. Also, you can connect with me on LinkedIn and Twitter.

Reply

or to participate.