ANN-driven KMeans with Faiss

20x speedup over sklearn.

June 28, 2024 • Reading Time: 5 minutes

Are you skillful at doing code reviews?

Take this 2-minute assessment I have prepared for you to find out.

There are 13 yes/no or rating-based questions to evaluate you in 4 key areas:

At the end, you will receive a customized report with clear action items on how to improve:

Start the assessment here:

A big thanks to Sourcery.ai, an AI-driven code review tool I have been using for over a year, who partnered with me to prepare this assessment.

Let’s get to today’s post!

Speedup KMeans with Faiss

KMeans is trained as follows:

Step 1) Initialize centroids
Step 2) Find the nearest centroid for each point
Step 3) Reassign centroids
Step 4) Repeat until convergence

But in this implementation, “Step 2” has a run-time bottleneck, as this step involves a brute-force and exhaustive search.

In other words, this finds the distance of every data point from every centroid.

As a result, this step isn’t optimized, and it takes plenty of time to train and predict.

This is especially challenging with large datasets.

To speed up KMeans, one of the implementations I usually prefer, especially on large datasets, is Faiss by Facebook AI Research.

To elaborate further, Faiss provides a much faster nearest-neighbor search using approximate nearest-neighbor search algorithms.

It uses an “Inverted Index,” which is an optimized data structure to store and index the data point.

We covered indexing techniques in the vector databases article here: A Beginner-friendly and Comprehensive Deep Dive on Vector Databases

This makes performing clustering extremely efficient, especially on large datasets, which is also evident from the image below:

As shown above, on a dataset of 500k data points (1024 dimensions), Faiss is roughly 20x faster than KMeans from Sklearn, which is an insane speedup.

What’s more, Faiss can also run on a GPU, which can further speed up your clustering run-time performance.

👉 Get started with Faiss here: GitHub.

👉 Over to you: What are some other limitations of the KMeans algorithm?

Are you overwhelmed with the amount of information in ML/DS?

Every week, I publish no-fluff deep dives on topics that truly matter to your skills for ML/DS roles.

For instance:

Join below to unlock all full articles:

SPONSOR US

Get your product in front of 80,000 data scientists and other tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., who have influence over significant tech decisions and big purchases.

To ensure your product reaches this influential audience, reserve your space here or reply to this email to ensure your product reaches this influential audience.

Reply

or to participate.