Daily Dose of Data Science
Posts
Approximate Nearest Neighbor Search Using Inverted File Index

Approximate Nearest Neighbor Search Using Inverted File Index

Speedup traditional kNN.

October 03, 2024 • Reading Time: 8 minutes

💼 Want to become an AI Consultant?

The AI consulting market is about to grow by 8X – from $6.9 billion today to $54.7 billion in 2032.

But how does an AI enthusiast become an AI consultant?

How well you answer that question makes the difference between just “having AI ideas” and being handsomely compensated for contributing to an organization’s AI transformation.

Thankfully, you don’t have to go it alone—our friends at Innovating with AI just welcomed 200 new students into The AI Consultancy Project, their new program that trains you to build a business as an AI consultant.

Some of the highlights current students are excited about:

The tools and frameworks to find clients and deliver top-notch services.
A 6-month plan to build a 6-figure AI consulting business.
Students getting their first AI client in as little as 3 days.

And as a Daily Dose of Data Science reader, you have a chance to get early access to the next enrollment cycle.

Click here to request early access to The AI Consultancy Project.

Thanks to Innovating with AI for sponsoring today’s issue.

Approximate Nearest Neighbor Search Using Inverted File Index

One of the biggest issues with nearest neighbor search using kNN is that it performs an exhaustive search.

In other words, the query data point must be matched across all data points to find the nearest neighbor(s).

This is highly inefficient, especially when we have many data points and a near-real-time response is necessary.

That is why approximate nearest neighbor search algorithms are becoming increasingly popular.

The core idea is to narrow down the search space using indexing techniques, thereby improving the overall run-time performance.

Inverted File Index (IVF) is possibly one of the simplest and most intuitive techniques here, which you can immediately start using.

Here’s how it works:

Given a set of data points in a high-dimensional space, the idea is to organize them into different partitions, typically using clustering algorithms like k-means.

As a result, each partition has a corresponding centroid, and every data point gets associated with only one partition corresponding to its nearest centroid.

Also, every centroid maintains information about all the data points that belong to its partition.

Indexing done!

Here’s how we search.

When searching for the nearest neighbor(s) to the query data point, instead of searching across the entire dataset, we first find the closest centroid to the query:

Once we find the nearest centroid, the nearest neighbor is searched in only those data points that belong to the closest partition found:

Let’s see how the run-time complexity stands in comparison to traditional kNN.

Consider the following:

There are N data points
Each data point is D dimensional
We create K partitions.
Lastly, for simplicity, let’s assume that each partition gets equal data points.

In kNN, the query data point is matched to all N data points, which makes the time complexity → O(ND).

In IVF, however, there are two steps:

Match to all centroids → O(KD).
Find the nearest neighbor in the nearest partition → O(ND/K).

The final time complexity comes out to be the following:

…which is significantly lower than that of kNN.

To get some perspective, assume we have 10M data points. The search complexity of kNN will be proportional to 10M.

But with IVF, say we divide the data into 100 centroids, and each partition gets roughly 100k data points.

Thus, the time complexity comes out to be proportional to 100 + 100k = 100100, which is nearly 100 times faster.

Of course, it is essential to note that if some data points are actually close to the input data point but still happen to be in the neighboring partition, we will miss them during the nearest neighbor search, as shown below:

But this accuracy tradeoff is something we willingly accept for better run-time performance, which is precisely why these techniques are called “approximate nearest neighbors search.”

In the deep dives on vector databases (not paywalled), we discussed 4 such techniques, along with an entirely beginner-friendly and thorough discussion on Vector Databases.

Check it out here if you haven’t already: A Beginner-friendly and Comprehensive Deep Dive on Vector Databases.

Moreover, a few weeks back, we discussed two powerful ways to supercharge kNN models in this newsletter. Read it here: Why Traditional kNN is Not Suited for Imbalanced Datasets.

For those wanting to develop “Industry ML” expertise:

We have discussed several other topics (with implementations) in the past that align with “industry ML.”

Here are some of them:

At the end of the day, all businesses care about impact. That’s it!

Can you reduce costs?
Drive revenue?
Can you scale ML models?
Predict trends before they happen?

All these resources will help you cultivate those key skills.

Reply

or to participate.