Approximate Nearest Neighbor Search Using Inverted File Index

Speedup traditional kNN.

One of the biggest issues with nearest neighbor search using kNN is that it performs an exhaustive search.

In other words, the query data point must be matched across all data points to find the nearest neighbor(s).

This is highly inefficient, especially when we have many data points and a near-real-time response is necessary.

That is why approximate nearest neighbor search algorithms are becoming increasingly popular.

The core idea is to narrow down the search space using indexing techniques, thereby improving the overall run-time performance.

Inverted File Index (IVF) is possibly one of the simplest and most intuitive techniques here, which you can immediately start using.

Here’s how it works:

Given a set of data points in a high-dimensional space, the idea is to organize them into different partitions, typically using clustering algorithms like k-means.

As a result, each partition has a corresponding centroid, and every data point gets associated with only one partition corresponding to its nearest centroid.

Also, every centroid maintains information about all the data points that belong to its partition.

Indexing done!

Here’s how we search.

When searching for the nearest neighbor(s) to the query data point, instead of searching across the entire dataset, we first find the closest centroid to the query:

Once we find the nearest centroid, the nearest neighbor is searched in only those data points that belong to the closest partition found:

Let’s see how the run-time complexity stands in comparison to traditional kNN.

Consider the following:

  • There are N data points

  • Each data point is D dimensional

  • We create K partitions.

  • Lastly, for simplicity, let’s assume that each partition gets equal data points.

In kNN, the query data point is matched to all N data points, which makes the time complexity → O(ND).

In IVF, however, there are two steps:

  1. Match to all centroids → O(KD).

  2. Find the nearest neighbor in the nearest partition → O(ND/K).

The final time complexity comes out to be the following:

…which is significantly lower than that of kNN.

To get some perspective, assume we have 10M data points. The search complexity of kNN will be proportional to 10M.

But with IVF, say we divide the data into 100 centroids, and each partition gets roughly 100k data points.

Thus, the time complexity comes out to be proportional to 100 + 100k = 100100, which is nearly 100 times faster.

Of course, it is essential to note that if some data points are actually close to the input data point but still happen to be in the neighboring partition, we will miss them during the nearest neighbor search, as shown below:

But this accuracy tradeoff is something we willingly accept for better run-time performance, which is precisely why these techniques are called “approximate nearest neighbors search.”

In one of the recent deep dives on vector databases (not paywalled), we discussed 4 such techniques, along with an entirely beginner-friendly and thorough discussion on Vector Databases.

Check it out here if you haven’t already: A Beginner-friendly and Comprehensive Deep Dive on Vector Databases.

Moreover, around 6 weeks back, we discussed two powerful ways to supercharge kNN models in this newsletter. Read it here: Two Simple Yet Immensely Powerful Techniques to Supercharge kNN Models.

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.

The button is located towards the bottom of this email.

Thanks for reading!

Latest full articles

If you’re not a full subscriber, here’s what you missed last month:

To receive all full articles and support the Daily Dose of Data Science, consider subscribing:

👉 Tell the world what makes this newsletter special for you by leaving a review here :)

👉 If you love reading this newsletter, feel free to share it with friends!

Reply

or to participate.