Use SQL "NOT IN" With Caution

An unexpected error that can take hours to debug.

In my experience, many unexpected errors in libraries/tools/languages can be attributed to the presence of missing values.

For instance, consider a 1D NumPy array with NaN values.

When we aggregate this NumPy array to calculate, say, its sum, we get the following output:

Strange, right?

Although an output of 6 may have made more sense here, NumPy produces a NaN value instead.

Anyway, this is not the topic of today’s post, but I hope you get the point.

A similar silent mistake can be found in SQL as well, specifically in the usage of “NOT IN” clause, which many SQL users are not aware of.

Let’s understand!

Consider we have the following tables (students and names), in our database:

The task is to select records from students table where first_name is not in the names table.

One way to do this is by using the NOT IN clause, as demonstrated below:

This provides the expected results as well.

Everything’s good so far!

Now, say our names table had a NULL value:

If we run the above query again, we get no records this time:

On a side note, had we used the IN clause to select rows from the students table where first_name was in the names table, we notice that it works as expected:

That’s strange, isn’t it?

Why do we get no records with NOT IN?

The reason we get no records when we use NOT IN but the IN clause works as expected has to do with how these two clauses operate internally.

For simplicity, consider we are currently checking the first record (where first_name is “John”) from the students table:

The WHERE clause needs a boolean value to determine whether a record must be filtered or not.

When we use “IN”, this boolean value is evaluated using the OR operator as follows:

If any condition is TRUE, the row gets filtered.

However, when we use “NOT IN”, the boolean value is evaluated using the AND operator as follows:

For the above expression to be TRUE, all individual conditions MUST be TRUE.

But the (JOHN != None) condition produces a conflict because, typically, this condition results in an UNKNOWN value.

Thus, the entire expression evaluates to UNKNOWN — producing no records.

This happens for every record in the students table. As a result, the query results in no records:

What to do instead?

There are many ways to avoid this.

As our task is to select records from students table where first_name is not in the names table, we can:

  • Filter out the NULL values in the sub-query:

  • Use Anti Joins: Anti join returns only those rows from the left table where no match is found in the right table. This is precisely what we need in our case and it is implemented below:

At times, such mistakes can take some serious time to debug if you are not aware of them beforehand.

If you wish to experiment with what we discussed about the NOT IN clause today, download this Jupyter Notebook: SQL NOT IN Notebook.

👉 Over to you: Can you propose another solution for the problem above?

Are you overwhelmed with the amount of information in ML/DS?

Every week, I publish no-fluff deep dives on topics that truly matter to your skills for ML/DS roles.

For instance:

Join below to unlock all full articles:

SPONSOR US

Get your product in front of 85,000 data scientists and other tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., who have influence over significant tech decisions and big purchases.

To ensure your product reaches this influential audience, reserve your space here or reply to this email to ensure your product reaches this influential audience.

Reply

or to participate.