Daily Dose of Data Science
Posts
A Silent Mistake That Many SQL Users Commit and Take Hours to Debug

A Silent Mistake That Many SQL Users Commit and Take Hours to Debug

...and 2 ways to avoid it.

January 05, 2024 • Reading Time: 7 minutes

In my experience, many unexpected errors in libraries/tools/languages can be attributed to the presence of missing values.

For instance, consider a 1D NumPy array with NaN values.

When we aggregate this NumPy array to calculate, say, its sum, we get the following output:

Strange, right?

Although an output of 6 may have made more sense here, NumPy produces a NaN value instead.

Anyway, this is not the topic of today’s post, but I hope you get the point.

A similar silent mistake can be found in SQL as well, specifically in the usage of “NOT IN” clause, which many SQL users are not aware of.

Let’s understand!

Consider we have the following tables (students and names), in our database:

The task is to select records from students table where first_name is not in the names table.

One way to do this is by using the NOT IN clause, as demonstrated below:

This provides the expected results as well.

Everything’s good so far!

Now, say our names table had a NULL value:

If we run the above query again, we get no records this time:

On a side note, had we used the IN clause to select rows from the students table where first_name was in the names table, we notice that it works as expected:

That’s strange, isn’t it?

Why do we get no records with NOT IN?

The reason we get no records when we use NOT IN but the IN clause works as expected has to do with how these two clauses operate internally.

For simplicity, consider we are currently checking the first record (where first_name is “John”) from the students table:

The WHERE clause needs a boolean value to determine whether a record must be filtered or not.

When we use “IN”, this boolean value is evaluated using the OR operator as follows:

If any condition is TRUE, the row gets filtered.

However, when we use “NOT IN”, the boolean value is evaluated using the AND operator as follows:

For the above expression to be TRUE, all individual conditions MUST be TRUE.

But the (JOHN != None) condition produces a conflict because, typically, this condition results in an UNKNOWN value.

Thus, the entire expression evaluates to UNKNOWN — producing no records.

This happens for every record in the students table. As a result, the query results in no records:

What to do instead?

There are many ways to avoid this.

As our task is to select records from students table where first_name is not in the names table, we can:

Filter out the NULL values in the sub-query:

Use Anti Joins: Anti join returns only those rows from the left table where no match is found in the right table. This is precisely what we need in our case and it is implemented below:

At times, such mistakes can take some serious time to debug if you are not aware of them beforehand.

Hope that helped!

If you wish to experiment with what we discussed about the NOT IN clause today, download this Jupyter Notebook: NOT IN Clause Notebook.

👉 Over to you: Can you propose another solution for the problem above?

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.

The button is located towards the bottom of this email.

Thanks for reading!

Latest full articles

If you’re not a full subscriber, here’s what you missed last month:

To receive all full articles and support the Daily Dose of Data Science, consider subscribing:

👉 Tell the world what makes this newsletter special for you by leaving a review here :)

👉 If you love reading this newsletter, feel free to share it with friends!

Reply

or to participate.