Double Descent vs. Bias-Variance Trade-off

A counterintuitive phenomenon while training ML models.

It is well-known that as the number of model parameters increases, we typically overfit the data more and more.

For instance, consider fitting a polynomial regression model trained on this dummy dataset below:

In case you don’t know, this is called a polynomial regression model:

It is expected that as we’ll increase the degree (m) and train the polynomial regression model:

  • The training loss will get closer and closer to zero.

  • The test (or validation) loss will first reduce and then get bigger and bigger.

This is because, with a higher degree, the model will find it easier to contort its regression fit through each training data point, which makes sense.

In fact, this is also evident from the following loss plot:

But notice what happens when we continue to increase the degree (m):

That’s strange, right?

Why does the test loss increase to a certain point but then decrease?

This was not expected, was it?

Well…what you are seeing is called the “double descent phenomenon,” which is quite commonly observed in many ML models, especially deep learning models.

It shows that, counterintuitively, increasing the model complexity beyond the point of interpolation can improve generalization performance.

In fact, this whole idea is deeply rooted to why LLMs, although massively big (billions or even trillions of parameters), can still generalize pretty well.

And it’s hard to accept it because this phenomenon directly challenges the traditional bias-variance trade-off we learn in any introductory ML class:

Putting it another way, training very large models, even with more parameters than training data points, can still generalize well.

To the best of my knowledge, this is still an open question, and it isn’t entirely clear why neural networks exhibit this behavior.

There are some theories around regularization, however, such as this one:

It could be that the model applies some sort of implicit regularization, with which, it can precisely focus on an apt number of parameters for generalization.

But to be honest, nothing is clear yet.

👉 Over to you: I would love to hear from you today on what you think about this phenomenon and its possible causes.

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.

The button is located towards the bottom of this email.

Thanks for reading!

Latest full articles

If you’re not a full subscriber, here’s what you missed last month:

To receive all full articles and support the Daily Dose of Data Science, consider subscribing:

👉 Tell the world what makes this newsletter special for you by leaving a review here :)

👉 If you love reading this newsletter, feel free to share it with friends!

Reply

or to participate.