Double Descent in ML

A counterintuitive phenomenon while training ML models.

October 02, 2024 • Reading Time: 5 minutes

Announcement

CoAgents (by CopilotKit) is going into public Beta today. The team is giving our readers a limited-time offer to promote the release and a chance to win:

$500 in CopilotKit cloud credits + $250 in OpenAI credits.
CopilotKit merchandise (Shirts, Hoodies, Stickers, etc).
Dinner with the founders (for those based in San Francisco).

Steps:

Star the repo.
Message Uli Barkai with the prize you want (Twitter | LinkedIn).

P.S. This message was NOT sponsored! It’s a genuine shout-out to the powerful product they are building. I have talked about them several times in this newsletter before and appreciate what they are building.

Let’s get to today’s post now.

Double Descent vs. Bias-Variance Trade-off

It is well-known that as the number of model parameters increases, we typically overfit the data more and more.

For instance, consider fitting a polynomial regression model trained on this dummy dataset below:

In case you don’t know, this is called a polynomial regression model:

It is expected that as we’ll increase the degree (m) and train the polynomial regression model:

The training loss will get closer and closer to zero.
The test (or validation) loss will first reduce and then get bigger and bigger.

This is because, with a higher degree, the model will find it easier to contort its regression fit through each training data point, which makes sense.

In fact, this is also evident from the following loss plot:

But notice what happens when we continue to increase the degree (m):

That’s strange, right?

Why does the test loss increase to a certain point but then decrease?

This was not expected, was it?

Well…what you are seeing is called the “double descent phenomenon,” which is quite commonly observed in many ML models, especially deep learning models.

It shows that, counterintuitively, increasing the model complexity beyond the point of interpolation can improve generalization performance.

In fact, this whole idea is deeply rooted to why LLMs, although massively big (billions or even trillions of parameters), can still generalize pretty well.

And it’s hard to accept it because this phenomenon directly challenges the traditional bias-variance trade-off we learn in any introductory ML class:

Putting it another way, training very large models, even with more parameters than training data points, can still generalize well.

To the best of my knowledge, this is still an open question, and it isn’t entirely clear why neural networks exhibit this behavior.

There are some theories around regularization, however, such as this one:

It could be that the model applies some sort of implicit regularization, with which, it can precisely focus on an apt number of parameters for generalization.

But to be honest, nothing is clear yet.

👉 Over to you: I would love to hear from you today on what you think about this phenomenon and its possible causes.

For those wanting to develop “Industry ML” expertise:

We have discussed several other topics (with implementations) in the past that align with “industry ML.”

Here are some of them:

At the end of the day, all businesses care about impact. That’s it!

Can you reduce costs?
Drive revenue?
Can you scale ML models?
Predict trends before they happen?

All these resources will help you cultivate those key skills.

Reply

or to participate.