A Common Misconception About Model Reproducibility

...And here's what reproducibility truly means.

In yesterday’s issue, we discussed a unique perspective on understanding the internal workings of a neural network.

To recall, introducing a dummy 2D layer right before the output layer helped us conclude that all a neural network is trying to do is transform the data into a linearly separable form before reaching the output layer.

If you are new here or wish to recall, you can read yesterday’s issue after reading today’s email: A Unique Perspective on What Hidden Layers and Activation Functions Do.

Today, we utilize this idea and discuss something extremely important about ML model reproducibility.

Let’s begin!

Imagine you trained an ML model, say a neural network.

Its training accuracy is 95%, and test accuracy is 92%.

You trained the model again and got the same performance.

Will you call this a reproducible experiment?

Think for a second before you read further.

Contrary to common belief, this is not the true definition of reproducibility.

To understand better, consider this illustration below. Here, we fed the input data to neural networks with the same architecture but different randomizations. Next, we visualized the transformation using a 2D dummy layer:

It is clear that all models separate the data pretty well and give 100% accuracy.

Yet, if you look closely at each model, you will notice that each generates varying data transformations (or decision boundaries).

Now will you call this reproducible?

No, right?

It is important to remember that reproducibility is NEVER measured in terms of performance metrics.

Instead, reproducibility is ensured when all sources of randomizations, model config, code, data, hyperparameters, etc., are tracked/logged.

This is because, as we saw above, two models with the same architecture yet different randomizations can still perform equally well.

But that does not make your experiment reproducible.

And that is why it is also recommended to set seeds for random generators

Once we do that, reproducibility will automatically follow.

In fact, in my experience, most ML projects lack a dedicated experimentation management/tracking system.

As the name suggests, this helps us track:

  • Model configuration → critical for reproducibility.

  • Model performance → critical for comparing different models.

…across all experiments.

Most data scientists and machine learning engineers develop entire models in Jupyter notebooks without having any well-defined and automated reproducibility and performance tracking protocols.

They heavily rely on inefficient and manual tracking systems — Sheets, Docs, etc., which get difficult to manage quickly.

MLflow stands out as a valuable tool for ML engineers, offering robust practices for ML pipelines.

It seamlessly integrates with various cloud services, which facilitates flexibility in usage — whether locally for an individual or remotely for a large ML engineering team.

👉 Over to you: How do you make your ML projects reproducible?

  • 1 Referral: Unlock 450+ practice questions on NumPy, Pandas, and SQL.

  • 2 Referrals: Get access to advanced Python OOP deep dive.

  • 3 Referrals: Get access to the PySpark deep dive for big-data mastery.

Get your unique referral link:

Are you overwhelmed with the amount of information in ML/DS?

Every week, I publish no-fluff deep dives on topics that truly matter to your skills for ML/DS roles.

For instance:

Join below to unlock all full articles:

SPONSOR US

Get your product in front of 79,000 data scientists and other tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., who have influence over significant tech decisions and big purchases.

To ensure your product reaches this influential audience, reserve your space here or reply to this email to ensure your product reaches this influential audience.

Reply

or to participate.