You Cannot Build Reliable Data Projects Until You Learn Data Version Control

The underappreciated, yet critical, skill that most data scientists overlook.

GitHub is not ideal for ML projects.

See, the problem is that Git is best suited for versioning codebase, which is primarily composed of lightweight files.

And in collaborative software projects, GitHub lets us maintain a single record of truth for local Git repositories.

But ML projects are not solely driven by code, are they?

Instead, they involve large data files as well, and across experiments, these datasets can vastly vary.

To ensure proper reproducibility and experiment traceability, it becomes necessary to version datasets as well.

But versioning these GBs of datasets is practically impossible with GitHub because it imposes an upper limit on the file size we can push to its remote repositories.

So what is the solution here?

Data version control.

The core idea is to integrate another version controlling system with Git, specifically used for large files.

To help you skill up to build and deliver large data projects, this is precisely the topic we are covering in this week’s ML deep dive: You Cannot Build Large Data Projects Until You Learn Data Version Control!

As we transition to real-world data projects, we will certainly need data version control.

Many data science teams actively use it.

The idea behind data version control appeared to be extremely compelling and clever to me when I first learned it a few years back.

Learning and utilizing this skill has been extremely helpful to me in building reliable large ML models.

Thus, learning about data version control will be immensely valuable if you envision doing the same.

To help you truly get hold of this skill, today’s article covers everything from scratch and also contains a full exercise:

I am sure you will learn something new today :)

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights. The button is located towards the bottom of this email.

Thanks for reading!

Reply

or to participate.