Daily Dose of Data Science
Posts
No Data Scientist Should Ever Overlook Distributed Computing Skills

No Data Scientist Should Ever Overlook Distributed Computing Skills

Don't stop at Pandas and Sklearn.

January 10, 2024 • Reading Time: 4 minutes

While tabular data space is mostly dominated by Pandas and Sklearn, one can hardly expect any benefit from them beyond a certain GBs of data.

This is because these tools are primarily designed for single-node processing, i.e., they operate under the assumption that the entire dataset can fit into the memory (RAM) of a single machine.

But this will hardly ever be true in large datasets (>100 GBs, say), unless you continue to expand the RAM.

A more practical solution is to use distributed computing instead — a framework that disperses the data across many small computers.

While we have seen many distributed computing frameworks in the past, nothing comes close to Spark. It’s among the best technologies used to quickly and efficiently analyze, process, and train models on big datasets.

That is why most data science roles at big tech demand proficiency in Spark. It’s that important.

To help you cultivate this critical skill, I have covered PySpark in the most recent deep dive: Don't Stop at Pandas and Sklearn! Get Started with Spark DataFrames and Big Data ML using PySpark.

Fun fact: This deep dive is the longest article I have ever written in my couple of years of being a creator — (102 minutes read, to be precise).

Thus, if you are a complete beginner and have never used Spark before, it’s okay. The article covers everything.

We start with the theoretical background, the history of big data tools like Hadoop, why they didn’t work, what is Spark, how it works, what are Spark RDDs, how Spark outperformed other tools, and more.
For practical understanding, we use Databricks to learn PySpark DataFrame syntax. Again, if you don’t know how to use Databricks, it’s okay. The article covers everything.

Lastly, we use Spark’s MLlib library and learn how to train machine learning models with PySpark. We also discuss PySpark pipelines, a popular utility that simplifies building, tuning, and deploying machine learning workflows.

Read the article here: Don't Stop at Pandas and Sklearn! Get Started with Spark DataFrames and Big Data ML using PySpark.

I am sure adding Spark to your skillset will be extremely valuable in your data science career ahead.

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.

The button is located towards the bottom of this email.

Thanks for reading!

Reply

or to participate.