Daily Dose of Data Science
Posts
The Most Common Misconception Pandas Users Have About Apply() Method

The Most Common Misconception Pandas Users Have About Apply() Method

Avoid using apply() method at all times.

January 06, 2024 • Reading Time: 5 minutes

The apply() method in Pandas is the most common approach to apply a function along an axis of a DataFrame/Series.

In my experience, when using apply(), most Pandas users believe that it is a vectorized method.

In other words, they believe that apply() operates efficiently and performs element-wise operations like other vectorized operations in Pandas.

But this is NOT true.

Contrary to this common belief, every Pandas user MUST know that Pandas’ apply() method is NOT vectorized.

Instead, it’s just a glorified Python for-loop, which never offers any inherent vectorization-based optimization that one might expect.

As a result, the code always runs at native Python speed, i.e., slow.

What are the solutions?

One solution is to eliminate the apply() method by using a vectorized approach instead.

But I understand that, at times, coming up with a vectorized approach is difficult.

Another solution that I find handy is to parallelize the apply() method by using third-party optimized libraries instead.

The image below compares the run-time of Pandas apply() with four alternatives that support parallelization:

It is evident that Pandas’ apply() is not the optimal way to apply a method. In fact, it’s the slowest of all five.

There are a couple of reasons for this:

Pandas ALWAYS run on a single core of a CPU. Therefore, it does not possess any parallelization capabilities that it could possibly leverage.
Pandas’ apply() method is not vectorized. Therefore, it does not possess any vectorization capabilities either.

Honestly speaking, while the four external libraries shown in the visual above do not possess any vectorization capabilities either, they do leverage parallelization.

That is how we get to see a massive run-time improvement when we use them.

Here, please note that even though mapply() is the fastest here, it does not mean it will always be the fastest. Consider benchmarking on your own dataset first.

Moreover, I know that the add_row() method I demonstrated in the image above can be easily vectorized. I picked this particular example just for the sake of simplicity.

As a departing note, remember that your first possible attempt must ALWAYS be to write vectorized operations.

Consider these third-party libraries only when you see no scope to write vectorized code, and you see no other option but to use apply().

Get started with these libraries here:

Swifter: https://github.com/jmcarpenter2/swifter
Pandarallel: https://github.com/nalepae/pandarallel
Parallel Pandas: https://pypi.org/project/parallel-pandas/
Mapply: https://pypi.org/project/mapply/

👉 Over to you: What other techniques do you commonly use to optimize Pandas’ operations?

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.

The button is located towards the bottom of this email.

Thanks for reading!

Latest full articles

If you’re not a full subscriber, here’s what you missed last month:

To receive all full articles and support the Daily Dose of Data Science, consider subscribing:

👉 Tell the world what makes this newsletter special for you by leaving a review here :)

👉 If you love reading this newsletter, feel free to share it with friends!

Reply

or to participate.