- Daily Dose of Data Science
- Posts
- Avoid This Costly Mistake When Indexing A DataFrame
Avoid This Costly Mistake When Indexing A DataFrame
Row-then-column is not the same as Column-then-row.
When indexing a dataframe, choosing whether to select a column first or slice a row first is pretty important from a run-time perspective.
As shown above, selecting the column first is over 15 times faster than slicing the row first. Why?
As I have talked before, Pandas DataFrame is a column-major data structure. Thus, consecutive elements in a column are stored next to each other in memory.
As processors are efficient with contiguous blocks of memory, accessing a column is much faster than accessing a row (read more about this in one of my previous posts here).
But when you slice a row first, each row is retrieved by accessing non-contiguous blocks of memory, thereby making it slow.
Also, once all the elements of a row are gathered, Pandas converts them to a Series, which is another overhead.
We can verify this conversion below:
Instead, when you select a column first, elements are retrieved by accessing contiguous blocks of memory, which is way faster. Also, a column is inherently a Pandas Series. Thus, there is no conversion overhead involved like above.
Overall, by accessing the column first, we avoid accessing non-contiguous memory access, which does happen when we access the row first.
This makes selecting the column first faster than slicing a row first in indexing operations.
If you are confused about what selecting, indexing, slicing, and filtering mean, here’s what you should read next:
👉 Read what others are saying about this post on LinkedIn.
👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights. The button is located towards the bottom of this email.
👉 If you love reading this newsletter, feel free to share it with friends!
Find the code for my tips here: GitHub.
Reply