- Daily Dose of Data Science
- Posts
- The Biggest Limitation of Pearson Correlation Which Many Overlook
The Biggest Limitation of Pearson Correlation Which Many Overlook
...And what to use instead, along with a genuine advice on summary statistics.
Pearson correlation is commonly used to determine the association between two continuous variables.
Many frameworks (in Pandas, for instance) have it as their default correlation metric.
Yet, unknown to many, Pearson correlation:
Only measures the linear relationship.
Penalizes a non-linear yet monotonic association.
Instead, Spearman correlation is a better alternative.
It assesses monotonicity, which can be linear as well as non-linear.
This is evident from the illustration below:
Pearson and Spearman correlation is the same on linear data.
But Pearson correlation underestimates a non-linear association.
Spearman correlation is also useful when data is ranked or ordinal. If you want to learn more about this, we covered it in this issue: The Limitation of Pearson Correlation While Using It With Ordinal Categorical Data.
Also, before I end, remember to always be cautious before drawing any conclusions using summary statistics.
While analyzing data, so many people get tempted to draw conclusions solely based on its statistics. Yet, the actual data might be conveying a totally different story.
This is also evident from the image below:
All nine datasets have approx. zero correlation between the two variables. However, the summary statistic, Pearson correlation in this case, gives no clue about what’s inside the data because it is always zero.
In fact, this is not just about Pearson correlation but applies to all summary statistics. The idea is that whenever you generate any summary statistic, you lose essential information.
Thus, the importance of looking at the data cannot be stressed enough. It saves us from drawing wrong conclusions, which we could have made otherwise by looking at the statistics alone.
I have written more posts on this topic in the past, and I would highly encourage you to read them next:
👉 Over to you: What are some other alternatives that address Pearson’s limitations?
👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.
The button is located towards the bottom of this email.
Thanks for reading!
Latest full articles
If you’re not a full subscriber, here’s what you missed last month:
DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering
Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning
You Cannot Build Large Data Projects Until You Learn Data Version Control!
Sklearn Models are Not Deployment Friendly! Supercharge Them With Tensor Computations.
Deploy, Version Control, and Manage ML Models Right From Your Jupyter Notebook with Modelbit
Gaussian Mixture Models (GMMs): The Flexible Twin of KMeans.
To receive all full articles and support the Daily Dose of Data Science, consider subscribing:
👉 Tell the world what makes this newsletter special for you by leaving a review here :)
👉 If you love reading this newsletter, feel free to share it with friends!
Reply