Even Two Outliers Can Distort Your Data Analysis

...And here's how to avoid drawing misleading conclusions.

In partnership with Sourcery

~92% of developers do code reviews manually. Moreover, only 8% of developers are fully confident in doing code reviews.

To make things even worse, ~80% of programmers wait more than one day for a review on their pull request.

Code review is a tedious job, and most developers are doing it the same way as 5 years ago.

Sourcery is making manual code reviews obsolete with AI.

Thus, every pull request instantly gets a human-like expert review from Sourcery with general feedback, in-line comments, and relevant suggestions.

What used to take over a day now takes a few seconds only.

Sourcery handles ~600,000+ requests every month (reviews + refactoring + writing tests + documenting code, etc.) and I have been using it for over 1.5 years now.

Thanks to Sourcery for partnering with me today!

Let’s get to today’s post now.

How outliers distort correlation and regression fit

Many data scientists solely rely on the correlation matrix to study the association between variables.

But unknown to them, the obtained statistic can be heavily driven by outliers.

This is evident from the image below.

Adding just two outliers drastically changed the correlation coefficient and the regression fit.

Thus, plotting the data is highly important.

This can save you from drawing wrong conclusions, which you may have drawn otherwise by solely looking at the summary statistics.

One thing that I often do when using a correlation matrix is creating a PairPlot as well (shown below).

This lets me infer if the scatter plot of two variables and their corresponding correlation measure resonate with each other or not.

We covered 8 more pitfalls and cautionary measures in this article: 8 Fatal (Yet Non-obvious) Pitfalls and Cautionary Measures in Data Science.

👉 Over to you: What is the difference between regression coefficient and correlation coefficient?

Are you overwhelmed with the amount of information in ML/DS?

Every week, I publish no-fluff deep dives on topics that truly matter to your skills for ML/DS roles.

For instance:

Join below to unlock all full articles:

SPONSOR US

Get your product in front of 79,000 data scientists and other tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., who have influence over significant tech decisions and big purchases.

To ensure your product reaches this influential audience, reserve your space here or reply to this email to ensure your product reaches this influential audience.

Reply

or to participate.