Daily Dose of Data Science
Posts
11 Ways to Determine Data Normality

11 Ways to Determine Data Normality

A guide to plotting, statistical and distance methods.

May 04, 2024 • Reading Time: 6 minutes

Many ML models assume (or work better) under the presence of normal distribution.

For instance:

Linear regression assumes residuals are normally distributed.
At times, transforming the data to normal distribution can be beneficial.
Linear discriminant analysis (LDA) is derived under the assumption of normal distribution, etc.

Thus, being aware of the ways to test normality is extremely crucial for data scientists.

The visual below depicts the 11 essential ways to test normality.

Let’s understand these today.

#1) Plotting Methods (self-explanatory)

Histogram
QQ Plot (We covered it recently here: QQ Plot)
KDE Plot
Violin Plot

While plotting is often reliable, it is a subjective approach and prone to errors.

Thus, we must know reliable quantitative measures as well.

#2) Statistical Methods:

Shapiro-Wilk test:
- Finds a statistic using the correlation between the observed data and the expected values under a normal distribution.
- The p-value indicates the likelihood of observing such a correlation if the data were normally distributed.
- A high p-value indicates a normal distribution.
- Get started: Scipy Docs.
KS test:
- Measures the max difference between the cumulative distribution functions (CDF) of observed and normal distribution.
- The output statistic is based on the max difference between the two CDFs.
- A high p-value indicates a normal distribution.
- Get started: Scipy Docs.
Anderson-Darling test:
- Measures the differences between the observed data and the expected values under a normal distribution.
- Emphasizes the differences in the tail of the distribution.
- This makes it particularly effective at detecting deviations in the extreme values.
- Get started: Scipy Docs.
Lilliefors test:
- It is a modification of the KS test.
- The KS test is appropriate in situations where the parameters of the reference distribution are known.
- If the parameters are unknown, Lilliefors is recommended.
- Get started: Statsmodel Docs.

#3) Distance Measures

Distance measures are another reliable and more intuitive way to test normality.

But they can be a bit tricky to use.

See, the problem is that a single distance value needs more context for interpretability.

For instance, if the distance between two distributions is 5, is this large or small?

We need more context.

I prefer using these measures as follows:

Find the distance between the observed distribution and multiple reference distributions.
Select the reference distribution with the minimum distance to the observed distribution.

Here are a few distance common and useful measures:

Bhattacharyya distance:
- Measure the overlap between two distributions.
- This “overlap” is often interpreted as closeness between two distributions.
- Choose the distribution that has the least Bhattacharyya distance to the observed distribution.
- We covered it in detail here: Bhattacharyya Distance.
Hellinger distance:
- It is used quite similar to how we use the Bhattacharyya distance
- The difference is that Bhattacharyya distance does not satisfy triangular inequality.
- But Hellinger distance does.
KL Divergence:
- It is not entirely a "distance metric" per se, but can be used in this case.
- Measure information lost when one distribution is approximated using another distribution.
- The more information is lost, the more the KL Divergence.
- Choose the distribution that has the least KL divergence from the observed distribution.
- KL divergence is used as a loss function in the t-SNE algorithm. If you want to learn how, we discussed it here: t-SNE article.

👉 Over to you: What other common methods have I missed?

Thanks for reading!

Extended piece #1

Despite rigorously testing an ML model locally (on validation and test sets), it could be a terrible idea to instantly replace the previous model with the new model.

A more reliable strategy is to test the model in production (yes, on real-world incoming data).

While this might sound risky, ML teams do it all the time, and it isn’t that complicated.

We covered 5 such techniques here: 5 Must-Know Ways to Test ML Models in Production (Implementation Included).

Extended piece #2

Businesses have more data than ever before.

Traditional single-node model training just doesn’t work because one cannot wait months to train a model.

Distributed (or multi-GPU) training is one of the most essential ways to address this.

In fact, if you look at job descriptions for Applied ML or ML engineer roles on LinkedIn, most of them demand skills like the ability to train models on large datasets:

We covered some core technicalities behind multi-GPU training, how it works under the hood, and implementation details here: A Beginner-friendly Guide to Multi-GPU Model Training.

Are you preparing for ML/DS interviews or want to upskill at your current job?

Every week, I publish in-depth ML dives. The topics align with the practical skills that typical ML/DS roles demand.

Join below to unlock all full articles:

👉 If you love reading this newsletter, share it with friends!

👉 Tell the world what makes this newsletter special for you by leaving a review here :)

Reply

or to participate.