- Daily Dose of Data Science
- Posts
- KMeans vs. Gaussian Mixture Models
KMeans vs. Gaussian Mixture Models
Addressing the major limitation of KMeans.
I like to think of Gaussian Mixture Models as a more generalized version of KMeans.
While we have covered all the necessary conceptual and practical details here: Gaussian Mixture Models…
…let me tell you some of the widely known limitations of KMeans that you might not be aware of.
Limitations of KMeans
To begin:
It can only produce globular clusters. For instance, as shown below, even if the data has non-circular clusters, it still produces round clusters.
It performs a hard assignment. There are no probabilistic estimates of each data point belonging to each cluster.
It only relies on distance-based measures to assign data points to clusters.
To understand better, consider two clusters in 2D — A and B. Cluster A has a higher spread than B.
Now consider a line that is mid-way between centroids of A and B.
Although A has a higher spread, even if a point is slightly right to the midline, it will get assigned to cluster B.
Ideally, however, cluster A should have had a larger area of influence.
Gaussian Mixture Models
These limitations often make KMeans a non-ideal choice for clustering.
Gaussian Mixture Models are often a superior algorithm in this respect.
As the name suggests, they can cluster a dataset that has a mixture of many Gaussian distributions.
They can be thought of as a more flexible twin of KMeans.
The primary difference is that:
KMeans learns centroids.
Gaussian mixture models learn a distribution.
For instance, in 2 dimensions:
KMeans can only create circular clusters
GMM can create oval-shaped clusters.
This is illustrated in the animation below:
The effectiveness of GMMs over KMeans is evident from the image below.
KMeans just relies on distance and ignores the distribution of each cluster.
GMM learns the distribution and produces better clustering.
If you want to get into more details, I covered them in-depth here: Gaussian Mixture Models.
It covers:
What is the motivation and intuition behind GMMs?
The end-to-end mathematical formulation of GMMs.
How to use Expectation-Maximization to model data using GMMs?
Coding a GMM from scratch (no sklearn).
Comparing results of GMMs with KMeans.
How to determine the optimal number of clusters for GMMs?
Some practical use cases of GMMs.
Takeaways.
👉 Over to you: What are some other shortcomings of KMeans?
Thanks for reading!
Whenever you are ready, here’s one more way I can help you:
Every week, I publish 1-2 in-depth deep dives (typically 20+ mins long). Here are some of the latest ones that you will surely like:
[FREE] A Beginner-friendly and Comprehensive Deep Dive on Vector Databases.
A Detailed and Beginner-Friendly Introduction to PyTorch Lightning: The Supercharged PyTorch
You Are Probably Building Inconsistent Classification Models Without Even Realizing
Why Sklearn’s Logistic Regression Has no Learning Rate Hyperparameter?
PyTorch Models Are Not Deployment-Friendly! Supercharge Them With TorchScript.
Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.
You Cannot Build Large Data Projects Until You Learn Data Version Control!
To receive all full articles and support the Daily Dose of Data Science, consider subscribing:
👉 If you love reading this newsletter, feel free to share it with friends!
👉 Tell the world what makes this newsletter special for you by leaving a review here :)
Reply