KMeans vs. Gaussian Mixture Models

Addressing the major limitation of KMeans.

I like to think of Gaussian Mixture Models as a more generalized version of KMeans.

While we have covered all the necessary conceptual and practical details here: Gaussian Mixture Models

…let me tell you some of the widely known limitations of KMeans that you might not be aware of.

Limitations of KMeans

To begin:

  • It can only produce globular clusters. For instance, as shown below, even if the data has non-circular clusters, it still produces round clusters.

  • It performs a hard assignment. There are no probabilistic estimates of each data point belonging to each cluster.

  • It only relies on distance-based measures to assign data points to clusters.

    • To understand better, consider two clusters in 2D — A and B. Cluster A has a higher spread than B.

    • Now consider a line that is mid-way between centroids of A and B.

    • Although A has a higher spread, even if a point is slightly right to the midline, it will get assigned to cluster B.

    • Ideally, however, cluster A should have had a larger area of influence.

Gaussian Mixture Models

These limitations often make KMeans a non-ideal choice for clustering.

Gaussian Mixture Models are often a superior algorithm in this respect.

As the name suggests, they can cluster a dataset that has a mixture of many Gaussian distributions.

They can be thought of as a more flexible twin of KMeans.

The primary difference is that:

  • KMeans learns centroids.

  • Gaussian mixture models learn a distribution.

For instance, in 2 dimensions:

  • KMeans can only create circular clusters

  • GMM can create oval-shaped clusters.

This is illustrated in the animation below:

The effectiveness of GMMs over KMeans is evident from the image below.

  • KMeans just relies on distance and ignores the distribution of each cluster.

  • GMM learns the distribution and produces better clustering.

If you want to get into more details, I covered them in-depth here: Gaussian Mixture Models.

It covers:

  • What is the motivation and intuition behind GMMs?

  • The end-to-end mathematical formulation of GMMs.

  • How to use Expectation-Maximization to model data using GMMs?

  • Coding a GMM from scratch (no sklearn).

  • Comparing results of GMMs with KMeans.

  • How to determine the optimal number of clusters for GMMs?

  • Some practical use cases of GMMs.

  • Takeaways.

👉 Over to you: What are some other shortcomings of KMeans?

Thanks for reading!

Whenever you are ready, here’s one more way I can help you:

Every week, I publish 1-2 in-depth deep dives (typically 20+ mins long). Here are some of the latest ones that you will surely like:

To receive all full articles and support the Daily Dose of Data Science, consider subscribing:

👉 If you love reading this newsletter, feel free to share it with friends!

👉 Tell the world what makes this newsletter special for you by leaving a review here :)

Reply

or to participate.