t-SNE vs. SNE — What's the difference?

Addressing the crowding problem.

The t-SNE algorithm is an improved version of the SNE algorithm, both used for dimensionality reduction.

While I assume you know the ins and outs of the SNE and tSNE algorithm (if you don’t, then you can read this detailed article I published once where we also implemented it from scratch)…

…the core idea of SNE (not t-SNE) is the following:

  • Step 1) For every point (i) in the given high-dimensional data, convert the high-dimensional Euclidean distances to all other points (j) into conditional Gaussian probabilities.

    • For instance, consider the marked red point in the dataset on the left below.

    • Converting Euclidean distance to all other points into Gaussian probabilities (the distribution on the right above) shows that other red points have a higher probability of being its neighbor than other points.

  • Step 2) For every data point xᵢ, randomly initialize its counterpart yᵢ in 2-dimensional space. These will be our projections.

  • Step 3) Just like we defined conditional probabilities in the high-dimensional space in Step 1, we define the conditional probabilities in the low-dimensional space, using Gaussian distribution again.

  • Step 4) Now, every data point (i) has a high-dimensional probability distribution and a corresponding low-dimensional distribution:

    • The objective is to match these two probability distributions.

    • Thus, we can make the positions of counterpart yᵢ’s learnable such that this difference is minimized.

    • Using KL divergence as a loss function helps us achieve this. It measures how much information is lost when we use distribution Q to approximate distribution P.

    • Ideally, we want to have the minimum loss value (which is zero), and this will be achieved when P=Q.

The model can be trained using gradient descent, and it works pretty well.

For instance, the following image depicts a 2-dimensional visualization produced by the SNE algorithm on 256-dimensional handwritten digits:

SNE produces good clusters.

What’s even more astonishing is that properties like orientation, skew, and strokethickness vary smoothly across the space within each cluster. This is depicted below:

Nonetheless, it has some limitations, which the t-SNE algorithm addresses.

Notice that the clusters produced by SNE are not well separated.

Here, it could be fair to assume that the original data clusters, the one in the 256-dimensional space, most likely would have been well separated. Thus:

  • All zeros must have been together but well separated from other digits.

  • All ones must have been together but well separated from other digits.

  • And so on.

Yet, SNE still produces tightly packed clusters. This is also called the “crowding problem.”

To eliminate this problem, t-SNE was proposed, standing for t-distributed Stochastic Neighbor Embedding (t-SNE).

Here’s the difference.

Recall that in SNE, we used a Gaussian distribution to define the low-dimensional conditional probabilities. But it is not producing well-separated clusters.

One solution is to use some other probability distribution, such that for distant points, we get the same value of the conditional probability as we would have obtained from a Gaussian distribution but at a larger Euclidean distance.

Let me simplify that a bit.

Compare the following two distributions:

Notice that Gaussian achieves a specific value of low probability density at a smaller distance. But t-distribution achieves it at a larger distance.

This is precisely what we intend to achieve.

We need a heavier tail distribution so that we can still minimize the difference between the two probability distributions but at a larger distance in the low-dimensional space.

The Student t-distribution is a perfect fit for it.

The following image depicts the difference this change brings:

As shown above:

  • SNE produces closely packed clusters.

  • t-SNE produces well-separated clusters.

And that’s why t-distribution is used in t-SNE.

That said, besides producing well-separated clusters, using the Student t-distribution has many more advantages.

For instance, it is computationally much faster to evaluate the density of a point under a Student t-distribution than under a Gaussian.

We went into much more detail in a beginner-friendly manner in the full article: Formulating and Implementing the t-SNE Algorithm From Scratch.

Similar to this, we also formulated the PCA algorithm from scratch here: Formulating the Principal Component Analysis (PCA) Algorithm From Scratch.

👉 Over to you: Can you identify a bottleneck in the t-SNE algorithm?

Are you overwhelmed with the amount of information in ML/DS?

Every week, I publish no-fluff deep dives on topics that truly matter to your skills for ML/DS roles.

For instance:

Join below to unlock all full articles:

SPONSOR US

Get your product in front of 77,000 data scientists and other tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., who have influence over significant tech decisions and big purchases.

To ensure your product reaches this influential audience, reserve your space here or reply to this email to ensure your product reaches this influential audience.

Reply

or to participate.