Daily Dose of Data Science
Posts
One-Hot Encoding Introduces a Serious Problem in The Dataset

One-Hot Encoding Introduces a Serious Problem in The Dataset

Hint: This is NOT about sparse data representation.

January 28, 2024 • Reading Time: 5 minutes

One-hot encoding is probably one of the most common ways to encode categorical variables.

However, unknown to many, when we one-hot encode categorical data, we unknowingly introduce perfect multicollinearity.

Multicollinearity arises when two (or more) features are highly correlated OR two (or more) features can predict another feature:

In our case, as the sum of one-hot encoded features is always 1, it leads to perfect multicollinearity, and it can be problematic for models that don’t perform well under such conditions.

This problem is often called the Dummy Variable Trap.

Talking specifically about linear regression, for instance, it is bad because:

In some way, our data has a redundant feature.
Regression coefficients aren’t reliable in the presence of multicollinearity, etc.

That said, the solution is pretty simple.

Drop any arbitrary feature from the one-hot encoded features.

This instantly mitigates multicollinearity and breaks the linear relationship that existed before, as depicted below:

The above way of categorical data encoding is also known as dummy encoding, and it helps us eliminate the perfect multicollinearity introduced by one-hot encoding.

The visual below neatly summarizes this entire post: