- Daily Dose of Data Science
- Posts
- A Critical Feature Engineering Direction That Many ML Models Forget to Explore
A Critical Feature Engineering Direction That Many ML Models Forget to Explore
Understanding cyclical feature engineering.
In typical machine learning datasets, we mostly find features that progress from one value to another:
For instance:
Numerical features like age, income, transaction amount, etc.
Categorical features like t-shirt size, income groups, age groups, etc.
However, there is one more type of feature, which, in most cases, deserves special feature engineering effort but is often overlooked.
These are cyclical features, i.e., features with a recurring pattern (or cycle).
Unlike other features that progress continuously (or have no inherent order), cyclical features exhibit periodic behavior and repeat after a specific interval.
For instance, the hour-of-the-day, the day-of-the-week, and the month-of-an-year are all common examples of cyclical features.
Talking specifically about, say, the hour-of-the-day, its value can range between 0 to 23:
If we DON’T consider this as a cyclical feature and don’t utilize appropriate feature engineering techniques, we will lose some really critical information.
To understand better, consider this:
Realistically speaking, the values “23” and “0” must be close to each other in our “ideal” feature representation of the hour-of-the-day.
Moreover, the distance between “0” and “1” must be the same as the distance between “23” and “0”.
However, standard representation does not fulfill these properties.
Thus, the value “23” is far from “0”. In fact, the distance property isn’t satisfied either.
Now, think about it for a second.
Intuitively speaking, don’t you think this feature deserves special feature engineering, i.e., one that preserves the inherent natural property?
I am sure you do!
Let’s understand how we typically do it.
Cyclical feature encoding
One of the most common techniques to encode such a feature is using trigonometric functions, specifically, sine and cosine.
These are helpful because sine and cosine are periodic, bounded, and defined for all real values.
Of course, even other trigonometric functions are also periodic, but they are also undefined for some values, like, tan(pi/2).
For instance, consider representing the linear hour-of-the-day feature as a cyclical feature:
The central angle (2π) represents 24 hours.
Thus, the linear feature values can be easily converted into cyclical features as follows:
The benefit of doing this is how neatly the engineered feature satisfies the properties we discussed earlier:
As depicted above, the distance between the cyclical feature representation of “23” and “0” is the same as the distance between “0” and “1”.
The standard linear representation of the hour-of-the-day feature, however, violates this property, which results in loss of information…
…or rather, I should say that the standard linear representation of the hour-of-the-day feature results in an underutilization of information, which the model can benefit from.
Had it been the day-of-the-week instead, the central angle (2π) must have represented 7 days.
This makes intuitive sense as well.
The same idea can be extended to all sorts of cyclical features you may find in your dataset:
Wind direction, if represented categorically, will go in this order: N, NE, E, SE, S, SW, W, NW, and then back to N.
Phases of the moon, like new moon, first quarter, full moon, and last quarter, can be represented as categories with a cyclical order.
Seasons, such as spring, summer, fall, and winter, are categorical features with a cyclical pattern as they repeat annually.
The point is that as you will inspect the dataset features, you will intuitively know which features are cyclical and which are not.
Typically, the model will find it easier to interpret the engineered features and utilize them in modeling the dataset accurately.
👉 Over to you: What are some other ways to handle such features?
👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.
The button is located towards the bottom of this email.
Thanks for reading!
Latest full articles
If you’re not a full subscriber, here’s what you missed last month:
PyTorch Models Are Not Deployment-Friendly! Supercharge Them With TorchScript.
How To (Immensely) Optimize Your Machine Learning Development and Operations with MLflow.
Don’t Stop at Pandas and Sklearn! Get Started with Spark DataFrames and Big Data ML using PySpark.
DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering.
Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.
You Cannot Build Large Data Projects Until You Learn Data Version Control!
Deploy, Version Control, and Manage ML Models Right From Your Jupyter Notebook with Modelbit.
To receive all full articles and support the Daily Dose of Data Science, consider subscribing:
👉 Tell the world what makes this newsletter special for you by leaving a review here :)
👉 If you love reading this newsletter, feel free to share it with friends!
Reply