Daily Dose of Data Science
Posts
Boost Sklearn Model Training and Inference by Doing (Almost) Nothing

Boost Sklearn Model Training and Inference by Doing (Almost) Nothing

...along with a lesser-known advice on vectorized operations.

November 17, 2023 • Reading Time: 8 minutes

The best thing about sklearn is that it provides a standard and elegant API across each of its ML model implementations:

Train the model using model.fit().
Predict the output using model.predict().
Compute the accuracy using model.score().
and more.

However, the problem with training a model this way is that the entire dataset must be available in memory to train the model.

But what if the dataset itself is large enough to load in memory?

These are called out-of-memory datasets, and training on such datasets can be:

time-consuming, or,
impossible due to memory constraints.

One thing that I don’t find many sklearn users utilizing in such situations is its incremental learning API.

As the name suggests, it lets the model learn incrementally from a mini-batch of instances.

This prevents limited memory constraints as only a few instances are loaded in memory at once.

What’s more, by loading and training on a few instances at a time, we can speed up the training of sklearn models.

Why?

Usually, when we use the model.fit(X, y) method to train a model in sklearn, the training process is vectorized but on the entire dataset.

While vectorization provides magical run-time improvements when we have a bunch of data, the performance often degrades after a certain point.

Thus, by loading fewer training instances at a time into memory and applying vectorization, we can get a better training run-time.

Let’s test it.

Consider we have a classification dataset occupying ~4 GBs of local storage space:

20 Million training instances
5 features
2 classes

We train a SGDClassifier model using sklearn on the entire dataset as follows:

The above training takes about 251 seconds.

Next, let’s train the same model using the partial_fit() API of sklearn. This is implemented below:

We load data from the CSV file large_dataset.csv in chunks.
After loading a specific chunk, we will invoke the partial_fit() API on the SGDClassifier model.

The training time is reduced by ~8 times, which is massive.

This validates what we discussed earlier:

While vectorization provides magical run-time improvements when we have a bunch of data, it is observed that the performance may degrade after a certain point.

Thus, by loading fewer training instances at a time into memory and applying vectorization, we can get a better training run-time.

Here, we must also compare the model coefficients and prediction accuracy of the two models.

The following visual depicts the comparison between model_full and model_chunk:

Both models have similar coefficients and similar performance.

Here, it is also worth noting that not all sklearn estimators implement the partial_fit() API.

Here’s the list of models that do:

Of course, the list is short. Yet, it is surely worth exploring to see if you can benefit from it.

Having said that, once we have trained our sklearn model (either on a small dataset or large), we may want to deploy it.

However, it is important to note that sklearn models are backed by NumPy computations, which can only run on a single CPU core.

Thus, in a deployment scenario, this can lead to suboptimal run-time performance.

Nonetheless, it is possible to compile many sklearn models to tensor operations, which can be loaded on a GPU to gain immense speedups.

This is evident from the image below:

The compiled model runs:
- Over twice as fast on a CPU.
- ~40 times faster on a GPU, which is huge.
All models have the same accuracy — indicating no loss of information during compilation.

If you want to learn more about this critical model compilation technique, we discussed it in detail here: Sklearn Models are Not Deployment Friendly! Supercharge Them With Tensor Computations.

👉 Over to you: What are some other ways to boost sklearn models?

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.

The button is located towards the bottom of this email.

Thanks for reading!

Latest full articles

If you’re not a full subscriber, here’s what you missed last month:

To receive all full articles and support the Daily Dose of Data Science, consider subscribing:

👉 Tell the world what makes this newsletter special for you by leaving a review here :)

👉 If you love reading this newsletter, feel free to share it with friends!

Reply

or to participate.