Daily Dose of Data Science
Posts
Sourcery: The AI Pair Programmer That Every Python Programmer Must Have

Sourcery: The AI Pair Programmer That Every Python Programmer Must Have

Build reliable, documented and tested projects with Sourcery.

November 27, 2023 • Reading Time: 8 minutes

In my opinion, machine learning deserves the rigor of any software engineering field.

Data pipelines must always be reusable, modular, scalable, testable, maintainable, and well-documented.

But of course, creating an executable, reproducible, error-free, and organized project is hard, especially for people in data science.

This is because while dedicated software engineers do this day in and day out, this is not something that comes very naturally to the people in the data world.

To address this, one of the coolest and most powerful utilities I added a few months back to my tooling stack is Sourcery.

It’s an AI pair programmer which has helped me automate many redundant and tedious tasks in my data pipelines:

Writing better code (or refactoring)
Reviewing code
Writing test cases
Improving code
Writing Docstrings

Let’s understand this today and how you can use it in your projects.

To begin, there are various ways to use Sourcery:

The focus of this post is primarily on its VS code extension. Download it here.

To get started, open the command line, and install Sourcery.

Next, create an account here to use Sourcery’s pair programmer.
Once you do, go back to the command line and connect your local machine to Sourcery:

Done!

Now we can use the pair programmer for the above tasks like refactoring, code review, writing test cases, code understanding, docstrings, etc:

For this demonstration, we have a small data pipeline:

It loads a CSV file.
Next, it preprocesses the data.
Finally, it trains the model.

As depicted above, Sourcery adds buttons above every function definition in VS code:

Explain code
Generate tests
Generate docstring
Ask Sourcery

Looking at the code above, there are plenty of issues. To name a few:

What if the file does not exist?
- There are no checks for that.
- If the file does not exist, the preprocessing and modeling step will fail.
What if the path exists, but the CSV file is empty?
What about the data separator?
In train_test_split() method, random_state and test_size have not been specified.
And, of course, there are no docstrings.

Let’s try to correct this with Sourcery!

First, we add a docstring:

For the load_data() method, it generates the following, which is indeed correct.

Next, let’s generate some tests for this method.

The generated tests perfectly align with what we discussed above:

What if the file does not exist → Sourcery generates a test for that, which helps us identify a potential source of error.
What if the file has no records → Sourcery generates a test for that, which helps us identify another potential source of error.

As the current version of load_data() function does not test that, it’s obvious that it will fail on such inputs.

But this input from Sourcery lets us identify a source of error, which we can adjust in the code.

In fact, how about we ask Sourcery to do it?

To do so, we select the corresponding function, head over to the chat window, and provide the following prompt:

Modify the function so that:- if the input file does not exist, raise a file not found error- if the input file exists but has no records, raise a value error

It gives the following output, which is indeed correct:

Finally, let’s see if it can identify the missing parameters in train_test_split() method — random_state and test_size.

Similar to the above demo, we select the train_model() method. Next, we ask it to simplify it.

Here is the output:

As depicted above, Sourcery prompts about the missing parameter, which is great.

Additionally, it prompts us about the default usage of the RandomForestClassifier() and suggests us to explicitly specify the number of trees in the model.

Lastly, it also provides us with improved code.

Isn’t that cool?

Two more handy features of Sourcery that I actively use are:

It gives an on-demand review of the changes made to the code.
One can set up the GitHub action with Sourcery to have Sourcery review every GitHub PR. It provides feedback just like any other human reviewer would usually do:

Of course, here, I want to mention that being AI-driven, there could be some mistakes here and there. In fact, it is not just about Sourcery but all AI tools in general.

However, being specifically niched to Python programmers, my experience has been pretty reliable and effective.

Also, Sourcery is not limited to just data science tasks. You can use it for almost any Python-related project. And it supports many other languages as well.

Get started here: Sourcery Docs.

👉 Over to you: Do you use an AI pair programmer? If yes, which one?

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.

The button is located towards the bottom of this email.

Thanks for reading!

Latest full articles

If you’re not a full subscriber, here’s what you missed last month:

To receive all full articles and support the Daily Dose of Data Science, consider subscribing:

Reply

or to participate.