Building an All-in-One Audio Analysis App Using AssemblyAI

Transcript, sentiment analysis, speaker labels, topics, Q&A, all in one place.

A couple of months back, we learned about AssemblyAI, an AI transcription platform that provides state-of-the-art AI models for any task related to speech & audio understanding.

Most of you showed interest in learning to build an end-to-end voice application in that newsletter’s poll.

So today, let me walk you through building an all-encompassing audio analysis application using AssemblyAI API that will take an audio file as input and:

1. Transcribe the audio2. Perform sentiment analysis on the audio3. Summarize the audio4. Identify named entities mentioned in the audio5. Extract broad ideas from the audio6. Let the user interact with the audio in natural language.

By the end of this newsletter, we would have built the following app:

As mentioned above, we will use the AssemblyAI API to transcribe the audio file and Streamlit to build the web application in Python.

Let’s begin!

App Workflow

Before building the app, let me highlight its overall workflow.

  • The user will upload the audio file.

  • This audio will be sent for transcription to AssemblyAI. In the API call, we will enable sentiment analysis, language identification, summarization, etc.

  • We will use the transcript returned by AssemblyAI to display the outputs.

  • We shall create a text field in the app, which the user can use to chat with the audio transcript using natural language. This will be facilitated by AssemblyAI’s LeMUR.

Let’s implement it!

Audio analysis app using AssemblyAI

You can download the code for this project here: AssemblyAI audio analysis app.

To get started, you must get an AssemblyAI API key from here. They provide $50 worth of free transcription credits, which is sufficient for this demo.

Next, install the AssemblyAI Python package as follows:

For the audio file, I am using Andrew Ng’s podcast with Lex Fridman (the audio file is available in the code I provided above).

At its core, transcribing a file with AssemblyAI just takes a few lines of code:

  • Line 1 → Import the package

  • Line 3 → Set the API key

  • Line 5 → Specify the file path (this can be a remote location if you prefer that).

  • Line 7-12 → Instantitate the AssemeblyAI config and transcriber object.

  • Line 8 → Send the file for transcription.

Also, since in addition to transcribing the audio, we want to generate more insights in this demo, we must specify them in the TranscriptionConfig object as shown in the code above. Here’s a description of these arguments:

  • speaker_labels=True identifies the speakers in the audio.

  • iab_categories=True extracts broad topics from the audio.

  • sentiment_analysis=True predicts the sentiment of every sentence.

  • summarization=True generates a summary of the transcript.

  • language_detection=True detects the language in the audio.

Let’s build the application now!

We start by importing Streamlit and AssemblyAI and setting the AssemblyAI API key:

Next, we define a generate_results method, which will accept the transcript object returned by AssemblyAI and display the transcript and other insights inside visual elements from Streamlit:

First, we display the transcription summary using the summary attribute of the transcript object:

Just for more context, this is known as an expander in Streamlit:

Next, we display the sentence-level transcription using the get_sentences() method of the transcript object, yet again, in an expander:

We can print the speaker labels along with the transcript as follows:

Determining the sentiment of each sentence is quite simple as well. Similar to how we displayed speaker labels, we can display sentiment labels sentence-by-sentence as follows:

Here, we have encoded the sentiment using the warning, success, and error boxes provided by Streamlit.

Specifying iab_categories=True in the config object definition allowed us to do topic detection. Here’s how we display them:

With that, we have defined our generate_results method.

Next, let’s define the main() method, wherein, we will use a file uploader widget from Streamlit, invoke the AssemblyAI API, and then use the generate_results method to display the result.

Almost done!

Finally, we shall also make use of AssemblyAI’s LeMUR, which is a framework that allows us to build LLM apps on speech data in a few lines of code.

Note that you need to need to upgrade your account to access LeMUR. 

It’s like interacting with the audio in natural language to:

  • Extract some specific insights from the audio.

  • Do Q&A on the audio.

  • Get any other details you wish to extract, etc.

This is implemented below:

  • Line 3 → We declared a text field.

  • Line 6 → We defined a prompt.

  • Line 9 → We use the transcript object returned earlier to prompt it using LeMUR.

  • Line 12-14 → We print the response.

Done!

This DID NOT require us to do any of the following:

  • Store transcripts,

  • Ensure that the transcript fits into the model’s context window,

  • Invoke an LLM, etc.

LeMUR handles everything under the hood.

Isn’t that cool?

Run the app

Finally, we run the application as follows:

…which opens a window in the browser:

Uploading an audio produces the expected results in seconds:

Moreover, the LeMUR framework works as expected as well:

That was simple to build, wasn’t it?

Note that it did not require more than 80-100 lines of code to implement.

You can find the code for this project here: AssemblyAI audio analysis app: AssemblyAI audio analysis app.

Performance was never really a focus of this app. That is why I used Streamlit. Had performance been a factor, I would have used Taipy instead.

A departing note

I first used AssemblyAI two years ago, and in my experience, it has the most developer-friendly and intuitive SDKs to integrate speech AI into applications.

Earlier this year, they released Universal-1, which is a state-of-the-art multimodal speech recognition model trained on 12.5M hours of multilingual audio data.

The results are shown below:

  • Universal-1 exhibits 10% or greater improvement in English, Spanish, and German speech-to-text accuracy compared to the next-best system.

  • Universal-1 has 30% lower hallucination rate than OpenAI Whisper Large-v3.

  • Universal-1 is capable of transcribing multiple languages within a single audio file.

Isn’t that impressive?

I love AssemblyAI’s mission of supporting developers in building next-gen voice applications in the simplest and most effective way possible.

They have already made a big dent in speech technology, and I’m eager to see how they continue from here.

Get started with:

Their API docs are available here if you want to explore their services: AssemblyAI API docs.

🙌 Also, a big thanks to AssemblyAI, who very kindly partnered with me on this post and let me share my thoughts openly.

👉 Over to you: What would you use AssemblyAI for?

SPONSOR US

Get your product in front of ~90,000 data scientists and other tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., who have influence over significant tech decisions and big purchases.

To ensure your product reaches this influential audience, reserve your space here or reply to this email to ensure your product reaches this influential audience.

Reply

or to participate.