Transcribing Audio With Whisper

Table of Contents

How I transcribed 76 hours of audio overnight

A few years ago I inherited some digital files which included ~76 hours of audio. At the time I skimmed through various bits to see if there was anything interesting but it largely contained daily happenings. Recently I was thinking about a problem involving transcribing audio and realized I had a good sample of audio to practice with. I had heard of OpenAI’s whisper model through using the OpenAI API but I soon found that Whisper is open source and decided to try running it locally.

First Steps
#

Following the set-up instructions was easy enough, though make sure to install/update ffmpeg if you don’t have that already. Whisper has command-line support which worked well but with several hundred files I needed to utilize python.

import whisper

model = whisper.load_model("base")
result = model.transcribe("Test1.mp3")
print(result["text"])

 This is simply a test of this external microphone attached to my mini micro digital voice recorder.

Initial test worked as expected. Next, I needed a way to transcribe to a .txt file so I replaced the print command.

import whisper

model = whisper.load_model("base")
result = model.transcribe("Test1.mp3")
print(result["text"])

with open("transcription.txt", "w", encoding="utf-8") as f:
    f.write(result)

This successfully created a text file containing the result.

Scaling the script
#

So far so good, but there were a few more things on my wishlist to make all of this usable:

1. Run whisper on GPU
#

I needed to ensure the whisper model is running on my GPU. I have an Nvidia GTX 1070 which is capable of running CUDA Toolkit. I installed the toolkit and also installed the CUDA version of pytorch.

After installation I ran a simple test_script to check if it was working:

import torch

x = torch.rand(5, 3)
print(x)

if torch.cuda.is_available():
    print("CUDA is available. PyTorch can use the GPU.")
    print("CUDA version:", torch.version.cuda)
    print("Number of GPUs available:", torch.cuda.device_count())
    print("GPU:", torch.cuda.get_device_name(0))
else:
    print("CUDA is not available. PyTorch cannot use the GPU.")

tensor([[0.1242, 0.0886, 0.0134],
        [0.6424, 0.4822, 0.7987],
        [0.6571, 0.8142, 0.3190],
        [0.9580, 0.7326, 0.6206],
        [0.8626, 0.5179, 0.9533]])
CUDA is available. PyTorch can use the GPU.
CUDA version: 11.8
Number of GPUs available: 1
GPU: NVIDIA GeForce GTX 1070

If this doesn’t work, you may need to uninstall pytorch and install the version that supports CUDA Toolkit.

With this working, now to add it to my script by passing the device argument to whisper.load_model

    device = "cuda" if torch.cuda.is_available() else "cpu"
    model = whisper.load_model("base", device=device)

Additionally with using a GPU we can use 16-bit floating point matrices. We specify fp16 as a model.transcribe argument, and we’ll again check if our GPU is available.

result = model.transcribe("Test1.mp3", fp16=torch.cuda.is_available())

Using fp16 will increase the speed but does, by design, reduce the accuracy. We’ll be making greater improvements to accuracy in the next step.

2. Improve Accuracy
#

In my initial results of some longer audio samples the basic script worked alright, but there were some areas where there was noticeable mistakes. I opted to use a different size whisper model, and since my audio is in english I used the english-only version.

    model = whisper.load_model("medium.en", device=device)

This model has 769M parameters compared to base model’s 74M but will be much slower than running the base model. For a comparison of models, refer to the whisper README.md

3. Iterating through subdirectories
#

Next I needed to be able to iterate through all .mp3 files within a directory (including subdirectories). For this I used the built-in os package. I just needed a simple loop:

    for root, dirs, files in os.walk("."):
        for file in files:
            if file.lower().endswith(".mp3"):
                file_path = os.path.join(root, file)
                transcribe(model, file_path)

Note that I am now passing the model and filepath to a transcribe function. An initial test of a few test files within subdirectories worked as expected.

4. Format Text
#

The short sample I showed doesn’t capture it, but the entire result text was printed on a single line which was very hard to read longer passages. Breaking the text into separate lines can be done a few different ways. I went with a simple approach initially:

    formatted_text = result["text"].replace('. ', '.\n')

And then passed formatted_text to our .txt file instead of result. This worked by itself, though would do a line break after something like spelling some initials.

A better approach I found was to utilize the result["segments"] which is naturally already broken up into pieces and we just need to construct it together

    formatted_text = ""
    for segment in result["segments"]:
        text = segment["text"]
        formatted_text += f'{text}\n'

And as before, passing formatted_text to the .txt file. This could be condensed but we’ll be adding onto this in the next step.

5. Add Timestamps
#

Since I wanted the ability to reference audio based on the transcript, I needed a way to add a timestamp. Fortunately this is available in our result["segments"] if we specify the word_timestamps argument

    result = model.transcribe(file_path, fp16=torch.cuda.is_available(), word_timestamps=True)

    formatted_text = ""
    for segment in result["segments"]:
        start = segment["start"]
        end = segment["end"]
        text = segment["text"]
        formatted_text += f"[{format_timestamp(start)} --> {format_timestamp(end)}] {text}\n"

Here I am capturing the start and end of each segment (in seconds) and I added a format_timestamp function as well though if you were happy with just seconds you could forgo this.

def format_timestamp(seconds):
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    seconds = int(seconds % 60)
    return f"{hours:02}:{minutes:02}:{seconds:02}"

Now the .txt file for our test audio reads as:

[00:00:00 --> 00:00:07]  This is simply a test of this external microphone attached to my mini micro
[00:00:07 --> 00:00:09]  digital voice recorder.

Obviously there is a lot of flexibility here in how this is formatted, such as you could just list the start of each segment in seconds which could condense it quite a bit and still allow you to look sections up somewhat easily.

Putting it together
#

The end result of my script ended up looking like this:

import os
import torch
import whisper

def main():
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model = whisper.load_model("medium.en", device=device)

    for root, dirs, files in os.walk("."):
        for file in files:
            if file.lower().endswith(".mp3"):
                file_path = os.path.join(root, file)
                transcribe(model, file_path)

def transcribe(model, file_path):
    result = model.transcribe(file_path, fp16=torch.cuda.is_available(), word_timestamps=True, verbose=True)
    
    formatted_text = ""
    for segment in result["segments"]:
        start = segment["start"]
        end = segment["end"]
        text = segment["text"]
        formatted_text += f"[{format_timestamp(start)} --> {format_timestamp(end)}] {text}\n"

    output_file = f"{os.path.splitext(file_path)[0]}_transcription.txt"
    with open(output_file, "w", encoding="utf-8") as f:
        f.write(formatted_text)

    # remove file
    # os.remove(file_path) # uncomment to delete file after transcription. 

def format_timestamp(seconds):
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    seconds = int(seconds % 60)
    return f"{hours:02}:{minutes:02}:{seconds:02}"

if __name__ == "__main__":
    main()

I did end up wanting to remove files as they were converted since the audio files were already backed up separately and I was going to be pushing the .txt files to those backup locations. I also made each text file have the same name as the audio file with _transcription appended.

Overall, it worked as expected. In the end I had 76 hours / 4gb of audio and was able to convert it all to text overnight. Not bad for throwing something together at night and letting it run while I slept. The final .txt files were not all that interesting or illuminating but running a search through text files is better than listening to (or skipping through) all of it.

Of course if your data is more interesting you could do a sentiment analysis or something else with it!

First Steps #

Scaling the script #

1. Run whisper on GPU #

2. Improve Accuracy #

3. Iterating through subdirectories #

4. Format Text #

5. Add Timestamps #

Putting it together #