Whisper: The Voice-to-Text Model Hidden Beneath the Spotlight

Photographer: Product ManagerThe product manager’s photography skills have reached Unsplash levels, impressive!

On the day the ChatGPT model gpt-3.5-turbo was released, OpenAI also open-sourced a voice-to-text model: Whisper. However, due to the overwhelming attention on ChatGPT, many people overlooked Whisper’s existence.

I was one of them. I once thought that Whisper was also an API that required sending POST requests to OpenAI’s servers, which would then return the recognition results. As a result, I did not try this model for a long time.

It wasn’t until a few days ago that I saw someone on a minority platform post an article introducing a voice recognition app they had just created, stating that this app was based on Whisper and did not require an internet connection. I was puzzled; how could you call Whisper’s API without being online? So, I finally took the time to learn about Whisper and discovered that it is an open-source voice-to-text model from OpenAI, not an API service. This model can run locally offline as long as you have Python installed.

The Github address for Whisper is: https://github.com/openai/whisper. It is very simple to use in Python:

First, install the third-party library:

python3 -m pip install openai-whisper

Next, install ffmpeg on your computer. Here are the installation commands for various systems:

# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg

# on Arch Linux
sudo pacman -S ffmpeg

# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg

# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg

# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg

That’s all the preparation work done. Let’s test how accurate this model is. Below is a recording of mine:

The recording file is located at: /Users/kingname/Downloads/公众号演示.m4a. Now, write the following code:

import whisper

model = whisper.load_model("base")
result = model.transcribe("/Users/kingname/Downloads/公众号演示.m4a")
print(result["text"])

When the model is loaded for the first time, it will automatically fetch the model files. Different model files vary in size. Once the fetching is complete, you won’t need an internet connection for subsequent uses.

The results are shown in the image below:

Whisper: The Voice-to-Text Model Hidden Beneath the Spotlight

Although there are one or two typos, they are not significant. By switching to a larger model, the accuracy can be further improved:

We know that the most troublesome aspect of voice recognition is homophones. In such cases, we can use Whisper in conjunction with ChatGPT to make corrections:

Whisper: The Voice-to-Text Model Hidden Beneath the Spotlight

After testing, the small model has shown excellent recognition performance for Chinese, occupying about 2GB of memory during operation. It is also very fast. When we want to convert audio from a video into text or generate subtitles for a podcast, this model is very convenient, completely free, and there are no worries about voice leaks.

Although Whisper is developed by a foreign company, its recognition performance for Chinese currently surpasses many domestic voice recognition products from major companies. This includes a certain well-known company renowned for its voice recognition, which has been tested and found to be inferior to Whisper. This also indicates that domestic voice recognition technology still needs further improvement and requires more research and development. In this area, domestic products have significant room for improvement and need continuous exploration and innovation to better meet user needs.

END

Whisper: The Voice-to-Text Model Hidden Beneath the Spotlight

Unheard Code·Knowledge Planet is now open!

One-on-one Q&A about crawling-related issues

Career consulting

Interview experience sharing

Weekly live sharing

……

Unheard Code·Knowledge Planet looks forward to meeting you~

Whisper: The Voice-to-Text Model Hidden Beneath the Spotlight

Employees from major companies in first and second-tier cities

Programming veterans with over a decade of experience

Students studying in universities at home and abroad

Newcomers just starting in middle and primary schools

Waiting for you in the “Unheard Code Technical Exchange Group”!

How to join: Add WeChat “mekingname”, and note “Fan Group” (no advertising, serious inquiries only!)

Share this good article with friends~

Leave a Comment Cancel reply