Using OpenAI's Whisper Model for Speech Recognition

Source: DeepHub IMBA

This article has about 2200words, and it is recommended to read in 5minutes

This article will explain the types of datasets used for training, the training methods of the model, and how to use Whisper.

Speech recognition is a field of artificial intelligence that allows computers to understand human speech and convert it into text. This technology is used in devices such as Alexa and various chatbot applications. The most common application is voice transcription, which converts speech into written records or subtitles.

Recent developments in state-of-the-art models like wav2vec2, Conformer, and Hubert have greatly advanced the field of speech recognition. These models use techniques that allow them to learn from raw audio without the need for manually labeled data, enabling effective use of large datasets of unlabeled speech. They have also been scaled to utilize up to 1,000,000 hours of training data, far exceeding the traditional 1,000 hours used in academic supervised datasets. However, models pre-trained in a supervised manner across multiple datasets and domains have been found to exhibit better robustness and generalization to held-out datasets, so tasks like speech recognition still require fine-tuning, which limits their full potential. To address this issue, OpenAI developed Whisper, a model that leverages weak supervision methods.

Introduction to the Whisper Model

Dataset Used:

The Whisper model was trained on a dataset of 680,000 hours of labeled audio data, which includes 117,000 hours of speech in 96 different languages and 125,000 hours of translation data from “any language” to English. The model leverages internet-generated text, which was created by other automatic speech recognition (ASR) systems rather than by humans. This dataset also includes a language detector trained on VoxLingua107, which is a collection of short audio clips extracted from YouTube videos, labeled according to the language of the video titles and descriptions, with additional steps taken to remove false positives.

Model:

The primary architecture used is an encoder-decoder structure.

Resampling: 16,000 Hz.

Feature extraction method: An 80-channel log Mel spectrogram representation is computed using a 25 ms window and a 10 ms stride.

Feature normalization: Inputs are globally scaled to between -1 and 1, with an approximate mean of zero on the pre-training dataset.

Encoder/Decoder: The model’s encoder and decoder use Transformers.

Encoder Process:

The encoder first processes the input representation using a stem with two convolutional layers (filter width of 3) and uses the GELU activation function.

The stride of the second convolutional layer is 2.

Sine positional embeddings are then added to the output of the stem, followed by the application of encoder Transformer blocks.

Transformers use pre-activated residual blocks, and the output of the encoder is normalized using a normalization layer.

Model Block Diagram:

Using OpenAI's Whisper Model for Speech Recognition

Decoding Process:

In the decoder, learned positional embeddings and binding input-output token representations are used.

The encoder and decoder have the same width and number of Transformer blocks.

Training

To improve the model’s scaling properties, it was trained on different input sizes.

The model was trained using FP16, dynamic loss scaling, and data parallelism.

Using AdamW and gradient norm clipping, the learning rate decays linearly to zero after warming up for the first 2048 updates.

Using a batch size of 256, the model was trained for 220 updates, equivalent to two to three forward passes over the dataset.

As the model was only trained for a few epochs, overfitting was not a significant issue, and no data augmentation or regularization techniques were used. Instead, it relied on the diversity within the large dataset to promote generalization and robustness.

Whisper demonstrated good accuracy on previously used datasets and has been tested against other state-of-the-art models.

Advantages:

Whisper has been trained on real data as well as data used by other models and under weak supervision.
The model’s accuracy has been tested against human listeners and its performance evaluated.
It can detect clear audio regions and apply NLP techniques to correctly insert punctuation in transcripts.
The model is scalable, allowing for transcript extraction from audio signals without needing to chunk or batch the video, thereby reducing the risk of missing audio.
The model has achieved higher accuracy across various datasets.

Whisper’s comparative results on different datasets show that it currently achieves the lowest word error rate compared to wav2vec.

Using OpenAI's Whisper Model for Speech Recognition

The model has not been tested on the TIMIT dataset, so to check its word error rate, we will demonstrate how to use Whisper to validate the TIMIT dataset, that is, to build our own speech recognition application using Whisper.

Using Whisper Model for Speech Recognition

The TIMIT reading speech corpus is a collection of speech data specifically designed for acoustic speech research and the development and evaluation of automatic speech recognition systems. It includes recordings from 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences. The corpus includes time-aligned phonetic, speech, and word transcriptions, as well as 16-bit, 16kHz speech waveform files for each phoneme. The TIMIT corpus was jointly developed by MIT, SRI International, and Texas Instruments. The TIMIT corpus transcriptions have been manually verified and designated with test and training subsets to balance phonetic and dialect coverage.

Installation:

 !pip install git+https://github.com/openai/whisper.git !pip install jiwer !pip install datasets==1.18.3

The first command installs all the dependencies required by the Whisper model. Jiwer is used to download the word error rate package, and datasets is a package provided by Hugging Face that can be used to download the TIMIT dataset.

Import Libraries

 import whisper from pytube import YouTube from glob import glob import os import pandas as pd from tqdm.notebook import tqdm

Load TIMIT Dataset

 from datasets import load_dataset, load_metric timit = load_dataset("timit_asr")

Calculate Word Error Rate for Different Model Sizes

Considering the need to filter English and non-English data, we choose to use a multilingual model instead of one specifically designed for English.

However, the TIMIT dataset is purely in English, so we will apply the same language detection and recognition process. Additionally, the TIMIT dataset has already been split into training and validation sets, which we can use directly.

To use Whisper, we first need to understand the parameters, sizes, and speeds of different models.

Using OpenAI's Whisper Model for Speech Recognition

Load Model

 model = whisper.load_model('tiny')

tiny can be replaced with the model names mentioned above.

Define Language Detector Function

 def lan_detector(audio_file):   print('reading the audio file')   audio = whisper.load_audio(audio_file)   audio = whisper.pad_or_trim(audio)   mel = whisper.log_mel_spectrogram(audio).to(model.device)   _, probs = model.detect_language(mel)   if max(probs, key=probs.get) == 'en':     return True   return False

Function to Convert Speech to Text

 def speech2text(audio_file):   text = model.transcribe(audio_file)   return text["text"]

Run the above functions under different model sizes, the word error rates obtained from TIMIT training and testing are as follows:

Using OpenAI's Whisper Model for Speech Recognition

Transcribing Speech from YouTube

Compared to other speech recognition models, Whisper not only recognizes speech but also interprets the intonation of a person’s voice and inserts appropriate punctuation. We will test this using videos from YouTube.

Here, we need a package called pytube, which can easily help us download and extract audio.

 def youtube_audio(link):     youtube_1 = YouTube(link)     videos = youtube_1.streams.filter(only_audio=True)     name = str(link.split('=')[-1])     out_file = videos[0].download(name)     link = name.split('=')[-1]     new_filename = link+".wav"     print(new_filename)     os.rename(out_file, new_filename)     print(name)     return new_filename,link

Once we have the wav file, we can apply the above functions to extract text from it.

Conclusion

The code from this article is available here: https://drive.google.com/file/d/1FejhGseX_S1Ig_Y5nIPn1OcHN8DLFGIO/view, and there are many operations that can be done with Whisper; you can try them based on the code provided in this article.

Editor: Yu Tengkai

Proofreader: Lin Yilin

Using OpenAI’s Whisper Model for Speech Recognition

Introduction to the Whisper Model

Using Whisper Model for Speech Recognition

Conclusion

Leave a Comment Cancel reply