Excerpt from Awni
Translation by Machine Heart
Contributors:Nurhachu Null,Lu Xue
Since the application of deep learning in the field of speech recognition, the word error rate has significantly decreased. However, speech recognition has not yet reached human-level performance and still faces multiple unresolved issues. This article discusses various aspects of the unresolved problems in speech recognition, including accents, noise, multiple speakers, context, and deployment.
After deep learning was applied to the field of speech recognition, the word error rate has significantly decreased. However, despite reading many papers on this topic, we still have not achieved human-level speech recognition. Speech recognizers have many failure modes. Recognizing these issues and taking measures to address them is key to advancing speech recognition. This is the only way to transform automatic speech recognition (ASR) from “serving some people most of the time” to “serving everyone all the time.”
Improvement in word error rate on the Switchboard conversational speech recognition benchmark. This test set was collected in 2000 and includes 40 phone recordings, each conversation taking place between two randomly selected native English speakers.
If the conversational speech recognition results based on Switchboard reached human-level performance, it would be akin to claiming that autonomous driving has achieved human driving levels in a sunny, traffic-free town. Although the progress in speech recognition for conversational speech is evident, the assertion that it has reached human-level performance is ultimately too broad. Here are some areas in the field of speech recognition that still need improvement.
Accents and Noise
One of the most obvious shortcomings of speech recognition lies in its handling of accents and background noise. The most direct reason is that the vast majority of training data consists of American English with a high signal-to-noise ratio. For example, the Switchboard conversational speech training and test sets are recorded by native English speakers (mostly Americans) in almost noise-free environments.
However, having more training data does not overcome this issue. Many languages have dialects and accents. It is impractical to collect enough labeled data for every situation. Developing a speech recognizer specifically for American English requires over 5,000 hours of transcribed audio data!
Comparison of word error rates between Baidu Deep Speech 2 model and human transcribers on different types of speech data. We notice that human transcribers perform worse on non-American accented speech. This may be because most transcribers are Americans. I hope that local transcribers in a region will have a lower error rate.
As for background noise, the noise inside a moving car is almost impossible to be as low as -5dB. Humans can easily understand what each other is saying in such environments, yet the performance of speech recognizers can drop dramatically due to the presence of noise. From the above figure, we can see that there is a huge gap in word error rates between humans and models in low and high signal-to-noise ratio audio.
Semantic Errors
Typically, the word error rate is not the actual target of a speech recognition system. What we care about is the semantic error rate, which is the proportion of phrases that are not correctly understood in terms of meaning.
For example, if someone says “let’s meet up Tuesday” and the speech recognizer understands it as “let’s meet up today,” there is a semantic error. Even without a semantic error, there can still be a word error rate. In this example, if the speech recognizer omits “up” and recognizes it as “let’s meet Tuesday,” the meaning of the sentence remains unchanged.
When using the word error rate as a standard, we must be cautious. A 5% word error rate corresponds to about one error every 20 words. If a sentence has 20 words (which is the average length of an English sentence), then in this case, the sentence error rate is 100%. We hope that the erroneous words do not change the meaning of the sentence; otherwise, even with a word error rate of only 5%, the speech recognizer may still misinterpret the entire sentence’s meaning.
When comparing models with humans, it is crucial to examine the nature of the errors rather than merely focusing on the conclusive number of word error rate (WER). From my experience, humans generally make fewer errors than recognizers during transcription, especially serious semantic errors.
Researchers from Microsoft recently compared the errors made by humans and Microsoft’s human-level speech recognizer [3]. They found that one difference is that the model confuses “uh” and “uh huh” more frequently than humans. These two phrases have very different meanings: “uh” is merely a filler word, while “uh huh” indicates agreement and affirmation. Both humans and models make many similar errors.
Monophonic, Multiple Speakers
The Switchboard conversational speech recognition task is relatively easy because each speaker uses a separate microphone for recording. There is no overlapping speech from multiple speakers in the same audio stream. However, humans can understand spoken content even when multiple speakers talk simultaneously.
A good conversational speech recognizer must be able to segment audio based on who is speaking (sound source). It should also be able to understand audio with overlapping speech from multiple speakers (source separation). This should be achievable without having to install a microphone next to each speaker’s mouth, allowing conversational speech recognition to work anywhere.
Domain Variation
Accents and background noise are just two of the issues that speech recognizers need to enhance robustness to solve. There are other factors as well:
-
Echo in varying acoustic environments
-
Hardware defects
-
Defects in audio encoding and compression
-
Sampling rate
-
Speaker’s age
Most people cannot even distinguish between mp3 and wav files. Before we claim that speech recognizers perform at human levels, they need to be sufficiently robust to these issues.
Context
You may notice that the human-level error rate on benchmark test sets similar to Switchboard is actually quite high. If, while conversing with a friend, they misunderstand one word every 20 words, it becomes difficult to continue the conversation.
The reason for this is that the evaluation is conducted without considering context. In real life, there are many other cues that help us understand what someone is saying. Humans use context that speech recognizers do not utilize, including:
-
The historical context of the conversation and the topic being discussed.
-
Visual cues while a person is speaking, such as facial expressions and lip movements.
-
Knowledge of the person being spoken to.
Currently, Android’s speech recognizer knows your contacts, so it can accurately recognize your friends’ names. Voice search in mapping products uses your geographic location to narrow down the area you want to navigate to.
The accuracy of automatic speech recognition (ASR) systems has indeed improved with the help of these types of signals. However, here we only have a preliminary understanding of the types of context that can be used and how to use them.
Deployment and Application
The latest advancements in conversational speech recognition are often not deployable. When considering what makes a new speech recognition algorithm deployable, it is helpful to measure its latency and computational requirements. These two are related; generally, if an algorithm requires more computational power, the latency will also increase. However, for simplicity, I will discuss them separately.
Latency: What I mean by “latency” is the time from when the user finishes speaking to when the transcription is completed. Low latency is a common product constraint in ASR. It can significantly affect user experience. It is common for ASR systems to have latency requirements in the tens of milliseconds. Although this may sound extreme, do not forget that generating transcription results is often the first step in a series of expensive computations. For instance, in voice search, the actual network-scale search must occur after speech recognition.
Bidirectional recurrent layers are a good example of improvements that eliminate latency. All the latest advanced results in conversational speech recognition use them. The problem is that we cannot compute anything with the first bidirectional layer until the user has finished speaking. Therefore, latency increases with the length of the speech.
Left image: We can start transcribing immediately when forward recurrence occurs.
Right image: When bidirectional recurrence occurs, we must wait for all speech to arrive before we can start transcribing.
The effective way to incorporate future information in speech recognition is still a subject of research and discovery.
Computation: The computational power required to transcribe a phrase is an economic constraint. We must consider the cost-effectiveness of improving the accuracy of the speech recognizer. If an improvement does not meet the economic threshold, it cannot be deployed.
The Next Five Years
There are still many open challenges in the field of speech recognition, including:
-
Expanding speech recognition capabilities to new domains, accents, and far-field, low signal-to-noise ratio speech.
-
Incorporating more contextual information during speech recognition.
-
Source and sound source separation.
-
Semantic error rates and new evaluation methods for speech recognizers.
-
Ultra-low latency and ultra-efficient inference.
I look forward to progress in the field of speech recognition in these areas over the next five years.
Original link: https://awni.github.io/speech-recognition/
This article is translated by Machine Heart, please contact this public account for authorization.
✄————————————————
Join Machine Heart (Full-time Reporter/Intern): [email protected]
Submissions or inquiries: [email protected]
Advertising & Business Cooperation: [email protected]