Will Speech Recognition Accuracy Ever Reach 100%?

Will Speech Recognition Accuracy Ever Reach 100%?

Illustration by Jay Bendt

Written by Wade Roush

Translated by Zhao Jianlin

Looking back to 2010, Matt Thompson predicted in a commentary for NPR that “in the near future, automatic speech transcription technology will become fast, easy to use, and free.” He referred to that moment as the “speech singularity,” cleverly borrowing from inventor Ray Kurzweil’s “singularity theory”—the idea that our consciousness could one day be uploaded to computers. Thompson also predicted that reliable Automatic Speech Recognition (ASR) software would change the work of journalists, not to mention lawyers, salespeople, and the hearing impaired; all professionals dealing with spoken and written language would be affected.

Thompson’s prediction excited me; I was eager for a technology that could free me from the tedious task of organizing interview notes. However, despite his illustrious career in broadcasting, which continues (he is currently the director of NPR’s Investigative Reporting Unit, responsible for programs like Reveal), the “speech singularity” he predicted seems to be far off.

Nonetheless, we have clearly made significant progress. A plethora of startups, such as Otter, Temi, and Trint, have begun to offer online services. Users can upload digital audio files and receive transcribed text within minutes. During my time as an audio producer, I used these services almost every day. The speed at which the software generates text is improving, and the costs are continually decreasing, which is indeed encouraging.

But the accuracy of the text is another matter. In 2016, a team at Microsoft Research announced that their machine learning algorithm achieved an impressive 94% accuracy in converting recordings from a standard corpus into text. In Microsoft’s testing, this software performed almost as well as professional transcribers, leading many media outlets to proclaim that the era of speech recognition software being “on par with humans” had arrived.

However, the last 6% of accuracy is where the real challenge lies. A more painful lesson is that proofreading a text with 94% accuracy takes almost as much time as manually transcribing the original recording. Four years after this breakthrough, services like Temi have still not managed to increase accuracy above 95%, and they can only handle clear audio without accents.

Why is accuracy so important? For example, more and more audio producers follow web usability guidelines by providing a text version when releasing podcasts. However, if the text contains an error every 20 words, no one will want to read it. Consider how much time could be saved if voice assistants like Alexa, Bixby, Cortana, Google Assistant, and Siri could correctly recognize every question or command they receive.

ASR software may never achieve 100% accuracy. After all, people do not always speak fluently, even in their native language. There are too many homonyms in language that require context to be understood. (Speech transcription services once confused “iOS” with “Ayahuasca.”)

What I hope for is that these speech services can improve their accuracy by 1% to 2%. In the field of machine learning, a crucial method for reducing algorithm error rates is to provide more high-quality training data. Therefore, most text transcription service providers collect more data in a privacy-respecting manner. For example, every time I revise text transcribed by Trint or Sonix, I generate a verified dataset that aligns with the original recording, which can be used to enhance the quality of the algorithm models. If this can lead to lower error rates in the future, I am happy to let these companies use that data.

Clearly, increasing training data is one of the ways to achieve the “speech singularity.” As the number of conversations we have with machines increases, the amount of audio we produce will grow as well, and reliable speech transcription technology will no longer be a luxury fantasy or a distant goal; it will inevitably be realized.

Will Speech Recognition Accuracy Ever Reach 100%?

Will Speech Recognition Accuracy Ever Reach 100%?

The June issue of Global Science is now available

Click the image to buy now

Click to read the original article and go to the subscription page

Will Speech Recognition Accuracy Ever Reach 100%?

Leave a Comment