Is 100% Accuracy in Speech Recognition Possible?

Is 100% Accuracy in Speech Recognition Possible?

Illustration by Jay Bendt

Written by Wade Roush

Translated by Zhao Jianlin

Looking back to 2010, Matt Thompson predicted in a commentary article for NPR that “in the near future, automatic speech transcription technology will become quick, user-friendly, and free.” He referred to that moment as the “speech singularity,” cleverly borrowing from inventor Ray Kurzweil’s “singularity theory”—the idea that our consciousness might one day be uploaded to computers. Thompson also predicted that reliable automatic speech recognition (ASR) software would change the work of journalists, not to mention lawyers, salespeople, and those with hearing impairments; anyone working with spoken and written language would be affected.

Thompson’s prediction excited me greatly, as I eagerly hoped for a technology that could free me from the tedious task of organizing interview notes. However, despite his illustrious career in broadcasting, which continues (he currently serves as the director of NPR’s Investigative Reporting Unit, overseeing programs like “Reveal”), the “speech singularity” he predicted seems a long way off.

Nevertheless, we have clearly made significant progress. Numerous startups, such as Otter, Temi, and Trint, have begun offering online services. Users can upload digital audio files and receive transcribed text within minutes. During my time as an audio producer, I used these services almost every day. The speed at which service software generates text is improving, and the costs are continually decreasing, which is indeed encouraging.

However, the accuracy of the text is another matter. In 2016, a team at Microsoft Research announced that their machine learning algorithm, after training, achieved a text conversion accuracy as high as 94% for recordings from a standard corpus. In Microsoft’s testing, this software performed nearly as well as professional transcribers, leading many media outlets to proclaim that the era of speech recognition software being on par with humans had arrived.

But in reality, the final 6% accuracy is the real challenge. A more painful lesson is that proofreading a text with 94% accuracy takes almost as much time as manually transcribing the original recording. And four years after this breakthrough, services like Temi still have not been able to raise their accuracy above 95%, and they can only handle clear audio without accents.

Why is accuracy so important? For example, more and more audio producers are following web usability standards when releasing podcasts by including a text version. However, if the text contains an error every 20 words, no one will want to read it. Consider how much time could be saved if voice assistants like Alexa, Bixby, Cortana, Google Assistant, and Siri could correctly recognize every question or command they receive.

ASR software may never achieve 100% accuracy. After all, people do not always speak fluently, even in their native language. There are too many homophones and homographs in language that require context to understand. (Speech transcription services once misidentified “iOS” as “Ayahuasca.”)

What I hope for is that these speech services can improve their accuracy by 1% to 2%. In the field of machine learning, a critical method to reduce algorithm error rates is to provide more high-quality training data. Therefore, most text transcription service providers collect more data in a privacy-respecting manner. For instance, every time I revise text transcribed by Trint or Sonix, I generate a verified dataset that aligns with the original recording, which can be used to enhance the quality of the algorithm model. If this can lower the error rate in the future, I am more than happy to allow these companies to use this data.

Clearly, increasing training data is one of the ways to achieve the “speech singularity.” As the number of conversations we have with machines increases, the amount of audio we produce will also grow, and reliable speech transcription technology will no longer be a luxury or a distant fantasy; it will inevitably come to fruition.

Is 100% Accuracy in Speech Recognition Possible?

Is 100% Accuracy in Speech Recognition Possible?

The June issue of Scientific American is now available

Click the image to purchase now

Click to read the original article and go to the subscription page

Is 100% Accuracy in Speech Recognition Possible?

Leave a Comment