Unveiling Voice Recognition: How WeChat Understands You

Editor’s Note: This is the first issue of ‘Tech Box’, aiming to share the stories behind WeChat products in an engaging and lighthearted manner.

Since 1972, when a lonely programmer typed the first “hello world” on a computer, the communication and dialogue between humans and artificial intelligence have never ceased.

Can humans really teach cold AI to understand our words and calls?

We have always dreamed and imagined, as technology gradually breaks the boundaries between science fiction and reality. In Spielberg’s ‘Artificial Intelligence’, the robot boy David is adopted by a mother. Through daily interactions with humans, his tender voice gradually embodies warmth, courage, and love.

Unveiling Voice Recognition: How WeChat Understands You

In ‘Resident Evil’, the advanced AI system “Red Queen” appears as a holographic image of an innocent little girl, but her coldness and cruelty during conversations with the protagonist Alice are chilling.

Unveiling Voice Recognition: How WeChat Understands You

In ‘The Three-Body Problem’, the sophon speaks in a gentle tone, and the profound “tea ceremony conversation” with the two sword bearers directly influences the fate of human society.

However, in reality, we often encounter situations like this—

A man from Shandong tries to make a phone call using the car’s voice system, and within three minutes, he is driven mad by the system…

“Baba Yao Ba (8188)! I said Baba Yao Ba! Are you deaf?!”

For example, in a game, you lead your team in fierce combat, and when chatting with teammates, you receive a command like this:

Unveiling Voice Recognition: How WeChat Understands You

Excuse me?

After getting beaten up, I learned that the boss’s original command was “Go kill bosses 1 to 5”.

Team wipeout… the friendship boat capsized.

One of the foundations of human-machine communication is voice recognition.

Misunderstandings often occur in human conversations, let alone between humans and machines?

You might think that WeChat’s voice chat and recognition experience are quite good. In fact, the technical team behind the scenes, the voice technology group of WeChat’s technical architecture department, spent four whole years “teaching” WeChat how to better understand human speech.

For artificial intelligence, the WeChat voice recognition team acts as a professional and authoritative teacher, helping WeChat evolve from its initial “elementary school level” to a perceptive and articulate top student: achieving an industry-leading voice recognition accuracy rate of over 95%, capable of understanding multiple languages including English, Mandarin, and Cantonese.

Without delving into complex technology, let’s return to the earlier humorous incident about the “five piles of poop” issue—

Why does your phone often fail to understand what you say?

You didn’t “speak properly”

Don’t get me wrong, this isn’t about whether you have a speech impediment, can’t read, or speak a heavy dialect, but rather about your tone of voice.

For example, Apple’s Siri is quite smart, right? Usually, when we talk to Siri, we subconsciously adopt a more articulate tone. In this case, our voice is nearly standard, greatly reducing the difficulty of recognition.

However, in gaming and casual conversations, due to the relaxed environment, phenomena like rapid speech, accents, slurring, and repetition occur frequently. For example, “Hey, hurry, I’m almost out of health, healer, come heal me!” This greatly affects the recognition rate.

Regarding the challenge of colloquial Chinese recognition, tech companies worldwide find it very troublesome. Once we leave the reading environment, cases like mishearing “Brian” as “testicular inflammation” and “Joe Hisaishi” as “that’s it” are common.

The WeChat voice recognition team explains that due to the high level of randomness, varying audio quality, and fast speech, even the best current voice recognition systems can have about a 25% error rate.

Noise and distance are recognition “killers”

Some might exclaim, I have a top-level Mandarin proficiency and can articulate clearly, so why does voice recognition still have discrepancies?

This depends on whether your speaking environment is noisy and how far you are from the microphone.

For instance, in a car setting, echoes or outdoor noise can lead to a sharp decline in performance; likewise, in today’s mobile internet usage, we generally refer to close-range recognition, where the microphone and sound source are near. However, in indoor scenarios where the distance exceeds one meter, it remains practically distant, leading to signal degradation during transmission, which also reduces performance.

AI is slow in completing tasks

How can we make AI more competent? Two words: do homework!

For voice recognition, allowing machines to “hear” more data makes them smarter over time. However, when training machines, we must inform them what words are being spoken (the so-called supervised learning), which makes data accumulation slow.

Thus, finding ways to allow teachers to avoid constantly supervising with a whip and to implement unsupervised or semi-supervised training, enabling machines to evolve and continuously improve their performance, will be an important direction for technological development.

The machines aren’t smart enough

When a program converts a piece of speech into text, it doesn’t know which parts are correct or incorrect, nor does it know whether the sentence is coherent human language.

In actual usage, people’s speaking speed, enunciation, frequency, and volume vary, and there are also dialects and surrounding environmental issues. In summary, achieving a certain recognition rate is relatively easy, but reaching a high standard of recognition accuracy is not a simple task. The higher the recognition rate, the harder it becomes.

However, after WeChat entered the voice recognition field, it quickly rose to a leading industry level within just a few years and continues to optimize and improve.

How does WeChat “hear everything”?

Since we can’t control how users from all corners of the world speak, we must teach WeChat how to “listen attentively”.

In 2012, the WeChat team quietly began researching voice systems.

However, at that time, the attempts were merely “cautiously” launching a voice reminder public account, without much application.

It wasn’t until 2013 that WeChat’s voice input gained tremendous success in the industry, followed by the official launch of the voice-to-text function in 2014.

Interestingly, despite being such a practical feature, its access point is deeply hidden within WeChat, yet the user base continues to grow.

Unveiling Voice Recognition: How WeChat Understands You

Have you noticed? Voice input is in an additional menu, while voice-to-text must be activated by long-pressing the voice message.

Unveiling Voice Recognition: How WeChat Understands You

The WeChat team explains that every interface and function in WeChat is extremely “restrained”, and all designs follow actual user needs rather than showcasing technology. By hiding the access point deeper, it avoids disturbing users who do not need to use that feature.

WeChat adopts deep learning and faces challenges head-on

Returning to technology—

First, WeChat employs deep learning methods.

In simple terms, the input of the voice recognition system is speech, and the output is Chinese characters. The machine needs to learn the mapping relationship from speech to language.

First, regarding speech, we need to teach WeChat how to listen. Human vocalization involves vibrations from the vocal cords, passing through the vocal tract and oral cavity, influenced by many muscular movements, similar to how original signals undergo complex function transformations. The deep learning framework, due to its multi-layer structure, can effectively simulate these complex functions.

Next, regarding language, we need to teach WeChat how to understand. Typically, what we say must conform to syntax (combinatoriality) and pairing habits (causality), and we need to let the machine learn these rules. The difficulty lies in word meanings; for instance, “know” and “understand” have completely different pronunciations, yet their meanings can sometimes be similar.

“Research shows that the order of Chinese characters does not affect reading.”

“For example, after reading this sentence, you will find that the characters are all jumbled up.”

—— Xiao Pai

As you see, often we might not fully understand a sentence, but we can still grasp its meaning based on context and the phonetic combinations of the words.

The deep learning approach mimics the neurons of the human brain; as more language data is processed, this network can gradually understand language. Simply put, the voice recognition system is akin to a person learning a language; given equal intelligence, the more heard (training data), the easier it is to achieve better recognition results.

WeChat employs deep learning technology, and with a vast user base and natural voice interaction scenarios, it has accumulated a wealth of voice resources, which is one of the key reasons for the rapid development of WeChat’s voice interaction technology.

Meanwhile, the dedicated technical team continues to tackle challenges.

In addition to deep learning, what other efforts has WeChat made to enhance voice recognition?

The WeChat voice recognition team provided numerous examples, and after careful consideration, Xiao Pai chose to share those he could understand…

For instance, to address performance issues under colloquial styles (like phone conversations), WeChat has adopted a good segmentation engine that integrates audio attributes, speaker information, and some semantic data to effectively segment sentences. To overcome noise interference, WeChat employs algorithms that simulate real-world scenarios, converting past noise-free data into data that includes various environmental noises, allowing the model to learn both content and different environmental interferences. Regarding the long-tail problem of general models due to the unique voice of each user, WeChat has innovatively adjusted its algorithms to erase speaker information during the learning process, which has also contributed to improving recognition rates.

As WeChat’s voice recognition technology continues to advance, the higher the recognition rate, the better the user experience in practical applications, potentially revolutionizing user engagement and significantly increasing reliance on voice interactions.

In the future, WeChat will chat directly with you

Once AI truly understands, how far can meaningful conversations be?

The human-machine voice interaction scenes depicted in those sci-fi movies are becoming tangible.

Just as humans have five senses, smartphones have corresponding capabilities for image recognition, voice recognition, NFC communication, etc. Especially as voice serves as a crucial entry point, applications like Apple’s Siri, Microsoft’s Cortana, and Google Now are flooding the market.

Many people may not have noticed that at the end of last year, the WeChat team and the Hong Kong University of Science and Technology announced the establishment of a joint AI laboratory, focusing on research directions such as data mining, robotic dialogue, machine vision, and voice recognition. With a large user base and natural voice interaction scenarios, if the increasingly intelligent voice assistant is integrated as one of WeChat’s entry points, the WeChat ecosystem will further evolve.

Fields like smart homes, internet cars, smart healthcare, online education, automated phone customer service, and machine simultaneous interpretation will all be filled with voice interaction technology. Imagine being able to not only chat and input via voice but also tell your alarm to wake you up 10 minutes later, search for restaurants to eat at using voice, or casually send a message or email while driving. Even your robot assistant could fully comprehend every word you casually say, interacting with you as if it were a wise human—what an exciting prospect!

All of this is bound to happen in the future, perhaps in the near future.

Unveiling Voice Recognition: How WeChat Understands You

WeChat Pai

WeChat ID: wx-pai

Unveiling Voice Recognition: How WeChat Understands You WeChat Pai is the official platform for the WeChat team’s entire product line, helping you understand the latest trends, coolest features, and exclusive insights from the WeChat team, as well as the authentic voices of product managers, programmers, and designers.

Leave a Comment Cancel reply