Since 1972, when a lonely programmer typed the first “hello world” on a computer, the communication and dialogue between humans and artificial intelligence have never ceased.
Can humans truly teach cold AI (artificial intelligence) to understand our words and calls??
We have always yearned and imagined, and technology is gradually breaking the boundaries between science fiction and reality.
In Spielberg’s “Artificial Intelligence”, the robot boy David is adopted by a mother. Through daily interactions with humans, his tender voice gradually acquired warmth, courage, and love.
In “Resident Evil”, the top AI system “Red Queen” appears as a naive little girl hologram, but her coldness and cruelty in conversations with the protagonist Alice are chilling.
In “The Three-Body Problem”, the gentle tone of the sophon directly influenced the fate of human society during a profound “tea ceremony conversation” with the two sword bearers.
However, in reality, we often experience this –
A Shandong man attempted to make a call using his car’s voice command system and was driven mad by the system in just three minutes…
“Babaoyababa (8188)! I meant Babaoyababa! Are you deaf?!”
For example, during a game, while fiercely battling the enemy, you receive a command from a teammate:
Excuse me?
After getting beaten up, you learn that the leader’s original command was “go kill bosses 1 to 5”.
Team wipe… the friendship boat capsized.
One of the foundations of human-machine communication is voice recognition.
Misunderstandings can occur even in conversations between people, let alone between humans and machines?
You might think that WeChat’s voice chat and recognition experience are quite good. In fact, the technical team behind the scenes, the Voice Technology Group of WeChat’s Technical Architecture Department, spent four full years teaching WeChat how to better understand human speech.
For artificial intelligence, the WeChat voice recognition team is like a professional and authoritative teacher, helping WeChat evolve from a “primary school level” at launch to a sensitive and articulate star student: with a voice recognition accuracy rate exceeding 95%, it can understand multiple languages, including English, Mandarin, and Cantonese.
Let’s not discuss complex technology for now; let’s return to the earlier humorous incident with the “Five Poo” question –
Why does your phone often fail to understand what you say?
Don’t misunderstand; this isn’t about whether you have a speech impediment, can’t read, or speak a heavy dialect, but rather about your tone of voice.
For example, Siri from Apple is quite smart, right? Usually, when we talk to Siri, we unconsciously adopt a more formal tone. In this case, our voice is almost standard, significantly reducing recognition difficulty.
However, during gaming or casual conversations, the environment is more relaxed, and phenomena like speaking speed, accent, slurred speech, and overlapping words become very common, such as “Hey, hurry up, I’m almost out of health, healer come heal me!” which greatly affects recognition rates.
The challenge of colloquial Chinese recognition is troubling technology companies worldwide. Once the reading environment changes, cases like mishearing “Brian” as “testicular inflammation” or “Joe Hisaishi” as “that’s it” are not uncommon.
The WeChat voice recognition team explains that due to the high variability, with audio quality being inconsistent and fast speaking speeds, even the best current voice recognition systems can have an error rate of nearly 25% under these circumstances.
Some may exclaim, “I have a first-class proficiency in Mandarin and speak clearly and accurately; why is there still a discrepancy in voice recognition?”
This depends on whether the environment in which you are speaking is noisy, and whether you are too far from the microphone.
For instance, in a car, echo or outdoor noise can drastically reduce performance; similarly, in indoor scenarios where the distance is more than one meter, the signal can attenuate during transmission, leading to decreased performance.
How can we make artificial intelligence more understanding? Two words: do homework!
For voice recognition, allowing machines to “hear” more data can make them smarter. However, when training machines, we must tell them what words are spoken (the so-called supervised learning), and this data accumulation is slow.
Thus, how to enable teachers to supervise without constantly wielding a whip, achieving unsupervised or semi-supervised training, and allowing machines to evolve on their own and continuously improve their performance will also be an important direction for technological development.
When a program converts speech into text, it doesn’t know which parts are correct or incorrect, nor does it understand whether the sentence is coherent human language.
In actual use, the speed of speech, articulation, frequency, and volume vary among people, and there are also issues like differing dialects and surrounding environments. In summary, while achieving a certain recognition rate is relatively easy, reaching a high standard of recognition accuracy is not a simple task. The higher the recognition rate, the harder it becomes.
However, WeChat has reached a leading level in the industry in just a few years since entering the voice recognition field, and it continues to optimize and improve.
How does WeChat “hear all directions”?
Since we can’t control how users from all over speak, we need to teach WeChat how to “listen attentively”.
In 2012, the WeChat team quietly began researching voice systems..
However, at that time, the attempts were merely “cautiously” launching a voice reminder public account without much functionality.
It wasn’t until 2013 that the voice input launched by WeChat achieved great success in the industry, followed by the official launch of the voice-to-text feature in 2014.
Interestingly, such a practical feature is deeply hidden in WeChat, yet the number of users continues to grow.
Have you noticed? Voice input is in the additional menu, while voice-to-text must be accessed by long-pressing the voice message.
The WeChat team explains that every interface and function in WeChat is extremely “restrained“, and all designs follow actual user needs rather than showcasing technology. Hiding the entrance deeper can avoid disturbing those users who do not need this function.
WeChat adopts deep learning and rises to the challenge
Returning to technology –
First, WeChat adopts deep learning.
In simple terms, the input of the voice recognition system is sound, and the output is Chinese characters; the machine needs to learn the mapping relationship from sound to language.
First, let’s teach WeChat how to listen. Human vocalization involves vibrations of the vocal cords, traveling through the vocal tract and mouth, influenced by many muscle movements, similar to how raw signals undergo complex function transformations. The deep learning framework, with its multi-layered structure, can effectively simulate this complex function.
Next, we need to teach WeChat how to understand. Generally, the sentences we speak must conform to syntax (combinatorial) and collocation habits (causal), and we need to help the machine learn these rules. The difficulty lies in word meanings; for example, “know” and “understand” have completely different pronunciations but sometimes similar meanings.
“Research shows that the order of Chinese characters does not affect reading.”
“For example, after reading this sentence, you will find that all the characters are jumbled.”
You see, many times we might not fully understand a sentence, but we can still grasp its meaning based on context and the way words are pronounced.
The machine’s deep learning method mimics the neurons in the human brain; as more and more language is processed, this network can gradually understand language. Simply put, the voice recognition system is like a person learning a language; under the same intelligence level, the more they hear (training data), the easier it is to achieve good recognition results.
WeChat employs deep learning technology and has a vast user base, a natural voice interaction environment, and a wealth of voice resource accumulation, which has also become one of the important reasons for the rapid development of WeChat’s voice interaction technology.
At the same time, the dedicated technical team continues to rise to the challenge.
In addition to deep learning, what other efforts has WeChat made to improve voice recognition?
The WeChat voice recognition team provided numerous examples, and after careful consideration, I chose the ones I could understand to discuss…
For example, to address performance issues in conversational styles (like phone calls), WeChat adopted a robust segmentation engine that integrates audio attributes, speaker information, and some semantic information for effective sentence segmentation; to overcome noise interference, WeChat uses algorithms that simulate real-world scenarios to convert past noise-free data into data containing various noise types, allowing the model to learn content while also adapting to different environmental interferences. Regarding the challenges posed by big data, since each user’s voice is unique, the long-tail problem of universal models is a significant cause of errors. WeChat employs diverse algorithms to eliminate speaker information during the learning process, which has also helped improve recognition rates.
As WeChat’s voice recognition technology continues to evolve, the higher the recognition rate, the better the user experience in practical scenarios, which may even fundamentally change how users interact with voice technology.
The future: WeChat will chat directly with you
Once AI can truly understand, how far away can conversations be?
The human-machine voice interaction scenes in those sci-fi movies are already becoming tangible.
Just as humans have five senses, smartphones have corresponding image recognition, voice recognition, NFC communication, and other “senses”.. Especially as an important entry point, applications like Apple’s Siri, Microsoft’s Cortana, and Google Now are emerging one after another.
Many people may not have noticed that at the end of last year, the WeChat team and the Hong Kong University of Science and Technology announced the establishment of a joint AI laboratory, focusing on research in data mining, robot dialogue, machine vision, and voice recognition. With a large user base and natural voice interaction scenarios, if the increasingly intelligent voice assistant becomes one of WeChat’s entry points, WeChat’s ecosystem will further evolve.
Smart homes, internet cars, smart healthcare, online education, automated phone customer service, and machine simultaneous interpretation will all be filled with voice interaction technology.
Imagine, when you can not only chat and input via voice, but also tell your alarm to wake you up 10 minutes later, use voice to search for restaurants to eat, or send a text or email while driving.
Moreover, your robotic assistant could completely understand everything you casually say, interacting with you like a wise person; how exciting would that be?
All of this will happen in the future, perhaps in the near future..
This is the first issue of WeChat Team’s Technical Box, aiming to discuss the stories behind WeChat products in an engaging and light-hearted way.
If you are interested, please follow the “WeChat派” public account for more interesting stories about the development of black technology.
Get the latest updates from the goose factory first.
Exclusive internal reports and interactive benefits.
Long press to recognize the QR code → Follow immediately.