Since 1972, when a lonely programmer typed the first “hello world” on a computer, communication and dialogue between humans and artificial intelligence have never stopped.
1.Can Humans Really Make Cold AI Understand Us?
In reality, we often find ourselves in situations like this:
A big man from Shandong wanted to make a voice call using his car system, but he was driven crazy by the system in just three minutes…
Ba Yao Pa Ba (8188), I said Ba Yao Pa Ba! Are you deaf?
For example, while you are in a heated battle in a game, you receive a command like this from your teammate:
Excuse me?
After being beaten up, you realize the leader’s original words were “Go kill 1 to 5 bosses“.
Team wipe…The little boat of friendship capsizes when it says it will.
One of the foundations of human-machine communication is voice recognition.
If people can misunderstand each other, how much more so between humans and machines?
You might think that WeChat’s voice chat and recognition experience are quite good. In fact, the technical team behind the scenes, the Voice Technology Team of WeChat’s Technical Architecture Department, spent a full four years teaching WeChat how to better understand human speech.
For artificial intelligence, the WeChat voice recognition team is like a professional and authoritative teacher, helping WeChat evolve from a “primary school level” at its launch to a smart student that can understand and communicate well: achieving an industry-leading voice recognition accuracy rate of over 95%, capable of understanding multiple languages including English, Mandarin, and Cantonese.
Let’s not talk about complex technology for now, and return to the earlier funny “five piles of poop” issue—
2. Why Does Your Phone Often Fail to Understand What You Say?
1. You Are Not “Speaking Properly”
Don’t get me wrong, this doesn’t mean you have a speech impediment or can’t read or have a strong dialect, but rather refers to your tone of voice.
For example, Apple’s Siri is quite smart, right? Usually, when we talk to Siri, we subconsciously adopt a more articulate voice. In this case, our voice tends to be nearly standard, greatly reducing recognition difficulty.
However, in gaming and casual conversations, the environment is more relaxed, resulting in a lot of speed, accent, slurring, and repetition phenomena, such as “Hey, quick, I’m almost out of health, healer, come heal me!”, which greatly affects recognition rates.
For the challenges of spoken Chinese recognition, tech companies worldwide are struggling. When not in a reading environment, cases like mistaking “Brian” for “orchitis” or “Joe Hisaishi” for “that’s it” are common.
The WeChat voice recognition team explains that due to the high variability, audio quality is uneven, and speaking speed is fast, even the best voice recognition systems currently have an error rate of nearly 25% in such cases.
2. Noise and Distance Are Recognition “Killers”
Some might exclaim, I have a first-class Mandarin level, and I speak clearly and accurately. Why is there still a discrepancy in voice recognition?
This depends on whether your speaking environment is noisy and whether you are too far from the microphone.
For instance, in a car, echoes or outdoor noise can cause a dramatic drop in voice recognition accuracy; similarly, the voice recognition method commonly used in the mobile internet is called close-range recognition, meaning the microphone and sound source are close together. However, even indoors, if the distance exceeds one meter, it is considered too far, leading to signal attenuation along the transmission path, which also reduces accuracy.
3. Artificial Intelligence Works Slowly
How can we make artificial intelligence smarter? Three words: do homework!
For voice recognition, getting machines to “hear” more data can make them smarter. However, when training machines, we must tell them what words are in a sentence (the so-called supervised learning), but this data accumulation is very slow.
Thus, finding ways for teachers to avoid constantly supervising with a whip, achieving unsupervised or semi-supervised training, allowing machines to evolve and continuously improve their performance, will also be an important direction for technological development.
4. Machines Are Not Smart Enough Yet
Once a program converts a segment of speech into text, it does not know where the sentence is right or wrong, nor does it understand whether the sentence is coherent human language.
Moreover, in actual use, the speed, enunciation, frequency, and volume of speech vary among people, compounded by dialects and surrounding environments. In summary, achieving a certain recognition rate is relatively easy, but reaching a high standard of recognition accuracy is very difficult. This means that as recognition rates improve, the challenges increase.
However, since WeChat entered the voice recognition field, it has quickly risen to an industry-leading level in just a few years and continues to optimize and improve.
3. How Does WeChat “Listen to All Directions”?
Since we cannot control how users speak from all over the world, we can only teach WeChat how to “listen carefully”.
In 2012, the WeChat team quietly began researching voice systems.
However, the initial attempts were merely “cautiously” launching a voice reminder public account without much utilization.
It wasn’t until 2013 that WeChat’s voice input achieved great success in the industry, and in 2014, the voice-to-text feature was officially launched.
Interestingly, this feature is deeply hidden by WeChat, yet the number of users continues to grow.
Have you noticed? Voice input is in the additional menu, while voice-to-text must be accessed by long-pressing the voice message.
The WeChat team explains that every interface and feature launch is approached with extreme “restraint”; all designs follow actual user needs rather than showcasing technology.By hiding the entry deeper, it can avoid disturbing those users who do not need this feature.
4. WeChat Uses Deep Learning and Faces Challenges
Back to technology—
First, WeChat employs deep learning methods.
Simply put, the voice recognition system inputs speech and outputs Chinese characters; the machine must learn the mapping relationship from speech to language.
First, regarding speech, we need to teach WeChat how to listen. Human vocalization involves vibrations from the vocal cords, passing through the vocal tract and mouth, influenced by many muscle movements, similar to how raw signals undergo complex function transformations. Deep learning frameworks, due to their multi-layer structure, can effectively simulate these complex functions.
Next, regarding language, we need to teach WeChat how to understand. Typically, what we say must adhere to syntax (combinability) and collocation habits (causality). We need to help the machine learn these rules. One of the challenges is word meaning; for example, “know” and “understand” have completely different pronunciations, yet their meanings can sometimes be similar.
“Research shows that the order of Chinese characters does not affect reading.”
“For instance, after reading this sentence, you may find that the characters are all jumbled.”
— Xiao Pai
As you can see, many times, we may not fully understand a sentence, but we can still grasp its meaning based on context and the way words are pronounced.
The machine’s deep learning approach mimics the neurons in the human brain; as more language is processed, this network can gradually understand language. Simply put, the voice recognition system is like a person learning a language; given the same level of intelligence, the more they hear (training data), the easier it is to achieve good results.
WeChat employs deep learning technology, and with a vast user base, it has natural voice interaction scenarios and a large accumulation of voice resources, which has also become one of the important reasons for the rapid development of WeChat’s voice interaction technology.
At the same time, the dedicated technical team continues to tackle challenges.
Besides deep learning, what other efforts has WeChat made to improve voice recognition?
The WeChat voice recognition team has provided numerous examples. After careful consideration, Xiao Pai will share what he can understand…
For instance, to address performance issues in spoken styles (like phone calls), WeChat has adopted a good segmentation and sentence break engine that integrates audio attributes, speaker information, and some semantic data to achieve effective sentence segmentation; to overcome noise interference, WeChat uses algorithms that simulate real scenarios to convert past noise-free data into noise data containing various different scenarios, allowing the model to learn how to cope with different environmental interferences while learning content.Regarding the challenges posed by big data, since every user’s voice is unique, the long-tail problem of the universal model is a significant cause of errors. WeChat adopts a flexible approach, using algorithms to eliminate speaker information during the learning process, which also helps improve recognition rates.
As WeChat’s voice recognition technology continues to evolve, higher recognition rates will provide users with a better experience in practical scenarios and could even fundamentally change how they interact with the app, significantly increasing their reliance on voice.
5. In the Future, WeChat Will Chat Directly with You
Once AI can truly understand, how far away will conversations be?
Just as humans have five senses, phones have corresponding capabilities like image recognition, voice recognition, NFC communication, etc. Especially since voice is an important entry point, applications like Apple’s Siri, Microsoft’s Cortana, and Google Now have emerged one after another.
Many people may not have noticed that at the end of last year, the WeChat team, together with the Hong Kong University of Science and Technology, announced the establishment of a joint artificial intelligence laboratory, primarily researching:data mining, robotic dialogue, machine vision, and voice recognition. Now, with a vast user base and natural voice interaction scenarios, if continuously intelligent voice assistants are integrated as one of WeChat’s entry points, WeChat’s ecosystem will undoubtedly evolve further.
Smart homes, internet cars, smart healthcare, online education, automated telephone customer service, machine simultaneous interpretation, and other fields will be filled with voice interaction technology. Imagine, when you can not only chat and input via voice but also tell your alarm clock to wake you up 10 minutes later, search for a restaurant to eat with voice, or casually send a message or email while driving. Even your robotic assistant could completely understand every word you casually say and interact with you like a wise person; how exciting would that be?
All of this will happen in the future, perhaps very soon.
