How Intelligent Voice Robots Communicate With You

Do you know Xiaoice?Have you called Duer?Have you interacted with Xiaona?Have you used Siri?If not, then you are out of the loop.They are all currently popular intelligent voice robots that many people have chatted with.

What exactly is voice technology?What applications does it have?What is a voice operating system?

Interacting with machines through pure voice information

How Intelligent Voice Robots Communicate With You

Like image recognition and machine learning, intelligent voice is a branch of artificial intelligence.In the current hot trend of artificial intelligence, from Siri to Duer, from Xiaoice to Xiaona, intelligent voice is integrating into people’s lives.

The so-called intelligent voice technology studies various theories and methods for effective communication between humans and computers using natural voice, involving speech recognition, content understanding, dialogue questioning, and answering.Generally speaking, intelligent voice refers to the technology that automatically processes and recognizes voice information using computers.

“From the perspective of the engine module, intelligent voice technology includes modules such as voice front-end processing (including voice enhancement), speech recognition, speech synthesis, semantic understanding, dialogue management, and voiceprint recognition. Among them, speech recognition is the process of automatically processing voice information through a computer and converting it into text, also known as speech transcription. It includes processes such as voice segmentation, endpoint detection, feature extraction, decoding, and post-processing,” said Zhao Qingwei, a researcher at the Institute of Acoustics, Chinese Academy of Sciences (hereinafter referred to as the Institute of Acoustics).

Currently, intelligent voice technology is mainly applied in areas such as smart home, virtual assistants, wearable devices, smart in-car systems, intelligent customer service, smart healthcare, and companion robots.The so-called virtual assistant is an intelligent voice assistant, where the core is that humans interact with machines purely through voice information, allowing the intelligent machine “assistant” to help complete assigned tasks.

In Zhao Qingwei’s view, a voice operating system is a bold idea, as voice-based human-computer interaction has great development potential, which is why many internet companies are optimistic about this direction.Currently, Amazon has created an intelligent voice cloud platform (Alexa), which has various intelligent voice applications (80,000 skills). On this platform, users can issue a series of commands through voice, such as shopping, searching, listening to music, telling stories, etc.

The Past and Present of Intelligent Voice Technology

How Intelligent Voice Robots Communicate With You

In fact, the research on intelligent voice technology originated in the 1950s.In 1952, Bell Labs in the United States created a 6-foot tall automatic digital recognition machine called “Audrey,” which could recognize the pronunciation of numbers 0-9 with an accuracy of over 90%.It had high accuracy for familiar voices but lower accuracy for strangers.In 1958, the Acoustics Research Laboratory of the Institute of Electronics, Chinese Academy of Sciences achieved recognition of 10 vowels using vacuum tubes.“Due to the weak computing power at that time, intelligent voice could only perform very simple recognition of letters or numbers,” Zhao Qingwei said.

From the 1960s to the early 1970s, research on speech recognition made some progress.“At this time, intelligent voice technology began to form a systematic framework, proposing feature extraction methods based on Linear Predictive Coding (LPC) technology and Dynamic Time Warping (DTW) technology, and using template matching methods for simple speech recognition (small vocabulary, specific person, isolated words).”

From the mid-1970s to the 1980s, the framework of speech recognition made breakthroughs, with statistical models gradually replacing template matching methods, and Hidden Markov Models becoming the foundational model for speech recognition systems.At the same time, Gaussian Mixture Models were adopted as the main modeling method for acoustic models, leading to significant advancements in connected word recognition and moderate vocabulary continuous speech recognition.

“By the 1990s, the basic neural network speech recognition model had been proposed.However, the reason why the neural network speech recognition model did not achieve significant breakthroughs at that time was mainly due to insufficient computing power of servers and not enough training speech data.”Zhao Qingwei stated that the neural network speech recognition model in the 1990s did not replace traditional methods, and at that time, intelligent voice technology was still based on Hidden Markov Models and Gaussian Mixture Models.

From the 1990s to the early 21st century, research on non-specific person, large vocabulary, continuous speech recognition systems became the mainstream research direction in the international speech community.In 1997, IBM launched its first dictation product, Via Voice, which allowed users to speak into a microphone to input text, which the system would automatically recognize and output.

In 2002, the Institute of Automation, Chinese Academy of Sciences launched the “Tianyu” series of Chinese voice products—Pattek ASR;in 2005, the Institute of Acoustics, Chinese Academy of Sciences launched the first domestically developed telecommunications-grade speech recognition platform, achieving large-scale application of domestic speech recognition software for the first time in the value-added services of China Mobile across 23 provinces, capturing 80% of the domestic market share and ending the monopoly of American companies in the Chinese speech recognition market.

Deep Neural Network Framework Becomes Mainstream

How Intelligent Voice Robots Communicate With You

In 2010, with a significant increase in server computing power (thanks to the application of GPUs) and a significant increase in training speech data (due to the development of mobile internet and cloud computing), Microsoft’s research on speech recognition based on deep neural networks made significant progress, with “recognition error rates dropping by more than 20%.”Since then, the modeling advantages of deep neural networks have been validated by many well-known speech research institutions both internationally and domestically, and the industry has begun to recognize that the modeling framework based on deep neural networks performs significantly better than the original framework, “now everyone basically adopts a modeling framework based on deep neural networks,” said Zhao Qingwei.

In recent years, speech recognition technology based on deep neural networks has also undergone continuous iterations, evolving from basic deep neural networks to Time Delay Neural Networks (TDNN), Bidirectional Long Short-Term Memory (BLSTM), and Convolutional Neural Networks (CNN);in recent years, end-to-end architecture-based speech recognition systems have been deeply researched by both the academic and industrial communities, and some systems have already gone online, with the Institute of Acoustics applying its latest research results to the customer service hotlines of China Mobile and China Telecom, directly serving hundreds of millions of customers.

According to reports, the Institute of Acoustics has long been committed to researching core technologies for speech recognition.In response to the demand for real-time speech recognition, researchers proposed a low-latency acoustic modeling technology based on hybrid neural networks (Time Delay Neural Networks + Output Projection Gate Recurrent Units), which can process long-term information, has a simple network structure, fast computation speed, and is easy to parallelize for training.This model structure has been adopted as a new type of feedback neural network structure by the international mainstream speech recognition open-source software Kaldi.In terms of non-real-time speech recognition, a deep neural network structure based on BLSTM-E (Bidirectional Long Short-Term Memory Extended) has been proposed to enhance the performance of the mainstream BLSTM and solve the problem of poor robustness of LSTM (Long Short-Term Memory Networks) to different length speech inputs under serialized training conditions.

Source: Science and Technology Daily

Leave a Comment Cancel reply