An In-Depth Analysis of Baidu's Speech Recognition and Wake-Up Technology

With the popularization of artificial intelligence, speech has become an important interaction method, especially since Baidu’s speech recognition and wake-up technology was launched, it has attracted widespread attention from developers.

On August 6, at the 65th “Analysis and Practice of Baidu Speech Recognition and Wake-Up Technology” salon jointly held by Baidu Developer Center and InfoQ, senior product manager He Dang shared the latest developments and solutions of Baidu’s speech technology. Meanwhile, senior R&D engineers Wei Likai and Tang Liliang from Baidu’s speech open platform introduced the details of Baidu’s speech recognition and wake-up technology, as well as specific practices. Finally, a demonstration sharing session was set up to better interact with developers.

Reply with the keyword ‘Salon’ in the WeChat backend to obtain the download link for three keynote speech PPTs.

The Latest Developments and Solutions of Baidu’s Speech Open Technology

First, He Dang introduced the latest overview of Baidu’s speech technology, especially mentioning speech recognition and speech synthesis, and demonstrated the latest effects of speech later on. In terms of speech recognition, machine recognition technology has surpassed human capabilities; in terms of speech synthesis, the main focus is on emotional speech synthesis, which adds emotional factors to the voice through big data splicing technology, making the effect of voice collection and final synthesis infused with emotion.

An In-Depth Analysis of Baidu's Speech Recognition and Wake-Up Technology

In addition, He Dang also mentioned the open plan of Baidu’s speech platform:

The first is far-field recognition, which is planned to be opened by the end of this year. This technology will also be open for free, allowing everyone to develop some hands-free applications.

The second is emotional speech, which adds emotional factors to the voice, replacing the original mechanical voice, making it closer to real human voice, and is expected to be opened by the end of the year.

The third is Deep Speech, which was recognized by MIT as one of the top ten breakthrough technologies in October 2016. There will be greater technological improvements and optimizations by the end of this year, with an updated version of Deep Speech being released on the speech platform.

Personalized Speech Recognition – Offline Command Word Recognition and Custom Semantics

Senior R&D engineer Wei Likai from Baidu’s speech open platform is currently responsible for technologies such as online and offline speech, integrated wake-up, and custom semantics. His sharing is mainly divided into the following four parts:

Online Customization
Offline Customization
Custom Semantics
Grammar Editor

Online customization allows developers to enumerate uncommon, difficult-to-recognize, or desired content into a text file known as a hotword list, enabling precise recognition of the content in the hotword list. With online customization, every developer, every application, and every machine can have different recognition strategies; offline customization provides command word recognition capabilities, allowing for high-accuracy speech recognition even in poor or no network conditions, such as in-car environments; custom semantics allow developers to define the desired verticals to be usable offline, and this technology is initially based on offline capabilities.

The three newly opened functions solve one issue of inaccurate online recognition, one issue of inability to recognize without a network, and custom semantics solve the issue of spoken content not being parsed or being parsed into the wrong domain.

Finally, Wei Likai introduced a grammar editor customized for the above new functions, making it easier for developers to use the aforementioned technologies.

Analysis and Practice of Baidu’s Speech Wake-Up Technology

What are the core technologies of Baidu’s speech wake-up, and what are their principles and implementation methods?

Tang Liliang introduced that the common solutions for speech wake-up technology mainly include: confidence-based, recognition-based wake-up systems, and technology based on garbage word networks. Baidu’s speech wake-up technology draws on the essence of these three solutions, using garbage phonemes and model statistics to represent all pronunciations, followed by a confidence system to greatly reduce false positive rates.

Next, Tang Liliang introduced the process of Baidu’s speech wake-up through an image:

An In-Depth Analysis of Baidu's Speech Recognition and Wake-Up Technology

First, users need to input their speech, then endpoint detection is performed to detect the parts where people are speaking, followed by a signal processing process to effectively handle noise or other aspects. Next, acoustic features are extracted, recognition decoding is performed, and then confidence determination takes place. Since this is a wake-up + recognition system, after successful wake-up, it needs to be sent to the server for online decoding, and finally, the recognition result is obtained.

Additionally, Tang Liliang mentioned how to evaluate the quality of wake-up technology. Two very important indicators are the wake-up accuracy and false positive rate. Good wake-up technology has a high wake-up accuracy and a low false positive rate.

Then, Tang Liliang shared some application scenarios for Baidu’s speech wake-up, including mobile apps replacing common user operations, photography, robots, in-car scenarios, smart homes, and smart hardware.

Regarding how to choose wake-up words, he also provided the following suggestions:

Wake-up words can be customized according to the application’s personalized needs
Each word should be between 3 to 5 Chinese characters, with 4 characters being optimal
Syllable coverage should be as broad as possible, with significant differences and loudness
It is recommended to choose uncommon words
A wake-up word evaluation system to help reasonably select your wake-up words

Finally, Tang Liliang introduced that future plans will consider developing excellent technologies such as English wake-up, interrupt wake-up, commonly used command wake-up, and far-field wake-up. These technologies will be made available on the platform as soon as they are completed.

Baidu Technology Series Salon, More Than Just Dry Goods!

Having such a free learning and exchange opportunity on Saturday is also a rare chance!

Looking forward to the 66th Baidu Technology Salon, see you there! We are waiting for you!

For more details, please click to read the original article!

An In-Depth Analysis of Baidu’s Speech Recognition and Wake-Up Technology

Leave a Comment Cancel reply