With the popularization of artificial intelligence, speech has become an important interaction method, especially since Baidu’s speech recognition and wake-up technology was launched, it has attracted widespread attention from developers.
On August 6, at the 65th “Analysis and Practice of Baidu Speech Recognition and Wake-Up Technology” salon jointly held by Baidu Developer Center and InfoQ, senior product manager He Dang shared the latest developments and solutions of Baidu’s speech technology. Meanwhile, senior R&D engineers Wei Likai and Tang Liliang from Baidu’s speech open platform introduced the details of Baidu’s speech recognition and wake-up technology, as well as specific practices. Finally, a demonstration sharing session was set up to better interact with developers.
Senior R&D engineer Wei Likai from Baidu’s speech open platform is currently responsible for technologies such as online and offline speech, integrated wake-up, and custom semantics. His sharing is mainly divided into the following four parts:
-
Online Customization
-
Offline Customization
-
Custom Semantics
-
Grammar Editor
Online customization allows developers to enumerate uncommon, difficult-to-recognize, or desired content into a text file known as a hotword list, enabling precise recognition of the content in the hotword list. With online customization, every developer, every application, and every machine can have different recognition strategies; offline customization provides command word recognition capabilities, allowing for high-accuracy speech recognition even in poor or no network conditions, such as in-car environments; custom semantics allow developers to define the desired verticals to be usable offline, and this technology is initially based on offline capabilities.
The three newly opened functions solve one issue of inaccurate online recognition, one issue of inability to recognize without a network, and custom semantics solve the issue of spoken content not being parsed or being parsed into the wrong domain.
Finally, Wei Likai introduced a grammar editor customized for the above new functions, making it easier for developers to use the aforementioned technologies.
Next, Tang Liliang introduced the process of Baidu’s speech wake-up through an image:
First, users need to input their speech, then endpoint detection is performed to detect the parts where people are speaking, followed by a signal processing process to effectively handle noise or other aspects. Next, acoustic features are extracted, recognition decoding is performed, and then confidence determination takes place. Since this is a wake-up + recognition system, after successful wake-up, it needs to be sent to the server for online decoding, and finally, the recognition result is obtained.
Additionally, Tang Liliang mentioned how to evaluate the quality of wake-up technology. Two very important indicators are the wake-up accuracy and false positive rate. Good wake-up technology has a high wake-up accuracy and a low false positive rate.
Then, Tang Liliang shared some application scenarios for Baidu’s speech wake-up, including mobile apps replacing common user operations, photography, robots, in-car scenarios, smart homes, and smart hardware.
Regarding how to choose wake-up words, he also provided the following suggestions:
-
Wake-up words can be customized according to the application’s personalized needs
-
Each word should be between 3 to 5 Chinese characters, with 4 characters being optimal
-
Syllable coverage should be as broad as possible, with significant differences and loudness
-
It is recommended to choose uncommon words
-
A wake-up word evaluation system to help reasonably select your wake-up words
Finally, Tang Liliang introduced that future plans will consider developing excellent technologies such as English wake-up, interrupt wake-up, commonly used command wake-up, and far-field wake-up. These technologies will be made available on the platform as soon as they are completed.
Baidu Technology Series Salon, More Than Just Dry Goods!