Baidu AI Series: Open Capabilities

One original article every week, focusing on 5G, IoT, and artificial intelligence. Follow my 【Top Viewpoint】 to consistently utilize fragmented time for learning.

In the previous articles, we detailed Huawei’s AI capabilities and layout. Starting today, we will further explore Baidu’s AI capabilities and layout, one article per week. Everyone is welcome to join the discussion.

Introduction

Baidu’s “all in AI” strategy has been implemented for four years and is very thoughtful, mainly focusing on:

(1) Open capabilities: Technical capabilities, industry capabilities, scenario solutions

(2) Technical capabilities: Releasing general AI capabilities to the public, such as: voice, image, text recognition, face and body recognition, video, AR/VR, natural language processing, knowledge graphs, data intelligence

(3) Industry capabilities: Retail, education, parks, services, industry, government, hardware, agriculture, healthcare

(4) Practical solutions based on industry scenarios: For example, the “smart cabinet solution” for intelligent retail.

From the above Baidu layout, it is very close to practical scenarios. Baidu has cultivated the internet for many years and has a deep understanding of various industries. The classification of technical capabilities is also derived from practical experience, not floating in the air. Especially, the separate listing of “data intelligence” shows that they are seasoned players. Due to the vast amount of content, today we will focus on analyzing Baidu’s “technical capabilities” in the area of “voice technology”.

Capability Classification

If a company has not truly practiced AI, it cannot accurately classify technical capabilities. Baidu’s classification is very close to practical scenarios: voice recognition, voice synthesis, voice wake-up, call center, intelligent hardware, voice translation.

Voice interaction is recognized as the new generation of human-computer interaction. People expect to control electronic devices through the most natural language commands. Therefore, major platform manufacturers are vigorously developing intelligent voice technology, especially the leader iFlytek. However, even within the Chinese language, there are many issues, such as Mandarin, Cantonese, and dialects being difficult points. A large national study titled “Languages of China,” published on January 18, 2018, by the Institute of Ethnology and Anthropology of the Chinese Academy of Social Sciences, divides the book into seven sections: Overview, Sino-Tibetan languages, Altaic languages, Austronesian languages, South Asian languages, Indo-European languages, and mixed languages, covering 129 languages distributed within China. Of course, even casual conversations with friends reveal that China has far more than 129 dialects, with Hakka being one of the most complex. Fortunately, more people are increasingly using Mandarin, allowing voice recognition technology to be applied quickly. In simple terms, voice recognition technology converts voice into text for computer processing.

Capability Analysis

1. Voice Recognition

Baidu provides short voice recognition, rapid short voice recognition, real-time voice recognition, audio file transcription, and a voice self-training platform.

(1) Short Voice Recognition: Accurately recognizes voice under 60 seconds as text, suitable for mobile voice input, smart voice interaction, voice commands, voice search, and other short voice interaction scenarios.

Recognition rate 98%: Using a leading international streaming end-to-end speech-language integrated modeling method, integrating Baidu’s natural language processing technology, the near-field Mandarin recognition accuracy reaches 98%;

Dialect recognition: Supports Mandarin and slightly accented Chinese recognition; supports Cantonese and Sichuan dialect recognition; supports English recognition;

Semantic understanding: Supports semantic understanding across more than 50 fields, such as weather, traffic, entertainment, etc. It can also connect to the intelligent dialogue customization and service platform UNIT to customize semantic understanding and dialogue services, allowing for more accurate understanding of user intentions;

Punctuation segmentation: Uses large-scale datasets to train language models, intelligently matching appropriate punctuation marks (including , . ! ? ) based on the content and pauses in the speech, making the recognition results more comprehensible;

Smart numeric format conversion: Based on speech content understanding, it can correctly convert number sequences, decimals, time, fractions, and basic operators into numeric formats, making the recognized numeric results more intuitive and natural;

Exclusive model training: Supports self-training of models on the voice self-training platform, allowing vocabulary text uploads to complete training without coding, accurately improving vocabulary recognition rates by 5-25% in business fields, and can be used exclusively;

(2) Rapid Short Voice Recognition: Quickly recognizes voice within 60 seconds as text, suitable for mobile voice input, voice search, human-computer dialogue, and other voice interaction scenarios.

The specific content is the same as (1), not to be detailed further; the applicable scenarios are fewer than (1), mainly aimed at quick recognition, while (1) focuses on accurate recognition.

(3) Real-Time Voice Recognition: Based on Deep Peak2’s end-to-end modeling, it recognizes audio streams in real-time as text and returns the start and end time of each sentence, suitable for long sentence voice input, audio and video subtitles, conferences, etc.

Recognition rate 98%: Based on Deep Peak2 end-to-end modeling, trained on over 100,000 hours of data, multi-sampling rate multi-scenario acoustic modeling, the near-field Mandarin recognition accuracy reaches 98%;

Dialect recognition: Supports Mandarin and slightly accented Chinese recognition; supports Cantonese and Sichuan dialect recognition; supports English recognition;

Intelligent language recognition: Uses large-scale datasets to train language models, intelligently correcting intermediate recognition results, and intelligently matching appropriate punctuation marks based on the content and pauses in the speech, . ! ?;

Multiple interfaces: Supports WebSocket API, Android, iOS, Linux SDK, can be called on various operating systems and devices, quick to get started, simple to use;

Fast response speed: The first package response time is in milliseconds, and it displays intermediate text results in real-time, quickly recognizing audio streams;

Timestamped: The recognized text results come with timestamps, showing the start and end times of VAD-segmented sentences, facilitating functional development;

Main application scenarios: Real-time voice input, video live subtitles, speech subtitles on the same screen, real-time meeting records, classroom audio recognition.

(4) Far-field Voice Recognition: Through microphone array front-end processing algorithms, it can accurately recognize speech even when spoken from three to five meters away;

This scenario is mainly used for intelligent robots and smart home products, primarily aimed at recognizing speech from a distance.

(5) Call Center Audio File Transcription: Based on a dedicated model for call centers, it can recognize 8k sampling rate phone recording files into text in bulk and at low cost. It is suitable for phone content analysis and quality inspection scenarios;

(6) Voice Self-Training Platform: Using business scenario text corpus, zero-code self-training of language models, accurately recognizing voice content and effectively improving recognition accuracy in business fields;

This platform is certainly derived from practical scenarios, as various industries often have unique vocabulary that basic models cannot fully meet, and enterprises may not have related technical personnel to retrain models, while Baidu’s platform can meet the needs of these enterprises. Of course, certain large enterprises or those with high security requirements may choose to implement private deployment instead of online training.

2. Voice Synthesis

Provides highly human-like, smooth, and natural voice synthesis services, supporting various online and offline invocation methods to meet voice broadcasting needs in scenarios such as general reading, order reporting, and smart hardware.

(1) Online Voice Synthesis

Provides a multi-scenario sound library: Offers a total of 9 sound libraries, including basic and premium sound libraries, suitable for general reading, order reporting, smart hardware, etc. More featured sound libraries will be launched soon;

Adjustable speech rate and pitch: Supports various parameter configurations, allowing flexible settings for speech rate, pitch, and volume based on scene requirements to meet personalized needs;

Support for polyphonic characters: Chinese polyphonic characters can be annotated with pinyin and tone to define pronunciation, such as “轻舟已过万重（chong2）山” and “脑筋急转（zhuan3）弯”;

Multiple invocation methods: Provides REST API interface and online SDK, meeting the needs of mobile apps, web pages, mini-programs, hardware, and other multi-scenario requirements, providing a smooth and natural synthesis experience;

(2) Offline Voice Synthesis: In environments without or with weak networks, it can perform voice broadcasting on mobile apps or intelligent hardware devices such as story machines and robots, converting text into sound, providing a stable, consistent, and smooth synthesis experience.

Device-side real-time offline synthesis: A real-time responsive offline voice synthesis engine meets the needs of app applications, children’s story machines, and smart hardware devices in environments without or with weak networks, providing a stable and consistent synthesis experience;

High-quality multi-scenario offline sound libraries: Provides a total of 14 high-smoothness offline sound libraries, including “Basic Sound Library” + “Premium Sound Library.” Supports mixed reading of Chinese and English, and supports adjustments of speech rate, pitch, and volume;

Supports multiple platforms and usage modes: Provides Android and iOS offline voice synthesis SDK; supports pure offline and online-offline hybrid modes, allowing for free combination based on application scenarios;

3. Voice Wake-up

Baidu’s voice wake-up technology pre-sets wake-up words in devices or software. When users issue the voice command, the device wakes from sleep and responds accordingly, greatly improving human-computer interaction efficiency.

Predefined wake-up words currently supported include:

Camera: Take a photo, cheese

Music: Increase volume, decrease volume, play, stop, pause, previous song, next song

Light: Turn on the light, turn off the light, increase brightness, decrease brightness

Baidu has already achieved rapid customization of wake-up words: In 2017, Baidu acquired KITT.AI, the most well-known wake-up technology company in Silicon Valley. This company’s snowboy technology supports custom wake-up words, requiring only three steps to complete the machine’s training on the wake-up word. On September 19, 2019, Baidu Maps released a voice customization feature, internally codenamed “Bai Que Ling,” where users only need to record 20 sentences in the Baidu Maps App, and a complete personal voice package can be generated in about 20 minutes.

4. Voice Translation

Machine translation is a significant area with promising future content. Different languages inherently obstruct free communication among humans. Humans are continually exploring ways to eliminate this gap, and machine translation has already shown good performance.

General translation: Supports real-time mutual translation of 28 languages, covering Chinese, English, Japanese, Korean, Spanish, French, Thai, Arabic, Russian, Portuguese, German, Italian, Dutch, Finnish, Danish, etc.; also supports language detection for 28 languages.

Vertical domain translation: To improve the accuracy of machine translation in specific fields, Baidu Translation optimizes models for multiple vertical domains, making specific domain terminology translations more accurate than general translation API results, with sentence structures more in line with industry characteristics. Currently, three vertical domains have been opened for technology electronics, water conservancy machinery, and biomedicine, and the domains are continuously expanding.

Language recognition: The first batch supports language recognition for Chinese, English, Japanese, Korean, Thai, Vietnamese, and Russian, can recognize the language of given text, and return recognition results.

Photo translation: The photo translation SDK combines image recognition and text translation, easily achieving image-text translation without the hassle of secondary integration.

AI simultaneous interpretation: A high-quality, low-latency machine simultaneous interpretation service solution.

Personally, I think simultaneous interpretation is still better done by humans, especially in serious business and government scenarios, as the margin for error is quite low.

5. Call Center Solutions

End-to-end voice technology solutions for call center scenarios, including exclusive 8K sampling rate voice recognition, voice synthesis, and MRCP services, helping enterprises to efficiently and quickly access call center voice capabilities.

Main application scenarios include: intelligent voice IVR, real-time voice quality inspection and reminders.

Intelligent voice IVR: In customer conversations with intelligent voice customer service (IVR), real-time voice recognition is used to accurately recognize customer speech as text, determine customer intent, and respond smoothly and naturally through voice synthesis, handling consultation, processing, and diverting customer service tasks.

Real-time voice quality inspection and reminders: During the communication process between the agent and the customer, real-time voice recognition is used to conduct quality inspection of the entire conversation in real-time, which can remind the agent of speaking points, improving customer service quality.

6. Honghu Voice Chip

Baidu’s Honghu voice chip is designed specifically for voice interactions in smart home, intelligent in-vehicle, and IoT scenarios, featuring ultra-low power consumption, far-field voice interaction capabilities, high-precision low-false-alarm voice wake-up, and offline voice recognition.

Real-time processing of far-field array signals: Supports input of up to six-channel microphone array voice signals; also supports dual-channel stereo echo cancellation, sound source localization, and beamforming in traditional digital signal processing.

High-precision low-false-alarm voice wake-up: Based on Baidu’s leading Deep Peak and Deep CNN voice wake-up technology, it achieves high-precision wake-up in complex internal and external noise environments, with a daily false alarm rate of no more than once.

Offline voice recognition: Default support for voice recognition in smart IoT scenarios in environments without a network, and supports command word recognition in in-vehicle scenarios.

7. Private Deployment

This section will be analyzed and introduced in detail in other chapters.

Conclusion

Overall, in terms of classification, if viewed as an open platform, it is quite well categorized, although it should also include “voiceprint recognition”—a biometric application. However, the application of voiceprints is not yet very good, and collecting a large amount of voiceprints involves user privacy, making it a sensitive area at present.

Overall, in the voice field, Baidu is the best among the large companies after iFlytek, and its applications are quite widespread. The wake-up word “Xiao Du Xiao Du” has also become a brand that is increasingly integrated into smart hardware. Voice technology is becoming an important search entry point.

The next chapter will focus on: “Baidu AI Series: Open Capabilities—Image Technology,” and everyone is welcome to discuss. Finally, here is the complete video of the 2019 Baidu Developer Conference for you to watch at your convenience. Welcome to discuss.

Disclaimer:

This public account is for personal research and learning sharing, not a commercial account with any commercial purpose. If the content of the article infringes or contains illegal information, please contact this account immediately for deletion. Thank you. Contact: [email protected]

Leave a Comment Cancel reply