Exploring Xiaomi's Practical Applications of BERT in NLP

Machine Heart Column

Author: Xiaomi AI Lab NLP Team

When a technology is applied in practice, it often encounters various challenges. Taking BERT as an example, when adapting to business needs, engineers need to make various adjustments according to specific scenarios. This article introduces the practical exploration of the Xiaomi AI Lab NLP team in applying BERT.

Exploring Xiaomi's Practical Applications of BERT in NLP

In recent years, pre-trained models have made a significant impact in the field of Natural Language Processing (NLP), with one of the most important works being Google’s release of the BERT pre-trained model in 2018 [1]. Since its release, the BERT pre-trained model has achieved excellent results on various natural language understanding tasks, ushering in the era of pre-training and fine-tuning in NLP, inspiring a series of subsequent pre-trained model works in the NLP field. At the same time, the BERT model has been widely applied in industrial fields related to NLP and has achieved good results. However, due to the complexity of data formats in industrial applications and the performance requirements for inference, the BERT model often cannot be directly applied to NLP businesses; it needs to be adjusted and modified according to specific scenarios and data to meet the practical needs of the business.

The Xiaomi AI Lab NLP team has actively explored the forefront of NLP since its establishment, aiming to apply advanced NLP technologies to the company’s core businesses, supporting the NLP technology needs of information flow, search recommendations, voice interactions, and more. Recently, we conducted exploratory research on the application of the BERT pre-trained model in various businesses, utilizing various deep learning technologies to leverage and modify the powerful BERT pre-trained model to meet the data forms and performance requirements of the business, achieving good results and applying it to dialogue understanding, voice interaction, and foundational NLP algorithms among other fields. Below, we will specifically introduce the practical exploration of BERT in Xiaomi’s NLP business, which is divided into three parts: the first part is an introduction to the BERT pre-trained model, the second part is the practical exploration of the BERT model, and the third part is a summary and reflections.

BERT Pre-trained Model Introduction

The full name of the BERT pre-trained model [1] is Bidirectional Encoder Representations from Transformers, which is based on the Transformer architecture. Its main idea is to use the Transformer network [2] as the basic structure of the model and pre-train it on a large-scale unsupervised corpus through two pre-training tasks: masked language modeling and next sentence prediction. This results in the pre-trained BERT model. Based on this pre-trained model, fine-tuning is performed on downstream NLP tasks. The BERT pre-trained model can fully utilize the language prior knowledge learned during unsupervised pre-training and transfer it to downstream NLP tasks during fine-tuning, achieving excellent results on 11 downstream natural language processing tasks and initiating a new paradigm of pre-training in natural language processing.

The structure of the BERT model mainly consists of three parts: the input layer, the encoding layer, and the task-related layer. The input layer includes token embeddings, position embeddings, and segment embeddings, and the three are summed to obtain the input representation of each word. The encoding layer directly uses the Transformer encoder [2] to encode the representation of the input sequence. The task-related layer of the BERT model varies depending on the downstream task; for example, for text classification tasks, the task-related layer is typically a linear classifier with softmax.

The BERT model employs two pre-training tasks: one is the Masked Language Model (MLM) and the other is Next Sentence Prediction (NSP). Through these two pre-training tasks, the BERT model can learn prior language knowledge and later transfer it to downstream tasks. The principle of the first pre-training task, the Masked Language Model (MLM), is to randomly select a certain proportion (15%) of words in the input sequence and replace them with a mask token [MASK], and then predict these masked words based on the bidirectional context. The main goal of the second pre-training task, Next Sentence Prediction (NSP), is to predict whether sentence B is the next sentence of sentence A based on the two input sentences A and B.

The pre-trained BERT model can be used for downstream natural language processing tasks. When using it, the task-related layer is added on top of the pre-trained BERT model, and fine-tuning is performed on specific tasks. Typically, we extract the vector representation corresponding to the last layer of the BERT model and feed it into the task-related layer to obtain the target probabilities for the task to be modeled. For example, in a text classification task, we extract the vector representation corresponding to the last [CLS] token and then perform a linear transformation and softmax normalization to obtain the classification probabilities. During fine-tuning, all parameters of the BERT model and the task-related layer are updated together to optimize the loss function for the current downstream task.

The BERT model, based on the pre-training and fine-tuning paradigm, has many advantages. Through pre-training on a large-scale unsupervised corpus, the BERT model can introduce rich prior semantic knowledge, providing better initialization for downstream tasks, reducing overfitting on downstream tasks, and decreasing the reliance on large-scale data for downstream tasks. At the same time, the deep Transformer encoding layers used by the BERT model have strong representational capabilities, so downstream tasks do not require overly complex task-related layers, simplifying the model design for downstream tasks and providing a unified framework for various natural language processing tasks.

Practical Exploration of the BERT Model

In the practical application of the BERT model, it often needs to be adjusted according to specific business forms. The original BERT model can only model text sequences and lacks the ability to model other features; at the same time, the BERT model has too many parameters, and its inference speed is slow, which cannot meet the performance requirements for business deployment. Additionally, the BERT model needs to be fine-tuned separately for each sub-task, lacking universality and not being able to effectively utilize shared information between tasks. In practice, we have used various deep learning techniques such as feature fusion, attention mechanisms, ensemble learning, knowledge distillation, and multi-task learning to enhance or modify the BERT model, applying it to multiple specific tasks and achieving good business results. Below, we mainly introduce the practical exploration of the BERT model in three business areas: dialogue system intent recognition, voice interaction query completion, and Xiaomi NLP platform multi-granularity word segmentation.

Dialogue System Intent Recognition

Intent recognition is a crucial component of task-oriented dialogue systems, playing an essential role in the natural language understanding of user dialogues. Given a user query, the intent recognition module can identify the intent that the user wants to express, providing important intent label information for subsequent dialogue feedback. However, during the intent recognition process, due to the sparsity of entity slot knowledge, a model based solely on user query text struggles to further improve its effectiveness. When modeling the query text, if slot knowledge features can be integrated, the model’s intent recognition performance may be further enhanced. Therefore, the input to our intent recognition module consists of user query text and slot labels, with the output being the intent category, as shown in the following example.

Query Text/Slot Labels: Play Zhang Jie/b-music_artist/b-mobileVideo_artist Zhang/i-music_artist/i-mobileVideo_artist’s song
Intent Category: music

Unlike typical intent recognition, which only uses query text features, our intent recognition model that integrates slot features also includes slot label features, and each position may contain more than one slot label. When attempting to apply the BERT model to the intent recognition task, how to appropriately combine slot label features with the BERT model becomes an important problem to solve. Since the BERT pre-trained model did not input slot information during pre-training, directly placing slot label features into the input of the BERT model would disrupt the input of the BERT pre-trained model, rendering the pre-training process meaningless. Therefore, we need to consider a more reasonable way to integrate slot label features. After exploration and experimentation, we ultimately adopted a slot attention and fusion gating mechanism to integrate slot features, as shown in the structure of the intent recognition model with integrated slot features in Figure 4.

Exploring Xiaomi's Practical Applications of BERT in NLP

Figure 4 Intent recognition model with integrated slot features

First, we use the pre-trained BERT model to encode the query text, obtaining a text vector Q that integrates pre-trained prior knowledge.

Next, we embed the slot labels to obtain slot embeddings ES. Since each position may have multiple slot labels, we need to perform a pooling operation on the slot embeddings, and here we adopted a slot attention mechanism to perform a weighted sum of multiple slot embeddings. We used the Scaled Dot-Product Attention [2] as our slot attention mechanism, and before applying the dot-product attention mechanism, we first linearly transformed the text vector and slot embeddings to map them into a common dimensional subspace. After applying the slot attention, multiple slot embeddings are weighted and averaged to form a single slot vector S.

Exploring Xiaomi's Practical Applications of BERT in NLP

Then, we use a fusion gating mechanism to combine the text vector Q and the slot vector S to obtain the fused vector F. The fusion gating mechanism here is similar to the gating mechanism in Long Short-Term Memory (LSTM).

Exploring Xiaomi's Practical Applications of BERT in NLP

The fused vector F represents the text information and slot information for each position; however, since the fusion gating mechanism is applied at each position, the fused vector at a single position lacks contextual information about slots from other positions. To encode contextual information, we used a multi-head attention mechanism with residual connections and layer normalization [2] to encode the fused vector F, resulting in the final output vector O.

Finally, we extract the output vector corresponding to the first position (the position corresponding to the [CLS] token), concatenate it with the text length feature, and feed it into a linear classifier with softmax to obtain the probabilities for each intent category, thereby predicting the intent category label corresponding to the query.

Micro-tuning experiments on business data indicate that the BERT intent recognition model with integrated slot features achieves better results than the BERT model that only uses text information.

Voice Interaction Query Completion

During voice input, users may pause or provide incomplete statements. After ASR recognition, the resulting query text may be incomplete, affecting subsequent intent recognition and slot extraction. Therefore, before sending to the platform, it must go through a completion judgment module to identify incomplete queries, returning them to ASR for further input until a complete query is received. The proposed voice interaction query completion module is an important component for determining whether a user query is complete. The completion judgment module inputs user query text and outputs a completion label indicating whether the query is complete, where ‘incomplete’ indicates an incomplete query, and ‘normal’ indicates a complete query, as shown in the following example.

Query Text: Play Zhang Jie’s
Completion Label: incomplete

The main difficulty of the completion judgment model lies in the large volume of online user query requests, which places strict limits on the model’s inference speed; therefore, a typical BERT model cannot meet the online performance requirements of the completion judgment business. However, smaller models that can meet performance requirements are often too simple in structure, resulting in lower accuracy than the BERT model. Therefore, after careful consideration, we adopted an “ensemble learning + knowledge distillation” framework for the completion judgment system, first using ensemble learning to combine BERT and other effective models to achieve high accuracy. Next, we used knowledge distillation to transfer the knowledge of the ensemble model to the Albert Tiny model [3], significantly improving the model’s performance while ensuring inference speed. The “ensemble learning + knowledge distillation” completion judgment framework is shown in Figure 5.

Exploring Xiaomi's Practical Applications of BERT in NLP

Figure 5 Ensemble Learning + Knowledge Distillation Completion Judgment Framework

The first step of the completion judgment framework is ensemble learning. The goal is to mitigate overfitting issues by merging multiple models and enhancing the performance of the ensemble model. First, we train multiple effective large models, such as the BERT model, on the completion judgment business dataset; these effective large models are referred to as teacher models. Next, for each teacher model, we predict the logits corresponding to each data point, where logits essentially represent the softmax outputs before applying softmax, reflecting the teacher model’s knowledge on the data. Finally, for each data point, we integrate the logits predicted by multiple teacher models, resulting in ensemble logits that correspond to an ensemble model that combines the advantages of multiple teacher models, leading to better performance than any individual BERT teacher model.

The second step of the completion judgment framework is knowledge distillation. Due to inference performance constraints, the ensemble model cannot be directly applied to the online completion judgment business. Therefore, we use knowledge distillation [4] to transfer the knowledge of the ensemble model to a smaller model, the Albert Tiny model, which can meet online performance requirements. This smaller model is also referred to as the student model. When distilling the student model, we use both the ensemble logits and the true labels to train the student model, allowing it to learn the knowledge of the ensemble model and further enhance its performance. Specifically, for the i-th training data point, we first use the Albert Tiny model to predict the corresponding student logits, denoted as z. Additionally, the ensemble model logits for this data point are denoted as t, and the true label is denoted as y. We compute the usual cross-entropy loss using the true label y and student logits z, and simultaneously compute the temperature-smoothed distillation cross-entropy loss using the ensemble logits t and student logits z. The final loss function is a weighted sum of the two loss functions.

Exploring Xiaomi's Practical Applications of BERT in NLP

Ultimately, the Albert Tiny model obtained through the “ensemble learning + knowledge distillation” framework can achieve results comparable to the BERT single model while also meeting online inference performance requirements.

Xiaomi NLP Platform Multi-Granularity Word Segmentation

Chinese word segmentation has become a foundational requirement for many NLP tasks, providing essential segmentation services for downstream tasks. The Xiaomi NLP platform (MiNLP platform) has also developed a Chinese word segmentation function to support other NLP-related businesses within the company. In practice, we have found that different businesses have varying granularity requirements for Chinese word segmentation; short-text related businesses often require finer-grained segmentation services, while long-text related businesses prefer coarser segmentation granularity. To address the different granularity segmentation needs of various businesses, we have developed a multi-granularity segmentation algorithm that can support both coarse and fine-grained Chinese word segmentation services. Examples of multi-granularity segmentation are as follows:

Text: This is a mobile internet company
Coarse Granularity Segmentation: This/Is/A/Mobile Internet/Company
Fine Granularity Segmentation: This/Is/A/Company/Mobile/Internet

When developing the multi-granularity segmentation algorithm based on the BERT model, we found that if fine-tuning is done separately on the datasets for each granularity, it would require training and deploying multiple different granularity BERT segmentation models, leading to a linear increase in overall model resource consumption with the number of granularities. At the same time, while there are some differences in the segmentation results for different granularities, there are still many commonalities; training different granularity BERT segmentation models separately does not fully leverage this shared knowledge to enhance segmentation performance. To fully utilize the shared information from different granularity segmentations and reduce training and deployment costs, we proposed a “Unified Multi-Granularity Word Segmentation Model based on BERT” [5], which uses a single unified BERT model to train a Chinese word segmenter that supports multiple granularities.

Exploring Xiaomi's Practical Applications of BERT in NLP

Figure 6 Unified Multi-Granularity Word Segmentation Model based on BERT

The proposed unified multi-granularity word segmentation model, as shown in Figure 6, is based on the idea of first adding a special granularity marker to the input text to indicate granularity information, such as [fine] for fine granularity and [coarse] for coarse granularity. Then, the text characters with granularity markers are input into the BERT model, which is followed by a linear classifier with softmax to map the representations of each position to the probabilities of four segmentation labels: BMES. The meanings of the four segmentation labels are: B-beginning of a word, M-middle of a word, E-end of a word, S-single word. For the above text, the fine granularity segmentation label sequence would be:

Fine Granularity Segmentation Labels: This/S Is/S A/S Company/Mobile/B Internet/E Company/E

Finally, decoding the segmentation labels for the entire text sequence yields the final segmentation results.

The proposed unified multi-granularity word segmentation model not only achieves better results on the company’s internal multi-granularity segmentation data but also achieves state-of-the-art (SOTA) performance on publicly available multi-standard segmentation datasets, with experimental results available in reference [5].

Summary and Reflections

The Xiaomi AI Lab NLP team has explored the practical applications of the BERT model in specific businesses, using techniques such as feature fusion, ensemble learning, knowledge distillation, and multi-task learning to modify and enhance the BERT pre-trained model, achieving good results in dialogue system intent recognition, voice interaction query completion, and multi-granularity word segmentation on the Xiaomi NLP platform.

At the same time, we propose some simple ideas on how to further apply the BERT model, leaving room for future exploration and research. First, how to more effectively integrate non-text features, especially external knowledge features, into the BERT model is a question worth exploring in depth. Additionally, some specific businesses have strict requirements for online inference performance; in such scenarios, the BERT model often cannot be directly applied. We need to further investigate knowledge transfer from large models to small models or model compression to enhance model performance while ensuring inference speed. Lastly, various internal business units often operate independently, leading to redundant work when using the BERT model and not effectively utilizing shared information across multiple tasks. Building a BERT model platform that allows for sufficient sharing among various business units based on multi-task learning to serve and benefit multiple business needs is the direction we need to strive for next.

References

[1] Devlin. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL. 2019.

[2] Vaswani. Attention is all you need. NIPS. 2017.

[3] Lan. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. ICLR. 2020.

[4] Hinton. Distilling the Knowledge in a Neural Network. NIPS. 2014.

[5] Ke. Unified Multi-Criteria Chinese Word Segmentation with BERT. 2020.

This article is part of the Machine Heart column, please contact this public account for authorization to reprint.

✄————————————————

Join Machine Heart (Full-time Reporter / Intern):[email protected]

Submissions or Seeking Coverage: content@jiqizhixin.com

Advertising & Business Cooperation:[email protected]

Exploring Xiaomi’s Practical Applications of BERT in NLP

Leave a Comment Cancel reply