Innovations in the Era of BERT: Applications of BERT in NLP

Article Author: Zhang Junlin, Senior Algorithm Expert at Weibo AI Lab

Content Source: Deep Learning Frontier Notes @ Zhihu Column

Community Production: DataFun

Note: You are welcome to leave messages in the background to submit to DataFun.

BERT has brought great surprises to people, but it has been about half a year since then, and many new works related to BERT have emerged during this time.

In recent months, aside from my main work on recommendation algorithms, I have been quite curious about the following two questions:

Question 1: The original BERT paper proved that pre-training BERT has a significant promoting effect on almost all types of NLP tasks (except generation models) under comprehensive NLP datasets like GLUE. However, since various tasks in GLUE have a certain proportion of small dataset sizes and relatively limited domains, does pre-training technology really have a significant promoting effect on many application areas under larger datasets and more domains, as demonstrated in the original BERT paper’s experiments? If so, how significant is this effect? Is the magnitude of this effect related to the domain? This is my first question of concern.

Question 2: As a new technology, BERT certainly has many immature or improvable aspects. So, what problems does BERT currently face? What are the directions worth improving in the future? This is my second question of concern.

Gradually, I have collected about 70-80 published works related to BERT up until the end of May 2019. Initially, I planned to write the answers to these two questions in one article, but as I wrote, I found it too lengthy, and the two themes could indeed be separated, so I changed it to two articles. This one answers the first question, focusing on the applications of BERT in various NLP fields. Generally, these applications do not involve improvements to BERT’s capabilities; they simply apply and leverage the capabilities of the pre-trained BERT model, which is relatively straightforward. As for enhancing or improving BERT’s capabilities, the technical aspects are stronger; I have summarized about ten future improvement directions for BERT, which will be included in the second article. I will revise and publish the second article later, summarizing the potential development directions for BERT.

These two articles require a certain knowledge structure from readers, so I recommend familiarizing yourself with the working mechanism of BERT before reading. You can refer to my previous article introducing BERT:

https://zhuanlan.zhihu.com/p/49271699

This article mainly answers the first question. In addition, from the application perspective, what characteristics of tasks is BERT particularly good at handling? What types of applications is it not good at handling? Which NLP application fields is BERT good at but still unexplored? What impact will BERT’s emergence have on traditional technologies in various NLP fields? Will a unified technical solution be formed in the future? Including how we should innovate in the BERT era… these questions will also be addressed, along with my personal views.

Before answering these questions, I will first discuss some abstract views on BERT. The original content was supposed to be in this position in the article, but it was too long, so I have extracted it separately. If you are interested, you can refer to the following article:

https://zhuanlan.zhihu.com/p/65470719

Flourishing: The Application Progress of BERT in Various NLP Fields

Since the birth of BERT, it has been half a year. If summarized, there has been a large number of works using BERT for direct applications in various NLP fields. The methods are simple and direct, and the overall results are relatively good, although it also depends on specific fields, as different fields benefit from BERT to varying degrees.

This section introduces these works by field and provides some subjective judgments and analyses.

Application Field: Question Answering (QA) and Reading Comprehension

QA, generally referred to as question-answering systems in Chinese, is an important application field in NLP and has a long history. I remember almost choosing this direction for my PhD proposal when I was studying… that was close… At that time, the level of technological development was various tricks flying around, and the reliability of various technologies was questionable… Of course, I eventually chose an even worse PhD proposal direction… This should be a specific example of Murphy’s Law? “Choice is greater than effort,” this famous saying has always been proven and has never been overturned. PhD students, please pay attention to this heartfelt advice and choose your proposal direction wisely. Of course, sometimes the choice of direction is not entirely up to you, but the above is what you need to pay attention to when you have the freedom to choose.

The core question of QA is: Given a user’s natural language query Q, for example, “Who is the president most like a 2B pencil in American history?” the system hopes to find a language segment from a large number of candidate documents that can correctly answer the user’s question and ideally return the answer A directly to the user, such as the correct answer to the above question: “Donald Trump”.

It is obvious that this is a very practical direction. In fact, the future of search engines may be QA + reading comprehension, where machines learn to read and understand each article and directly return answers to user questions.

The QA field is currently one of the best-performing areas for BERT applications, and it may even be possible to remove the “one of”. I personally believe that the possible reason is that QA questions are relatively pure. The so-called “purity” means that this is a relatively pure natural language or semantic problem, where the required answer exists in the text content, and at most some knowledge graphs may be needed. Therefore, as long as NLP technology improves, this field will directly benefit. Of course, it may also be related to the fact that the form of QA questions matches BERT’s advantages. So what kinds of problems is BERT particularly suitable for solving? This will be specifically analyzed later in this article.

Currently, different technical solutions that utilize BERT in the QA field are quite similar and generally follow the process below:

Applying BERT in QA, from a process perspective, is generally divided into two stages: retrieval + QA judgment. First, long documents are often segmented into paragraphs or n-grams of sentences, commonly referred to as passages, and a fast query mechanism is established using inverted indexing in search. The first stage is the retrieval stage, which is similar to the conventional search process, generally using the BM25 model (or BM25 + RM3 and other technologies) to query possible candidate paragraphs or sentences where answers may be located based on the query. The second stage is QA judgment. During model training, larger QA datasets like SQuAD are generally used to fine-tune the BERT model. In the application stage, for the top K candidate passages returned with high scores from the first stage, the user query and candidate passages are input into BERT, which classifies whether the current passage includes the correct answer to the query or outputs the starting and ending positions of the answer. This is a relatively general approach to optimizing QA problems using BERT. Different solutions are quite similar; the only differences may lie in the datasets used for fine-tuning.

QA and reading comprehension are fundamentally similar tasks when applying BERT; if you simplify your understanding, you can actually throw away the first stage of the above QA process and keep only the second stage, which is the process of applying BERT to reading comprehension tasks. Of course, this is a simplified understanding. In terms of the tasks themselves, there are significant commonalities, but there are also some subtle differences; generally, ordinary QA questions rely on shorter contexts when searching for answers, and the reference information is more localized, with answers being more surface-level. In contrast, reading comprehension tasks may require longer contextual references to correctly locate answers, and some high-difficulty reading comprehension questions may require machines to engage in a certain degree of reasoning. Overall, reading comprehension feels like a more challenging version of ordinary QA tasks. However, from the perspective of BERT application, the processes of both are similar, so I have combined these two tasks together. I know that the above statements may be controversial, but this is purely my personal view; please take it with caution.

As mentioned earlier, the QA field may be the most successful application area for BERT, and many studies have proven that applying the BERT pre-trained model often leads to significant improvements in tasks. Below are some specific examples.

In reading comprehension tasks, applying BERT has also had a huge impact on the previously complex array of technologies. A few years ago, I personally felt that although many new technologies had been proposed in the reading comprehension field, the methods were overly complex and tended to become increasingly complicated, which is definitely not a normal or good technological development route. I feel that the path may have gone awry, and I have always had a psychological aversion to overly complex technologies. Perhaps my limited intelligence prevents me from understanding the profound secrets behind complex technologies? Regardless of the reason, I have not followed this direction any further. Of course, the direction is a good one. I believe that the emergence of BERT will bring the technology in this field back to its essence, and the simplification of models will be more thorough. Perhaps this is not the case now, but I believe it will certainly happen in the future, and the technical solutions in reading comprehension should be a concise and unified model. As for the effects of applying BERT in reading comprehension, you can look at the SQuAD competition leaderboard, where the top entries are all BERT models, which also indicates BERT’s tremendous influence in the reading comprehension field.

Application Field: Search and Information Retrieval (IR)

Regarding the application of BERT, the problem patterns and solution processes in search or IR are very similar to those in QA tasks, but there are still some differences due to the different tasks.

The differences between search tasks and QA tasks can be summarized into three main points:

First, although both tasks are about matching queries and documents, the factors emphasized during matching are different. The connotations represented by the “relevance” and “semantic similarity” of the two texts differ; “relevance” emphasizes precise matching of literal content, while “semantic similarity” encompasses another meaning: that even if the literal content does not match, the underlying semantics may be similar. QA tasks pay attention to both, but may lean more toward semantic similarity, while search tasks focus more on the relevance of text matching.

Second, there is a difference in document length. QA tasks often seek answers to question Q, and the answers are likely to be short language segments within a Passage. Generally, the correct answer will be contained within this relatively short range, so the answers to QA tasks are usually brief, or the search objects are short enough to cover the correct answers, meaning that the processing objects of QA tasks tend to be short texts. In contrast, search tasks generally deal with longer documents. Although the judgment of whether a document is relevant to a query may rely only on a few key passages or sentences within a long document, these key segments may be scattered throughout different parts of the document. When applying BERT, due to its input length limitation, with a maximum input length of 512 units, how to handle long documents becomes quite important for search.

Lastly, for tasks like QA, the information contained in the text may be sufficient for making judgments, so no additional feature information is needed. However, for search tasks, especially practical searches in real life rather than the relatively simple Ad hoc retrieval tasks in evaluations, relying solely on text may not effectively judge the relevance between queries and documents. Many other factors also significantly impact search quality, such as link analysis, page quality, user behavior data, and various other features that also play an important role in the final judgment. For non-textual information, BERT seems unable to effectively integrate and represent this information. The recommendation field faces similar issues as search.

Despite discussing these differences between fields, it may not be particularly useful. If you look at the current works utilizing BERT to improve the retrieval field, especially for passage-level information retrieval problems, you may not be able to distinguish whether the task being performed is a search problem or a QA problem. Of course, for long document searches, there are still separate issues that need to be addressed. Why does this phenomenon occur, where it is difficult to distinguish between QA and search tasks? This is because the current works that utilize BERT to improve retrieval are mainly focused on passage-level tasks; on the other hand, the tasks are usually Ad Hoc retrieval, mainly focusing on content matching, which is significantly different from the main features used in real search engines. These two reasons primarily contribute to this phenomenon.

Let’s summarize how BERT is generally applied in Ad Hoc retrieval tasks: it is generally divided into two stages. First, using classic text matching models like BM25 or other simple and fast models for preliminary sorting of documents to obtain the top K documents with the highest scores, and then using complex machine learning models to reorder the top K results. The application of BERT is evident in the search re-ranking stage, and the application pattern is similar to QA, where the query and document are input into BERT to utilize its deep language processing capabilities to determine whether they are relevant. If it is passage-level short document retrieval, the process is basically the same as QA; if it is long document retrieval, a technical solution for handling long documents needs to be added before using BERT for relevance judgment.

Therefore, regarding how to apply BERT in the information retrieval field, we can discuss it from two different angles: short document retrieval and long document retrieval.

For short document retrieval, you can treat it as a QA task, so there is no need to elaborate further; I will directly present the results. The effects of several works in passage-level document retrieval tasks can be referenced in the following PPT:

From the various experimental data above, it can be seen that for short document retrieval, using BERT generally leads to significant performance improvements.

For long document retrieval tasks, because BERT cannot accept too long inputs at the input end, it faces the issue of how to shorten long documents. The other processes are basically similar to short document retrieval. So how do we solve the long document problem in search? You can refer to the ideas in the following paper.

Paper: Simple Applications of BERT for Ad Hoc Document Retrieval

This paper was the first to apply BERT in the field of information retrieval and proved that BERT can effectively improve performance in search applications. It also conducted experiments on both short and long documents. For short documents, it utilized the TREC 2014 Weibo data. After introducing BERT, in the Weibo search task, it achieved a 5% to 18% performance improvement over the current best retrieval models, and a 20% to 30% improvement compared to the baseline method (BM25 + RM3, which is a strong baseline that surpasses most other improvement methods).

The second dataset is the TREC long document retrieval task, where the differences between search and QA tasks are highlighted. Because the documents to be searched are relatively long, it is challenging to input the entire document into BERT during the re-ranking stage. Therefore, this work adopted a simple method: it segmented the document into sentences and used BERT to judge the relevance of each sentence to the query Q, then accumulated the scores of the highest-scoring Top N sentences (the conclusion is that obtaining the three highest-scoring sentences is sufficient; any more may decrease performance) to obtain the relevance score between the document and query Q, thus transforming the long document problem into a scoring accumulation model of portions of the document’s sentences. Experiments showed that compared to the strong baseline BM25 + RM3, using BERT yielded about a 10% performance improvement.

A Solution Idea for Long Documents in the Search Field

From the above paper’s handling process of long documents in search, we can further think deeply about this issue. Considering the uniqueness of search tasks, the relevance of documents and user queries is not reflected in all sentences in the article, but is concentrated in certain sentences within the document. If this fact holds, then an intuitive solution to the long document problem in search tasks can be as follows: first, through certain methods, determine the relevance between the sentences in the query and document, generating a judgment function Score=F(Q,S), and based on the Score, filter out a smaller subset of sentences Sub_Set(Sentences) that represent the content of the document. This way, the long document can be shortened in a targeted manner. From the perspective of relevance to the query, this approach will not lose too much information. The key is how to define this function F(Q,S); different definitions may yield different performance results. This function can be referred to as the sentence selection function for documents in the search field, and different DNN models can be used to implement it. There are many articles to be written here, and those interested can pay attention to it.

If we summarize: in the search field, applying BERT for passage-level short document retrieval often leads to significant performance improvements; while for long document searches, using BERT can also yield some improvement, but the effect is not as pronounced as for short documents. The likely reason is that the handling of long documents in search has its own characteristics, and further exploration of more reasonable methods that reflect the characteristics of long documents in search is needed to further leverage the effects of BERT.

Application Field: Dialogue Systems / Chatbots

Chatbots or dialogue systems have also been very active in recent years, which is related to the emergence of a large number of chatbot products in the market. Personally, I believe this direction aligns with future development trends; however, the current technology is not mature enough to support a product that meets people’s expectations for usability. Therefore, I am not too optimistic about the recent product forms in this direction, mainly limited by the current stage of technological development, which cannot support a good user experience. This is a digression.

Chatbots can be divided into two types based on task type: casual chatting and task-solving. Casual chatting is easy to understand; it is aimless chitchat that helps pass the time, assuming you have free time to be passed. I found that young children are likely to be the target audience for this task; two years ago, my daughter could chat with Siri for half an hour, until Siri got annoyed with her. Of course, the last thing Siri usually says is, “Siri, you are too dumb!”

Task-solving involves helping users solve daily tasks and real problems they encounter. For example, 99% of straight men will suffer from holiday phobia due to forgetting some holidays, and with a chatbot, you no longer have to worry. You can tell the chatbot: “From now on, whenever it’s a holiday, remember to remind me to order flowers for XX.” Thus, you will receive over 500 reminders from the chatbot throughout the year, and you won’t have to worry about being scolded, making life better. Moreover, if you happen to be unemployed in midlife due to your familiarity with sending flowers, you might open a chain flower shop, which could be listed in a few years, perhaps even faster than Luckin Coffee’s IPO… This is the concrete benefit that task-solving chatbots bring you; if you’re lucky, they might even push you unexpectedly toward the pinnacle of your life.

Jokes aside, from a technical perspective, chatbots primarily face two technical challenges.

The first is for single-turn dialogues, which is a question-and-answer scenario where task-solving chatbots need to parse the user’s intent from their utterances. For example, whether the user wants to order food or request a song is generally a classification problem called user intent classification, which categorizes the user’s intent into various service types. Additionally, if the user’s intent is confirmed, key elements must be extracted based on that intent. For instance, when booking a flight, one needs to extract the departure location, destination, departure time, return time, and other information. Currently, the slot-filling technique is generally used to do this, where each key element corresponds to a slot, and the value extracted from user interactions corresponds to the filling process. For instance, in a song request scenario, one slot might be “singer,” and if the user says, “Play a song by TFBoys,” the slot filling will yield “singer=TFBoys.” This is a typical slot-filling process.

The paper “BERT for Joint Intent Classification and Slot Filling” utilizes BERT to address the single-turn conversation tasks of intent classification and slot filling. The solution is quite straightforward: input a conversational sentence, and the high-level Transformer output corresponding to the [CLS] input position classifies the sentence’s intent. This is a typical application of BERT for text classification; on the other hand, for each word in the conversational sentence, it is treated as a sequence labeling problem, where each word at the corresponding position in the highest layer of the Transformer is classified to mark which type of slot the word corresponds to using IOE tagging. This is a typical approach to using BERT for sequence labeling, where this method accomplishes both tasks simultaneously, which is quite good. By adopting the BERT pre-training process, the results on two datasets show that the performance improvement in intent classification is not significant, possibly because the baseline methods have already achieved relatively high metrics. In terms of slot filling, compared to RNN + Attention and other baseline methods, performance varies across two datasets, with one dataset showing a 2% improvement and the other a 12% improvement. Overall, the performance is decent but not outstanding.

The second challenge is for multi-turn dialogues, where users interact with the chatbot through several rounds of Q&A. How to improve the model so that the chatbot remembers historical user interaction information and correctly uses that historical information in subsequent responses is a crucial issue that significantly affects how intelligent users perceive the chatbot to be. Therefore, effectively integrating more historical information and using it correctly in context is key to improving the model.

So what effect would applying BERT to multi-turn dialogue issues have? The paper “Comparison of Transfer-Learning Approaches for Response Selection in Multi-Turn Conversations” provides experimental results. It utilizes GPT and BERT and other pre-trained models to improve how historical information is integrated into multi-turn dialogues for response selection. The results show significant improvements, with BERT outperforming GPT, and GPT performing significantly better than baseline methods. Compared to baseline methods, BERT shows performance improvements ranging from 11% to 41% across different datasets.

In summary, BERT’s application in the chatbot field has considerable potential. Single-turn dialogue issues are relatively straightforward; in multi-turn dialogues, how to incorporate context is more complex, and I believe BERT can play a significant role here.

Application Field: Text Summarization

Text summarization can be divided into two types: abstractive summarization and extractive summarization. Abstractive summarization involves inputting a longer original document and generating a summary that is not limited to sentences appearing in the original text but autonomously creates a shorter summary that reflects the main ideas of the article. Extractive summarization means selecting sentences from the original document that can reflect the main ideas; the summary is constructed from several original sentences extracted from the text.

Below are the key points for applying BERT in these two different types of summarization tasks.

Abstractive Text Summarization

It is evident that the abstractive summarization task fits the typical Encoder-Decoder technical framework: the original article is input into the Encoder, and the Decoder generates sentences as summary results. Given this, to leverage BERT’s pre-training results, it can be reflected in two places. One is the Encoder side, which is straightforward; it only requires initializing the Encoder’s Transformer parameters with the pre-trained BERT model. The other is the Decoder side, which is more complicated. The main issue here is that while the BERT pre-training process uses a bidirectional language model, most NLP tasks in the Decoder phase generate words step by step from left to right. Thus, this differs from BERT’s bidirectional language model training mode and cannot effectively utilize the contextual information learned during BERT’s pre-training phase. As a result, the pre-trained BERT model cannot fully exert its power at the Decoder end.

Therefore, if you want to utilize BERT’s power in generating text summaries within an Encoder-Decoder framework, it is not easy. This faces the same issues as other NLP generation tasks involving BERT, and despite some recent proposals, this problem still seems to remain unresolved, making it heavily reliant on advancements in BERT’s generative models.

As for how to leverage BERT’s potential in generative tasks, this is an important direction for BERT model improvement. I will analyze the current solutions and effectiveness evaluations in the next article, “Model Section,” so I will temporarily skip this here.

Extractive Text Summarization

Extractive text summarization is a typical sentence classification problem. This means the model inputs the overall text content of the article, and given a specific sentence in the text, the model needs to perform a binary classification task to determine whether this sentence should be included in the summary. Therefore, extractive text summarization is essentially a sentence classification task, but it has its unique characteristics compared to conventional text classification tasks. The key difference is that the input must include the entire article content, while the classification target is only the current sentence being judged. The entire article serves as context for the current sentence, but it must be inputted. In contrast, typical text or sentence classification tasks do not require this additional context.

Thus, while extractive text summarization can be viewed as a sentence classification task, the input content and output target do not match well, which is a critical difference. Consequently, how to express the relationship between the sentence and the article in the model input needs careful consideration.

To use BERT for extractive summarization, you can construct a sentence classification task using the Transformer as a feature extractor and initialize the Transformer parameters with BERT’s pre-trained model. From a model perspective, BERT can certainly support this task; it only requires initializing the Transformer parameters with BERT’s pre-trained model. The key issue is how to design and construct the input part of the Transformer. The requirements are that the input must include the entire content of the article and indicate which sentence is currently being judged.

I previously published an article on methodological thinking:

https://zhuanlan.zhihu.com/p/51934140

Now, I can apply this method. Before seeing how others do it, I thought about how I would approach it. Therefore, before seeing any papers on using BERT for extractive summarization, I thought of a couple of relatively easy methods:

Method 1: Divide the input of the Transformer into two parts. The first part is the original text of the article. Of course, since BERT supports a maximum input length of 512, the original text cannot be too long. The second part is the current sentence to be judged. A separator <SEP> is added between the two parts; the output requires that the highest layer output corresponding to the initial [CLS] position produce two classification results of 0/1. If the result is 1, it means this sentence can be included as a summary sentence; if 0, it means it will not be included. By sequentially judging each sentence in the text, the sentences classified as 1 can be concatenated in order to generate the text summary. This is one possible method; I have not yet seen a model doing this, and I personally find this approach a bit cumbersome.

Method 2: The input part of the Transformer consists of only one part, which is the complete content of the article made up of multiple sentences. If this is the case, a new question arises: how do we know which sentence we are currently judging as a potential summary sentence? This can be done by adding a sentence start marker <BOS> at the beginning of each sentence in the input part of the Transformer, or treating this separator as a delimiter between sentences; alternatively, you could also add sentence numbers to the corresponding embeddings of the sentences to distinguish different sentence boundaries (BERT’s input part includes not only conventional word embeddings but also sentence embeddings, where words belonging to the same sentence share the same sentence embedding). Although various specific approaches can be taken, the core idea is similar: to clearly add some markers in the input part to distinguish different sentences.

Once the input part issue is resolved, the remaining question is how to design the output part of the Transformer. Similarly, several different approaches may exist here. For example, one possibility is to mimic the output of reading comprehension, requiring the Transformer output to provide several sentences’ <start Position, end Position> based on input. Another possibility is to bind K output heads to the initial [CLS] symbol, with each output head indicating the sentence number selected as a summary, specifying a maximum of K summary sentences. Additionally, there are other potential approaches, such as finding the corresponding position of the sentence start markers <BOS> in the high-level embeddings corresponding to the input tokens, where each <BOS> corresponds to an input position in the high-level embedding as the information set for that sentence, and designing a classification layer based on this sentence information embedding. This effectively classifies each sentence.

Are there other methods? There should be. For instance, the summary can also be treated as a single-word/word classification problem, where each input word corresponds to a high-level node of the Transformer, requiring classification of each word, with the output categories designed as [BOS (summary sentence start word marker), MOS (summary sentence word marker), EOS (summary sentence end word marker), OOS (non-summary sentence word marker)]. This is also a possible approach, treating the summary as a sequence labeling problem. Of course, you can brainstorm and estimate that there are many other methods.

Currently, there is a paper (Fine-tune BERT for Extractive Summarization) that focuses on extractive text summarization. Its specific approach generally follows the framework of Method 2 described above, using special separators to distinguish different sentences in the input part, adding a sentence start marker <CLS> to each sentence, and in the output part, requiring the output layer to determine whether each sentence will be selected as a summary sentence based on the highest layer embedding corresponding to the input <CLS> position. Unlike the methods described above, it adds an additional network layer between the output layer and the actual classification layer to integrate the relationships between different sentences in the article, such as using linear classifiers/transformers/RNNs to capture the common information between sentences, and then outputting the actual sentence classification results based on this. However, experimental results indicate that the newly added intermediate network layer does not significantly capture new information, which suggests that this layer could be simplified by removing it. As for the summarization system that incorporates BERT’s pre-trained model, while it outperforms the current SOTA model, the improvement is not substantial. This raises an interesting question: what does this indicate?

Application Field: Data Augmentation in NLP

We know that in the CV field, image data augmentation plays a vital role in performance, such as rotating images or cropping parts of images to create new training instances. Similarly, NLP tasks face similar needs; the reason for this need is that the more comprehensive the training data covering various scenarios, the better the model’s performance, which is easy to understand. The question is how to expand the training data for the task at a low cost.

Of course, you can choose to manually label more training data, but unfortunately, the cost of manual labeling is too high. Can we leverage some models to assist in generating new training data instances to enhance model performance? NLP data augmentation is aimed at achieving this.

Returning to our topic: Can we use the BERT model to expand the manually labeled training data? This is the core goal of applying BERT in the data augmentation field. The goal is clear; the remaining issue is the specific methods. This field is relatively innovative in terms of BERT application.

Paper: Conditional BERT Contextual Augmentation

This is a relatively interesting application from the Institute of Computing Technology, Chinese Academy of Sciences. Its aim is to generate new training data by modifying the BERT pre-trained model to enhance task classification performance. This means that for a certain task, input training data a, and generate training data b through BERT, using b to enhance classifier performance.

Its modification method is as follows: transforming BERT’s bidirectional language model into a conditional language model. The term “conditional” means that for training data a, certain words are masked, and BERT is required to predict these masked words. However, unlike the usual BERT pre-training, when predicting the masked words, a condition is added at the input end, which is the class label of the training data a. Assuming the class label of the training data is known, BERT is required to predict certain words based on the class label of the training data a and the context. The reason for this is to generate more meaningful training data. For instance, in sentiment analysis tasks, if a training instance S with positive sentiment has the sentiment word “good” masked, BERT is tasked with generating a new training instance N. If no conditional constraints are applied, BERT might generate the predicted word “bad,” which is reasonable but completely reverses the sentiment meaning for the sentiment analysis task, which is not what we want. We want the new training instance N to also express positive sentiment, such as “funny”. By adding class label constraints, this can be achieved. The specific method of adding constraints is to replace the sentence embedding part of the original BERT input with the embedding of the corresponding class label of the input sentence, thereby generating new training data that meets class constraints. This approach is quite interesting.

If the new training data generated by BERT is added to the original training data, the paper demonstrates that it can provide stable performance improvements for CNN and RNN classifiers.

Another paper, “Data Augmentation for BERT Fine-Tuning in Open-Domain Question Answering,” also addresses data augmentation in NLP, but this data augmentation does not involve generating training data through BERT like the previous article. Instead, it seems to expand positive and negative examples using rules within QA problems. The example itself does not hold particular technical content and has little relation to BERT. Its valuable conclusion is that simultaneously increasing both positive and negative examples generated through augmentation helps enhance BERT’s application performance; and that a stage-wise approach to increasing augmented data (i.e., training with original training data and augmented training data in multiple stages, with the original training data being trained first) performs better than mixing augmented data and original data for single-stage training.

Thus, when viewed together, these two articles represent a complete process of generating new training instances with BERT and how to apply these augmented instances.

Application Field: Text Classification

Text classification is a long-standing and well-established application field in NLP. It involves assigning a document to a category, determining whether it is about “sports” or “entertainment,” and so on.

So, how does BERT perform in this field? There are currently works addressing this.

Paper: DocBERT: BERT for Document Classification

In tests on four commonly used standard text classification datasets, the BERT pre-trained model achieved results that met or exceeded previous methods. However, overall, the improvement over previous common methods such as LSTM or CNN models was not substantial, with basic improvements ranging from 3% to 6%.

For text classification, BERT has not achieved a significant performance boost, which is understandable. This is because categorizing a relatively long document into a category relies heavily on shallow linguistic features, and indicative words are relatively numerous, making it a relatively straightforward task. Thus, the potential of BERT may not be easily realized.

Application Field: Sequence Labeling

Strictly speaking, sequence labeling is not a specific application field but a problem-solving pattern in NLP. Many NLP tasks can be mapped to sequence labeling problems, such as word segmentation, part-of-speech tagging, semantic role labeling, and many others. A characteristic of this pattern is that every word in a sentence has a corresponding classification output result. The original BERT paper also illustrated how to utilize BERT’s pre-training process for sequence labeling tasks, and the application pattern is generally that.

If we disregard specific application scenarios and map different application scenarios to the sequence labeling problem-solving pattern, there are currently works using BERT to enhance application performance.

Paper: Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning

This work uses BERT as a feature extractor for multi-criteria word segmentation. The term “multi-criteria” refers to different granularities of segmentation results for the same language segment under different scenarios. It uses the pre-trained BERT Transformer as the main feature extractor and builds a unique parameter head for each segmentation dataset to learn their respective standards while adding a shared parameter head to capture common information. On this basis, CRF is employed for global optimal planning. This model achieved the highest word segmentation performance on multiple datasets. However, overall, the performance improvement was not very significant. This may be related to the fact that previous technical methods have already solved word segmentation quite well, resulting in a high baseline.

Paper: BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis

This work addressed two tasks: reading comprehension and sentiment analysis. The aspect extraction task of sentiment analysis was conducted using a sequence labeling approach, and after utilizing BERT’s pre-training process, performance improvement was not significant compared to the best previous methods.

Paper: Multi-Head Multi-Layer Attention to Deep Language Representations for Grammatical Error Detection

This work primarily addresses the application problem of syntactic error detection. It maps this problem to a sequence labeling problem and shows significant performance improvements after using BERT’s pre-training, with performance drastically enhanced compared to baseline methods.

Earlier, when discussing the application field of dialogue systems, I also mentioned a single-turn dialogue system that utilized BERT for slot filling, which was also mapped to a sequence labeling problem. The conclusions from that work showed a 2% improvement on one dataset and a 12% improvement on another.

Overall, most sequence labeling tasks utilizing BERT achieve the current best performance, but the extent of performance improvement is relatively limited compared to many other fields. This may be related to the specific application fields.

Application Field: Others

In addition to the BERT applications I have categorized above, there are also sporadic works in other areas that utilize BERT, which I will briefly mention.

Paper: Assessing BERT’s Syntactic Abilities

This work tested BERT’s syntactic analysis capabilities using a subject-verb agreement task. The test data showed that BERT’s performance significantly exceeded that of traditional LSTM models, although the authors emphasized that due to some reasons, this data cannot be directly compared. However, it at least indicates that BERT is not weaker than or is stronger than LSTM in terms of syntax.

Additionally, there are two papers focused on NLP to SQL tasks (Achieving 90% accuracy in WikiSQL/ Table2answer: Read the database and answer without SQL), which means issuing commands in natural language without writing SQL statements, and the system automatically converts them into executable SQL statements. After using BERT, there was also a noticeable performance improvement. I understand that this task is relatively easy due to its domain limitations, so I am not particularly interested in this direction, and I will not elaborate further; those interested can look for the papers.

Furthermore, there are two papers on information extraction that also showed average results after utilizing BERT. This field is indeed worth paying close attention to.

Most of the published BERT applications have already been categorized into the various fields mentioned above. The papers explicitly mentioned here are only a portion of those I consider to have reference value; I have filtered out a batch that I believe has low quality or reference value or whose methods are overly complex, so please note this. Of course, there may also be valuable works that did not enter my narrow view, so this is also possible.

Everything Returns to Origin: The Facts May Be Different from What You Think

Having introduced many ways in which BERT can be applied to enhance performance in various NLP fields, the methods are numerous, and the effects vary, which can easily make one feel dazzled and unable to grasp the essence. Perhaps before reading this content, you had a single impression: BERT is great, and it can significantly enhance performance across various NLP applications. Is this really the case? Actually, it is not. After reviewing the content above, having seen the myriad methods, you may feel more confused, thinking that no conclusions can be drawn, with a glimmer of confusion in your eyes…

In fact, this is just the surface phenomenon. I will summarize here. This is purely a personal analysis, and I do not guarantee correctness; guessing right is a coincidence, and guessing wrong is normal.

Despite the appearance of various NLP tasks, how to apply BERT is primarily answered in the original BERT paper, where most of the processes are discussed. The core process is essentially the same: use the Transformer as a feature extractor, initialize the Transformer parameters with the BERT pre-trained model, and then fine-tune the parameters for the current task. That’s all.

If we delve deeper and analyze by task type, the conclusions may likely be as follows:

If the application problem can be transformed into a standard sequence labeling problem (word segmentation, part-of-speech tagging, semantic role labeling, dialogue slot filling, information extraction, etc., many NLP tasks can be transformed into sequence labeling problem-solving forms), or single-sentence or document classification problems (text classification/extractive text summarization can be seen as a single-sentence classification problem with context), then BERT’s pre-training process can be directly utilized without special modifications. Current experimental results seem to indicate that in these two types of tasks, using BERT should achieve the best results, but the performance improvements compared to the best models that did not adopt BERT seem relatively limited. What could be the underlying reasons for this? I have a judgment that I will discuss later.

If it is a short document’s dual-sentence relationship judgment task, such as typical QA/reading comprehension/short document search/dialogue tasks, the intuitive way to utilize BERT is the input method proposed in the original BERT paper, where two sentences are separated by a delimiter, without the need for special modifications. Currently, it appears that for these tasks, utilizing BERT often leads to significant performance improvements.

However, why do you feel that there are many different models when reading the above text? This is mainly because in some NLP fields, despite the utilization of BERT being fundamentally the same, certain task-specific characteristics need to be addressed separately, such as how to solve the long document input problem in the search field, what methods are needed for coarse sorting in the search field, how to design input-output for extractive summarization, and how to effectively utilize historical information in multi-turn dialogue systems. These issues need to be addressed. In essence, the key parts of applying BERT do not differ significantly. Everything has long been discussed in the original BERT paper. “Do not be surprised by the spring sleep, the tea aroma dissipates after the gamble, back then it was just ordinary.” Many things are just like that.

After this free-spirited explanation, has the glimmer of confusion in your eyes extinguished? Or has it become even more intense?

72 Transformations: Reconstructing Application Problems

If the above judgments are correct, you should ask yourself a question: “Since BERT seems more suitable for handling sentence relationship judgment problems, can we utilize this aspect? For instance, can we transform single-sentence classification problems or sequence labeling problems into dual-sentence relationship judgment problems?”

If you can ask this question, it indicates that you are quite suitable for cutting-edge research.

In fact, some works have already done this, and it has been proven that if application problems can be reconstructed, many tasks can directly improve their performance, with some tasks showing significant improvements. How to reconstruct? For certain NLP tasks with specific characteristics, if they appear to be single-sentence classification problems, you can introduce auxiliary sentences to transform the single-sentence input problem into a sentence pair matching problem, thus fully leveraging BERT’s strengths.

That’s how reconstruction is done.

Paper: Utilizing BERT for Aspect-Based Sentiment Analysis via Constructing Auxiliary Sentence

This work comes from Fudan University. It utilizes BERT to optimize fine-grained sentiment classification tasks. Fine-grained sentiment classification refers to making nuanced judgments about different aspects of a certain entity, such as the example “LOCATION1 is often considered the coolest area of London,” which means that the sentiment regarding a certain aspect (e.g., price) of an entity (here, a location) is assessed.

This work is quite clever; it transforms the conventional single-sentence classification problem of sentiment analysis by adding auxiliary sentences to turn it into a sentence pair matching task (for example, the auxiliary sentence could be: “What do you think of the safety of LOCATION 1?”). As mentioned earlier, many experimental results indicate that BERT is particularly suited for sentence pair matching tasks. Therefore, this transformation can undoubtedly leverage BERT’s application advantages more fully. The experiments also prove that through this problem transformation, significant performance improvements are achieved.

Salesforce also has a similar idea in their work (Unifying Question Answering and Text Classification via Span Extraction), with some experimental results indicating that single-sentence classification tasks can also improve performance by incorporating auxiliary sentences to transform them into sentence pair matching tasks.

Why does the performance suddenly improve when applying BERT to the same data and task, simply by transforming a single-sentence task into a sentence pair matching task? This is indeed a good question, and I will provide two possible explanations later.

This direction is worth further exploration; there are currently few works in this area, and I sense that there is still considerable potential to be tapped, whether from exploratory or application perspectives.

Competitive Advantage: What is BERT Good At?

Thus, a new question arises: What types of NLP tasks does BERT excel at solving? What scenarios are more suitable for BERT to address?

To answer this question, I compared various tasks and attempted to summarize and reason some conclusions, hoping to identify the features of tasks that can leverage BERT’s model advantages. The analysis results are as follows; these are purely personal judgments, and errors are inevitable, so please approach them critically and cautiously to avoid misleading you.

First, if NLP tasks lean toward having answers contained within the language itself and do not particularly depend on external textual features, applying BERT can significantly improve application performance. Typical tasks include QA and reading comprehension, where the correct answers lean more toward understanding the language itself; the better the understanding, the better the resolution, without relying too much on external judgment factors. Conversely, for some tasks where features beyond text are crucial, such as user behavior, link analysis, and content quality in search, BERT’s advantages may be more challenging to realize. Similarly, the recommendation system shares a similar rationale; BERT may only help with text content encoding, while other user behavior features may not be easily integrated into BERT.

Second, BERT is particularly suited for solving sentence or paragraph matching tasks. In other words, BERT excels at addressing sentence relationship judgment problems compared to typical single-text classification tasks and sequence labeling tasks, as many experimental results have indicated. The two possible reasons for this may be: Firstly, BERT’s pre-training phase includes the Next Sentence Prediction task, which allows it to learn some inter-sentence relationship knowledge during pre-training. If downstream tasks involve sentence relationship judgments, this aligns well with BERT’s strengths, thus yielding particularly significant effects. Secondly, the self-attention mechanism inherently allows for attention calculations between any words in sentence A and any words in sentence B. This fine-grained matching is especially important for sentence matching tasks, which is why the nature of the Transformer makes it particularly suitable for solving these types of tasks.

From the above characteristic of BERT excelling at handling inter-sentence relationship tasks, we can further infer the following viewpoint: Since the pre-training phase includes the Next Sentence Prediction task, it can positively influence similar downstream tasks. Can we continue to introduce other new auxiliary tasks during the pre-training phase? If this auxiliary task has a certain universality, it may directly enhance the performance of a class of downstream tasks. This is also a very interesting exploratory direction, although this direction falls within the realm of the affluent in the NLP community, while the less fortunate can only observe, applaud, and cheer.

Third, BERT’s applicable scenarios are related to the degree of demand for deep semantic features in NLP tasks. The more a task requires deep semantic features, the more suitable it is for BERT to address. In contrast, for some NLP tasks, shallow features suffice to solve the problem. Typical shallow feature tasks include word segmentation, POS tagging, named entity recognition, text classification, etc. These types of tasks can be effectively solved with relatively short contexts and shallow non-semantic features, so there is less room for BERT to demonstrate its strengths, as it feels somewhat excessive.

This is likely due to the depth of the Transformer layer, which can capture various features at different levels and depths. Thus, for tasks requiring semantic features, BERT’s ability to capture various deep features is more easily realized, whereas for shallow tasks like word segmentation/text classification, traditional methods may perform adequately.

Fourth, BERT is more suitable for solving NLP tasks with relatively short inputs, while tasks with longer inputs, such as document-level tasks, may not be well solved by BERT. The primary reason is that the self-attention mechanism of the Transformer requires attention calculations for any two words, leading to a time complexity of n squared (where n is the input length). If the input length is long, the training and inference speed of the Transformer suffers significantly, which limits BERT’s input length to not be too long. Thus, BERT is more suited for sentence-level or paragraph-level NLP tasks.

There may be other factors, but they do not seem as obvious as the four listed above, so I will summarize these four basic principles.

Gold Mining: How to Find Untapped BERT Application Areas

Since we have summarized the characteristics of tasks that BERT excels at, the next step is to look for application areas that have not yet been explored but are particularly suited for BERT to tackle. You can unleash your talents and skills…

How to find these areas? You can look for application fields that simultaneously meet one or more of the following criteria; the more criteria met, the theoretically more suitable it is for BERT:

1. The input is not too long, ideally sentences or paragraphs, avoiding BERT’s long document issues;

2. The language itself can effectively solve the problem without relying on other types of features;

3. Avoid generative tasks, steering clear of the pitfalls where BERT’s performance in generative tasks is insufficient;

4. Ideally, it should involve multi-sentence relationship judgment tasks, fully leveraging BERT’s strengths in sentence matching;

5. Ideally, it should involve semantic-level tasks, fully utilizing BERT’s ability to encode deep language knowledge;

6. If it is a single-input problem, think about whether you can add an auxiliary sentence to transform it into a sentence matching task;

Having read this, you might start feigning curiosity and asking me: Which application areas meet these characteristics?…

Well, brother, this is not a matter of curiosity but rather a matter of laziness. I have prepared the vinegar for you, and now it just needs the dumplings. It depends on you now; please take the time to ask such questions and go make those dumplings yourself… It is better to take action than to envy others…

New Trend: Can BERT Unify the World of NLP?

Before the advent of BERT, different application fields in NLP often used various models unique to those fields, leading to a diverse and significant disparity. For instance, reading comprehension involved a variety of attention mechanisms; the search field, while entering the DNN era, still relied heavily on learning-to-rank frameworks; and text classification was a typical domain for LSTM…

However, with the emergence of BERT, I believe that the chaotic situation of different technical means competing in various application fields will not last long. BERT, with its pre-trained model, should gradually unify the various NLP application fields with a relatively unified solution, restoring the old order and returning to the celestial throne. This likely signifies the beginning of a new era in NLP, as historically, there has not been such a powerful unified NLP model.

Why do I say this? In fact, you should have already seen some clues in the first section of this article. The content above involves many application fields in NLP. Although BERT’s promoting effects vary across different application areas, with some being significant and others less so, the reality is that all have outperformed the previous SOTA methods in their respective fields. The question is merely how much better they are, not whether they are better.

This means that at least in the fields mentioned above, BERT’s architecture and model can replace all other SOTA methods in that domain. And what does this imply? It means that at least in those fields, it is possible to use BERT’s architecture and model to replace all other SOTA methods. This also indicates that the principle of “long separation must unite, and long unity must separate” is at play; the era of unification has arrived, led by BERT. This will also mean that you will have to learn significantly less than before, and the cost-effectiveness of learning NLP will dramatically improve. Can you confidently say that this is a good thing? Well, you didn’t ask anyone else, brother…

With the gradual enhancement of BERT’s own capabilities, this unified pace will likely accelerate. I estimate that within the next year or two, most NLP subfields will likely be unified under the BERT two-phase + Transformer feature extractor framework. I believe this is a very positive development because everyone can focus their energy on enhancing the foundational model’s capabilities. As long as the foundational model improves, it means that the application effects in most application fields will directly improve, without the need to devise personalized solutions for each field, which is somewhat inefficient.

Will this really happen? Can such a powerful model exist in the vast and diverse field of NLP research? Let us wait and see.

Of course, I personally hold an optimistic attitude toward this.

That’s all for now; I find this too lengthy. To those who read to the end, I applaud your curiosity and patience… But please consider how much time you would estimate it took me to write this article after reading it… Also, I haven’t been feeling too well lately, but I still have to find ways to come up with jokes to make you laugh… After writing these AI articles, my technical level hasn’t seen improvement, but my potential to transform into a joke writer has certainly increased… Li Dan, just you wait…

It’s all tears; in fact, it doesn’t matter. Many things are just like that.

Author Introduction:

Zhang Junlin, Director of the Chinese Society for Chinese Information Processing, PhD from the Institute of Software, Chinese Academy of Sciences. Currently a Senior Algorithm Expert at Weibo AI Lab. Previously, Zhang Junlin served as a Senior Technical Expert at Alibaba, leading a new technology team, and held positions as a Technical Manager and Technical Director at Baidu and Yonyou, respectively. He is also the author of the technical books “This Is Search Engine: Core Technology Explained” (which won the 12th National Excellent Book Award) and “Big Data Daily Record: Architecture and Algorithms”.

If you are interested in the author, feel free to click on the original text at the end to communicate with the author.

——END——

About DataFun:

DataFun is positioned as the most practical data intelligence platform, mainly in the form of offline deep salons and online content organization. We hope to disseminate and spread the practical experiences of industry experts in their respective scenarios through the DataFun platform, providing inspiration and reference for those who are about to or have already begun related attempts.

DataFun’s vision is to create a platform for sharing, communicating, learning, and growing for big data and artificial intelligence practitioners and enthusiasts, allowing knowledge and experience in the field of data science to be better disseminated and realized for value.

Since its establishment, DataFun has successfully held dozens of offline technical salons nationwide, with over three hundred industry experts participating in sharing, gathering tens of thousands of practitioners in big data and algorithm-related fields.

Your “Look” is my motivation!👇

Leave a Comment Cancel reply