Do You Really Need GPT-3? BERT's MLM Model Also Enables Few-Shot Learning

Follow the official account “ML_NLP“

Set it as “Starred“, delivering heavy content immediately!

Source｜PaperWeekly

©PaperWeekly Original · Author｜Su Jianlin

Unit｜Zhuiyi Technology

Research Direction｜NLP, Neural Networks

As we all know, GPT-3 is currently very popular, however, everywhere we see promotions for GPT-3, do readers remember the name of the GPT-3 paper? In fact, the paper is titled Language Models are Few-Shot Learners [1], and the title no longer contains the words G, P, T; it is simply referred to as GPT because it is a continuation of the original GPT.

As the name suggests, GPT-3 focuses on Few-Shot Learning, which is small sample learning. Additionally, another feature of GPT-3 is its size, with the largest version having up to 175 billion parameters, which is over a thousand times that of BERT Base.

Because of this, a recent paper on Arxiv titled It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners [2] caught my attention, which can be interpreted as “Who says you need to be large? Small models can also perform few-shot learning.”

Clearly, this title is directly targeting GPT-3, so I was curious to see who had the courage to challenge GPT-3 and what small model could do so. After reading, I found that the authors proposed that with appropriate construction, BERT’s MLM model can also perform few-shot learning, leading to a moment of realization that “Ah, it can be done this way too.” I would like to share this with everyone.

The Rising MLM

MLM, short for “Masked Language Model”, can be translated as “masking language model”, which is essentially a fill-in-the-blank task where certain words in the text are randomly masked, and the model is required to predict the masked words, as illustrated below:

▲ Simple illustration of BERT’s MLM model

The masked parts can either be randomly selected tokens or can be a random selection of consecutive tokens that form a whole word, the latter referred to as WWM (Whole Word Masking).

Initially, MLM was only regarded as a pre-training task for BERT, something that could be discarded after training, hence some open-source models did not retain the weights of the MLM part, such as the brightmart version [3] and clue version [4] of RoBERTa, while the Harbin Institute of Technology’s open-source RoBERTa-wwm-ext-large [5] had its MLM weights randomly initialized for unknown reasons, thus these versions are not suitable for reproducing the results discussed later in this article.

However, as research deepened, researchers found that not only BERT’s Encoder is useful, but the pre-training MLM itself is also quite valuable.

For instance, the paper BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model [6] suggests that MLM can be used as a general generative model, while the paper Spelling Error Correction with Soft-Masked BERT [7] utilizes MLM for text correction.

I previously found in experiments from language models to Seq2Seq: Transformers are like plays, relying on masks, that the pre-training weights of MLM can also be used as UniLM for Seq2Seq tasks, as well as unsupervised word segmentation and syntactic analysis! It turns out BERT can also be used this way, applying the ideas of MLM to unsupervised word segmentation and syntactic analysis. It can be said that MLM has already shone brightly.

Transforming Tasks into Fill-in-the-Blank

In this article, we will learn another exciting application of MLM: using it for few-shot learning or semi-supervised learning, and in some scenarios, even achieving zero-shot learning.

How do we combine the tasks we want to do with MLM? It’s simple, give the task a text description and then convert it into a fill-in-the-blank question. For example, if we have the sentence “I feel quite good about this trip to Beijing.”, we can add a description and construct the following fill-in-the-blank:

______ satisfied. I feel quite good about this trip to Beijing.

Furthermore, we can limit the blank to only be filled with either “very” or “not”, making the question clear, which is to determine whether the sentiment is positive based on contextual consistency. If the probability of “very” is greater than “not”, it indicates a positive sentiment, otherwise negative. Thus, we have transformed the sentiment classification problem into a fill-in-the-blank question that can be predicted using the MLM model, and the training of the MLM model can be done without supervised data, theoretically enabling zero-shot learning.

Multi-class problems can also be transformed similarly, for instance, for news topic classification, the input sentence is “After eight months, I can finally see the women’s volleyball team back on the field.” We can construct:

Here comes a ______ news report. After eight months, I can finally see the women’s volleyball team back on the field.

Thus, we have transformed news topic classification into a fill-in-the-blank problem, and a good MLM model should be able to predict the word “sports”.

Some simple reasoning tasks can also be transformed in this way. A common approach is to determine whether two sentences are compatible, for example, “I went to Beijing” and “I went to Shanghai” are contradictory, while “I went to Beijing” and “I am at Tiananmen Square” are compatible. The common practice is to concatenate the two sentences and input them into the model as a binary classification task. To transform this into a fill-in-the-blank, a natural construction could be:

Did I go to Beijing? ______, I went to Shanghai.

Did I go to Beijing? ______, I am at Tiananmen Square.

Where the candidates for the blank are .

Pattern-Exploiting Training

At this point, readers should notice the pattern, which is to add a prefix or suffix description to the input text and mask certain tokens, transforming it into a fill-in-the-blank question. This transformation is referred to as Pattern in the original paper, and it should form a natural sentence with the original sentence, avoiding stiffness, as the pre-trained MLM model is trained on natural language.

Clearly, the same question can have many different patterns. For example, in the sentiment classification case, the description can be placed at the end, becoming “I feel quite good about this trip to Beijing. ____ satisfied.”; or we can add a few more words, such as “How do you feel? ____ satisfied. I feel quite good about this trip to Beijing.”.

Next, we need to construct the candidate space for predicting the token and establish the mapping from tokens to actual categories, which is referred to as Verbalizer in the original paper. For example, in the sentiment classification case, our candidate space is , and the mapping is . The candidate space and actual categories do not necessarily have a one-to-one mapping. For instance, we could also include the words “quite”, “too”, “difficult”, and consider , etc.

It is not difficult to understand that many NLP tasks can undergo this transformation, but this transformation is generally only applicable to tasks with limited candidate space, essentially for multiple-choice questions, with common tasks being text classification.

1. For each pattern, separately finetune an MLM model using the training set;

2. Then, integrate the models corresponding to different patterns to obtain a fusion model;

3. Use the fusion model to predict pseudo-labels for unannotated data;

4. Finetune a conventional (non-MLM) model using pseudo-labeled data.

The specific integration method can be found in the paper, which is not the focus. This training mode is referred to as Pattern-Exploiting Training (PET), first appearing in the paper Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference [8].

This article introduces a paper that further affirms and refines the value and results of Pattern-Exploiting Training, integrating multi-task learning, allowing it to surpass GPT-3 in few-shot learning on the SuperGLUE leaderboard. The authors of both papers are the same, representing a continuous work.

▲ PET’s few-shot learning results on SuperGLUE

However, one point to criticize is that in the above figure, the 223M parameters of PET use the ALBERT-xxlarge-v2 model. In fact, calling ALBERT a “small model” is somewhat misleading, as its forward computation speed has not improved at all. ALBERT-xxlarge has 12 layers, with shared parameters between layers, meaning that in terms of forward computation, it should be equivalent to approximately 2700M (12 times) parameters of GPT.

Chinese Practice, Testing Effectiveness

To truly confirm the value of a method or model, it is not enough to look at the experimental tables in papers; the experimental results provided by papers are not necessarily reproducible. Moreover, even if they can be reproduced in English, it does not mean they are valuable in Chinese. Thus, the most practical approach is to conduct experiments personally for verification. Below is my experimental code for readers’ reference:

Github Address:

https://github.com/bojone/Pattern-Exploiting-Training

We will explore the feasibility of PET from the following angles:

1. How effective is it to directly use existing MLM models? (Zero-shot learning 1)

2. How effective is it to finetune existing MLM models using “large amounts of unlabeled data”? (Zero-shot learning 2)

3. How effective is it to finetune existing MLM models using “small amounts of labeled data”? (Few-shot learning)

4. How effective is it to finetune existing MLM models using “small amounts of labeled data + large amounts of unlabeled data”? (Semi-supervised learning)

The following mainly presents the sentiment binary classification experimental results. Additionally, there is also a multi-class classification for news topics, and the code is also available on Github, so I won’t repeat it here.

4.1 Zero-Shot Learning 1

Here we mainly explore the accuracy of predictions using existing MLM models after supplementing the corresponding patterns to the input text. Since the entire model construction process does not involve supervised training with labeled data, this is considered a form of “zero-shot learning.” We need to compare the effects across different patterns and different MLM models:

Below are several patterns for the experiment, where the candidates for the blank are “very” and “not”:

P1: ____ satisfied. I feel quite good about this trip to Beijing.

P2: I feel quite good about this trip to Beijing. ____ satisfied.

P3: ____ good. I feel quite good about this trip to Beijing.

P4: ____ ideal. I feel quite good about this trip to Beijing.

P5: How do you feel? ____ satisfied. I feel quite good about this trip to Beijing.

As for the MLM models, they are as follows:

M1: Google’s open-source Chinese BERT Base:

https://github.com/google-research/bert

M2: Harbin Institute of Technology’s open-source RoBERTa-wwm-ext Base:

https://github.com/ymcui/Chinese-BERT-wwm

M3: Tencent UER’s open-source BERT Base:

https://share.weiyun.com/5QOzPqq

M4: Tencent UER’s open-source BERT Large:

https://share.weiyun.com/5G90sMJ

The experimental results are shown in the table below (validation/test set):

The best result can reach 88%! This means that by loading existing MLMs and using appropriate patterns, we can correctly identify the sentiment tendency of most samples without any labeled data. This makes us see the potential of MLM models in a new light.

It can be observed that there are still certain differences between different patterns and pre-trained models. Overall, the performance of the large versions is significantly better than that of the base versions, indicating that like the transition from GPT to GPT2 and then to GPT3, making the models larger generally leads to better performance.

Additionally, this may also indicate that MLM has not been fully trained yet, possibly due to the inefficiency of the masking training method used by BERT, suggesting that improvements may be possible with a modified Transformer structure, as mentioned in the article on a better MLM model.

4.2 Zero-Shot Learning 2

After reviewing the above results, readers may wonder: if I continue pre-training the MLM model with domain-specific data, will the performance improve? The answer is: yes! Below are our experimental results; due to limited computing power, we only made comparisons based on RoBERTa-wwm-ext (the aforementioned M2, which we refer to as M2+ unsupervised after continued pre-training):

It should be noted that here we are only continuing the MLM training with domain-specific data, which is an unsupervised process and does not require labeled signals, thus also considered “zero-shot learning.” At the same time, from the results so far, we can see that adding a “prefix” to the input text has a slight advantage over adding a “suffix.”

4.2.1 Few-Shot Learning

Having discussed the improvements from continuing pre-training with unlabeled data, what happens if we return to the target scenario of PET, directly training the MLM with a small amount of labeled data paired with specific patterns?

This is the true “few-shot learning” training. Here, we retain about 200 labeled samples, and when constructing the samples, we first add patterns to each sentence, in addition to the masked positions inherent in the pattern, we also randomly mask other parts to enhance regularization for the model. The final experimental results are as follows:

The conclusion is that except for the “suffix” pattern P2, the other results are quite similar, further indicating that the “prefix” pattern is more competitive. In terms of performance, directly fine-tuning a BERT model using the same data yields results around 88.93, suggesting that the “MLM+Pattern” few-shot learning method may provide a slight performance boost.

4.3 Semi-Supervised Learning

After discussing unsupervised zero-shot learning and supervised few-shot learning, it is now time to combine labeled and unlabeled data for “semi-supervised learning.” Using the same task, the ratio of labeled to unlabeled data is about 1:99, with labeled data carrying patterns and unlabeled data not carrying patterns, where both mask some tokens for MLM pre-training. The final measured effects are as follows:

Again, the “suffix” clearly performs worse than the “prefix”, and the results are quite similar for the “prefix”. Specifically, the additional unlabeled data has been confirmed to be effective.

Intuitively, the “prefix” performs better than the “suffix” primarily because the masked positions in the “prefix” are more fixed, allowing the weak supervisory signal to be stacked and enhanced? However, this does not explain why the “prefix” is also better in the case of zero-shot learning; it may also relate to the model’s learning difficulty, where the patterns in the earlier parts of the sentence are more evident, making them easier to learn, thus allowing the model to learn more thoroughly? All of this remains speculative.

4.4 Summary and Conclusion

Below is a summary of the results mentioned above:

Readers can also compare the results from our previous article on generalization chaos: from random noise, gradient penalties to virtual adversarial training, showing that whether it is zero-shot learning, few-shot learning, or semi-supervised learning, the MLM model-based approach can match the results of VAT-based semi-supervised learning.

Our results in short news multi-class classification experiments are also similar. Therefore, this indicates that the MLM model can indeed serve as an excellent zero-shot/few-shot/semi-supervised learner.

Of course, there are still drawbacks to MLM-based models, such as the independent assumption used by MLM limiting its predictive capability for longer texts (in simple terms, the text in the blank cannot be too long), and its inability to predict variable-length answers also restricts its application scenarios (currently only suitable for multiple-choice questions). We look forward to more powerful MLM models emerging so that they can compete with GPT-3 on all tasks.

It’s time for a summary

This article introduced a novel application of BERT’s MLM model: transforming tasks into fill-in-the-blank questions with specific descriptions, utilizing the MLM model for zero-shot learning, few-shot learning, and semi-supervised learning.

In the SuperGLUE experiments of the original paper, it achieved results comparable to GPT-3, and I have also conducted some experiments in Chinese tasks, further affirming the effectiveness of this approach. The entire idea is quite unique, providing a sense of enlightenment that “it can be done this way too.” I recommend everyone to learn from it.

Do You Really Need GPT-3? BERT's MLM Model Also Enables Few-Shot Learning

References

[1] https://arxiv.org/abs/2005.14165

[2] https://arxiv.org/abs/2009.07118

[3] https://github.com/brightmart/roberta_zh

[4] https://github.com/CLUEbenchmark/CLUEPretrainedModels

[5] https://github.com/ymcui/Chinese-BERT-wwm

[6] https://arxiv.org/abs/1902.04094

[7] https://kexue.fm/archives/7661

[8] https://arxiv.org/abs/2001.07676

Download 1: Hands-on Learning Deep Learning
Reply "Hands-on Learning" in the backend of the Machine Learning Algorithms and Natural Language Processing official account
To obtain the 547-page eBook and source code of "Hands-on Learning Deep Learning."
This book covers both methods and practices of deep learning,
Not only explaining the techniques and applications of deep learning from a mathematical perspective,
But also includes runnable code,
Showing readers how to solve problems in practice.


Download 2: Repository address sharing
Reply "Code" in the backend of the Machine Learning Algorithms and Natural Language Processing official account
To obtain 195 papers from NAACL + 295 papers from ACL2019 with open-source code. The open-source address is as follows: https://github.com/yizhen20133868/NLP-Conferences-Code

Heavyweight! The Machine Learning Algorithms and Natural Language Processing exchange group has officially been established!
There are a lot of resources in the group, welcome everyone to join and learn!

Additional welfare resources! Qiu Xipeng's deep learning and neural networks, official PyTorch Chinese tutorial, data analysis using Python, machine learning notes, official pandas documentation in Chinese, effective java (Chinese version) and 20 other welfare resources

How to obtain: After entering the group, click on the group announcement to get the download link
Note: Please modify the remarks when adding as [School/Company + Name + Direction]
For example - Harbin Institute of Technology + Zhang San + Dialogue System.
The group owner, please avoid business. Thank you!


Recommended Reading:
Review of Open Domain Knowledge Base Question Answering Research
Use PyTorch Lightning to Automatically Train Your Deep Neural Network
Collection of Commonly Used PyTorch Code Snippets

Do You Really Need GPT-3? BERT’s MLM Model Also Enables Few-Shot Learning

4.1 Zero-Shot Learning 1

4.2 Zero-Shot Learning 2

Leave a Comment Cancel reply