Overview of Transformer Pre-trained Models in NLP

The revolution brought by the Transformer in the field of natural language processing (NLP) is beyond words. Recently, researchers from the Indian Institute of Technology and biomedical AI startup Nference.ai conducted a comprehensive investigation of Transformer-based pre-trained models in NLP and compiled the results into a review paper. This article will roughly translate and introduce this paper, focusing on the discussion section where the researchers highlighted new research opportunities in the field. It is worth noting that the researchers named the paper “AMMUS”, which stands for AMMU Smiles, in memory of their friend K.S.Kalyan.

In today’s NLP field, the success of “Transformer-based Pre-trained Language Models (T-PTLM)” can be seen in almost every task. These models originated from GPT and BERT. The technical foundations of these models include the Transformer, self-supervised learning, and transfer learning. T-PTLM can learn universal language representations from large-scale text data using self-supervised learning, and then transfer the acquired knowledge to downstream tasks. These models can provide high-quality background knowledge for downstream tasks, thus avoiding the need to train downstream tasks from scratch.

This detailed review paper on T-PTLM will first briefly introduce self-supervised learning. It will then explain several core concepts, including pre-training, pre-training methods, pre-training tasks, embeddings, and downstream task adaptation methods. Next, the article will provide a new classification method for T-PTLM and briefly introduce various benchmarks, including internal and external benchmarks. The researchers also summarize some software libraries applicable to T-PTLM. Finally, the paper discusses some future research directions that may help further improve these models.

Paper link: https://arxiv.org/pdf/2108.05542.pdf

The researchers believe that this comprehensive and detailed review paper can serve as a good reference for readers to understand the core concepts and recent research progress related to T-PTLM.

Introduction

Transformer-based Pre-trained Language Models (T-PTLM) possess the ability to learn universal language representations from large-scale unlabeled text data and transfer the learned knowledge to downstream tasks, thus achieving great success in the NLP field. Such models include GPT-1, BERT, XLNet, RoBERTa, ELECTRA, T5, ALBERT, BART, and PEGAUSUS. In earlier times, most NLP systems adopted rule-based methods, which were later replaced by machine learning models. Machine learning models require feature engineering, which in turn requires domain expertise and takes a long time.

With the emergence of better computing hardware like GPUs and word embedding methods such as Word2Vec and Glove, deep learning models like CNN and RNN have been more widely applied in building NLP systems. The main drawback of these deep learning models is that they need to be trained from scratch, apart from word embeddings. Training such models from scratch requires a large number of labeled instances, and generating these instances is costly. However, we hope to obtain well-performing models using only a small number of labeled instances.

Transfer learning allows us to effectively reuse the knowledge learned from the source task to the target task. The target task should be similar to the source task. Based on the idea of transfer learning, researchers in the computer vision field have been using large-scale labeled datasets like ImageNet to train large CNN models. The image representations learned by these models are universal for all tasks. Then, these large pre-trained CNN models can adapt to downstream tasks by adding a few task-specific layers and fine-tuning on the target dataset. Since pre-trained CNN models can provide good background knowledge for downstream models, they have achieved great success in many computer vision tasks.

Deep learning models like CNN and RNN struggle to model long-term context and learn word representations with locality bias. Additionally, since RNN processes inputs sequentially (word by word), it can only utilize parallel computing hardware to a limited extent. To overcome these drawbacks of existing deep learning models, Vaswani et al. proposed a deep learning model that is entirely based on self-attention: the Transformer. Compared to RNN, self-attention supports a higher degree of parallelization and can easily model long-term context since each token in the input sequence attends to all other tokens.

The Transformer consists of several stacked encoder and decoder layers. With the help of stacked encoder and decoder layers, the Transformer can learn complex linguistic information. In the NLP field, generating large amounts of labeled data is very costly and time-consuming. However, large amounts of unlabeled text data are easily available. Inspired by the success of using CNN-based pre-trained models in the computer vision community, the NLP research community combined the capabilities of Transformers and self-supervised learning to develop T-PTLM. Self-supervised learning allows the Transformer to learn using pseudo-supervision provided by one or more pre-training tasks.

GPT and BERT are the earliest T-PTLM, developed based on Transformer decoder and encoder layers, respectively. Subsequently, models such as XLNet, RoBERTa, ELECTRA, ALBERT, T5, BART, and PEGAUSUS were born. Among them, XLNet, RoBERTa, ELECTRA, and ALBERT are improved models based on BERT; T5, BART, and PEGAUSUS are based on encoder-decoder models.

Kaplan et al. showed that simply increasing the scale of T-PTLM models can lead to performance improvements. This finding has driven the development of large-scale T-PTLM and spawned models with hundreds of billions of parameters, such as GPT-3 (175B), PANGU (200B), and GShard (600B), while Switch-Transformers (1.6T) has reached a trillion parameters.

After achieving success in general English, T-PTLM has begun to venture into other fields, including finance, law, news, programming, dialogue, the web, academia, and biomedicine. T-PTLM also supports transfer learning, allowing these models to be applicable to downstream tasks through fine-tuning and prompt adjustment on the target dataset. This article will comprehensively review recent research results related to T-PTLM. The highlights of this review paper are summarized as follows:

Section 2 will briefly introduce self-supervised learning, which is the core technology of T-PTLM.
Section 3 will introduce some core concepts related to T-PTLM, including pre-training, pre-training methods, pre-training tasks, embeddings, and downstream adaptation methods.
Section 4 will provide a new classification method for T-PTLM, considering four aspects: pre-training corpus, architecture, types of self-supervised learning, and expansion methods.
Section 5 will provide a new classification method for different downstream adaptation methods and explain each category in detail.
Section 6 will briefly introduce various benchmarks used to evaluate the progress of T-PTLM, including internal and external benchmarks.
Section 7 will provide some software libraries applicable to T-PTLM, from Huggingface Transformers to Transformer-interpret.
Section 8 will briefly discuss some future research directions that may help further improve these models.

Self-Supervised Learning (SSL)

The drawbacks of supervised learning are summarized as follows:

Heavily relies on human-annotated instances, which are time-consuming and labor-intensive to obtain.
Lacks generalization ability and is prone to false correlation issues.
Many fields, such as healthcare and law, lack labeled data, which limits the application of AI models in these areas.
It is difficult to learn using a large amount of freely available unlabeled data.

SSL shares some similarities with other popular learning paradigms, such as supervised learning and unsupervised learning. The similarity between SSL and unsupervised learning is that neither requires human-annotated instances. However, there are differences: a) SSL requires supervision, while unsupervised learning does not; b) the goal of unsupervised learning is to identify hidden patterns, while the goal of SSL is to learn meaningful representations. The similarities between SSL and supervised learning are that both require supervision during the learning paradigm. However, there are differences: a) SSL automatically generates labels without any human intervention; b) the goal of supervised learning is to provide task-specific knowledge, while the goal of SSL is to provide the model with general knowledge.

The goals of SSL can be summarized as follows:

Learn universal language representations that can provide excellent background for downstream models.
Obtain better generalization ability by learning from a large amount of freely available unlabeled text data.

Self-supervised learning can be roughly divided into generative SSL, contrastive SSL, and adversarial SSL.

Core Concepts of T-PTLM

Pre-training

Pre-training offers the following advantages:

By leveraging a large amount of unlabeled text, pre-training helps the model learn universal language representations.
Pre-trained models can adapt to downstream tasks by adding just one or two task-specific layers. Therefore, this provides a good initialization, avoiding the need to train downstream models from scratch (only training task-specific layers).
Allows models to achieve better performance with small datasets, thus reducing the need for a large number of labeled instances.
Deep learning models tend to overfit when trained on small datasets due to their large number of parameters. Pre-training can provide good initialization, thus avoiding overfitting on small datasets, and can be seen as a form of regularization.

Steps of Pre-training

Pre-training a model involves the following five steps:

Prepare the pre-training corpus
Generate the vocabulary
Design pre-training tasks
Select pre-training methods
Select pre-training dynamics

Pre-training Corpus

Figure 1: Pre-training Corpus

Figure 2: Pre-training Methods, where PTS is from-scratch pre-training, CPT is continuous pre-training, SPT is simultaneous pre-training, TAPT is task-adaptive pre-training, KIPT is knowledge-inheriting pre-training

Pre-training Tasks

Casual Language Modeling (CLM)
Masked Language Modeling (MLM)
Replacement Token Detection (RTD)
Shuffled Token Detection (STD)
Random Token Replacement (RTS)
Swapped Language Modeling (SLM)
Translation Language Modeling (TLM)
Alternative Language Modeling (ALM)
Sentence Boundary Objective (SBO)
Next Sentence Prediction (NSP)
Sentence Order Prediction (SOP)
Sequence-to-Sequence Language Model (Seq2SeqLM)
Denoising Autoencoder (DAE)

Embeddings

Figure 8: Embeddings in T-PTLM

Classification Method

To understand and track the development of various T-PTLM, researchers categorized T-PTLM from four aspects: pre-training corpus, model architecture, types of SSL, and expansion methods. As shown in Figure 9:

Figure 9: Classification of T-PTLM.

Downstream Adaptation Methods

Once the language model is trained, it can be used for downstream tasks. There are three ways to apply a pre-trained language model to downstream tasks: feature-based methods, fine-tuning, and prompt-based tuning.

As shown in Figure 10, feature-based methods involve using the context generated by the language model to produce word embeddings, which are then used as input features in models targeting specific downstream tasks. Fine-tuning involves adjusting the model weights according to the downstream task by minimizing the loss for that specific task.

Figure 10: Downstream Adaptation Methods.

Evaluation

During the pre-training phase, T-PTLM acquires knowledge encoded in the pre-training corpus. This knowledge includes syntax, semantics, facts, and common sense. There are two evaluation methods for T-PTLM’s effectiveness: intrinsic and extrinsic. See Figure 11.

The intrinsic evaluation method assesses T-PTLM by probing the knowledge encoded in it, while the extrinsic evaluation method assesses how T-PTLM performs in real-world downstream tasks. The intrinsic evaluation method helps us understand the knowledge that T-PTLM has acquired during the pre-training phase, which aids in designing better pre-training tasks so that the model can learn more knowledge during the pre-training phase.

Figure 11: Benchmarks for Evaluating T-PTLM Research Progress.

Useful Software Libraries

The researchers also summarized some commonly used software libraries applicable to T-PTLM. Among them, libraries such as Transformers and Fairseq are suitable for model training and evaluation. SimpleTransformers, HappyTransformer, AdaptNLP, etc., are built on top of the Transformer library and allow users to achieve easier training and evaluation with minimal code. FastSeq, DeepSpeed, FastT5, OnnxT5, and LightSeq can be used to improve the inference speed of models. Ecco, BertViz, and exBERT are visualization analysis tools for exploring the layers of Transformer models. Transformers-interpret and Captum can be used to explain model decisions.

Table 11: Software Libraries for T-PTLM.

Discussion and Future Directions

Better Pre-training Methods

Training models using only SSL (especially large models with trillions of parameters) is very costly. New pre-training methods like Knowledge-Inheriting Pre-Training (KIPT) involve SSL and knowledge distillation. SSL allows the model to learn the knowledge available in the pre-training corpus, while knowledge distillation allows the model to learn the knowledge already encoded in existing pre-trained models. Since the model can gain additional knowledge during the pre-training phase through knowledge distillation, a) the model can converge faster, thus shortening the pre-training time, and b) it will perform better on downstream tasks compared to models pre-trained using only SSL. The research community must focus on developing better pre-training methods like KIPT to enable models to gain more knowledge and reduce pre-training time.

Sample-Efficient Pre-training Tasks

If a pre-training task can maximize the utilization of each training instance, it can be said to be sample-efficient, meaning it should be definable on all tokens in the training instances. Sample-efficient pre-training tasks can enhance the computational efficiency of pre-training. The most commonly used pre-training task, MLM, is not very sample-efficient, as it only involves a subset of tokens, specifically masked tokens, which account for 15% of the total tokens. Pre-training tasks like RTD, RTS, and STD can be viewed as early attempts to develop sample-efficient pre-training tasks. These three pre-training tasks are defined over all tokens in each training instance, meaning they involve identifying whether each token has been replaced, randomly replaced, or shuffled. Future developments should also introduce sample-efficient pre-training tasks that enhance computational efficiency.

Efficient Models

Due to the large model sizes and the need for vast amounts of unlabeled text data, the cost of pre-training T-PTLM is also high. However, long pre-training times are not environmentally friendly, as this process releases carbon dioxide; moreover, many fields, such as biomedicine, lack large-scale unlabeled text data. Recently, models like DeBERTa, which are based on improvements to the BERT model, have achieved better performance than the RoBERTa model, even though they used only 78 GB of data for pre-training, which is only half the amount used for pre-training the RoBERTa model. Similarly, ConvBERT has achieved better performance with only a quarter of the pre-training cost of the ELECTRA model by employing a novel hybrid attention module. To reduce the amount of data and training costs for pre-training, efficient models like DeBERTa and ConvBERT are needed.

Better Positional Encoding Mechanisms

Self-attention mechanisms are permutation-invariant methods that do not have positional bias. By using absolute or relative position embeddings, positional bias can be provided. Additionally, absolute position embeddings can be predetermined or learned. However, both methods have their pros and cons. Absolute position embeddings may have generalization issues but are easy to implement. Unlike absolute positions, relative position embeddings can robustly handle variations in sequence length but are difficult to implement and perform worse. We still need entirely new positional encoding mechanisms, such as CAPE, which combines the advantages of absolute and relative position embeddings.

Improving Existing T-PTLM

Models like BERT and RoBERTa have already achieved excellent results on many NLP tasks. Recent research indicates that further improvements can be made to these models by injecting sentence-level semantics through continuous pre-training based on adversarial or contrastive pre-training tasks. For example, Panda et al. showed that continuous pre-training using a shuffled token detection objective could enhance the performance of the RoBERTa model on GLUE tasks, as it allows the model to learn more coherent sentence representations. Similarly, continuous pre-training with contrastive pre-training objectives can improve the performance of T-PTLM on GLUE tasks and multilingual T-PTLM on the Mickey Probe. Further research is needed to expand this to other single-language and domain-specific T-PTLM.

Beyond Naive Fine-tuning

Fine-tuning is the most common method for adapting pre-trained models to downstream tasks. However, the main drawback of naive fine-tuning is that it alters all layers in the pre-trained model, necessitating maintaining another copy for each task, which increases deployment costs. To apply pre-trained models to downstream tasks in a parameter-efficient manner, methods like Adapters and pruning-based fine-tuning have been proposed.

For example, adapters are small task-specific layers added to each Transformer layer. During downstream task adaptation, only the parameters of the adapter layers are updated while the parameters of the Transformer layers remain unchanged. Additionally, Poth et al. showed that adapters can also be used for intermediate fine-tuning. Recently, prompt-based tuning methods have achieved significantly better performance in terms of parameter efficiency and have garnered attention from the research community. For instance, prompt-based tuning methods like Prefix-tuning only require 0.1% of task-specific parameters, while adapter-based fine-tuning requires 3% of task-specific parameters.

Benchmark Evaluation

In the last four layers, many benchmarks have been introduced to evaluate the progress of general and domain-specific pre-trained models. In addition to English, benchmarks for evaluating the progress of other single-language and multilingual models have also emerged. However, existing benchmarks are insufficient to cover all scenarios. For example, there are no benchmarks for evaluating a) the progress of compact pre-trained models, b) the robustness of pre-trained models, c) PTLM developed for specialized domains such as social media and academia.

Recently, rankings like Explainboard have begun to analyze models’ strengths and weaknesses beyond using existing benchmarks or single metrics. Such rankings should also expand to other fields. Additionally, benchmarks like FewGLUE, FLEX, and FewCLUE that evaluate few-shot learning techniques should also be expanded to other languages and domains.

Compact Models

T-PTLM has achieved the best performance on almost every NLP task. However, these models are large and require more storage space. Because of the many layers in these models, input takes time to fully pass through the model to obtain predictions, resulting in high latency. Real-world applications have limited resources and require lower latency, so model compression methods such as pruning, quantization, knowledge distillation, parameter sharing, and decomposition have been explored in the application of English general domain.

Researching the application of these model compression methods in other languages and domains has great potential.

Robustness to Noise

T-PTLM is easily affected by noise, including adversarial noise and natural noise. The main reason is the use of subword embeddings. When using subword embeddings, a word is decomposed into multiple subword tokens, so even a small spelling error can change the overall representation of the word, hindering the model’s learning and affecting its predictions. To enhance the model’s robustness to noise, models like CharacterBERT have adopted character-only embeddings, while models like CharBERT use both character embeddings and subword embeddings. Both approaches can enhance robustness to noise.

Recently, researchers have proposed token-free models like CANINE, ByT5, and Charformer to improve robustness to noise. To enable these models to be applied in the real world, especially in sensitive fields like medicine, we need to enhance their robustness.

Novel Adaptation Methods

To adapt general models to specialized fields like biomedicine or to adapt multilingual models to specific languages, the common strategy is to use continuous pre-training. Although this method can yield good results by adjusting the model to fit specific domains or languages, the performance of downstream models may be affected if there is a lack of domain- or language-specific vocabulary. Recently, researchers have proposed methods that expand the vocabulary and then continuously pre-train. These methods can overcome the OOV word problem, but they increase the size of the vocabulary by adding new words. Recently, Yao et al. proposed the Adapt and Distill method, which uses vocabulary expansion and knowledge distillation to adapt general models to specific domains. Unlike existing adaptation methods, this method not only allows general models to adapt to specific domains but also reduces the size of the model. This attention deserves further research and has the potential to yield novel adaptation methods.

Privacy Issues

T-PTLM has achieved excellent results on many NLP tasks. However, these models also pose some unexpected and unbeneficial risks. For example, data leakage is a major concern, especially when the pre-training of these models uses private data. Since the models are pre-trained on large amounts of text data, there is a possibility of recovering sensitive information, such as identifiable personal information. Therefore, it is necessary to prevent the public release of models pre-trained using private data.

Recently, research by Carlini et al. demonstrated that the GPT-2 model can generate a person’s complete postal address, which was included in the training data and could be obtained using the person’s name through prompts. Recently, the KART framework that has appeared in the biomedical field can assess data leakage through various attacks. The research community needs to develop more sophisticated attacks to assess data leakage and develop methods to prevent pre-trained models from leaking sensitive data.

Reducing Bias

Deep learning-based methods are being increasingly applied in the real world, including in specialized fields like biomedicine and law. However, these models can easily learn and amplify biases present in the training data. The result is that these models may produce biases against specific races, genders, or age groups. We do not need such models.

Recently, some research has focused on identifying and reducing bias. For example, Minot et al. proposed a data augmentation method for reducing gender bias, while Liang et al. proposed the A-INLP method, which can dynamically identify bias-sensitive tokens. Further research in this area can help reduce biases in pre-trained models and assist them in making fair decisions.

Reducing Fine-tuning Instability

Fine-tuning is the most common method for adapting pre-trained models to downstream tasks. Although fine-tuning performs well, it is not stable; using different random seeds to perform fine-tuning can lead to significant differences in downstream performance. It is believed that the instability of fine-tuning is due to catastrophic forgetting and the small scale of the dataset. However, Mosbach et al. indicated that neither of these reasons is the cause of fine-tuning instability and further showed that the causes of fine-tuning instability include: a) optimization difficulties leading to gradient vanishing, b) generalization issues. Possible solutions to reduce fine-tuning instability include: a) intermediate fine-tuning, b) mix-out, c) using a smaller learning rate in the early epochs and increasing the number of fine-tuning epochs, d) simultaneously using supervised contrastive loss and cross-entropy loss. Methods to make fine-tuning more stable deserve further research.

Leave a Comment Cancel reply