Overview of Latest Transformer Pre-training Models

Overview of Latest Transformer Pre-training Models

Reported by Machine Heart

In today’s NLP field, we can see the success of “Transformer-based Pre-trained Language Models (T-PTLM)” in almost every task. These models originated from GPT and BERT. The technical foundations of these models include Transformer, self-supervised learning, and transfer learning. T-PTLM can learn universal language representations from large-scale text data using self-supervised learning, and then transfer the acquired knowledge to downstream tasks. These models can provide high-quality background knowledge for downstream tasks, thus avoiding the need to train downstream tasks from scratch.
This comprehensive review of T-PTLM will first briefly introduce self-supervised learning. Next, it will explain several core concepts, including pre-training, pre-training methods, pre-training tasks, embeddings, and downstream task adaptation methods. Then, the article will provide a new classification method for T-PTLM, followed by a brief introduction to various benchmarks, including internal and external benchmarks. The researchers also summarize some software libraries applicable to T-PTLM. Finally, it discusses potential future research directions that may help further improve these models.

Overview of Latest Transformer Pre-training Models

Paper link: https://arxiv.org/pdf/2108.05542.pdf
The researchers believe that this comprehensive and detailed review paper can serve as a good reference material to help readers understand the core concepts and recent research progress related to T-PTLM.
Introduction
Transformer-based Pre-trained Language Models (T-PTLM) have the ability to learn general language representations from large-scale unlabeled text data and transfer the learned knowledge to downstream tasks, thus achieving great success in the NLP field. These models include GPT-1, BERT, XLNet, RoBERTa, ELECTRA, T5, ALBERT, BART, and PEGAUSUS. In earlier times, most NLP systems adopted rule-based methods, which were later replaced by machine learning models. Machine learning models require feature engineering, which in turn requires domain expertise and takes a long time.
With the emergence of better computing hardware such as GPUs and word embedding methods like Word2Vec and Glove, deep learning models such as CNN and RNN have been more widely used in building NLP systems. The main disadvantage of these deep learning models is that they need to be trained from scratch, except for word embeddings. Training such models from scratch requires a large number of labeled instances, the cost of which is very high. However, we hope to achieve good-performing models using only a small number of labeled instances.
Transfer learning allows us to effectively reuse the knowledge learned from the source task to the target task, where the target task should be similar to the source task. Based on the idea of transfer learning, researchers in the computer vision field have been using large labeled datasets like ImageNet to train large CNN models. The image representations learned by these models are universal for all tasks. Then, these large pre-trained CNN models can be adapted to downstream tasks by adding a few task-specific layers and fine-tuning on the target dataset. Since pre-trained CNN models can provide good background knowledge for downstream models, they have achieved great success in many computer vision tasks.
Deep learning models such as CNN and RNN struggle to model long-term context and learn word representations with locality bias. Additionally, since RNN processes input sequentially (word by word), it can only utilize parallel computing hardware to a limited extent. To overcome these shortcomings of existing deep learning models, Vaswani et al. proposed a deep learning model completely based on self-attention: the Transformer. Compared to RNN, self-attention supports a higher degree of parallelization and can easily model long-term context, as each token in the input sequence attends to all other tokens.
The Transformer consists of several stacked encoder and decoder layers. With the help of stacked encoder and decoder layers, the Transformer can learn complex language information. In the NLP field, generating a large amount of labeled data is very costly and time-consuming. However, large amounts of unlabeled text data are readily available. Inspired by the success of using CNN-based pre-trained models in the computer vision community, the NLP research community combined the capabilities of Transformer and self-supervised learning to develop T-PTLM. Self-supervised learning allows the Transformer to learn from pseudo-supervision provided by one or more pre-training tasks.
GPT and BERT are the earliest T-PTLM, developed based on Transformer decoder and encoder layers, respectively. Subsequently, models such as XLNet, RoBERTa, ELECTRA, ALBERT, T5, BART, and PEGAUSUS were born. Among them, XLNet, RoBERTa, ELECTRA, and ALBERT are improved models based on BERT; T5, BART, and PEGAUSUS are encoder-decoder based models.
Kaplan et al. demonstrated that simply increasing the scale of T-PTLM models can lead to performance improvements. This finding has driven the development of large-scale T-PTLMs and has led to the emergence of models with hundreds of billions of parameters, such as GPT-3 (175B), PANGU (200B), and GShard (600B), while Switch-Transformers (1.6T) have even reached trillions of parameters.
After achieving success in general English, T-PTLM began to expand into other fields, including finance, law, news, programming, dialogue, web, academia, and biomedicine. T-PTLM also supports transfer learning, allowing these models to be adapted to downstream tasks through fine-tuning and immediate adjustment on the target dataset. This article will comprehensively review recent research achievements related to T-PTLM. The highlights of this review paper are summarized as follows:
  • Section 2 will briefly introduce self-supervised learning, which is the core technology of T-PTLM.

  • Section 3 will introduce some core concepts related to T-PTLM, including pre-training, pre-training methods, pre-training tasks, embeddings, and downstream adaptation methods.

  • Section 4 will provide a new classification method for T-PTLM. This classification considers four aspects: pre-training corpus, architecture, type of self-supervised learning, and expansion methods.

  • Section 5 will provide a new classification for different downstream adaptation methods and explain each category in detail.

  • Section 6 will briefly introduce various benchmarks used to evaluate the progress of T-PTLM, including internal and external benchmarks.

  • Section 7 will provide some software libraries applicable to T-PTLM, from Huggingface Transformers to Transformer-interpret.

  • Section 8 will briefly discuss some future research directions that may help further improve these models.

Self-Supervised Learning (SSL)
The disadvantages of supervised learning are summarized as follows:
  • It heavily relies on human-labeled instances, the acquisition of which is time-consuming and labor-intensive.

  • Lacks generalization ability and is prone to false correlation issues.

  • Many fields, such as healthcare and law, lack labeled data, which limits the application of AI models in these areas.

  • It is difficult to learn from a large amount of freely available unlabeled data.

SSL shares some similarities with other popular learning paradigms such as supervised learning and unsupervised learning. The similarity between SSL and unsupervised learning is that both do not require human-labeled instances. However, there are also differences: a) SSL requires supervision, while unsupervised learning does not; b) the goal of unsupervised learning is to identify hidden patterns, while the goal of SSL is to learn meaningful representations. The similarity between SSL and supervised learning is that both require supervision in their learning paradigms. However, there are also differences: a) SSL automatically generates labels without any human intervention; b) the goal of supervised learning is to provide task-specific knowledge, while the goal of SSL is to provide general knowledge to the model.
The goals of SSL are summarized as follows:
  • To learn universal language representations that can provide excellent background for downstream models.

  • To achieve better generalization by learning from a large amount of freely available unlabeled text data.

Self-supervised learning can be roughly divided into three types: generative SSL, contrastive SSL, and adversarial SSL.
Core Concepts of T-PTLM
Pre-training
Pre-training offers the following advantages:
  • By utilizing a large amount of unlabeled text, pre-training helps the model learn universal language representations.

  • Pre-trained models can adapt to downstream tasks with just a couple of additional task-specific layers. Therefore, this provides a good initialization, avoiding the need to train downstream models from scratch (only training task-specific layers).

  • Allows models to achieve better performance with small datasets, thus reducing the need for a large number of labeled instances.

  • Deep learning models tend to overfit when trained on small datasets due to their large number of parameters. Pre-training can provide a good initialization, thus avoiding overfitting on small datasets, making pre-training a form of regularization.

Steps of Pre-training
Pre-training a model involves the following five steps:
  • Preparing the pre-training corpus

  • Generating the vocabulary

  • Designing pre-training tasks

  • Selecting pre-training methods

  • Selecting pre-training dynamics

Pre-training Corpus

Overview of Latest Transformer Pre-training Models

Figure 1: Pre-training Corpus

Overview of Latest Transformer Pre-training Models

Figure 2: Pre-training Methods, where PTS is from-scratch pre-training, CPT is continuous pre-training, SPT is simultaneous pre-training, TAPT is task-adaptive pre-training, KIPT is knowledge-inheritance pre-training
Pre-training Tasks
  • Casual Language Modeling (CLM)

  • Masked Language Modeling (MLM)

  • Replacement Token Detection (RTD)

  • Shuffled Token Detection (STD)

  • Random Token Substitution (RTS)

  • Swap Language Modeling (SLM)

  • Translation Language Modeling (TLM)

  • Alternative Language Modeling (ALM)

  • Sentence Boundary Objective (SBO)

  • Next Sentence Prediction (NSP)

  • Sentence Order Prediction (SOP)

  • Sequence-to-Sequence Language Modeling (Seq2SeqLM)

  • Denoising Autoencoder (DAE)

Embeddings

Overview of Latest Transformer Pre-training Models

Figure 8: Embeddings in T-PTLM
Classification Method
To understand and track the development of various T-PTLM, researchers classified T-PTLM from four aspects: pre-training corpus, model architecture, type of SSL, and expansion methods. As shown in Figure 9:

Overview of Latest Transformer Pre-training Models

Figure 9: Classification of T-PTLM.
Downstream Adaptation Methods
Once the language model has been trained, it can be used for downstream tasks. There are three ways to apply the pre-trained language model to downstream tasks: feature-based methods, fine-tuning, and prompt-based tuning.
As shown in Figure 10, feature-based methods involve generating context word embeddings based on the language model, which are then used as input features in models targeted at specific downstream tasks. Fine-tuning involves adjusting model weights according to the downstream task by minimizing the loss for the specific task.

Overview of Latest Transformer Pre-training Models

Figure 10: Downstream Adaptation Methods.
Evaluation
In the pre-training phase, T-PTLM acquires knowledge encoded in the pre-training corpus. This knowledge includes syntax, semantics, facts, and common sense. There are two evaluation methods for T-PTLM’s performance: intrinsic and extrinsic. See Figure 11.
Intrinsic evaluation is done by probing the knowledge encoded in T-PTLM, while extrinsic evaluation assesses how T-PTLM performs on real-world downstream tasks. Intrinsic evaluation allows us to understand the knowledge T-PTLM has gained during the pre-training phase, which helps us design better pre-training tasks so that the model can learn more knowledge during pre-training.

Overview of Latest Transformer Pre-training Models

Figure 11: Benchmarks for Evaluating T-PTLM Research Progress.
Useful Software Libraries
The researchers have also summarized some commonly used software libraries applicable to T-PTLM. Among them, libraries like Transformers and Fairseq are suitable for model training and evaluation. SimpleTransformers, HappyTransformer, AdaptNLP, etc., are built on top of the Transformer library, allowing users to achieve easier training and evaluation with minimal code. FastSeq, DeepSpeed, FastT5, OnnxT5, and LightSeq can be used to improve the inference speed of models. Ecco, BertViz, and exBERT are visualization tools for exploring the layers of Transformer models. Transformers-interpret and Captum can be used to explain model decisions.

Overview of Latest Transformer Pre-training Models

Table 11: Software Libraries Applicable to T-PTLM.
Discussion and Future Directions
Better Pre-training Methods
Using only SSL to train models (especially large models with trillions of parameters) is very costly. New pre-training methods such as Knowledge-Inheritance Pre-training (KIPT) involve SSL and knowledge distillation. SSL allows models to learn from the knowledge available in the pre-training corpus, while knowledge distillation allows models to learn from the knowledge already encoded in existing pre-trained models. Because models can gain additional knowledge during the pre-training phase through knowledge distillation, a) models can converge more quickly, thus shortening pre-training time; b) compared to models pre-trained using only SSL, they perform better on downstream tasks. The research community must focus on developing better pre-training methods like KIPT to enable models to gain more knowledge and reduce pre-training time.
Sample-Efficient Pre-training Tasks
If a pre-training task maximizes the utilization of each training instance, it can be considered sample-efficient, meaning it should derive definitions from all tokens in the training instances. Sample-efficient pre-training tasks can enhance the computational efficiency of pre-training. The most commonly used pre-training task, MLM, is not very sample-efficient, as it only involves a subset of tokens, namely masked tokens, which account for 15% of the total tokens. Tasks like RTD, RTS, and STD can be seen as early attempts at developing sample-efficient pre-training tasks. These three pre-training tasks are defined over all tokens in each training instance, meaning they involve identifying whether each token has been replaced, randomly substituted, or shuffled. Future developments should also include sample-efficient pre-training tasks that enhance computational efficiency.
Efficient Models
Due to the large model size and the need for a significant amount of unlabeled text data, pre-training T-PTLM is also costly. However, longer pre-training times are not environmentally friendly, as this process releases carbon dioxide; many fields, such as biomedicine, also lack large-scale unlabeled text data. Recently, models like DeBERTa, which have been newly improved based on the BERT model, have achieved better performance than the RoBERTa model, despite using only 78 GB of data for pre-training, which is only half of what was used to pre-train the RoBERTa model. Similarly, ConvBERT has achieved superior performance with only a quarter of the pre-training cost of the ELECTRA model due to its new hybrid attention module. To reduce the amount of data and training costs for pre-training, efficient models like DeBERTa and ConvBERT are needed.
Better Positional Encoding Mechanisms
The self-attention mechanism is permutation-invariant, meaning it lacks positional bias. Using absolute or relative positional embeddings can provide positional bias. Additionally, absolute positional embeddings can be predetermined or learned. However, both methods have their pros and cons. Absolute positional embeddings can have generalization issues but are easy to implement. In contrast, relative positional embeddings can robustly handle changes in sequence length but are harder to implement and perform worse. We also need new positional encoding mechanisms, such as CAPE, which combines the advantages of both absolute and relative positional embeddings.
Improving Existing T-PTLM
T-PTLM such as BERT and RoBERTa have already achieved excellent results on many NLP tasks. Recent research has shown that further improvements can be made to these models by injecting sentence-level semantics through continuous pre-training based on adversarial or contrastive pre-training tasks. For instance, Panda et al. demonstrated that continuous pre-training using a shuffled token detection objective can improve the performance of the RoBERTa model on the GLUE tasks, as it allows the model to learn more coherent sentence representations. Similarly, continuous pre-training using a contrastive pre-training objective can enhance the performance of T-PTLM on GLUE tasks and the performance of multilingual T-PTLM on the Mickey Probe. Further research is needed to extend this to other single-language and domain-specific T-PTLM.
Beyond Naive Fine-Tuning
Fine-tuning is the most common method for adapting pre-trained models to downstream tasks. However, the main drawback of naive fine-tuning is that it alters all layers in the pre-trained model, necessitating the maintenance of another copy for each task, which increases deployment costs. To apply pre-trained models to downstream tasks in a parameter-efficient manner, methods such as Adapters and pruning-based fine-tuning have been proposed.
For example, an adapter is a small layer added to each Transformer layer for a specific task. During downstream task adaptation, only the parameters of the adapter layer are updated, while the parameters of the Transformer layers remain unchanged. Moreover, Poth et al. demonstrated that adapters can also be used for intermediate fine-tuning. Recently, prompt-based tuning methods have shown significantly better performance in terms of parameter efficiency and have gained attention from the research community. For instance, prompt-based tuning methods like Prefix-tuning require only 0.1% of the task-specific parameters, while adapter-based fine-tuning requires 3% of the task-specific parameters.
Benchmarking
In the last four layers, numerous benchmarks have been introduced to assess the progress of general-purpose and domain-specific pre-trained models. In addition to English, some benchmarks for evaluating the progress of other single-language and multilingual models have also emerged. However, existing benchmarks are insufficient to cover all scenarios. For example, there are no benchmarks to evaluate a) the progress of compact pre-trained models, b) the robustness of pre-trained models, and c) pre-trained models developed for specialized fields such as social media and academia.
Recently, leaderboards like Explainboard have begun to use not only existing benchmarks as single indicators for progress evaluation but also to delve deeply into or analyze the strengths and weaknesses of models. Such leaderboards should also extend to other fields. Additionally, benchmarks like FewGLUE, FLEX, and FewCLUE that evaluate few-shot learning techniques should be expanded to other languages and domains.
Compact Models
T-PTLM has achieved the best performance in almost every NLP task. However, these models are large and require more storage space. Due to the numerous layers in these models, inputs take time to fully pass through the model to obtain predictions, leading to high latency. Real-world applications have limited resources and require lower latency, so model compression methods such as pruning, quantization, knowledge distillation, parameter sharing, and decomposition have been explored for application in the English general domain. Researching the application of these model compression methods in other languages and fields holds great promise.
Robustness to Noise
T-PTLM is susceptible to noise, including adversarial noise and natural noise. The main reason is the use of subword embeddings. When using subword embeddings, a word is decomposed into multiple subword tokens, so even a small spelling error can change the overall representation of the word, hindering model learning and affecting model predictions. To enhance the model’s robustness to noise, models like CharacterBERT have adopted methods that use only character embeddings, while models like CharBERT combine character embeddings with subword embeddings. Both methods can improve robustness to noise.
Recently, researchers have proposed token-free models like CANINE, ByT5, and Charformer to enhance robustness to noise. To enable these models for real-world applications, especially in sensitive areas like healthcare, we need to enhance their robustness.
New Adaptation Methods
To adapt general models to specialized fields like biomedicine or to adapt multilingual models to specific languages, a common strategy is to use continuous pre-training. Although this method achieves good results by fine-tuning the model to adapt to specific domains or languages, the performance of downstream models can be affected if there is a lack of domain or language-specific vocabulary. Recently, researchers have proposed methods to expand the vocabulary and then continuously pre-train. These methods can overcome the problem of out-of-vocabulary (OOV) words, but they increase the size of the vocabulary due to the addition of new words. Recently, Yao et al. proposed the Adapt and Distill method, which uses vocabulary expansion and knowledge distillation to adapt general models to specific domains. Unlike existing adaptation methods, this method not only allows general models to adapt to specific domains but also reduces the model’s size. This approach deserves further research and is expected to yield new adaptation methods.
Privacy Issues
T-PTLM has achieved excellent results on many NLP tasks. However, these models also pose some unexpected and unhelpful risks. For example, data leakage is a major concern, especially when the pre-training of these models uses private data. Since models are pre-trained on large amounts of text data, there is a possibility of recovering sensitive information, such as identifiable personal information. Therefore, it is necessary to prevent the public release of models trained using private data.
Recently, research by Carlini et al. has shown that the GPT-2 model can generate a person’s complete postal address, which is included in the training data and can be obtained using that person’s name through prompts. The recently emerged KART framework in the biomedical field can assess data leakage through various attacks. The research community needs to develop more sophisticated attacks to evaluate data leakage and develop methods to prevent pre-trained models from leaking sensitive data.
Reducing Bias
Deep learning methods are being increasingly applied in the real world, including in specialized fields such as biomedicine and law. However, these models can easily learn and amplify existing biases in the training data. The result is that these models may exhibit biases against specific races, genders, or age groups. We do not need such models.
Recently, some research has focused on identifying and reducing bias. For example, Minot et al. proposed a data augmentation method to reduce gender bias, while Liang et al. proposed the A-INLP method, which can dynamically identify bias-sensitive tokens. Further research in this area can help reduce bias in pre-trained models and assist them in making fair decisions.
Reducing Fine-Tuning Instability
Fine-tuning is the most common method for adapting pre-trained models to downstream tasks. Although fine-tuning performs well, it is unstable, meaning that using different random seeds to perform fine-tuning can lead to significant differences in downstream performance. It is believed that the instability of fine-tuning is due to catastrophic forgetting and the small size of the dataset. However, Mosbach et al. demonstrated that neither of these reasons is the cause of fine-tuning instability, and further indicated that the reasons for fine-tuning instability include: a) optimization difficulties leading to gradient vanishing, b) generalization issues. Possible solutions to reduce fine-tuning instability include: a) intermediate fine-tuning, b) mix-out, c) using a smaller learning rate in earlier epochs and increasing the number of fine-tuning epochs, d) simultaneously using supervised contrastive loss and cross-entropy loss. Methods to stabilize fine-tuning deserve further research.

Leave a Comment