Where Has BERT Gone? Insights on the Shift in LLM Paradigms

MLNLP community is a well-known machine learning and natural language processing community both domestically and internationally, covering NLP master’s and doctoral students, university teachers, and researchers from enterprises.

The Vision of the Community is to promote communication and progress between the academic and industrial circles of natural language processing and machine learning, especially for the progress of beginners.

Reprinted from | Machine Heart

Editor | Panda

Where have the encoder models gone? If BERT performed well, why not expand it? What about encoder-decoder or just encoder models?

Where Has BERT Gone? Insights on the Shift in LLM Paradigms

In the field of large language models (LLM), it is currently the era where only decoder models (such as the GPT series) dominate. So how have encoder-decoder or just encoder models developed? Why has the once-famous BERT gradually lost attention?

Recently, Yi Tay, the chief scientist and co-founder of the AI startup Reka, published a blog post sharing his views. Before co-founding Reka, Yi Tay worked at Google Research and Google Brain for over three years, participating in the development of famous LLMs such as PaLM, UL2, Flan-2, Bard, as well as multimodal models like PaLI-X and ViT-22B. Below is the content of his blog post.

Basic Introduction

Overall, the LLM model architectures of the past few years can be mainly divided into three paradigms: encoder-only models (like BERT), encoder-decoder models (like T5), and decoder-only models (like the GPT series). People often confuse these and misunderstand these classification methods and architectures.

One key point to understand is: encoder-decoder models are actually autoregressive models. In an encoder-decoder model, the decoder is essentially still a causal decoder. It does not require a pre-filled decoder model; instead, it unloads certain texts to the encoder and then sends them to the decoder through cross-attention. Yes, the T5 model is also a language model!

A variant of this type of model is the Prefix Language Model (PrefixLM), which works almost the same way but lacks cross-attention (and some other small details, such as weight sharing between encoder/decoder and no encoder bottleneck). PrefixLM is sometimes referred to as a non-causal decoder. In simple terms, there is not much difference overall between encoder-decoder, decoder-only models, and PrefixLM!

In a recent excellent lecture by Hyung Won, he skillfully explained the relationships between these models. For details, refer to the Machine Heart report: “What Will Be the Main Driving Force Behind AI Research? ChatGPT Team Research Scientist: The Cost of Computing Power Will Decrease“

At the same time, the denoising method of encoder-only models like BERT is different (i.e., in-place); and to some extent, encoder-only models need to rely on a classification “task” head to truly function after pre-training. Later, models like T5 adopted a “modified” denoising objective, which uses a sequence-to-sequence format.

For this reason, it should be pointed out that the denoising in T5 is not a new objective function (in the machine learning sense), but a cross-input data transformation; you can also train a causal decoder on a span corruption objective.

People always assume that encoder-decoder models must be denoising models, partly because T5 is indeed too representative. But that is not always the case. You can train encoder-decoder models using standard language modeling tasks (like causal language modeling). Conversely, you can also train causal decoders using span corruption tasks. As I mentioned earlier, this is basically a data transformation.

Another point worth noting: generally speaking, the computational cost of an encoder-decoder with 2N parameters is the same as that of a decoder-only model with N parameters, which leads to a difference in their FLOP and parameter ratio. This is like distributing “model sparsity” between the input and the target.

This is not something new, nor did I come up with it myself. It was mentioned in the 2019 T5 paper and reiterated in the UL2 paper.

For now, I am glad to clarify this point. Now, let’s talk about the objectives.

About the Denoising Objective (Does It Not Work? Cannot Scale? Or Is It Too Easy?)

The denoising objective here refers to any variant of the “span corruption” task. This is sometimes referred to as “filling” or “fill-in-the-blank”. There are many ways to express it, such as span length, randomness, sentinel tokens, etc. You must have grasped the key here.

Although the denoising objective of BERT-style models is basically in-place (e.g., the classification head is on the masked token), the “T5 style” is a bit more modern, processing data transformations through encoder-decoder or decoder-only models. In such data transformations, the masked tokens are merely “moved back” so that the model can make predictions.

The main goal of pre-training is to build internal representations aligned with downstream tasks as efficiently and effectively as possible. The better this internal representation, the easier it is to apply these learned representations to subsequent tasks. We all know that simple next-word prediction “causal language modeling” objectives perform excellently and have become central to the LLM revolution. Now, the question is whether the denoising objective is equally excellent.

According to public information, we know that T5-11B performs quite well, even after alignment and supervised fine-tuning (the MMLU score of Flan-T5 XXL is 55+, which was quite good for a model of this scale at that time). Therefore, we can conclude that the transfer process of the denoising objective (pre-training → alignment) works relatively well at this scale.

In my view, the denoising objective works well, but it is not sufficient to serve as a standalone objective. A significant drawback stems from what is called “loss exposure”. In the denoising objective, only a small number of tokens are masked and learned (i.e., considered in the loss). In contrast, in regular language modeling, this is close to 100%. This makes the sample efficiency per FLOP very low, leading to a significant disadvantage for the denoising objective in comparisons based on FLOP.

Another disadvantage of the denoising objective is that it is less natural than regular language modeling, as it resets the input/output format in a strange way, making it less suitable for few-shot learning. (However, in few-shot tasks, it is still possible to adjust these models to perform quite well.) Therefore, I believe the denoising objective should only serve as a supplementary objective to regular language modeling.

The Unified Early and Reasons for the Disappearance of BERT-like Models

Models similar to BERT are gradually disappearing, and now not many people discuss them anymore. This also explains why we no longer see large-scale BERT models. What is the reason? This is largely due to the unification and shift of tasks/modeling paradigms. BERT-style models are cumbersome, but the real reason for the abandonment of BERT models is that people want to accomplish all tasks at once, thus adopting a better denoising method—using autoregressive models.

During the period from 2018 to 2021, a paradigm shift occurred: from single-task fine-tuning to large-scale multi-task models. This gradually led us to the unified SFT models, which are the general models we see today. It is difficult to achieve this using BERT. I believe this has little to do with “denoising”. For those who still want to use such models (i.e., T5), they found a way to re-express the denoising pre-training task, which has led to the near abandonment of BERT-style models, as we now have better alternatives.

More specifically, encoder-decoder and decoder-only models can be used for multiple tasks without task-specific classification heads. For encoder-decoder models, researchers and engineers have begun to find that giving up the encoder performs similarly to the BERT encoder. Additionally, this retains the advantages of bidirectional attention—an advantage that allows BERT to compete with GPT on a small scale (often production scale).

The Value of the Denoising Objective

The denoising pre-training objective can also learn to predict the next word in a manner similar to regular language modeling. However, unlike regular causal language modeling, this requires a data transformation on the sequence so that the model can learn “fill-in-the-blank” rather than simply predicting natural text from left to right.

It is worth noting that the denoising objective is sometimes referred to as the “filling task”, and it may be mixed with regular language modeling tasks during the pre-training process.

While the exact configurations and implementation details may vary, today’s modern LLMs may combine language modeling and filling to some extent. Interestingly, this “language model + filling” mix has also likely spread around the same time (such as UL2, FIM, GLM, CM3), with many teams bringing their unique mixed solutions. By the way, the largest model known to be trained this way is likely PaLM-2.

It should also be noted that pre-training task mixing can be sequentially stacked, not necessarily mixed simultaneously. For example, Flan-T5 was initially trained on 1T span corruption tokens, then switched to 100B tokens for feed-forward language modeling objectives, and later underwent flan instruction fine-tuning. To some extent, this fits the mixed denoising/LM target model. It is important to clarify that the prefix language modeling objective (do not confuse with architecture) is simply causal language modeling, which has a randomly determined cutoff point sent to the input side (with no loss and non-causal masking).

By the way, filling may have originated in the code LLM field, where “fill-in-the-blank” is more like a function needed for writing code. At the same time, UL2’s motivation is more to unify the denoising objectives and the task categories that bidirectional LLMs excel at with inherent generative tasks (such as summarization or open-ended generation). The advantage of this autoregressive decoding “moving back” is that it not only allows the model to learn longer-term dependencies but also enables it to implicitly benefit from non-explicit bidirectional attention (because to fill in the blanks, you have already seen the future).

There is a legendary experience: representations learned from denoising objectives perform better on specific task categories and sometimes have higher sample efficiency. In the U-PaLM paper, we demonstrated how a small amount of span corruption up-training changed behavior and emergence phenomena on a set of BIG-Bench tasks. On this basis, fine-tuning models trained with this objective usually yields better supervised fine-tuning models, especially when the scale is smaller.

In single-task fine-tuning, we can see that the PaLM-1 62B model was outperformed by a much smaller T5 model. At relatively smaller scales, “bidirectional attention + denoising objective” is a beautiful combination! I believe many practitioners have also noticed this, especially in production applications.

How About Bidirectional Attention?

For language models, bidirectional attention is an interesting “inductive bias”—people often confuse it with objectives and model backbones. In different computational fields, the uses of inductive bias vary and may also have different impacts on scaling curves. That said, compared to smaller scales, bidirectional attention may not be as important at larger scales, or it may have different impacts on different tasks or modalities. For example, PaliGemma uses the PrefixLM architecture.

Hyung Won also pointed out in his talk that PrefixLM models (decoder-only models with bidirectional attention) also have caching issues, which is an inherent flaw of this architecture. However, I believe there are many ways to address this flaw, but that is beyond the scope of this article.

Advantages and Disadvantages of Encoder-Decoder Architecture

Compared to decoder-only models, encoder-decoder architectures have both advantages and disadvantages. One advantage is that the encoder side is not limited by causal masking. To some extent, you can unleash creativity at the attention layer, executing pooling or any form of linear attention without worrying about the limitations of autoregressive design. This is a good way to offload less important “context” to the encoder. You can also make the encoder smaller, which is an advantage.

A necessary example of the encoder-decoder architecture is Charformer, which boldly utilizes the encoder and mitigates the speed disadvantages of byte-level models. Innovating on the encoder can yield quick benefits without worrying about the major flaws of causal masking.

At the same time, a disadvantage of the encoder-decoder compared to PrefixLM is that the input and target must allocate a fixed budget. For example, if the input budget is 1024 tokens, the encoder side must fill up to this value, which may waste significant computation. In contrast, in PrefixLM, the input and target can be directly connected, alleviating this issue.

Relevance to Today’s Models and Key Points

In today’s era, a key ability to become a qualified LLM researcher and practitioner is the ability to infer inductive bias from both architectural and pre-training perspectives. Understanding the subtle differences can help people extrapolate and innovate continuously.

Here are my key points:

Encoder-decoder and decoder-only models are both autoregressive models, differing in implementation and having their respective advantages and disadvantages. They represent slightly different inductive biases. The choice between them depends on downstream use cases and application constraints. Meanwhile, for most LLM use cases and niche cases, BERT-style encoder models can be considered outdated.
The denoising objective mainly serves as a supplement to causal language modeling. They have been successfully used as “support objectives” during the training phase. Using denoising objectives to train causal language models usually brings some level of benefit. While this is quite common in the code model field (i.e., code filling), for today’s general models, using causal language modeling with some denoising objective for pre-training is also quite common.
Bidirectional attention can greatly benefit smaller models but may be optional for larger models. This is mostly hearsay. I believe bidirectional attention has an inductive bias, similar to many other types of modifications made to Transformer models.

Finally, to summarize: Currently, there are no large-scale versions of BERT models in operation: BERT models have been abandoned in favor of more flexible denoising (autoregressive) T5 models. This is primarily due to paradigm unification, as people prefer to use a general model to perform various tasks (rather than using task-specific models). At the same time, autoregressive denoising can sometimes serve as a secondary objective for causal language models.

Original link: https://www.yitay.net/blog/model-architecture-blogpost-encoders-prefixlm-denoising

Technical Exchange Group Invitation

△Long press to add the assistant

Scan the QR code to add the assistant WeChat

Please note: Name-School/Company-Research Direction

(e.g., Xiao Zhang-Harbin Institute of Technology-Dialogue System)

to apply to join Natural Language Processing/Pytorch and other technical exchange groups

About Us

MLNLP Community is a grassroots academic community jointly built by machine learning and natural language processing scholars from home and abroad. It has now developed into a well-known machine learning and natural language processing community, aiming to promote progress between the academic and industrial circles of machine learning and natural language processing.

The community can provide an open communication platform for the further education, employment, and research of related practitioners. Everyone is welcome to follow and join us.