Google & Hugging Face: The Most Powerful Language Model Architecture for Zero-Shot Learning

Data Digest authorized reprint from Xi Xiaoyao’s Cute Selling House

Author: iven

From GPT-3 to prompts, more and more people have discovered that large models perform very well under zero-shot learning settings. This has led to increasing expectations for the arrival of AGI.

However, one thing is very puzzling: In 2019, T5 discovered through “hyperparameter tuning” that when designing pre-training models, the Encoder-Decoder model structure + MLM task had the best fine-tuning effects on downstream tasks. Yet, as of 2022, the mainstream large models are all using only the decoder model structure, such as OpenAI’s GPT series, Google’s PaLM [1], DeepMind’s Chinchilla [2], etc. Why is this? Could there be issues with the design of these large models?

Today, we bring an article from Hugging Face and Google. This article shares a similar experimental approach to T5 and provides a significant conclusion through extensive comparative design: if the goal is to enhance the model’s zero-shot generalization ability, the decoder structure + language model task is best; if multitask fine-tuning is also considered, the encoder-decoder structure + MLM task is optimal.

In addition to finding the best training method, the authors also discovered the most cost-effective training method through extensive experiments. The computational cost of training only requires one-ninth!

Paper Title: What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?

Paper Link: https://arxiv.org/abs/2204.05832

Model Design

Google & Hugging Face: The Most Powerful Language Model Architecture for Zero-Shot Learning

Model design can be divided into four aspects in the figure: what structure to choose? What training objective? Should adaptation be done? Multitask fine-tuning? The article also evaluates two benchmarks.

Model Structure

The model structures are based on transformers, with three options as shown in the figure:

Google & Hugging Face: The Most Powerful Language Model Architecture for Zero-Shot Learning

Causal decoder-only (CD): Directly uses only the transformer decoder. Most models of this type use the language model training objective, predicting the current token based on previous context. Representative works include the GPT series.
Non-causal decoder-only (ND): In order to generate based on given conditions, part of the previous tokens can be made visible during training.
Encoder-decoder (ED): This is the original transformer structure, where an input sequence is taken, and the encoder outputs a vector representation of the same length, while the decoder generates autoregressively based on the encoder’s output.

To summarize, CD uses only the decoder, ND is a decoder with prompts, and ED is the encoder-decoder. Abbreviations will be used later.

Training Objective

Corresponding to the model structures, there are also three types of training objectives:

Google & Hugging Face: The Most Powerful Language Model Architecture for Zero-Shot Learning

Full language modeling (FLM): CD-type model architectures commonly use FLM, predicting the current token based on previous context. During training, each token’s loss can be computed in parallel, while prediction requires iterative forecasting.
Prefix language modeling (PLM): ND and ED model architectures can use PLM. A prefix is defined in the attention matrix, and the model is required to generate the tokens following the prefix during training.
Masked language modeling (MLM): Models using only the Encoder commonly use MLM objectives. Later, T5, a seq2seq model, also used a full-masked MLM task.

To summarize, FLM is the language modeling objective, PLM is the prompt-based language modeling objective, and MLM is the masking objective. Abbreviations will also be used later.

Adaptation Tasks

Adaptation tasks involve continuing training with a new training objective after pre-training. Unlike fine-tuning, the adaptation process does not use data from new downstream tasks but continues to use pre-training data. Adaptation tasks can also be divided into two categories.

Language modeling adaptation (LM-A): Pre-trained with MLM, then continue training with PLM or FLM. MLM + FLM is the method used by T5, while MLM + PLM is a previously popular method of continuous prompt-tuning, such as prefix-tuning, etc.
Non-causal MLM adaptation (NC-A): Pre-trained using PLM, then continue training with FLM. This method is proposed for the first time in this article, fixing part of the prefix in front of the decoder and training with the PLM objective, essentially performing prefix-tuning for GPT.

Multitask Fine-tuning

Google & Hugging Face: The Most Powerful Language Model Architecture for Zero-Shot Learning

Multitask fine-tuning (MT-F) is Hugging Face’s work from the end of last year [3], which involves taking a pre-trained model and finetuning it simultaneously on 171 tasks using prompts. This approach can significantly enhance the zero-shot capability of the pre-trained model.

Experiments and Conclusions

Evaluation Tasks

This article used two benchmarks:

EleutherAI LM Evaluation Harness (EAI-Eval): This task is used to evaluate the zero-shot capability of language models (specifically, models trained with the FLM objective).
T0’s test set (T0-Eval): This is the test set previously used in Hugging Face’s multitask fine-tuning work.

Both test sets are tested using prompts, directly constructing prompts to input into the pre-trained model and allowing the model to generate predictions. The difference between the two test sets is that EAI-Eval provides only one prompt for each task, so the evaluation is significantly impacted by prompt fluctuations. In this article’s testing, the authors designed multiple prompts for each task to eliminate randomness.

Conclusions

The experiments yielded the following conclusions:

When only unsupervised pre-training:

The CD model structure + FLM training objective = the best zero-shot model.

Google & Hugging Face: The Most Powerful Language Model Architecture for Zero-Shot Learning

This aligns with current large models. The large models all use this combination, which has the best zero-shot generalization capability.

After pre-training with multitask fine-tuning:

The ED model structure + MLM training objective = the best zero-shot model.

Google & Hugging Face: The Most Powerful Language Model Architecture for Zero-Shot Learning

This image shows the results from two evaluation sets. Each image has nine points representing nine combinations of model structures and training objectives. The results on T0-Eval are very clear: the nine combinations can be divided into three groups, with the left side consisting of several baselines, the middle containing three model structures + language model training objectives, and the right side showing three model structures + MLM training objectives. It is evident that the MLM training objective is significantly better, with MLM + ED being the best.

The role of adaptation tasks:

Google & Hugging Face: The Most Powerful Language Model Architecture for Zero-Shot Learning

After pre-training, changing to a new training objective and continuing training mainly reduces training costs. For example, in the left image, if we want a combination of CD + FLM, we first train an ND + MLM, then switch to CD + FLM for adaptation tasks, which can accelerate the overall process by 1.6 times.

After a series of experiments, the authors finally concluded: if you want to build an effective large model at the lowest cost, use CD + FLM for pre-training, then switch to ND + MLM for adaptation tasks, and finally use multitask fine-tuning. This training method is 9.1 times faster than direct training while achieving the best results.

Summary

This article is very similar to T5, utilizing a hyperparameter tuning approach in designing experiments to ultimately find the best model design and training method. Reading such papers feels logically clear and rigorous.

However, from another perspective, such articles may also seem a bit dull: the use of large models has turned into finding features for prompt engineering. The training and design of this article have also become hyperparameter tuning, losing the spark of innovation. This might represent the internal competition in the field of large models.

References:

[1] Aakanksha Chowdhery, et. al., “Palm: Scaling language modeling with pathways.”, https://arxiv.org/abs/2204.02311

[2]Jordan Hoffmann, et. al., “Training Compute-Optimal Large Language Models.”, https://arxiv.org/abs/2203.15556

[3]Victor Sanh, et. al., “Multitask Prompted Training Enables Zero-Shot Task Generalization”, https://arxiv.org/abs/2110.08207

Google & Hugging Face: The Most Powerful Language Model Architecture for Zero-Shot Learning

Clicking “Looking” makes everyone look better!

Model Structure

Training Objective

Adaptation Tasks

Multitask Fine-tuning

Evaluation Tasks

Conclusions

[2]Jordan Hoffmann, et. al., “Training Compute-Optimal Large Language Models.”, https://arxiv.org/abs/2203.15556

[3]Victor Sanh, et. al., “Multitask Prompted Training Enables Zero-Shot Task Generalization”, https://arxiv.org/abs/2110.08207

Leave a Comment Cancel reply