MLNLP Community is a well-known machine learning and natural language processing community at home and abroad, covering NLP master’s and doctoral students, university teachers, and corporate researchers.

Community Vision is to promote communication and progress between the academic and industrial circles of natural language processing and machine learning at home and abroad, especially for the progress of beginners.

Reprinted from | PaperWeekly

Author | Yu Yangmu

Unit | NIO

Research Direction | NLU, AIGC

In recent years, the development of the NLP academic field has been rapid. After the recent popularity of contrastive learning, prompt learning has gained even more attention. It is well known that data labeling largely determines the upper limit of AI algorithms and is very costly. Both contrastive learning and prompt learning focus on solving the problem of few-shot learning and can even achieve good results without labeled data. This article mainly introduces the concept of prompt learning and the commonly used methods.

What Are the Training Paradigms of NLP

Currently, the academic community generally divides the development of NLP tasks into four stages, namely the four paradigms of NLP:

1. The first paradigm: Paradigm based on traditional machine learning models, such as tf-idf features + Naive Bayes and other machine algorithms;

2. The second paradigm: Paradigm based on deep learning models, such as word2vec features + LSTM and other deep learning algorithms. Compared to the first paradigm, model accuracy has improved, and the workload of feature engineering has decreased;

3. The third paradigm: Paradigm based on pre-trained models + fine-tuning, such as BERT + fine-tuning for NLP tasks. Compared to the second paradigm, model accuracy has significantly improved, but the model has also become larger, allowing good models to be trained even with small datasets;

4. The fourth paradigm: Paradigm based on pre-trained models + Prompt + Prediction. Compared to the third paradigm, the amount of training data required for model training has significantly decreased.

An Overview of Prompt Learning Techniques

In the entire NLP field, you will find that the development is moving towards higher accuracy, less supervision, and even unsupervised directions. Prompt Learning is currently the latest and hottest research achievement in this direction.

Why Do We Need Prompt Learning

Why? To propose a good method, it must address the deficiencies or shortcomings of another method. Let’s start with the previous paradigm, which is the pre-trained model PLM + fine-tuning paradigm commonly used with BERT + fine-tuning:

This paradigm aims to better apply pre-trained models to downstream tasks, requiring fine-tuning of model parameters using downstream data. First, during pre-training, the training form used is autoregressive and autoencoding, which has a significant gap with the downstream task form and cannot fully exploit the capabilities of the pre-trained model itself.

This inevitably leads to: more data to adapt to the new task form —> poor few-shot learning ability and easy overfitting.

Secondly, the parameters of current pre-trained models are becoming larger and larger. Fine-tuning a model for a specific task and then deploying it for online business also causes a tremendous waste of deployment resources.

What Is Prompt Learning

First, we should have a consensus: there is a wealth of knowledge in pre-trained models; pre-trained models themselves possess few-shot learning capabilities.

In-Context Learning proposed by GPT-3 has also effectively demonstrated that in zero-shot and few-shot scenarios, models can achieve good results without any parameters, especially in the recently popular GPT-3.5 series, including ChatGPT.

The Essence of Prompt Learning:

Unify all downstream tasks into pre-training tasks; using specific templates, convert downstream task data into natural language forms, fully tapping into the capabilities of the pre-trained model itself.

Essentially, it is about designing a template that closely matches the upstream pre-training task. Through template design, the potential of the upstream pre-trained model is excavated, allowing the pre-trained model to perform downstream tasks well without requiring labeled data. The key includes three steps:

1. Design the task of the pre-trained language model

2. Design the input template style (Prompt Engineering)

3. Design the label style and the method of mapping the model’s output to the label (Answer Engineering)

Forms of Prompt Learning:

Taking the task of sentiment classification of movie reviews as an example, the model needs to perform binary classification based on the input sentence:

Original input: The special effects are very cool, I really like it.

Prompt Input: Prompt Template 1: The special effects are very cool, I really like it. This is a [MASK] movie; Prompt Template 2: The special effects are very cool, I really like it. This movie is very [MASK]

The role of the prompt template is to convert training data into natural language forms and to MASK at appropriate positions to stimulate the capabilities of the pre-trained model.

▲ Prompt learning template framework

Category mapping / Verbalizer: Choose the appropriate predictive words and map these words to different categories.

▲ Category mapping

By constructing prompt learning samples, effective results can be achieved with only a small amount of data for Prompt Tuning, demonstrating strong zero-shot/few-shot learning capabilities.

Common Prompt Learning Methods

4.1 Hard Template Method

4.1.1 Hard Template – PET (Pattern Exploiting Training)

PET is a classic prompt learning method. Similar to the previous example, it models the problem as a cloze problem and then optimizes the final output word. Although PET also optimizes the entire model’s parameters, it requires less data compared to traditional fine-tuning methods.

Modeling Method:

In the past, models only needed to model P(l|x) (l is label), but now the problem can be updated to include Prompt P and label mapping (the author calls it verbalizer):

Where M represents the model, s is the logits for generating the corresponding word under a certain prompt. By applying softmax, the probabilities can be obtained:

During training, the author also added MLM loss for joint training.

▲ Training architecture

Specific Practices:

1. Train a model for each prompt on a small amount of supervised data;

2. For unsupervised data, integrate the prediction results of multiple prompts for the same sample, using averaging or weighting (assigning weights based on accuracy) to obtain a probability distribution as the soft label for unsupervised data;

3. Fine-tune a final model on the obtained soft labels.

4.1.2 Hard Template – LM-BFF

LM-BFF is the work of Chen Tianqi’s team, proposing Prompt Tuning with demonstration & Auto Prompt Generation based on Prompt Tuning.

Defects of Hard Template Methods:

Hard templates rely on two approaches: manually designed based on experience & automated search. However, manually designed templates are not necessarily better than automated searches, and the readability and interpretability of automated searches are also not strong.

As shown in the experimental results, changing a single word in the prompt can lead to significant differences in the results, providing direction for subsequent optimization, such as directly abandoning hard templates and optimizing prompt token embeddings.

4.2 Soft Template Method

4.2.1 Soft Template – P Tuning

Rather than designing/searching for hard templates, several optimizable Pseudo Prompt Tokens are directly inserted at the input end to automate the search for knowledge templates in continuous space:

1. Does not rely on manual design

2. The parameters to be optimized are minimal, avoiding overfitting (can also be fully fine-tuned, degrading to traditional fine-tuning)

Traditional discrete prompts directly map each token of the template T to the corresponding embedding, while P-Tuning maps the Pi (Pseudo Prompt) in the template T to a trainable parameter hi.

The key to optimization is to replace hard prompts in natural language with trainable soft prompts; use bidirectional LSTM to represent the sequence of pseudo tokens in the template T; introduce a small number of anchor characters from natural language prompts to improve efficiency, such as the anchor “capital” in the image above. It can be seen that P-tuning is a hybrid of hard and soft forms, not entirely soft.

Specific Practices:

1. Initialize a template:The capital of [X] is [mask]

2. Replace input:[X] is replaced with the input “Britain”, predicting the capital of Britain

3. Select one or more tokens from the template as soft prompts

4. Feed all soft prompts into LSTM to obtain the hidden state vector h for each soft prompt

5. Send the initial template into the BERT Embedding Layer, replacing the token embeddings of all soft prompts with h, and then predict the mask.

Core Conclusion: Based on full data, large models: only fine-tune the parameters related to prompts, achieving performance comparable to fine-tuning.

Code:https://github.com/THUDM/

4.2.2 Soft Template – Prefix Tuning

P-tuning updates the method of prompt token embedding, optimizing fewer parameters. Prefix tuning aims to optimize more parameters to enhance performance without incurring excessive burdens. Although prefix tuning was proposed for generative tasks, it has an inspiring impact on the subsequent development of soft prompts.

▲ Optimize the Prompt token embedding of each layer, not just the input layer

As shown in the image, a prefix is added before each layer of the transformer. The characteristic is that the prefix is not a real token but a continuous vector (soft prompt). During Prefix-tuning training, the parameters of the transformer are frozen, and only the parameters of the prefix are updated.

Only one copy of the large transformer needs to be stored along with the learned task-specific prefix, resulting in very little overhead for each additional task.

▲ Autoregressive model

Taking the autoregressive model in the image above as an example:

1. Input is represented as Z = [ prefix ; x ; y ]

2. Prefix-tuning initializes a training matrix P to store prefix parameters

3. The token in the prefix part has parameters selected from the designed training matrix, while the parameters of other tokens are fixed and correspond to the parameters of the pre-trained language model.

Core Conclusion: Prefix-tuning on generative tasks, with full data and large models: only fine-tune the parameters related to prompts, achieving performance comparable to fine-tuning.

Code:

https://github.com/XiangLi1999/PrefixTuning

4.2.3 Soft Template – Soft Prompt Tuning

Soft Prompt Tuning has validated the effectiveness of the soft template method and proposed that by fixing the base model and effectively utilizing task-specific Soft Prompt Tokens, resource consumption can be significantly reduced while achieving the generality of large models.

It simplifies Prefix-tuning by fixing the pre-trained model and only adding k additional learnable tokens to the input of downstream tasks. This approach can achieve performance comparable to traditional fine-tuning under the premise of large-scale pre-trained models.

Code:

https://github.com/kipgparker/soft-prompt-tuning

Summary

Components of Prompt Learning

1. Prompt Template: Construct cloze or prefix-based templates according to the use of pre-trained models.

2. Category Mapping / Verbalizer: Choose appropriate category mapping words based on experience.

3. Pre-trained Language Model

Summary of Typical Prompt Learning Methods

1 Hard Template Methods: Manually designed/automatically constructed discrete token-based templates

1) PET

2) LM-BFF

2 Soft Template Methods: No longer pursuing intuitive interpretability of templates, but directly optimizing Prompt Token Embedding as vector/trainable parameters

1) P-tuning

2) Prefix Tuning

References

[1] https://arxiv.org/pdf/2107.13586.pdf

[2] https://arxiv.org/pdf/2009.07118.pdf

[3] https://arxiv.org/pdf/2012.15723.pdf

[4] https://arxiv.org/pdf/2103.10385.pdf

[5] https://aclanthology.org/2021.acl-long.353.pdf

[6] https://arxiv.org/pdf/2104.08691.pdf

Technical Group Invitation

△ Long press to add assistant

Scan the QR code to add the assistant on WeChat

Please note: Name – School/Company – Research Direction

(e.g., Xiao Zhang – Harbin Institute of Technology – Dialogue System)

to apply to join Natural Language Processing/Pytorch and other technical groups

About Us

MLNLP Community is a grassroots academic community jointly built by scholars in machine learning and natural language processing at home and abroad. It has now developed into a well-known machine learning and natural language processing community, aiming to promote progress between the academic and industrial circles of machine learning and natural language processing and a wide range of enthusiasts.

The community can provide an open communication platform for related practitioners’ further education, employment, and research. Everyone is welcome to follow and join us.