MLNLP community is a well-known machine learning and natural language processing community in China and abroad, targeting NLP graduate students, university teachers, and corporate researchers.

The vision of the community is to promote communication and progress between the academia and industry of natural language processing and machine learning, especially for the advancement of beginners.

Reprinted from | PaperWeekly

Author | Zhang Jianwei

Institution | Zhejiang University

Research Direction | Few-shot Learning, Image Segmentation

Since the advent of Self-Attention and Transformer, they have become a new star in the field of natural language processing. Thanks to the global attention mechanism and parallel training, transformer-based natural language models can conveniently encode long-distance dependencies, while parallel training on large-scale natural language datasets becomes possible. However, due to the wide variety of natural language tasks, and the minor differences between tasks, it is not cost-effective to fine-tune a large model for each task individually.

In CV, different image recognition tasks also often require fine-tuning the entire large model, which is also not economical. The proposal of Prompt Learning provides a good direction for this issue.

The NLP part of this article mainly refers to the review [1].

Development of NLP Models

Many past machine learning methods were based on fully supervised learning.

Since supervised learning requires a large amount of data to learn high-performance models, and in NLP, large-scale training data (i.e., data labeled for specific tasks) is insufficient, researchers typically focused on feature engineering, that is, using domain knowledge to extract good features from the data;

After the emergence of deep learning, as features can be learned from data, researchers shifted to architecture engineering, that is, designing an appropriate network structure to introduce inductive bias into the model, which is conducive to learning good features.

From 2017 to 2019, NLP models began to shift to a new paradigm (BERT), namely pre-training + fine-tuning. In this paradigm, a language model (LM) is pre-trained with a fixed structure, and the pre-training method involves having the model complete the context (for example, fill-in-the-blank).

Since pre-training does not require expert knowledge, it can be trained directly on large-scale text collected from the internet. Then, this LM adapts to downstream tasks by introducing additional parameters or fine-tuning. At this point, researchers turned to objective engineering, that is, designing better objective functions for the pre-training and fine-tuning tasks.

Prompt Learning

2.1 What is Prompt?

During the process of objective engineering, researchers found that aligning the objectives of downstream tasks with the objectives of pre-training yields good results. Therefore, downstream tasks reconstruct their original task objectives into fill-in-the-blank questions consistent with the pre-trained model by introducing textual prompts.

For example, a reconstruction of the input “I missed the bus today.” is:

● Sentiment Prediction Task. Input: “I missed the bus today. I felt so___.” where “I felt so” is the prompt, and then use the LM to fill in the blank with a word representing sentiment.

● Translation Task. Input: “English: I missed the bus today. French: ___.” where “English:” and “French:” are the prompts, and then use the LM to fill in the corresponding French sentence in the blank.

We find that using different prompts added to the same input can achieve different tasks, thus allowing downstream tasks to align well with pre-training tasks and achieve better prediction results.

Later, researchers found that using different prompts for the same task also leads to significant differences in prediction results, so many studies now focus on prompt engineering.

2.2 What Pre-trained Models Are There?

● Left-to-Right LM: GPT, GPT-2, GPT-3

● Masked LM: BERT, RoBERTa

● Prefix LM: UniLM1, UniLM2

● Encoder-Decoder: T5, MASS, BART

2.3 What Methods Are There for Prompt Learning?

● Classified by the shape of the prompt: fill-in-the-blank style, prefix style.

● Classified by human involvement: manually designed, automatic (discrete, continuous)

In-Depth Guide to Prompt Learning and Tuning

Prompt Tuning

3.1 Fine-tuning Strategies

Fine-tuning large-scale pre-trained models on downstream tasks has become a common training paradigm for many NLP and CV tasks. However, as the model size and the number of tasks increase, fine-tuning the entire model results in storing a copy of the model for each fine-tuning task, consuming a large amount of storage space. This is particularly important in edge devices where storage space and network speed are limited, making shared parameters crucial.

A straightforward method for sharing parameters is to fine-tune only a portion of the parameters or to add a small number of additional parameters to the pre-trained model. For example, for classification tasks:

● Linear: Fine-tune only the classifier (a linear layer), freezing the entire backbone network.

● Partial-k: Fine-tune only the last k layers of the backbone network, freezing other layers [2][3].

● MLP-k: Add a k-layer MLP as a classifier.

● Side-tuning [4]: Train a “side” network, then merge the pre-trained features and the features of the “side” network before inputting to the classifier.

● Bias: Fine-tune only the bias parameters of the pre-trained network [5][6].

● Adapter [7]: Insert additional MLP modules into the Transformer via residual connections.

In recent years, Transformer models have shone in NLP and CV. Transformer-based models have already matched or even surpassed convolution-based models in many CV tasks.

Comparison of Transformer and ConvNet: A significant characteristic of Transformers compared to ConvNets is that their operations on spatial (temporal) dimensions are different.

● ConvNet: Convolutional kernels perform convolution operations in the spatial dimension, thus features at different positions in space fuse information through convolution (learnable) operations and only fuse in local areas.

● Transformer: Features at different positions in the spatial (temporal) dimension fuse information through Attention (non-learnable) operations and fuse globally.

The non-learnable strategy of Transformers during feature fusion allows for easy model expansion by adding additional features.

3.2 Prompt-based Fine-tuning in NLP

● Prefix-Tuning

● Prompt-Tuning

● P-Tuning

● P-Tuning-v2

3.3 Prompt-based Fine-tuning in CV

3.3.1 Classification

Visual Prompt Tuning [8]

▲ Visual Prompt Tuning

● VPT-Shallow

● VPT-Deep

3.3.2 Continual Learning

Learning to Prompt for Continual Learning [9]

Introduce a prompt pool, from which the N closest prompts to each input are taken and added to the image tokens. The distance measurement between inputs and prompts is obtained by calculating the distance between the input feature and the key of each prompt, which are optimized together with the classification target through gradients.

▲ L2P

Note that prompts are used for classification at the end.

3.3.3 Multimodal Models

Vision-Language Model: Context Optimization (CoOp) [10]

Pre-trained models for multimodal learning. For example, CLIP aligns the feature spaces of text and images through contrastive learning.

▲ CLIP

Choosing different text prompts has a significant impact on accuracy.

Replace manually set prompts with learnable prompts:

● [CLASS] placed at the end:

● [CLASS] placed in the middle:

Prompts can be shared between different classes or different prompts can be used for each class (more effective for fine-grained classification tasks).

▲ Learning to Prompt for Vision-Language Model

Conditional Prompt Learning for Vision-Language Models [11]

CoOp performs poorly when generalized to new classes.

Therefore, the prompts are designed to be instance-conditional.

Add a feature related to the current image to the prompt to improve generalization performance. Specifically, first use the Image Encoder to compute the feature of the current image, then map the feature to the prompt’s feature space through a Meta-Net and add it to the prompt.

3.3.4 Domain Adaptation

Domain Adaptation via Prompt Learning [12]

Use prompts to indicate domain information.

▲ Example Prompt Structure

Decouple the representations of class and domain in the representation through contrastive learning.

[2] How transferable are features in deep neural networks? Jason Yosinski, Jeff Clune, Yoshua Bengio, Hod Lipson. In NeruIPS 2014 https://proceedings.neurips.cc/paper/2014/hash/375c71349b295fbe2dcdca9206f20a06-Abstract.html

[3] Masked autoencoders are scalable vision learners. Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick. In arXiv 2021 https://arxiv.org/abs/2111.06377

[4] Side-tuning: a baseline for network adaptation via additive side networks. Jeffrey O. Zhang, Alexander Sax, Amir Zamir, Leonidas Guibas, Jitendra Malik. In ECCV 2020 https://link.springer.com/chapter/10.1007/978-3-030-58580-8_41

[5] Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models.Elad Ben Zaken, Shauli Ravfogel, Yoav Goldberg. In ACL 2022 https://arxiv.org/abs/2106.10199

[6] TinyTL: Reduce memory, not parameters for efficient on-device learning. Han Cai, Chuang Gan, Ligeng Zhu, Song Han. In NeurIPS 2020 https://proceedings.neurips.cc/paper/2020/hash/81f7acabd411274fcf65ce2070ed568a-Abstract.html

[7] Parameter-efficient transfer learning for nlp. Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, Sylvain Gelly. In ICML 2019 http://proceedings.mlr.press/v97/houlsby19a.html

[8] Visual Prompt Tuning. Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, Ser-Nam Lim. In arXiv 2022 https://arxiv.org/abs/2203.12119

[9] Learning to Prompt for Continual Learning. Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, Tomas Pfister. In CVPR 2022 https://arxiv.org/abs/2112.08654

[10] Learning to Prompt for Vision-Language Models. Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu. In arXiv 2021 https://arxiv.org/abs/2109.01134

[11] Conditional Prompt Learning for Vision-Language Models. Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu. In CVPR 2022 https://arxiv.org/abs/2203.05557

[12] Domain Adaptation via Prompt Learning. Chunjiang Ge, Rui Huang, Mixue Xie, Zihang Lai, Shiji Song, Shuang Li, Gao Huang. In arXiv 2022 https://arxiv.org/abs/2202.06687

Technical Communication Group Invitation

△ Long press to add the assistant

Scan the QR code to add the assistant on WeChat

Please note: Name – School/Company – Research Direction

(e.g., Xiao Zhang – Harbin Institute of Technology – Dialogue System)

to apply for joining the Natural Language Processing/PyTorch technical communication group

About Us

MLNLP Community is a grassroots academic community jointly built by scholars in machine learning and natural language processing from home and abroad. It has developed into a well-known machine learning and natural language processing community, aiming to promote progress among the academia, industry, and enthusiasts of machine learning and natural language processing.

The community can provide an open communication platform for practitioners’ further education, employment, and research. Everyone is welcome to follow and join us.