In-Depth Explanation of Adapter Technology in NLP

Delivering NLP technical insights to you daily!

Institution | UCLA

Research Direction | NLP

Typesetting | PaperWeekly

In-Depth Explanation of Adapter Technology in NLP

Introduction

In modern natural language processing (NLP) applications, using pre-trained representations for transfer learning is an important method. After deep learning began to be applied, transfer learning first appeared in the use of pre-trained feature vectors and fine-tuning pre-trained language models (PLMs).[1]. Based on pre-trained models, the adapter provides a new idea: whether it is possible to insert a small number of parameters into the model and only train these parameters during fine-tuning on a downstream task while keeping the original parameters of the pre-trained model unchanged. If using adapters allows us to achieve the same effect as fine-tuning the entire model (or better), it can bring many benefits:

Higher parameter efficiency: a task only requires a small number of parameters, trains faster, occupies less memory, is less prone to overfitting on smaller datasets, and is more conducive to model storage and distribution.
The forgetting problem in continual learning: adapters freeze the parameters of the original model, ensuring that the original knowledge is not forgotten.
Multi-task learning: using adapters can also learn multiple tasks with relatively few parameters; compared to traditional multi-task learning, the advantage is that the influence between different tasks is reduced, while the disadvantage is that the mutual supervision brought by different tasks may decrease.

Adapters were first proposed by [2] and applied in computer vision models, later introduced into NLP by [1], and there has been an increasing amount of related work in recent years. Recently, there has been great interest in the application of adapters in NLP. In this article, I will organize notes on related papers when learning about adapters, many of which have implementations in the open-source library AdapterHub:

https://adapterhub.ml/

Bottleneck Adapter

First, let’s summarize the paper that introduced adapters into NLP [1]. The main contribution of this paper is to propose an adapter structure applied to transformers and demonstrate the feasibility of using adapters for parameter-efficient transfer learning on classic NLP tasks.

Network Structure: As shown in the figure below,[1] proposed inserting adapter layers into transformer layers. The structure of the adapter layer is very simple: it projects down to a smaller dimension, passes through a layer of non-linear activation function, and then projects back up to the original dimension. In addition, there is a residual connection between the input and output of the entire adapter layer. This type of adapter is also vividly referred to as a bottleneck adapter. Taking BERT as an example, depending on the size of the bottleneck layer, the added parameters roughly account for 0.5%-8% of the original model.

Initialization: All adapter parameters are sampled from a normal distribution with a mean of 0 and a standard deviation of 0.01. This ensures that at the beginning of training, the output of the fully connected network of the adapter backbone is very small, mainly passing information through the residual connection.

Training: Freeze all parameters of the original model and only train the parameters of the adapter layer and the parameters of the layer normalization.

Experiments: Mainly conducted on classification tasks (GLUE) and extractive question answering tasks (SQuAD v1.1), comparing the performance of fine-tuning the entire BERT-Large model and fine-tuning only the adapter of BERT-Large.

Experimental Findings:

Fine-tuning only the adapter can achieve performance close to that of fine-tuning the entire model; if the size of the adapter is adjusted according to each task, it can achieve less performance drop.
The parameter efficiency of using adapters is higher than that of fine-tuning several layers close to the output of BERT, and its performance is better than training only the parameters of layer normalization.
During the inference phase, pruning an adapter at a certain layer is feasible and will not significantly affect performance. However, pruning multiple layers will lead to a significant drop in performance. Compared to layers closer to the output (top layers), layers closer to the input (bottom layers) are less sensitive to pruning.
When the standard deviation of the weight initialization distribution is less than 0.01, the effect is better; a larger standard deviation can worsen the effect.

Improvements to Adapter Structure or Training/Inferences Process

The papers [3-5] improved and expanded upon [1], addressing the following main issues:

AdapterFusion [3]: How to better combine multi-task learning and adapters, leveraging the advantages of multi-task learning while avoiding its disadvantages?
AdapterDrop [4]: How much slower is the speed of adapters during inference? How to prune adapters?
Compacter [5]: Can adapter layers be made more lightweight without sacrificing performance?

3.1 AdapterFusion

To combine knowledge from multiple tasks, the traditional two methods are sequential fine-tuning or multi-task learning. A major problem with the former is that it requires prior knowledge to determine the order, and the model is prone to forgetting the knowledge learned from previous tasks. The problem with the latter is that different tasks can influence each other, making it difficult to balance tasks with significantly different dataset sizes. One advantage of adapters is that it does not require updating the parameters of the pre-trained model; instead, it can learn a task well by inserting relatively few new parameters. In this case, the parameters of the adapter express the knowledge needed to solve this task to some extent. Inspired by this,[3] proposed that if we want to combine knowledge from multiple tasks, we can consider combining the parameters of the adapters from multiple tasks.

[3] proposed a multi-task learning framework for AdapterFusion divided into two stages. First, for each task, learn a set of new adapter parameters. Then, for a specific target task, learn a fusion module to combine all the adapters from the first step.[3] assumes that each task in the second stage is included in the first stage, without considering the situation where new tasks are introduced in the second stage.

Network Structure:

The figure shows the structure of AdapterFusion (right) and its placement in the transformer layer (left). The single adapter layer (pink) is no different from that in[1] , but only the top adapter layer is retained in each transformer layer, removing the adapter layers after the multi-head attention layer.

The structure of the AdapterFusion layer (blue-green) is an attention module. Q is the output of the fully connected layer in the original transformer layer, and K and V are the outputs of the adapters corresponding to each task. Similar to cross-attention in transformers, QKV undergoes a linear projection first, then QK is multiplied and softmax is calculated, and the output is used to combine the outputs from different task adapters in a weighted manner. The specific formula is as follows. l represents the layer number, t represents the sequence position, and n represents the task (there are a total of N tasks). Abstractly speaking, the AdapterFusion for task X corresponds to selecting and applying the most suitable knowledge for task X from the knowledge of multiple tasks based on the output of the previous layer.

Initialization: QK is randomly initialized, and V is initialized as an identity matrix with some small random noise added.

Training:

First step: Train the adapters for each task. The authors experimented with two methods: (1) each task independently initializes a set of adapter parameters, learning only the current task without updating the parameters of the pre-trained model (ST-A); (2) assemble all adapters together and train all adapters simultaneously using a multi-task learning loss function, while also fine-tuning all pre-trained parameters (MT-A).
Second step: Assemble all the adapters from the first step (only for ST-A, MT-A is already assembled), then add the AdapterFusion layer, and train on the target task. The dataset used is the same version as that used in the first step. The authors also experimented with using MT-A in the second step as a control.

Experiments: The authors selected 16 different types and sizes of tasks for multi-task learning. The categories include common sense, sentiment analysis, natural language inference, sentence relevance; dataset sizes include over 40k, over 10k, over 5k, and under 5k. The pre-trained models used are BERT-base and RoBERTa-base.

Experimental Findings:

In the first stage of training, using ST-A can achieve performance close to that of fine-tuning the entire model, but using MT-A can somewhat affect performance. The authors explain that training only the adapter serves as a form of regularization, helping generalization.
In the second stage, the addition of AdapterFusion significantly improves performance for tasks with smaller training sets.
The best results are achieved by using ST-A in the first stage and AdapterFusion in the second stage, also utilizing the reuse of adapters. If MT-A is used in the first stage, MT-A must also be used in the second stage for some improvement.
For tasks where AdapterFusion significantly improves performance, each layer of the AdapterFusion layer tends to attend to the adapters of other tasks more.

3.2 AdapterDrop

The main contributions of paper [4] are: 1) establishing a series of measurements related to the training/inference speed of adapters; 2) proposing the AdapterDrop method for pruning entire adapter layers to accelerate the speed of multi-task inference; 3) establishing the results of pruning AdapterFusion.

Training and Inference Speed of Adapters: The authors measured the training and inference speed of the adapter structures in [1] and [3] compared to fine-tuning the entire model on two different GPUs, with the following results shown in the figure below. Training with adapters is approximately 60% faster than fine-tuning the entire model, while inference is about 4%-6% slower than using the original model.

3.2.1 AdapterDrop

To speed up inference, certain adapter layers can be pruned during inference. According to [1], pruning adapters closer to the input has less impact on performance. Therefore, the authors of AdapterDrop propose that during inference, the bottom n layers of adapters, which are closest to the input, can be pruned. To minimize performance drop, the authors designed two training schemes: (1) specialized AdapterDrop: during training, fix n, and during inference, the model prunes the first n layers; (2) robust AdapterDrop: during training, randomly select the size of n for each batch, allowing the trained model to adapt to multiple n values. Since the parameters of the original model are not trained, during training, gradients can only be backpropagated to the earliest layer that retains the adapter (see the figure below).

Experimental Results: On multiple tasks in GLUE, both AdapterDrop methods can maintain performance with n=5 or fewer during inference, while traditional adapters see a rapid drop in performance when n>1. When five layers of adapters are removed, training speed can increase by 26%, and multi-task inference speed can increase by 21%-42%, exceeding the inference speed of the original model. It should be noted that to highlight the advantages of AdapterDrop, the authors measured speed in the context of multi-task inference, meaning that input text generates outputs for multiple tasks.

3.2.2 Pruning AdapterFusion

The authors first measured the training and inference time of AdapterFusion (AF, [3]), finding that compared to the overall fine-tuning and inference of the original model, the training speed of AF with 8 adapters per layer is about 47% slower, and the inference speed is about 62% slower, mainly because adapters need to be inferred one by one. The authors trained an AF model on 8 GLUE tasks (excluding WNLI) and experimented with two ideas to accelerate AdapterFusion:

Removing the first few AF layers has different impacts on performance across tasks. For example, it has little impact on RTE but is very sensitive for CoLA. This indicates that directly removing AF layers is not a universally good method.
Pruning adapters that contribute less to the output within each layer. The authors measured the average activation level of each adapter using the training set (which should be the weighted output) and retained only the two highest-contributing adapters per layer, maintaining model performance close to the original while improving inference speed by 68%.

3.3 Compacter

The structure of Compacter comes from paper [5]. The authors adopted the adapter placement and training methods from [1] but redesigned a more lightweight adapter structure, requiring only about 0.05%-0.2% of the original model’s parameters to achieve good performance on benchmarks like GLUE and SuperGLUE.

Network Structure:

Compacter applies the Kronecker product. The Kronecker product of an mxf matrix A and a pxq matrix B is

Assuming the model hidden state size is k and the bottleneck size is b, the adapter layer in [1] contains two kxb matrices. Compacter first borrows the idea of parameterized hypercomplex multiplication layers, expressing each adapter’s parameters as a Kronecker product of an nxn matrix A and a (k/n)x(d/n) matrix B, significantly reducing the parameter count.
On top of this, all adapters are required to share matrix A.
Additionally, matrix B is further decomposed into n groups of the product of two low-rank matrices, with sizes (k/n)xr and rx(d/n). To reduce the parameter count, the authors fixed r to 1.
The structure of the Compacter layer is shown in the figure below. The figure shows two compacter layers, with the colored parts being the parameters that need to be trained. By expressing the parameters in the adapter layer of [1] in this form, the structure of the compacter is obtained.

Experiments: The authors mainly used T5-base for experiments on GLUE and SuperGLUE tasks. They compared Compacter with fine-tuning the entire model and a series of parameter-efficient methods. For Compacter, the authors also experimented with a version without low-rank decomposition for comparison, as well as a structure similar to [3] that retains only one adapter close to the output for each transformer layer (referred to as compacter++).

Experimental Results:

On T5-base, the adapter layers in [3] performed better than those in [1]. Both AdapterDrop or merely low-rank decomposition of the adapter layers performed worse than fine-tuning the entire model.
Three innovations in Compacter allow it to achieve performance comparable to full model fine-tuning while training only around 0.1% new parameters.
Compared to full model fine-tuning, Compacter performs better when the training set is smaller (0.1k-4k).

Applications of Adapters and Improvements for Specific Applications

The following papers [6-12] have utilized adapters to design solutions for specific problems, some focusing on parameter efficiency and others on further enhancing the performance of the original model. Due to space limitations, this section provides a brief introduction to each paper.

4.1 Bapna & Firat (2019)

This paper [6] mainly applies adapters to solve two neural machine translation (NMT) problems: (1) domain adaptation and (2) large-scale multilingual NMT. The framework primarily used is to first pre-train a base model and then insert a new adapter for each target task for fine-tuning. The main contribution of the paper is to demonstrate that methods using adapters in NMT tasks can achieve performance similar to that of fine-tuning the entire model while being more parameter-efficient. Additionally, in multilingual NMT tasks, a single model can perform well in both low-resource and high-resource languages simultaneously.

Adapter Structure: This paper uses a similar structure to that of [3] (but this work was done before [3]), inserting an adapter layer only at the end of each transformer layer. Additionally, the authors reinitialize the parameters of layer normalization (unlike [1], which continues training with the pre-trained layer norm parameters).
Domain Adaptation: The authors trained on WMT En-Fr, then froze the parameters, inserted an adapter, and transferred to IWSLT’15 and JRC. The model’s performance surpassed that of LHUC [7], being close to that of full model fine-tuning.
Multilingual NMT: The authors first trained a model for English <=> 102 other languages, then froze the parameters and fine-tuned by inserting an adapter for each source-target language pair. The main baseline method compared was the model trained only on the data of the (source language, target language) pair. The results showed that when English is the source language, most target languages performed comparably or better than the baseline; however, when English is the target language, there is a significant improvement in languages with less training data, while performance declines in languages with more training data.

4.2 K-Adapter

This paper [8] mainly contributes by modularly inserting knowledge into pre-trained language models using adapters to address some knowledge-intensive NLP tasks (relationship classification, entity type recognition, question answering, etc.). The authors introduce two types of knowledge into RoBERTa via training two adapters: (1) factual knowledge and (2) linguistic knowledge. The two adapters are trained separately and do not interfere with each other.

Model Structure: K-Adapter does not modify the original transformer layers but instead inserts adapter layers between two transformer layers. Within each adapter layer, two transformer layers are added between the fully connected layers that project down and up (left in the figure below), enhancing the expressiveness of the module. Each adapter layer’s input can see the output of the previous adapter and the output of the nearest previous transformer layer (right in the figure below).

Pre-training: Insert adapter layers, fix the parameters of the original model, concatenate the output of the last layer of the original model with the output of the last adapter as features, and then learn on specific pre-training tasks.

Factual Knowledge: Trained on relationship classification tasks, with a total of 430 classes and 5.5M sentences. By learning to predict the relationships between entities, the model can learn some basic facts and common sense.
Linguistic Knowledge: Trained on dependency classification tasks, the authors prepared approximately 1M training samples using Stanford’s parser. By learning to predict the head position corresponding to each token, the model can acquire some syntax/semantics-related knowledge.

Downstream Fine-tuning: The parameters added for each task are trained on the output of the last adapter layer. If multiple adapters are used simultaneously, their outputs are concatenated as features. Additionally, the parameters of the original pre-trained model are also fine-tuned.

Downstream tasks mainly cover relationship classification, entity type recognition, and question answering.
The baseline models include not only the previously effective language model + knowledge model but also the original RoBERTa model, the original RoBERTa model + randomly initialized adapter parameters, and the model obtained from multi-task learning of the original RoBERTa model on two tasks. The experimental results show that using both adapters simultaneously yields the best results. For the last two downstream tasks, factual knowledge is more beneficial, while linguistic knowledge is more beneficial for the first task.

4.3 MAD-X

Similar to AdapterFusion and K-Adapter summarized earlier, the goal of MAD-X [9] is also to use adapters to learn and store modularized knowledge, achieving a “plug-and-play” capability. The goal of AdapterFusion is to allow tasks with small training sets to utilize knowledge from tasks with large training sets, while MAD-X aims to enable low-resource languages to leverage knowledge from high-resource languages within the same task.

To decouple language and task, the authors propose training corresponding adapters separately, so that when solving a task T for a low-resource language L, the adapter for language L can be used alongside the adapter for task T trained on other languages L’.

Network Structure: Adapters are still inserted into each transformer layer (shown in the figure below), where language adapters and task adapters follow the network structure of AdapterFusion, and the invertible adapter is a token-level mapping function, trained alongside the language adapter. This can be understood as learning a dedicated embedding function for each language on multilingual embeddings, but more parameter-efficient than retraining embeddings for each target language.

4.4 Other Applications

UDapter [10]: Uses adapters to train parameter-efficient multilingual dependency parsing models.
Philip et al. (2020) [11]: Similar to [6], focusing on multilingual machine translation tasks, but introducing adapter parameters for each language instead of one adapter for each language pair like [6].
Lauscher et al. (2020) [12]: Similar to [8], using adapters to modularly introduce knowledge.

Adapter-Like Structures

The following articles propose ideas similar to adapters but with different network structure designs. Here, I will not go into detail but recommend the ICLR 2022 article[16] (or a video explanation by He Junxian).

BERT and PAL [13]
Prefix Tuning [14]
LoRA [15]

References

[1] Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., Laroussilhe, Q. D., Gesmundo, A., Attariyan, M., & Gelly, S. (2019). Parameter-Efficient Transfer Learning for NLP. Proceedings of the 36th International Conference on Machine Learning, 2790–2799. https://proceedings.mlr.press/v97/houlsby19a.html

[2] Rebuffi, S. A., Bilen, H., & Vedaldi, A. (2017). Learning multiple visual domains with residual adapters. Advances in neural information processing systems,30. https://proceedings.neurips.cc/paper/2017/file/e7b24b112a44fdd9ee93bdf998c6ca0e-Paper.pdf

[3] Pfeiffer, J., Kamath, A., Rücklé, A., Cho, K., & Gurevych, I. (2021). AdapterFusion: Non-Destructive Task Composition for Transfer Learning. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 487–503. https://doi.org/10.18653/v1/2021.eacl-main.39

[4] Rücklé, A., Geigle, G., Glockner, M., Beck, T., Pfeiffer, J., Reimers, N., & Gurevych, I. (2021). AdapterDrop: On the Efficiency of Adapters in Transformers. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 7930–7946. https://doi.org/10.18653/v1/2021.emnlp-main.626

[5] Karimi Mahabadi, R., Henderson, J., & Ruder, S. (2021). Compacter: Efficient Low-Rank Hypercomplex Adapter Layers. Advances in Neural Information Processing Systems, 34, 1022–1035. https://proceedings.neurips.cc/paper/2021/hash/081be9fdff07f3bc808f935906ef70c0-Abstract.html

[6] Bapna, A., & Firat, O. (2019). Simple, Scalable Adaptation for Neural Machine Translation. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 1538–1548. https://doi.org/10.18653/v1/D19-1165

[7] Vilar, D. (2018). Learning Hidden Unit Contribution for Adapting Neural Machine Translation Models. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), 500–505. https://doi.org/10.18653/v1/N18-2080

[8] Wang, R., Tang, D., Duan, N., Wei, Z., Huang, X., Ji, J., Cao, G., Jiang, D., & Zhou, M. (2021). K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 1405–1418. https://doi.org/10.18653/v1/2021.findings-acl.121

[9] Pfeiffer, J., Vulić, I., Gurevych, I., & Ruder, S. (2020). MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer (arXiv:2005.00052). arXiv. http://arxiv.org/abs/2005.00052

[10] Üstün, A., Bisazza, A., Bouma, G., & van Noord, G. (2020). UDapter: Language Adaptation for Truly Universal Dependency Parsing. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2302–2315. https://doi.org/10.18653/v1/2020.emnlp-main.180

[11] Philip, J., Berard, A., Gallé, M., & Besacier, L. (2020). Monolingual Adapters for Zero-Shot Neural Machine Translation. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 4465–4470. https://doi.org/10.18653/v1/2020.emnlp-main.361

[12] Lauscher, A., Majewska, O., Ribeiro, L. F. R., Gurevych, I., Rozanov, N., & Glavaš, G. (2020). Common Sense or World Knowledge? Investigating Adapter-Based Knowledge Injection into Pretrained Transformers. Proceedings of Deep Learning Inside Out (DeeLIO): The First Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, 43–49. https://doi.org/10.18653/v1/2020.deelio-1.5

[13] Stickland, A. C., & Murray, I. (2019). BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning. Proceedings of the 36th International Conference on Machine Learning, 5986–5995. https://proceedings.mlr.press/v97/stickland19a.html

[14] Li, X. L., & Liang, P. (2021). Prefix-Tuning: Optimizing Continuous Prompts for Generation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 4582–4597. https://doi.org/10.18653/v1/2021.acl-long.353

[15] Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models (arXiv:2106.09685). arXiv. http://arxiv.org/abs/2106.09685

[16] He, J., Zhou, C., Ma, X., Berg-Kirkpatrick, T., & Neubig, G. (2022). Towards a Unified View of Parameter-Efficient Transfer Learning (arXiv:2110.04366). arXiv. http://arxiv.org/abs/2110.04366