MLNLP community is a well-known machine learning and natural language processing community both domestically and internationally, covering NLP master’s and doctoral students, university teachers, and enterprise researchers.

Community Vision is to promote communication and progress between the academic and industrial sectors of natural language processing and machine learning, especially for beginners.

Reprinted from | RUC AI Box

Author | Liu Zikang

Institution | Renmin University of China, Gaoling School of Artificial Intelligence

Research Direction | Multimodal Learning

This article summarizes some commonly used lightweight methods for Transformers and introduces several mainstream lightweight Transformers in various fields.The article is also published simultaneously in the AI Box Zhihu column (search for AI Box column on Zhihu), and everyone is welcome to comment and discuss below the article in the Zhihu column!

0

Introduction

In recent years, Transformer models have been widely applied in various fields of artificial intelligence, becoming mainstream methods in computer vision, natural language processing, and multimodal fields. Although Transformers perform excellently on most tasks, their large number of parameters limits their application in real-world scenarios and poses challenges for related research, leading to the development of lightweight Transformers. The author first summarizes some commonly used lightweight methods for Transformers and introduces several mainstream lightweight Transformers across different fields. Criticism and suggestions are welcome for mutual communication.

Trend of Pre-trained Model Parameter Growth

1

Common Model Compression Methods in Transformers

Quantization: The basic idea of quantization is to use low-bit data types to replace the original high-bit data types while lightweighting the storage space and computational complexity of the data. Simple post-training quantization, which directly quantizes the parameters of the trained model, can lead to significant errors. Currently, quantization-aware training (QAT) is mainly used, which optimizes in full precision during training while simulating the low-precision inference process, effectively reducing the performance loss during quantization.
Pruning: The pruning method is based on the lottery ticket hypothesis, which states that only a small portion of parameters in the model play a core role, while most parameters are ineffective and can be removed. Pruning can be divided into unstructured pruning and structured pruning. Unstructured pruning identifies parameters in the parameter matrix that are close to zero or trending towards zero and sets these parameters to zero, making the parameter matrix sparse. Structured pruning reduces structured parts of the model, such as unnecessary attention heads in multi-head attention or unnecessary layers in multi-layer Transformers.

Fast and Effective Overview of Lightweight Transformers in Various Fields

Knowledge Distillation: Knowledge distillation typically occurs between a large teacher model and a small student model. The student model is trained using the “soft label distribution” output by the teacher model on supervised data. This learning of “soft labels” can effectively overcome label bias in supervised data, bringing good knowledge transfer capability.

Parameter Sharing: Sharing parameters between identical parts of the model.

2

Lightweight Transformers in Pre-trained Language Models

Transformers were first widely applied in the field of natural language, and their powerful capabilities have led to rapid development in the pre-training field, bringing innovation to related areas. However, as the scale of pre-trained models continues to grow, the cost of training and deploying a pre-trained model also increases, giving rise to the research direction of lightweight pre-trained language models. Since the mainstream architecture of pre-trained language models, represented by BERT, is still Transformers, the lightweighting of Transformers has become an important topic in the current NLP field. Below are some classic lightweight works on pre-trained language models compiled by the author for reference.

Q8BERT: Quantized 8Bit BERT

https://arxiv.org/pdf/1910.06188.pdf

Q8BERT is a simple application of quantization methods on BERT. In addition to using the basic quantization-aware training method, Q8BERT also uses exponential moving average to smooth the QAT training process. Ultimately, after compressing nearly 4 times the number of parameters and achieving 4 times the inference speed, Q8BERT achieved results close to BERT on the GLUE and SQUAD datasets, proving the effectiveness of quantization methods on BERT.

https://arxiv.org/pdf/1910.01108.pdf

DistilBERT is a smaller version of the BERT model released by Hugging Face, which only uses soft labels for distillation at the output layer, reducing the number of BERT layers to half while achieving good results on various downstream tasks.

TinyBERT: Distilling BERT for Natural Language Understanding

https://arxiv.org/pdf/1909.10351.pdf

Similar to DistilBERT, TinyBERT is also an application of knowledge distillation on BERT compression. Unlike DistilBERT, which only uses knowledge distillation during the pre-training phase, TinyBERT employs a two-stage knowledge distillation approach during both the pre-training and fine-tuning phases, allowing the small model to learn general and task-related semantic knowledge.

At the same time, TinyBERT adopts a more fine-grained knowledge distillation method, distilling knowledge from the output of the embedding layer, the hidden layers in each Transformer layer, and the soft labels of the entire model, achieving a more precise knowledge transfer effect.

Ultimately, TinyBERT distilled the BERT model into a 4-layer model with a smaller hidden layer dimension, achieving results comparable to those of DistilBERT with a higher parameter count.

ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS

https://arxiv.org/pdf/1909.11942.pdf

ALBERT is a work by Google presented at ICLR 2020. It first employs the method of vocabulary matrix factorization; since BERT uses a very large vocabulary, the embedding layer of the vocabulary contains a large number of parameters. ALBERT reduces this parameter count through factorized embedding parameterization. Specifically, for a vocabulary containing V words, if the hidden layer size E of the vocabulary matches the hidden layer size H of the model, the number of parameters in the embedding layer is V × E. Considering that V is generally large, the total parameter count of the embedding also becomes large. To reduce the parameter count, unlike directly mapping one-hot word vectors to H-dimensional space, ALBERT first maps word vectors to a smaller E-dimensional space and then to H-dimensional space, reducing the parameter count from O(V × H) to O(V × E + E × H), thus achieving compression of the embedding layer.

In addition, ALBERT also performs cross-layer parameter compression within the Transformer, achieving sufficient parameter compression through complete parameter sharing across Transformer layers. Under low parameter conditions, ALBERT achieves results similar to BERT-base while ensuring a deeper model with a larger hidden layer dimension, leading to better performance on downstream tasks.

3

Lightweight Transformers in Computer Vision

Although the application of Transformers in the field of computer vision has been slightly slower than in NLP, the emergence of Vision Transformers has made Transformers a mainstream model in visual tasks. Later pre-training methods based on MAE and BEiT further solidified the position of Transformers in the computer vision field. Similar to the natural language understanding field, Transformers in computer vision also face the problem of excessive parameters and deployment difficulties, thus requiring lightweight Transformers to efficiently perform downstream tasks. Below are several lightweight works in the field of computer vision compiled by the author for reference.

Training data-efficient image transformers & distillation through attention

http://proceedings.mlr.press/v139/touvron21a/touvron21a.pdf

DeiT is a work by Facebook in 2021, with the core method being knowledge distillation applied to compress visual Transformers. DeiT employs two distillation methods to achieve knowledge transfer from the teacher model to the student model:

Soft Distillation: Knowledge distillation through the soft labels output by the teacher model.
Hard-label Distillation: Knowledge distillation through the actual labels predicted by the teacher model, aimed at correcting potential label biases in the supervised data.

DeiT introduces the concept of a Distillation token during the distillation process, serving a purpose similar to the Class token, but while the Class token is used for training on the original data using cross-entropy, the Distillation token is used to simulate the soft distribution output of the teacher model or to train using the hard labels predicted by the teacher model.

Through the distillation process on the teacher model, DeiT achieves better results than ViT with a smaller parameter scale and faster inference speed.

TinyViT: Fast Pretraining Distillation for Small Vision Transformers

https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136810068.pdf

TinyViT is a work by Microsoft in 2022, with the core still being knowledge distillation, but with some optimizations in engineering implementation, allowing small models to acquire knowledge from large models under larger data scales through knowledge distillation. The knowledge distillation method employed by DeiT is quite expensive, as both the teacher and student models occupy GPU memory simultaneously during the distillation process, limiting the increase in batch size and the training speed of the student model. Additionally, the transfer of soft labels from the teacher model output to the student model output also incurs significant computational resource costs.

To address this issue, TinyViT proposes a method for pre-generating soft labels, decoupling the generation of soft labels from the training process of the student model. First, soft labels are generated and pre-stored, and then the pre-stored soft labels are used to train the student model. Given that the pre-stored soft label vectors incur significant storage overhead, considering that most of these vectors are sparse (because for a trained teacher model, given an image, only a small number of categories will have a probability of being the correct label), the authors adopt a strategy of storing sparse labels, only keeping the top-k probable labels and their corresponding probabilities. When training the student model, these sparse labels are restored to a complete probability distribution for knowledge distillation. The entire pipeline is illustrated below:

Under the same model scale, TinyViT improves the speed and data volume of knowledge distillation, achieving enhancements in classification tasks.

MiniViT: Compressing Vision Transformers with Weight Multiplexing

https://openaccess.thecvf.com/content/CVPR2022/papers/Zhang_MiniViT_Compressing_Vision_Transformers_With_Weight_Multiplexing_CVPR_2022_paper.pdf

MiniViT is a work by Microsoft presented at CVPR 2022, adopting a weight multiplexing method to compress model parameters. Unlike ALBERT, the authors found that simply reusing weights can lead to the homogenization of the l2 norm of gradients across layers and reduce the correlation of output features in the last few layers. To solve this problem, MiniViT employs Weight Transformation and Weight Distillation methods. Weight Transformation involves inserting small, adapter-like structures between each layer to ensure that the output of each layer does not homogenize due to identical parameter counts. Weight Distillation uses a teacher model to guide the output of MiniViT to enhance model performance. The overall pipeline is illustrated below:

As a general compression method, the authors tested it on DeiT and Swin-Transformer. With fewer parameters, the Mini version of the model achieved comparable or even better results on the ImageNet dataset.

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

https://proceedings.neurips.cc/paper/2021/file/747d3443e319a22747fbb873e8b2f9f2-Paper.pdf

This paper is a work by Tsinghua University presented at NeurIPS 2021, which draws on the idea of pruning in model compression but sparsifies the input tokens of the Transformer at each layer. The basic assumption of token sparsification is that for a given image, there are certain redundant parts that have little impact on the model’s prediction results, and reducing these parts can significantly increase the model’s inference speed.

The specific approach of DynamicViT is to add a lightweight prediction module between each layer to predict which tokens can be discarded. After prediction, a binary decision mask is used to complete the token discard process. At the same time, the authors modified the training objective function to ensure that the number of discarded tokens at each layer is limited.

Ultimately, under the aforementioned sparsification strategy, DynamicViT achieves significant acceleration based on the original ViT, balancing model performance and inference speed effectively.

4

Lightweight Transformers in Multimodal Learning

Multimodal models often reference the designs of both visual and language models, thus using Transformers as their mainstream architecture. However, lightweight work in multimodal settings is relatively scarce, and the author has compiled several representative lightweight works in multimodal learning for reference.

MiniVLM: A Smaller and Faster Vision-Language Model

https://arxiv.org/pdf/2012.06946.pdf

MiniVLM is a lightweighting effort by Microsoft on the Oscar model. The lightweighting of MiniVLM is based on an observational hypothesis: that in most multimodal tasks, the visual part of the multimodal model does not require particularly strong object detection information, and the object detector often becomes a bottleneck for the model. Therefore, using a less precise object detector can effectively compress the model’s parameter count and speed up inference while minimizing performance loss.

To achieve this effect, MiniVLM employs a lightweight object detector based on EfficientNet and Bi-FPN. Additionally, to further compress the model, MiniVLM also compresses the multimodal Transformer part, replacing the original BERT structure with a more lightweight MiniLM, as shown below:

Ultimately, within an acceptable range of accuracy loss, compared to the original OSCAR model, MiniVLM greatly improves inference speed:

Compressing Visual-linguistic Model via Knowledge Distillation

https://arxiv.org/pdf/2104.02096

DistilVLM is a continuation of MiniVLM’s work. Unlike before, while replacing the object detector and Transformer architecture, DistilVLM also employs knowledge distillation to maintain the model’s performance. The distillation strategy of DistilVLM is similar to that of TinyBERT, which also involves two-stage distillation during the pre-training and fine-tuning phases:

Due to the use of different object detectors, the detected target areas differ, making subsequent knowledge distillation ineffective. To resolve this issue, DistilVLM employs visual token alignment, ensuring that both the teacher and student models use the same object detector, aligning the detection areas of the two models and ensuring the effectiveness of subsequent knowledge distillation.

Ultimately, under the same parameters and inference speed as MiniVLM, DistilVLM achieves notable performance improvements.

References

[1] Q8BERT: Quantized 8Bit BERT
[2] DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
[3] TinyBERT: Distilling BERT for Natural Language Understanding
[4] ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS
[5] Training data-efficient image transformers & distillation through attention
[6] TinyViT: Fast Pretraining Distillation for Small Vision Transformers
[7] MiniViT: Compressing Vision Transformers with Weight Multiplexing
[8] DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification
[9] MiniVLM: A Smaller and Faster Vision-Language Model
[10] Compressing Visual-linguistic Model via Knowledge Distillation

Technical Community Invitation

△ Long press to add the assistant

Scan the QR code to add the assistant on WeChat

Please note: Name-School/Company-Research Direction

(e.g., Xiao Zhang-Harbin Institute of Technology-Dialogue System)

to apply for joining the Natural Language Processing/Pytorch and other technical community groups

About Us

MLNLP Community is a grassroots academic community jointly built by scholars in machine learning and natural language processing from both domestic and international backgrounds, and has developed into a well-known community for machine learning and natural language processing, aiming to promote progress among practitioners in academia, industry, and enthusiasts.

The community can provide an open communication platform for related practitioners’ further studies, employment, and research. Everyone is welcome to follow and join us.

0 Introduction

1 Common Model Compression Methods in Transformers

2 Lightweight Transformers in Pre-trained Language Models

3 Lightweight Transformers in Computer Vision

4 Lightweight Transformers in Multimodal Learning