Exclusive: BERT Model Compression Based on Knowledge Distillation

Authors: Siqi Sun, Yu Cheng, Zhe Gan, Jingjing Liu

This article is about1800 words, recommended reading time5 minutes.

This article introduces the “Patient Knowledge Distillation” model.

Data Department THU backend reply“191010”, get the paper address.

In the past year, there have been many groundbreaking advances in language model research, such as GPT generating sentences that are convincingly realistic [1]; BERT, XLNet, RoBERTa [2,3,4], etc. have dominated various NLP rankings as feature extractors. However, the number of parameters in these models is also quite staggering, for example, BERT-base has 109 million parameters, while BERT-large has as many as 330 million parameters, resulting in models running too slowly. To improve the model’s runtime, this article first proposes a new Knowledge Distillation [5] method to compress the model, thereby saving runtime and memory without losing too much accuracy. The article was published in EMNLP 2019.

Specifically, for sentence classification tasks, when a regular knowledge distillation model is used for model compression, it usually results in a significant loss of accuracy. The reason is that the student model only learns the probability distribution predicted by the teacher model, completely ignoring the representations of the intermediate hidden layers.

Just like a teacher teaching a student, the student only memorizes the final answer but learns nothing about the intermediate process. This increases the likelihood of the student model making errors when encountering new problems. Based on this hypothesis, the article proposes a loss function that makes the student model’s hidden layer representations close to those of the teacher model, thus enhancing the generalization ability of the student model. This model is referred to as the “Patient Knowledge Distillation” model (PKD).

Because for sentence classification problems, the model’s predictions are based on the feature representation of the [CLS] character, for example, adding two fully connected layers on top of this feature. Thus, the researchers proposed a new loss function that enables the model to simultaneously learn the feature representation of the [CLS] character:

Where M is the number of layers in the student model (e.g., 3, 6), N is the number of layers in the teacher model (e.g., 12, 24), h is the representation of the [CLS] in the model’s hidden layer, and i, j represent the correspondence between student-teacher hidden layers, as shown in the figure below. For example, for a 6-layer student model learning from a 12-layer teacher model, the student model can learn the representations of the teacher model’s (2,4,6,8,10) hidden layers (left PKD-skip), or the representations of the last few layers of the teacher model (7,8,9,10,11, right PKD-last). The last layer is skipped because it directly learns the teacher model’s predicted probabilities, thus omitting the learning of the last hidden layer.

The researchers compared the proposed model with model fine-tuning and standard knowledge distillation across 7 standard datasets for sentence classification. When distilling a 12-layer teacher model to a 6-layer or 3-layer student model, in most cases, PKD outperformed both baseline models. Furthermore, on five datasets: SST-2 (compared to the teacher model -2.3% accuracy), QQP (-0.1%), MNLI-m (-2.2%), MNLI-mm (-1.8%), and QNLI (-1.4%), the performance was close to that of the teacher model. Specific results can be seen in Figure 1. This further validates the researchers’ hypothesis that student models learning hidden layer representations outperform those that only learn the teacher’s predicted probabilities.

Figure 1

In terms of speed, the 6-layer transformer model can almost double the inference speed, reducing the total number of parameters by 1.64 times; while the 3-layer transformer model can achieve a speedup of 3.73 times, reducing the total number of parameters by 2.4 times. Specific results can be seen in Figure 2.

Figure 2

Radford, Alec, et al. “Language models are unsupervised multitask learners.” OpenAI Blog 1.8 (2019).
Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).
Yang, Zhilin, et al. “XLNet: Generalized Autoregressive Pretraining for Language Understanding.” arXiv preprint arXiv:1906.08237 (2019).
Liu, Yinhan, et al. “Roberta: A robustly optimized BERT pretraining approach.” arXiv preprint arXiv:1907.11692 (2019).
Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural network.” arXiv preprint arXiv:1503.02531 (2015).

Siqi Sun: is a Research SDE in Microsoft. He is currently working on commonsense reasoning and knowledge graph related projects. Prior to joining Microsoft, he was a PhD student in computer science at TTI Chicago, and before that, he was an undergraduate student from the school of mathematics at Fudan University.

Yu Cheng: is a senior researcher at Microsoft. His research is about deep learning in general, with specific interests in model compression, deep generative models, and adversarial learning. He is also interested in solving real-world problems in computer vision and natural language processing. Yu received his Ph.D. from Northwestern University in 2015 and his bachelor from Tsinghua University in 2010. Before joining Microsoft, he spent three years as a Research Staff Member at IBM Research/MIT-IBM Watson AI Lab.

Zhe Gan: is a senior researcher at Microsoft, primarily working on generative models, visual QA/dialog, machine reading comprehension (MRC), and natural language generation (NLG). He also has broad interests in various machine learning and NLP topics. Zhe received his PhD degree from Duke University in Spring 2018. Before that, he received his Master’s and Bachelor’s degree from Peking University in 2013 and 2010, respectively.

Jingjing (JJ) Liu: is a Principal Research Manager at Microsoft, leading a research team in NLP and Computer Vision. Her current research interests include Machine Reading Comprehension, Commonsense Reasoning, Visual QA/Dialog, and Text-to-Image Generation. She received her PhD degree in Computer Science from MIT EECS in 2011. She also holds an MBA degree from the Judge Business School at the University of Cambridge. Before joining MSR, Dr. Liu was the Director of Product at Mobvoi Inc and a Research Scientist at MIT CSAIL.

The code has been open-sourced at: https://github.com/intersun/PKD-for-BERT-Model-Compression.

Data Department THU backend reply“191010”, get the paper address.

Editor: Yu Tengkai

Proofreader: Lin Yilin

Exclusive: BERT Model Compression Based on Knowledge Distillation

Click “Read Original” to embrace the organization

Leave a Comment Cancel reply