BERT-of-Theseus: A Model Compression Method Based on Module Replacement

©PaperWeekly Original · Author｜Su Jianlin

School｜Zhuiyi Technology

Research Direction｜NLP, Neural Networks

Recently, I learned about a BERT model compression method called “BERT-of-Theseus”, derived from the paper BERT-of-Theseus: Compressing BERT by Progressive Module Replacing. This is a model compression scheme built on the concept of “replaceability”. Compared to conventional methods like pruning and distillation, it appears more elegant and straightforward.

Paper Title:BERT-of-Theseus: Compressing BERT by Progressive Module Replacing

Paper Link:https://arxiv.org/abs/2002.02925

This article will provide a brief introduction to this method, give an implementation based on bert4keras [1], and validate its effectiveness.

Model Compression

First, let’s briefly introduce model compression. However, since I am not a specialist in model compression and have not conducted a particularly systematic investigation, this introduction may seem unprofessional; I hope readers will understand.

1.1 Basic Concepts

In simple terms, model compression is about “simplifying a large model to obtain a smaller model with faster inference speed.” Of course, generally speaking, model compression comes with certain sacrifices, such as the most obvious drop in final evaluation metrics. After all, free lunches that are “better and faster” are rare, so the premise for choosing model compression is to allow for a certain degree of accuracy loss.

Additionally, the speedup from model compression usually only manifests during the prediction phase; in other words, it often requires longer training times. So, if your bottleneck is training time, then model compression may not be suitable for you.

The reason model compression takes longer is that it requires “first training a large model and then compressing it to a small model.” Readers may wonder: why not directly train a small model? The answer is that many experiments have shown that training a large model first and then compressing it usually yields better final accuracy than directly training a small model.

This means that, for the same inference speed, the compressed model is superior, and relevant discussions can refer to the paper Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers [2] , and there is also a discussion on Zhihu titled “Why compress models instead of directly training a small CNN?”[3].

1.2 Common Techniques

Common model compression techniques can be divided into two main categories: 1) directly simplifying a large model to obtain a small model; 2) retraining a small model using a large model. Both methods require first training a reasonably effective large model, followed by subsequent operations.

The representative methods of the first category are pruning and quantization.

Pruning, as the name suggests, attempts to remove some components of the original large model to transform it into a smaller model while keeping the model’s performance within an acceptable range;

As for quantization, it refers to changing the numerical format of the model without altering its structure, while not significantly degrading performance. Typically, models are built and trained using float32, but switching to float16 can speed up inference and reduce memory usage. If further converted to 8-bit integers or even 2-bit integers (binary), the speedup and memory savings will be even more pronounced.

The representative method of the second category is distillation. The basic idea of distillation is to use the output of the large model as labels for training the small model. For classification problems, the actual labels are in one-hot format, while the output of the large model (e.g., logits) contains richer signals, allowing the small model to learn better features.

In addition to learning from the output of the large model, often to further enhance performance, the small model also needs to learn from the intermediate layer results, attention matrices, correlation matrices, etc. Therefore, a good distillation process usually involves multiple losses, and how to reasonably design these losses and adjust their weights is one of the research topics in the field of distillation.

Theseus

The compression method to be introduced in this article is called “BERT-of-Theseus,” which belongs to the second category mentioned above. In other words, it also retrains a small model using a large model, but it is designed based on the replaceability of modules.

The name BERT-of-Theseus is derived from the thought experiment “Ship of Theseus”: If the wood on the Ship of Theseus is gradually replaced until all the wood is no longer the original wood, is it still the same ship?

2.1 Core Idea

As mentioned earlier, when using distillation for model compression, we often hope that not only the output of the small model aligns with that of the large model, but also the intermediate layer results align. What does “alignment” mean? It means replaceability!

Therefore, the idea behind BERT-of-Theseus is: Why go through the trouble of adding various losses to achieve replaceability? Why not directly replace the modules of the large model with those of the small model and then train?

Let’s give a practical analogy:

Suppose we have two teams, A and B, each with five members. Team A is a star team with exceptional strength; Team B is a rookie team that needs training. To train Team B, we select one member from Team B to replace one member from Team A, and then let this “4+1” team A continuously practice and compete. After some time, the newly added member will improve, and this “4+1” team will possess strength close to the original Team A.

By repeating this process until all members of Team B are sufficiently trained, Team B can eventually form a strong team. In contrast, if we only had Team B from the beginning, and they trained and competed among themselves, even if their strength gradually improved, without the help of the strong Team A, their final strength might not stand out.

2.2 Process Details

Returning to the compression of BERT, let’s assume we have a 6-layer BERT model. We directly fine-tune it on downstream tasks, achieving a reasonably good model, which we call Predecessor.

Our goal is to obtain a 3-layer BERT that performs close to the Predecessor on downstream tasks, at least better than directly fine-tuning the first 3 layers of BERT (otherwise, it would be a waste of effort). We call this small model Successor. So how does BERT-of-Theseus achieve this? See the diagram below.

▲ BERT-of-Theseus training process diagram

▲ Predecessor and Successor model diagram

In the entire process of BERT-of-Theseus, the weights of the Predecessor are fixed. The 6-layer Predecessor is divided into 3 modules, corresponding one-to-one with the 3-layer Successor model. During training, the corresponding modules of the Predecessor are randomly replaced with those of the Successor, and then fine-tuned directly using the optimization objective of the downstream task (only training the layers of the Successor).

After sufficient training, the entire Successor is then separated and fine-tuned again on the downstream task until the validation set metrics no longer improve.

In implementation, it is actually a process similar to Dropout, where both Predecessor and Successor models are executed simultaneously, and the output of one of the corresponding modules is set to zero, then summed and passed to the next layer, i.e.:

Since it is either 0 or 1 (without adjustment, both have a 0.5 probability of being selected), each branch effectively only has one module selected. Thus, the right diagram above corresponds to the model structure described below. As a result, each zeroing is random, so after sufficient training steps, every layer of the Successor can be well trained.

2.3 Method Analysis

What advantages does BERT-of-Theseus have compared to distillation? First of all, since it can be published, its effectiveness should be at least comparable, so we will not compare effectiveness, but rather the methods themselves. Clearly, the main feature of BERT-of-Theseus is: simplicity.

As mentioned earlier, distillation often requires matching intermediate layer outputs, which involves many training objectives: downstream task loss, intermediate layer output loss, correlation matrix loss, attention matrix loss, etc. Just thinking about balancing these losses can be a headache.

In contrast, BERT-of-Theseus directly enforces that the Successor has outputs similar to the Predecessor through the replacement operation, and the final training objective is only the downstream task loss, which is undeniably simple.

Moreover, BERT-of-Theseus has a special advantage: many distillation methods need to operate simultaneously during both pre-training and fine-tuning phases to achieve significant results, while BERT-of-Theseus directly acts on fine-tuning for downstream tasks, achieving comparable results. This advantage is not reflected in the algorithm but is an experimental conclusion.

Formally, the random replacement idea of BERT-of-Theseus is somewhat similar to data augmentation schemes like SamplePairing and mixup in images (refer to “From SamplePairing to mixup: The Magic of Regularization” [4]), both of which involve randomly sampling two objects and weighted summation to enhance the original model; it also resembles the progressive training scheme of PGGAN [5], which achieves a transition between two models through a certain degree of mixing.

If readers are familiar with them, they might raise some extensions or questions about BERT-of-Theseus: Must it be exactly 0 or 1? Can any random number work? Or can we let it gradually change from 1 to 0 without randomness? These ideas have not been thoroughly tested, and interested readers can modify the code below to experiment.

Experimental Results

The original authors have open-sourced their PyTorch implementation:

https://github.com/JetRunner/BERT-of-Theseus

Brother Qiu Zhenyu also shared his explanation [6] and a TensorFlow implementation based on the original BERT: qiufengyuyi/bert-of-theseus-tf [7]. Of course, since I decided to write this introduction, I cannot miss a Keras implementation based on bert4keras:

https://github.com/bojone/bert-of-theseus

This is probably the most concise and readable implementation of BERT-of-Theseus, bar none.

As for the performance of the original paper, readers can check the original paper themselves. I conducted experiments on several text classification tasks, and the results were quite similar, consistent with Brother Qiu’s experimental conclusions. Among them, the experimental results on the CLUE iflytek dataset are as follows:

It can be seen that compared to directly fine-tuning the first few layers, BERT-of-Theseus indeed brings a certain performance improvement. For the random zeroing scheme, aside from equal probability selection of 0/1, the original paper also tried other strategies with slight improvements; however, they introduced extra hyperparameters, so I did not experiment further. Interested readers can modify and try it themselves.

Additionally, regarding distillation, if the Successor has the same structure as the Predecessor (same model distillation), then generally speaking, the final performance of the Successor is often better than that of the Predecessor. Does BERT-of-Theseus have this characteristic?

I also experimented with this idea and found the conclusion to be negative. In other words, under the same model conditions, the Successor trained by BERT-of-Theseus did not perform better than the Predecessor. Thus, while BERT-of-Theseus is good, it cannot completely replace distillation.

Conclusion

This article introduced and experimented with a BERT model compression method called “BERT-of-Theseus.” The method is characterized by its simplicity and clarity, purely through replacement operations to allow the small model to learn the behavior of the large model, achieving the current optimal model compression effect with only one loss.

References

[1] https://github.com/bojone/bert4keras

[2] https://arxiv.org/abs/2002.11794

[3] https://www.zhihu.com/question/303922732

[4] https://kexue.fm/archives/5693

[5] https://arxiv.org/abs/1710.10196

[6] https://zhuanlan.zhihu.com/p/112787764

[7] https://github.com/qiufengyuyi/bert-of-theseus-tf

Further Reading

BERT-of-Theseus: A Model Compression Method Based on Module Replacement

#Submission Channel#

Make Your Paper Seen by More People

How can high-quality content reach readers more quickly, reducing the cost of finding quality content? The answer is: people you don’t know.

There are always some people you don’t know who know what you want to know. PaperWeekly may serve as a bridge, facilitating the collision of scholars and academic inspiration from different backgrounds and directions, sparking more possibilities.

PaperWeekly encourages university laboratories or individuals to share various high-quality content on our platform, whether it be latest paper interpretations, learning insights, or technical content. Our sole purpose is to make knowledge truly flow.

📝 Submission Standards:

• The manuscript must be an original personal work, and the author’s personal information (name + school/work unit + education/position + research direction) must be specified in the submission.

• If the article is not a first publication, please remind us during submission and attach all published links.

• PaperWeekly assumes every article is a first publication and will add an “original” label.

📬 Submission Email:

• Submission Email: [email protected]

• All images for the article should be sent separately as attachments.

• Please leave immediate contact information (WeChat or phone) so that we can communicate with the author during editing and publishing.

🔍

Now, you can also find us on “Zhihu”

Search for “PaperWeekly” on Zhihu’s homepage

Click “Follow” to subscribe to our column!

About PaperWeekly

PaperWeekly is an academic platform that recommends, interprets, discusses, and reports on cutting-edge AI research papers. If you study or work in the AI field, feel free to click “Group Chat” in the background of our official account, and our assistant will bring you into the PaperWeekly group chat.

BERT-of-Theseus: A Model Compression Method Based on Module Replacement

Leave a Comment Cancel reply