Machine Heart Reprint
Source: Xixiaoyao’s Cute Selling House
Author: Sheryc_Wang Su
There are two types of highly challenging engineering projects in this world: the first is to maximize something very ordinary, like expanding a language model to write poetry, prose, and code like GPT-3; while the other is exactly the opposite, to minimize something very ordinary. For NLPers, this kind of “small project” is most urgently needed for BERT.

-
Paper Title: EdgeBERT: Optimizing On-Chip Inference for Multi-Task NLP
-
Paper Link: https://arxiv.org/pdf/2011.14203.pdf
-
Source: ALBERT: A Lite BERT for Self-supervised Learning of Language Representations (ICLR’20)
-
Link: https://arxiv.org/pdf/1909.11942.pdf
-
Embedding Layer Decomposition: In BERT, the embedding dimension of WordPiece is consistent with the hidden layer dimension in the network. The authors propose that the embedding layer encodes context-independent information, while the hidden layer adds context information on this basis, so it should have a higher dimension; at the same time, if the embedding layer and hidden layer dimensions are consistent, increasing the hidden layer dimension will significantly increase the embedding layer parameter count. Therefore, ALBERT decomposes the embedding layer into matrices, introducing an additional embedding layer E. Let the vocabulary size of WordPiece be V, the embedding layer dimension be E, and the hidden layer dimension be H, then the embedding layer parameter count can be reduced from O(V x H) to O(V x E + E x H).
-
Parameter Sharing: In BERT, each Transformer layer has different parameters. The authors propose sharing all parameters of the Transformer layers across layers, thus compressing the parameter count to only the level of a single Transformer layer.
-
Next Sentence Prediction Task → Sentence Order Prediction Task: In BERT, in addition to the MLM task of the language model, there is also a next sentence prediction task, which judges whether sentence 2 is the next sentence of sentence 1. However, this task has been confirmed to have mediocre performance by models such as RoBERTa and XLNET. The authors propose replacing it with a sentence order prediction task, which judges the order of sentences 2 and 1 to learn text consistency.


-
Source: DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference (ACL’20)
-
Link: https://arxiv.org/pdf/2004.12993.pdf


-
Source: Adaptive Attention Span in Transformers (ACL’19)
-
Link: https://arxiv.org/pdf/1905.07799.pdf


-
Source: Movement Pruning: Adaptive Sparsity by Fine-Tuning (NeurIPS’20)
-
Link: https://arxiv.org/pdf/2005.07683.pdf
-
Source: Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding (ICLR’16)
-
Link: https://arxiv.org/pdf/1510.00149.pdf


-
Source: AdaptivFloat: A Floating-point based Data Type for Resilient Deep Learning Inference (arXiv Preprint)
-
Link: https://arxiv.org/pdf/1909.13271.pdf


-
Embedding Layer: Stores the embedding vectors. EdgeBERT generally does not modify the embedding layer during downstream task fine-tuning. These parameters are equivalent to read-only parameters, requiring high-speed reading and hoping to retain original data during power loss to reduce data read/write overhead, so low-energy, fast-reading eNVM (Embedded Non-Volatile Memory) is suitable. The choice here is MLC-based ReRAM, a low-power, high-speed RAM.
-
Other Parameters: These parameters need to be changed during fine-tuning. SRAM is used here (unlike computer memory DRAM, SRAM is more expensive but consumes less power and has higher bandwidth, often used to make cache or registers).



-
When the performance (accuracy) decreases by 1 percentage point compared to ALBERT, EdgeBERT can achieve reduced memory and inference speed; when the decrease is 5 percentage points, it can even achieve a reduction in inference speed.
-
The embedding has been pruned to retain only 40%, resulting in the storage of the embedding layer parameters in eNVM being only 1.73MB.
-
The Transformer parameters of QQP have been masked by 80%, while those of MNLI, SST-2, and QNLI have been masked by 60% with only a 1 percentage point drop in performance.




© THE END
Reprint please contact this public account for authorization
Submission or seeking coverage: [email protected]