Source | Zhihu
Address | https://zhuanlan.zhihu.com/p/268130746
Author | Mr.robot
Editor | Machine Learning Algorithms and Natural Language Processing WeChat Public Account
This article has been authorized by the author, and secondary reproduction is prohibited without permission.
Interviewer: Do you understand ALBERT?
Interviewee: Yes.
Interviewer: Then can you explain the advantages of ALBERT compared to BERT?
Interviewee: The optimization points of ALBERT are divided into three parts: Factorized Embedding Parameterization, Cross-layer Parameter Sharing, and Sentence Order Prediction.
These three parts are ALBERT’s optimizations over BERT, where Factorized Embedding Parameterization and Cross-layer Parameter Sharing optimize the number of parameters, greatly reducing the parameter count. Sentence Order Prediction optimizes BERT’s pre-training tasks, enhancing BERT’s learning effectiveness.
Interviewer: Can you elaborate on these three parts?
Interviewee: Sure, let’s start with Factorized Embedding Parameterization. First, BERT-base consists of 12 layers of Transformer encoder layers. When we obtain the vector representation of words or sentences using BERT, we use the output value from the Transformer encoder layers, typically selecting the output from the second-to-last layer, which provides the best vector representation. This means that the output H from the Transformer encoder considers the context of surrounding words to obtain the current word’s vector representation, making it contextually relevant. We also obtain the vector representation of the input through input_ids, which gives us the vector representation E, processed by the embedding layer. The dimensions of E and H in BERT are equal, and E‘s dimension increases with H‘s dimension. For example, in the BERT-Large model, H is 1024, and E is also 1024, which is unnecessary since we ultimately want to obtain H. We only need to ensure that H meets the required dimensions. The dimension of E is variable and closely related to the vocabulary size, meaning the parameter count of the embedding layer is V*E. We can adjust E to a smaller dimension to optimize and reduce the parameter count, and then transform E to H‘s dimension through E*H. The total parameter count is now V*E + E*H (original parameter count was V*E).
Interviewer: Now tell me about Cross-layer Parameter Sharing?
Interviewee: Cross-layer Parameter Sharing involves sharing parameters across all layers. The parameters of the Transformer encoder mainly consist of attention parameters and FeedForward parameters, and of course, LayerNorm also has parameters to learn, but we can’t ignore that. Cross-layer Parameter Sharing primarily shares the parameters of the attention and FeedForward parts. This significantly reduces the parameter count, but sharing parameters can lead to a decrease in performance, which the paper addresses by increasing the dimension of H.
Interviewer: What about Sentence Order Prediction?
Interviewee: Sentence Order Prediction optimizes BERT’s NSP pre-training. RoBERTa also pointed out that the NSP pre-training effect is not very good, so it simply removed the NSP pre-training task. The NSP pre-training task merges Topic Prediction and Coherence Prediction, allowing us to get a rough result for the pre-training task by determining whether two sentences are on the same topic. The Topic Prediction task is very simple, significantly lowering the learning difficulty. The paper improves the learning effect of the pre-training task by replacing negative samples with two sentences in reverse order from the same article, eliminating Topic Prediction.
Interviewer: Alright, you passed the interview.
PS: If anyone encounters other questions about ALBERT during interviews, feel free to leave a comment to supplement.
Reference link:
How to view the slimmed-down version of BERT—ALBERT? – Xiaolianzi’s answer – Zhihu zhihu.com/question/3478
Download 1: Four-piece Set
Reply "Four-piece Set" in the backend of the Machine Learning Algorithms and Natural Language Processing WeChat public account to obtain learning materials for TensorFlow, Pytorch, machine learning, and deep learning!
Download 2: Repository Address Sharing
Reply "Code" in the backend of the Machine Learning Algorithms and Natural Language Processing WeChat public account to get 195 NAACL + 295 ACL2019 papers with open-source code. The open-source address is as follows: https://github.com/yizhen20133868/NLP-Conferences-Code
Heavy! The Machine Learning Algorithms and Natural Language Processing communication group has officially been established! There are a lot of resources in the group, welcome everyone to join and learn!
Additional welfare resources! Qiu Xipeng's deep learning and neural networks, official PyTorch Chinese tutorials, data analysis using Python, machine learning study notes, official pandas documentation in Chinese, effective java (Chinese version), and other 20 welfare resources.
How to get: After entering the group, click on the group announcement to get the download link. Please modify the remarks as [School/Company + Name + Direction] when adding. For example —— Harbin Institute of Technology + Zhang San + Dialogue System. The account owner, please avoid commercial solicitation. Thank you!
Recommended Reading:
12 Golden Rules for Solving NER Problems in Industry
Master Machine Learning Core in Three Steps: Matrix Derivation
Distillation Techniques in Neural Networks, Starting with Softmax