Detailed Insights into BERT/Transformer

Follow the WeChat account “ML_NLP“

Set as “Starred“, heavy content delivered first-hand!

Source | Zhihu

Link | https://zhuanlan.zhihu.com/p/132554155

Author | Haichen Wei

Editor | Machine Learning Algorithms and Natural Language Processing WeChat Account

This article is for academic sharing only, if there is any infringement, please contact the backend for deletion.

With the continuous development of NLP, the research and application of knowledge related to BERT/Transformer have become increasingly detailed. Below, we attempt to delve into the details of BERT/Transformer through a QA format.

1. What problems arise if the word vectors in self-attention do not multiply by the QKV parameter matrix, without considering the multi-head aspect? 2. Why does BERT choose to mask 15% of the words? Can it be another proportion? 3. Why can the BERT pre-trained model only input a maximum of 512 words and combine at most two sentences? 4. Why does BERT add a [CLS] token before the first sentence? 5. How is the time complexity of Self-Attention calculated? 6. Where does Transformer share weights, and why can it share weights? 7. Where does the non-linearity of BERT come from? 8. Does directly summing the three embeddings in BERT affect semantics? 9. What is the reason for scaling in the dot product model of Transformer? 10. How to solve the long text problem in BERT applications?

1. What problems arise if the word vectors in self-attention do not multiply by the QKV parameter matrix, without considering the multi-head aspect?

The core of Self-Attention is to enhance the semantic representation of the target word using other words in the text, thus better utilizing contextual information.

In self-attention, each word in the sequence calculates similarity with every other word in the sequence through dot products, including itself.

If it does not multiply by the QKV parameter matrix, then the q, k, and v for this word would be exactly the same.

Under the same magnitude, the dot product of qi and ki will be the largest (this can be analogized from the situation where the product is maximized when two numbers are equal under the same sum).

Thus, in the weighted average after softmax, the weight of the word itself will be the largest, making the weights of other words very small, which cannot effectively utilize contextual information to enhance the current word’s semantic representation.

Multiplying by the QKV parameter matrix will make the q, k, and v for each word different, significantly alleviating the above impact.

Of course, the QKV parameter matrix also enables multi-head attention, similar to multiple kernels in CNN, to capture richer features/information.

2. Why does BERT choose to mask 15% of the words? Can it be another proportion?

BERT’s Masked LM randomly masks 15% of all words in the corpus, inspired by the cloze task, but actually has similarities with CBOW.

From the CBOW perspective, a good explanation is: randomly selecting one word in a window of size is similar to the center word in a sliding window in CBOW, but here the sliding window is non-overlapping.

From the CBOW sliding window perspective, a proportion of 10%~20% is still acceptable.

This unofficial explanation was provided by a friend of mine as a point of reference.

3. Why can the BERT pre-trained model only input a maximum of 512 words and combine at most two sentences?

This is due to the initial settings of Google’s BERT pre-trained model, where the former corresponds to Position Embeddings, and the latter corresponds to Segment Embeddings.

In BERT, Token, Position, and Segment Embeddings are all learned, and in the PyTorch code, they are as follows:

self.word_embeddings = Embedding(config.vocab_size, config.hidden_size)
self.position_embeddings = Embedding(config.max_position_embeddings, config.hidden_size)
self.token_type_embeddings = Embedding(config.type_vocab_size, config.hidden_size)

The above BERT PyTorch code is from: github.com/xieyufei1993, with a very clear structural hierarchy.

In the BERT config:

"max_position_embeddings": 512
"type_vocab_size": 2

Thus, when directly using Google’s BERT pre-trained model, the input can have a maximum of 512 words (excluding [CLS] and [SEP]), and at most two sentences can be combined into one. Words and sentences beyond this will not have corresponding embeddings.

Of course, if you have sufficient hardware resources to retrain BERT, you can modify the BERT config to set larger values for max_position_embeddings and type_vocab_size to meet your needs.

4. Why does BERT add a [CLS] token before the first sentence?

BERT adds a [CLS] token before the first sentence so that the vector corresponding to this position in the last layer can serve as the semantic representation of the entire sentence, which can be used for downstream classification tasks, etc.

Why choose it? Because compared to other words already in the text, this symbol with no obvious semantic information will more “fairly” integrate the semantic information of various words in the text, thus better representing the semantics of the entire sentence.

To supplement BERT’s output, there are two types:

One is get_pooled_out(), which is the representation of [CLS], with an output shape of [batch size, hidden size].

The other is get_sequence_out(), which retrieves the vector representation of each token in the entire sentence, with an output shape of [batch_size, seq_length, hidden_size], which also includes [CLS]. Therefore, care must be taken when performing token-level tasks.

5. How is the time complexity of Self-Attention calculated?

The time complexity of Self-Attention is: , where n is the length of the sequence, and d is the dimension of the embedding.

Self-Attention includes three steps: similarity calculation, softmax, and weighted average, with respective time complexities:

The similarity calculation can be viewed as multiplying two matrices of sizes (n,d) and (d,n): , resulting in an (n,n) matrix.

Softmax is calculated directly, with a time complexity of

The weighted average can be viewed as multiplying matrices of sizes (n,n) and (n,d): , resulting in an (n,d) matrix.

Therefore, the time complexity of Self-Attention is .

Next, let’s analyze Multi-Head Attention, which functions similarly to multiple kernels in CNN.

The implementation of multi-head is not through looping and calculating each head, but through transposes and reshapes, using matrix multiplication to accomplish it.

In practice, the multi-headed attention is done with transposes and reshapes rather than actual separate tensors. —— From Google BERT source code

In Transformer/BERT, the dimension d, which is the hidden_size/embedding_size, is reshaped and split, and you can refer to Google’s TF source code or the above PyTorch source code:

hidden_size (d) = num_attention_heads (m) * attention_head_size (a), i.e., d = m * a

And the num_attention_heads dimension is transposed to the front, so that the dimensions of Q and K are both (m,n,a), ignoring the batch dimension.

Thus, the dot product can be viewed as multiplying tensors of sizes (m,n,a) and (m,a,n), resulting in an (m,n,n) matrix, which is equivalent to m heads, and the time complexity is .

For tensor multiplication time complexity analysis, refer to: Matrix and Tensor Multiplication Time Complexity Analysis.

Thus, the time complexity of Multi-Head Attention is , but in practice, tensor multiplication can be accelerated, so the actual complexity may be lower.

However, regarding the logic of doing transposes and reshapes, I personally have not fully understood it, and I hope someone knowledgeable can clarify it. Thank you!

6. Where does Transformer share weights, and why can it share weights?

Transformer shares weights in two places:

(1) Weight sharing between the embedding layers of the Encoder and Decoder;

(2) Weight sharing between the embedding layer and the FC layer in the Decoder.

For (1), in “Attention is All You Need”, Transformer is applied to machine translation tasks, where the source and target languages are different, but they can share a large vocabulary. For words that appear in both languages (e.g., numbers, punctuation, etc.), better representations can be obtained, and for the Encoder and Decoder, only the embeddings corresponding to the respective languages will be activated, thus allowing for weight sharing using a common vocabulary.

In the paper, the Transformer vocabulary used BPE for processing, so the smallest unit is subword. English and German belong to the same Germanic language family, sharing many similar subwords that can share similar semantics. However, for languages like Chinese and English, which are quite different, the semantic sharing effect might not be significant.

However, sharing the vocabulary can increase the number of vocabulary items, thus increasing the computation time for softmax, so the decision to share in practice may need to be weighed according to the situation.

This point is referenced from: zhihu.com/question/3334

For (2), the embedding layer can be seen as retrieving the corresponding embedding vector through one-hot encoding, while the FC layer can be seen as the reverse, obtaining the softmax probability that a vector (defined as x) might correspond to a certain word, taking the maximum probability (in a greedy case) as the predicted value.

So which one will have the highest probability? Assuming the rows in the FC layer are of similar magnitude, theoretically, the dot product corresponding to the row that is the same as x will yield the highest softmax probability (this can be analogized to problem 1 in this article).

Therefore, the embedding layer and FC layer share weights, and the word corresponding to the row closest to the vector x will obtain a greater predicted probability. In fact, the embedding layer and FC layer in the Decoder are somewhat like inverse processes.

Through such weight sharing, the number of parameters can be reduced, speeding up convergence.

However, I initially had a confusion: the parameter dimensions of the embedding layer are (v,d), and the parameter dimensions of the FC layer are (d,v), can they be shared directly, or do they need to be transposed? Where v is the vocabulary size and d is the embedding dimension.

Upon checking the PyTorch source code, I found that they can indeed be shared directly:

fc = nn.Linear(d, v, bias=False)    # Decoder FC layer definition

weight = Parameter(torch.Tensor(out_features, in_features))   # Linear layer weight definition

In the definition of the Linear layer’s weights, they are ordered as (out_features, in_features), and the actual computation will first transpose the weight before multiplying it with the input matrix. Thus, the dimensions of the FC layer’s corresponding Linear weights are also (v,d), allowing for direct sharing.

7. Where does the non-linearity of BERT come from?

The non-linearity comes from the gelu activation function in the feedforward layer and self-attention, which is non-linear, thanks to the comments pointing this out.

Several other questions are also very good and worth focusing on, but there are already excellent answers online, as follows:

8. Does directly summing the three embeddings in BERT affect semantics?

Reference: zhihu.com/question/3748

9. What is the reason for scaling in the dot product model of Transformer?

Reference: zhihu.com/question/3397

10. How to solve the long text problem in BERT applications?

Reference: zhihu.com/question/3274

Download 1: Four-piece Set
Reply "Four-piece set" in the backend of the Machine Learning Algorithms and Natural Language Processing WeChat account to obtain the learning materials for TensorFlow, Pytorch, Machine Learning, and Deep Learning!

Download 2: Repository Address Sharing
Reply "Code" in the backend of the Machine Learning Algorithms and Natural Language Processing WeChat account to obtain 195 NAACL + 295 ACL2019 papers with open-source code. The open-source address is as follows: https://github.com/yizhen20133868/NLP-Conferences-Code

Exciting! The Machine Learning Algorithms and Natural Language Processing communication group has been officially established! The group has a wealth of resources, and everyone is welcome to join for learning!

Extra bonus resources! Deep Learning and Neural Networks by Qiu Xipeng, official Chinese tutorial for Pytorch, Python Data Analysis, Machine Learning notes, official Chinese documentation for pandas, Effective Java (Chinese version), and 20 other bonus resources.

How to obtain: After entering the group, click on the group announcement to receive the download link. Please modify your remarks to [School/Company + Name + Direction] when adding, e.g. — HIT + Zhang San + Dialogue System. The account owner and WeChat merchants are requested to avoid. Thank you!

Recommended Reading:
12 Golden Rules for Solving NER Problems in Industry
Three Steps to Mastering the Core of Machine Learning: Matrix Derivatives
Distillation Techniques in Neural Networks, Starting from Softmax

Leave a Comment Cancel reply