In the past two years, the BERT model has become very popular. Most people know about BERT but do not understand what it specifically is. In short, the emergence of BERT has completely changed the relationship between pre-training to generate word vectors and downstream specific NLP tasks, proposing the concept of training word vectors at a foundational level.

To understand BERT, you may need to first understand the Transformer framework. Today, we will provide a detailed explanation from Transformer to BERT.

1. Attention

Before learning about Transformer and BERT, we need to understand the mechanisms of Attention and Self-Attention. The essence of Attention is to find the weight distribution of input features. This feature has a length concept in a certain dimension. If we input a feature of length n, then Attention must learn a weight distribution of length n, which is calculated based on similarity. The final score returned will be the weighted sum of the weights and features.

1.1 Calculation Process of Attention

The input of Attention is Q, K, V, and the return is a score. The calculation formula is as follows:

It should be noted that the subscript positions in the above formula are crucial. Clearly, the weight distribution we need to learn is related to the feature we wish to find the weights for.

1.1.1 Meaning of QKV in English

Q stands for Query: the sequence being queried. It can be seen that during the calculation of similarity, Q maintains an overall state.
K stands for Key: the index being queried. The weight distribution A we learn has a length of n, where the size of each index in A represents the weight assigned to that corresponding index. Thus, K controls the index.
V stands for Value: this refers to the value of our feature itself, which needs to be weighted with the weight distribution to obtain the final distribution.

1.1.2 Calculation Methods of Similarity

There are many ways to calculate similarity:

Similarity Name	Calculation Method
Dot Product
Matrix Product	, parameter
Cosine Similarity	$s(q,k)=\frac{q^Tk}{
Concat	, parameter
MLP	, parameter

1.2 Attention in HAN

First, let’s look at how QKV is represented in the Attention of HAN.

In HAN, we only have one input, and the output is the weighted average of and , which is the Value in the Attention mechanism. We perform a linear transformation on to get , and then randomly generate a vector to compute . The formula is:

As can be seen in the formula, it remains in a queried state, maintaining an overall state. Thus, the randomly generated vector is the Query in the Attention mechanism. The linear transformation we performed generates U, which provides different index weight values for A, representing the Key in our Attention mechanism. The similarity formula used here is obviously the dot product, but I encountered some difficulties when implementing it myself and changed it to the MLP implementation.

1.3 Attention in Seq2seq

Let’s look at the Attention mechanism in seq2seq. In this task, we need to generate step by step, and we will find the corresponding words based on each generated step (which is actually a distribution).

Our generation formula is:

It can be seen that each generation must be updated, and in this model, it is the final score returned by the Attention model.

In the seq2seq model, we denote the values generated by the input Encoder as , and we need to learn the weight distribution regarding , so is the Value here, while the Key here is also itself. Unlike in HAN, it does not undergo transformation. The Query we want to query each time is the already generated sequence , which is the value generated in the Decoder. Clearly, as the generation changes, this queried will become longer. Thus, we can generate the final output.

2. Transformer

The Transformer improves the slow training issue that RNNs are often criticized for, utilizing the self-attention mechanism to achieve fast parallel processing.

2.1 Self-Attention

In Transformer, the Attention method we use is Self-Attention, which is slightly different from previous Attention. In simple terms, it learns three parameters to transform the features of the same embedding, linearly transforming them into Q, K, V to calculate the Attention score of the sentence. The term ‘Self’ indicates that all Q, K, V are generated by the input itself.

Normalization: Before normalizing the weight distribution, it must be divided by the square root of the first dimension of the input matrix, which stabilizes the gradients. Other values can also be used; 8 is just the default value, followed by softmax.

Return: The returned value has the same length dimension as the input, with each word corresponding to the weighted sum of all words’ weight distributions and Value scores for the current word. Thus, for as many words as there are, there are as many Attention scores, which is self-Attention.

2.2 Model Structure

Transformer: Input(Embedding)→Encoder ×6 → Decoder ×6 → output
Encoder: Multi-headed attention → Add&Norm → Feed Forward → Add&Norm
Decoder: Multi-headed attention → Add&Norm → Encoder-Decoder-Attention → Add&Norm → Feed Forward → Add&Norm
Multi-headed attention: Self-Attention ×8

Here, Encoder-Decoder-Attention refers to the Attention structure in seq2seq, where K and V are the output from the top layer of the Encoder.

2.3 Multi-headed Attention

Self-Attention generates one set, while Multi-headed Attention generates 8 sets. In practice, these 8 sets need to be concatenated.

It should be noted that the multi-headed self-attention at the Decoder end needs to be masked, as it cannot

A Simple Explanation of Transformer to BERT Models