Understanding Transformer in Ten Minutes

Transformer is a model that utilizes the attention mechanism to improve the training speed of models. For more information about the attention mechanism, you can refer to this article (https://zhuanlan.zhihu.com/p/52119092). The transformer can be said to be a deep learning model that is entirely based on the self-attention mechanism, as it is suitable for parallel computing, and its inherent model complexity leads to higher accuracy and performance compared to the previously popular RNN (Recurrent Neural Network).

So what is a transformer?

You can simply understand it as a black box. When we perform a text translation task, I input a piece of Chinese text, and after passing through this black box, it outputs the translated English text.

Understanding Transformer in Ten Minutes

So what is inside this black box?

It mainly consists of two parts: Encoder and Decoder.

When I input a text, the text data first goes through a module called Encoders, which encodes the text, and then the encoded data is passed to a module called Decoders for decoding. After decoding, we get the translated text. Correspondingly, we refer to Encoders as the encoder and Decoders as the decoder.

So what is inside the encoder and decoder?

Careful readers may have noticed that there is an ‘s’ after Decoders in the image above, indicating that there are multiple encoders. Indeed, this encoding module contains many small encoders. Generally, there are 6 small encoders in the Encoders, and similarly, there are 6 small decoders in the Decoders.

We see that in the encoding part, the input of each small encoder is the output of the previous small encoder, while the input of each small decoder not only includes the output of its previous decoder but also the output of the entire encoding part.

So you might ask, what is in each small encoder?

If we zoom in on an encoder, we find that its structure consists of a self-attention mechanism plus a feedforward neural network.

Let’s first take a look at what self-attention looks like.

We will explain it through several steps:

1. First, the input of self-attention is the word vectors, i.e., the initial input of the entire model is in the form of word vectors. The self-attention mechanism, as the name suggests, calculates attention over itself. For each input word vector, we need to construct the input for self-attention. Here, the transformer first multiplies the word vectors by three matrices to obtain three new vectors. The reason for multiplying by three matrix parameters instead of directly using the original word vectors is to increase the number of parameters and improve model performance. For the input X1 (machine), after multiplying by three matrices, we obtain Q1, K1, V1, and similarly, for the input X2 (learning), we also multiply by three different matrices to get Q2, K2, V2.

2. Next, we need to calculate the attention scores, which are obtained by calculating the dot product of Q with the K vectors of each word. Taking X1 as an example, we perform the dot product operation between Q1 and K1, K2, and assume we get scores of 112 and 96 respectively.

3. Divide the scores by a specific value of 8 (the square root of the dimension of the K vector, which is usually 64) to stabilize the gradients, resulting in the following:

4. Apply the softmax function to the above results, which mainly normalizes the scores, making them all positive and summing to 1.

5. Multiply the V vectors by the softmax results. The idea here is to keep the values of the words we want to focus on unchanged while masking those unrelated words (for example, multiplying them by a very small number).

6. Sum the weighted V vectors to produce the output of the self-attention layer at this position (the first word). The self-attention outputs for other positions are computed in the same way.

We can summarize the above process with a formula represented in the following diagram:

Does the self-attention layer end here?

No, the paper introduced the concept of “multi-head attention mechanism” to further refine the self-attention mechanism layer, which improves the performance of the self-attention layer from two aspects.

The first aspect is that it expands the model’s ability to focus on different positions, which is particularly useful for translating a sentence because we want to know which word “it” refers to.

The second aspect is that it gives the self-attention layer multiple “representation subspaces.” For the multi-head self-attention mechanism, we have not just one set of Q/K/V weight matrices, but multiple sets (8 sets are used in the paper), so each encoder/decoder uses 8 “heads” (which can be understood as 8 independent self-attention mechanism computations), and each set of Q/K/V is different. Then, we obtain 8 different weight matrices Z, each of which is used to project the input vector into different representation subspaces.

After applying the multi-head attention mechanism, we get multiple weight matrices Z. We concatenate these multiple Z to obtain the output of the self-attention layer:

After going through the self-attention layer, we obtain the output of self-attention, which becomes the input for the feedforward neural network layer. The input for the feedforward neural network only requires one matrix, rather than eight matrices, so we need to compress these 8 matrices into one. How do we do that? We simply concatenate these matrices and multiply by an additional weight matrix.

The final Z is used as the input for the feedforward neural network.

Next, we enter the feedforward neural network module inside the small encoder. There is already a lot of information available online about feedforward neural networks, so we won’t go into too much detail here. Just know that the input for the feedforward neural network is the output of self-attention, which is the Z in the above diagram, and is a matrix with dimensions (sequence length × D word vector). The output of the feedforward neural network also has the same dimensions.

This concludes the internal structure of a small encoder. A larger encoding part consists of repeating this process 6 times, ultimately obtaining the output of the entire encoding part.

Then, in the transformer, 6 encoders are used. To address the issue of gradient vanishing, a residual neural network structure is used in both the Encoders and Decoders, meaning that the input for each feedforward neural network not only contains the output Z of self-attention but also the original input.

The encoder mentioned above encodes the input (machine learning) using the structure of self-attention mechanism + feedforward neural network. Similarly, the decoder also uses the same structure. It first calculates the self-attention scores for the output (machine learning), and the difference is that after applying the self-attention mechanism, the output of self-attention is calculated again with the output of the Decoders module to compute the attention scores, and then enters the feedforward neural network module.

That concludes the explanation of the two main modules of Transformer encoding and decoding. Now, returning to the initial question, how do we translate “machine learning” into “machine learing”? The decoder’s output is originally a floating-point vector; how do we convert it into the two words “machine learning”?

The process is that the final linear layer is connected to a softmax, where the linear layer is a simple fully connected neural network that projects the vector generated by the decoder onto a higher-dimensional vector (logits). Assuming our model’s vocabulary consists of 10,000 words, the logits will have 10,000 dimensions, each dimension corresponding to a unique word’s score. The subsequent softmax layer converts these scores into probabilities. By selecting the dimension with the highest probability and generating the associated word as the output for this time step, we obtain the final output!

Assuming the vocabulary dimension is 6, the process for outputting the highest probability vocabulary is as follows:

This is the framework of Transformer, but there is one last question: we know that in RNN, each input is sequential and has a specific order, but the entire Transformer framework does not consider order information. This brings us to another concept: “positional encoding.”

Indeed, the Transformer does not consider order information. So what do we do? We can manipulate the input to make it carry positional information. How do we turn the word vector input into an input that carries positional information?

We can add a vector with sequential features to each word vector. It turns out that the sine and cosine functions can express these features well, so the positional vector is typically represented by the following formula:

Finally, here is a classic diagram. When we first look at this diagram, it may be difficult to understand. I hope that after gaining a deeper understanding of Transformer, you can look at this diagram again and have a more profound understanding.

This concludes the introduction to Transformer. Many classic models such as BERT and GPT-2 are based on the ideas of Transformer. We will have the opportunity to introduce these two record-breaking classic models in detail later.

Editor / Zhang Zhihong

Reviewer / Fan Ruiqiang

Rechecker / Zhang Zhihong

Click below

Read the original text

Leave a Comment Cancel reply