Understanding Transformer Models: A Comprehensive Guide

Click on the above “Beginner’s Visual Learning” to select “Add to Favorites” or “Pin”

Essential content delivered immediately

Source: Python Data Science

This article is about 7200 words long and is recommended to read in 14 minutes.
In this article, we will explore the Transformer model and understand how it works.

1. Introduction

Google’s BERT model achieved state-of-the-art results in 11 NLP tasks, igniting the entire NLP community. One key factor in BERT’s success is the powerful role of the Transformer. The Transformer model from Google was initially used for machine translation tasks, achieving state-of-the-art results at that time. The Transformer improved the slow training speed of RNNs by utilizing self-attention mechanisms for fast parallelization. Additionally, Transformers can be scaled to very deep depths, fully leveraging the characteristics of DNN models to improve model accuracy. In this article, we will study the Transformer model and understand how it works.

Source:

https://blog.csdn.net/longxinchen_ml/article/details/86533005

Original Author: Jay Alammar

Original Link:

https://jalammar.github.io/illustrated-transformer

2. Main Content Begins

The Transformer was proposed in the paper “Attention is All You Need”, and is now recommended as a reference model by Google Cloud TPU. Related TensorFlow code can be obtained from GitHub, as part of the Tensor2Tensor package. Harvard’s NLP team has also implemented a PyTorch version with annotations of the paper.

In this article, we will attempt to simplify the model and introduce its core concepts step by step, hoping to make it easy for ordinary readers to understand.

Attention is All You Need:

https://arxiv.org/abs/1706.03762

Starting from a macro perspective, we first view this model as a black box operation. In machine translation, it means inputting one language and outputting another.

To break down this black box, we can see that it consists of an encoder component, a decoder component, and the connections between them.

The encoder component consists of a stack of encoders (the paper stacks 6 encoders together—the number 6 is not magical; you can try other numbers). The decoder component is also composed of the same number (corresponding to the encoders) of decoders.

All encoders are structurally identical, but they do not share parameters. Each decoder can be broken down into two sub-layers.

The sentence input from the encoder first passes through a self-attention layer, which helps the encoder focus on other words in the input sentence while encoding each word. We will delve deeper into self-attention later in the article.

The output from the self-attention layer is passed to a feed-forward neural network. The feed-forward neural network corresponding to each position’s word is exactly the same (note: another interpretation is that it is a one-dimensional convolutional neural network with a window size of one word).

The decoder also has self-attention and feed-forward layers of the encoder. Additionally, there is an attention layer between these two layers to focus on relevant parts of the input sentence (similar to the attention mechanism of seq2seq models).

Introducing Tensors into the Picture

Now that we understand the main parts of the model, let’s take a look at how various vectors or tensors (note: the concept of tensors is a generalization of vectors; simply put, vectors are first-order tensors, and matrices are second-order tensors) transform the input into output in different parts of the model.

Like most NLP applications, we first convert each input word into a word vector using a word embedding algorithm.

Each word is embedded as a 512-dimensional vector, and we represent these vectors with simple boxes.

The word embedding process occurs only in the bottom layer of the encoder. All encoders share a common feature: they receive a list of vectors, where each vector is of size 512 dimensions. In the bottom (initial) encoder, it is the word vector, but in other encoders, it is the output of the next layer encoder (also a list of vectors). The size of the vector list is a hyperparameter we can set—usually the length of the longest sentence in our training dataset.

After the input sequence is word-embedded, each word flows through two sub-layers in the encoder.

Next, let’s take a look at a core feature of the Transformer: each word in the input sequence has its unique path flowing into the encoder. In the self-attention layer, there are dependencies among these paths. However, the feed-forward layer has no such dependencies. Thus, various paths can be executed in parallel in the feed-forward layer.

Then let’s take a shorter sentence as an example to see what happens in each sub-layer of the encoder.

Now We Begin “Encoding”

As mentioned above, an encoder receives a list of vectors as input, then passes the vectors in the list to the self-attention layer for processing, and then to the feed-forward neural network layer, passing the output results to the next encoder.

Each word in the input sequence undergoes a self-encoding process. They then each pass through the same feed-forward neural network—each vector goes through it separately.

Macro Perspective of the Self-Attention Mechanism

Don’t be confused by the term self-attention, as if everyone should be familiar with this concept. In fact, I hadn’t encountered this concept either until I read the paper Attention is All You Need. Let’s refine how it works.

For example, the following sentence is the input sentence we want to translate:

The animal didn’t cross the street because it was too tired

What does “it” refer to in this sentence? Does it refer to the street or the animal? This is a simple question for humans but not for algorithms.

When the model processes the word “it”, the self-attention mechanism allows “it” to connect with “animal”.

As the model processes each word in the input sequence, self-attention focuses on all words in the entire input sequence, helping the model better encode the current word.

If you are familiar with RNNs (Recurrent Neural Networks), recall how it maintains hidden layers. RNNs combine the representations of all previously processed words/vectors with the current word/vector being processed. The self-attention mechanism integrates the understanding of all relevant words into the word we are processing.

When we are encoding the word “it” in encoder #5 (the top encoder in the stack), the attention mechanism part will focus on “The Animal”, incorporating part of its representation into the encoding of “it”.

Please make sure to check the Tensor2Tensor notebook, where you can download a Transformer model and use interactive visualization to examine it.

Microscopic Perspective of the Self-Attention Mechanism

First, let’s understand how to use vectors to calculate self-attention, and then see how it is implemented using matrices.

The first step in calculating self-attention is generating three vectors from each encoder input vector (the word vector of each word). In other words, for each word, we create a query vector, a key vector, and a value vector. These three vectors are created by multiplying the word embedding with three weight matrices.

It can be observed that these new vectors have lower dimensions than the word embedding vectors. Their dimensions are 64, while the dimensions of the word embedding and the input/output vectors of the encoder are 512. However, there is no strict requirement for smaller dimensions; this is merely an architectural choice that keeps most of the calculations of multi-headed attention unchanged.

X1 is multiplied by the WQ weight matrix to obtain q1, which is the query vector related to this word. Ultimately, each word in the input sequence creates a query vector, a key vector, and a value vector.

What are Query Vectors, Key Vectors, and Value Vectors?

They are all abstract concepts that help calculate and understand the attention mechanism. Please continue reading the following content to understand the role each vector plays in the calculation of the attention mechanism.

The second step in calculating self-attention is to calculate the scores. Suppose we are calculating the self-attention vector for the first word “Thinking” in this example; we need to score each word in the input sentence against “Thinking”. These scores determine how much attention is paid to other parts of the sentence when encoding the word “Thinking”.

These scores are calculated by taking the dot product of the key vectors of all the words (the words in the input sentence) with the query vector of “Thinking”. So, if we are processing the self-attention of the word at the front position, the first score is the dot product of q1 and k1, and the second score is the dot product of q1 and k2.

The third and fourth steps involve dividing the scores by 8 (8 is the square root of the dimension of the key vector 64 used in the paper, which stabilizes the gradients. Other values can also be used; 8 is just the default value), followed by passing the results through softmax. The purpose of softmax is to normalize the scores of all words, resulting in positive values that sum to 1.

This softmax score determines the contribution of each word to the encoding of the current position (“Thinking”). Clearly, the words already at this position will receive the highest softmax scores, but sometimes it is also helpful to pay attention to another word related to the current word.

The fifth step is to multiply each value vector by the softmax score (this prepares them for summation later). The intuition here is to focus on semantically related words and diminish the impact of unrelated words (for example, multiplying them by a small decimal like 0.001).

The sixth step is to sum the weighted value vectors (note: another explanation of self-attention is that when encoding a word, it is the weighted sum of the representations (value vectors) of all words, where the weights are obtained by the dot product of the representation of the word (key vector) and the representation of the word being encoded (query vector), followed by softmax), and thus we obtain the output of the self-attention layer at that position (in our example, for the first word).

Thus, the computation of self-attention is complete. The resulting vector can then be passed to the feed-forward neural network. However, in practice, these calculations are done in matrix form for faster computation. Next, let’s see how it is implemented using matrices.

Implementing Self-Attention Mechanism through Matrix Operations

The first step is to compute the query matrix, key matrix, and value matrix. To do this, we pack the word embeddings of the input sentence into matrix X and multiply it by our trained weight matrices (WQ, WK, WV).

Each row in matrix x corresponds to one word in the input sentence. We again see the size difference between the word embedding vectors (512, or the 4 boxes in the figure) and the q/k/v vectors (64, or the 3 boxes in the figure).

Finally, since we are dealing with matrices, we can combine steps 2 to 6 into a single formula to compute the output of the self-attention layer.Matrix Operation Form of Self-Attention

“The Battle of Multi-Headed Attention”

By introducing a mechanism called “multi-headed” attention, the paper further improves the performance of the self-attention layer in two ways:

1. It expands the model’s ability to focus on different positions. In the example above, while each encoding somewhat reflects z1, it may be dominated by the actual words themselves. If we translate a sentence like “The animal didn’t cross the street because it was too tired”, we would want to know which word “it” refers to, and this is where the model’s “multi-headed” attention mechanism comes into play.

2. It provides multiple “representation subspaces” for the attention layer. Next, we will see that for the “multi-headed” attention mechanism, we have multiple sets of query/key/value weight matrices (the Transformer uses eight attention heads, so we have eight sets of matrices for each encoder/decoder). Each of these sets is randomly initialized, and after training, each set is used to project the input word embeddings (or vectors from lower encoders/decoders) into different representation subspaces.

In the “multi-headed” attention mechanism, we maintain independent query/key/value weight matrices for each head, resulting in different query/key/value matrices. As before, we multiply X by the WQ/WK/WV matrices to produce the query/key/value matrices.

If we perform the same self-attention calculations as above, we will obtain eight different Z matrices through eight different weight matrix operations.

This presents a challenge. The feed-forward layer does not require eight matrices; it only requires one matrix (composed of the representation vectors of each word). Therefore, we need a way to compress these eight matrices into one matrix. How do we do that? We can simply concatenate these matrices together and then multiply them with an additional weight matrix WO.

This is almost all there is to multi-headed self-attention. There are indeed many matrices, and we try to consolidate them into one image so it can be seen at a glance.

Now that we have explored the many “heads” of the attention mechanism, let’s revisit the previous example to see where different attention “heads” focus when encoding the word “it”:

When encoding the word “it”, one attention head focuses on “animal”, while another focuses on “tired”, meaning that the model’s representation of the word “it” is somewhat a representation of both “animal” and “tired”.

However, if we add all the attention heads to the diagram, it becomes even harder to explain:

Using Positional Encoding to Represent Sequence Order

So far, our description of the model lacks a way to understand the order of input words.

To solve this problem, the Transformer adds a vector for each input word embedding. These vectors follow a specific pattern learned by the model, which helps determine the position of each word or the distance between different words in the sequence. The intuition here is that adding positional vectors to word embeddings allows them to better express the distance between words in subsequent calculations.

To help the model understand the order of words, we add positional encoding vectors, and the values of these vectors follow a specific pattern.

Assuming the dimensionality of the word embedding is 4, the actual positional encoding would look like this:

Example of positional encoding with size 4

What would this pattern look like?

In the following figure, each row corresponds to the positional encoding of a word vector, so the first row corresponds to the first word in the input sequence. Each row contains 512 values, each ranging between 1 and -1. We have color-coded them to make the pattern visible.

Example of positional encoding for 20 rows (lines), with word embedding size 512 (columns). You can see that it splits into two halves from the middle. This is because the left half’s values are generated by one function (using sine), while the right half is generated by another function (using cosine). They are then concatenated to obtain each positional encoding vector.

The original paper describes the formula for positional encoding (Section 3.5). You can see the code for generating positional encodings in get_timing_signal_1d(). This is not the only possible method for positional encoding. However, its advantage is that it can scale to unknown sequence lengths (for example, when the model we train needs to translate sentences much longer than those in the training set).

Residual Modules

Before proceeding, we need to mention a detail in the encoder architecture: there is a residual connection around each sub-layer (self-attention, feed-forward network) in each encoder, followed by a “layer normalization” step.

Layer normalization step:

https://arxiv.org/abs/1607.06450

If we visualize these vectors and the layer normalization operation associated with self-attention, it would look like the following diagram:

The decoder’s sub-layers are similar. If we imagine a transformer with a 2-layer encoder-decoder structure, it would look like the following diagram:

Decoder Component

Now that we have discussed most concepts of the encoder, we basically know how the decoder works. But it is still best to look at the details of the decoder.

The encoder begins its work by processing the input sequence. The output of the top encoder is then transformed into a set of attention vectors containing K (key vectors) and V (value vectors). These vectors will be used by each decoder for its own “encoder-decoder attention layer”, which helps the decoder focus on the appropriate positions in the input sequence:

After completing the encoding phase, the decoding phase begins. Each step of the decoding phase outputs an element of an output sequence (in this example, the translated English sentence).

The subsequent steps repeat this process until a special termination symbol is reached, indicating that the transformer’s decoder has completed its output. The output at each step is provided to the bottom decoder at the next time step, and just like the encoders did before, these decoders will output their decoding results. Additionally, just like we did for the input to the encoder, we will embed and add positional encoding to those decoders to indicate the position of each word.

The self-attention layers in those decoders behave differently from those in the encoders: in the decoder, the self-attention layers are only allowed to process those positions in the output sequence that are earlier. Before the softmax step, it will mask the later positions (setting them to -inf).

The “encoder-decoder attention layer” works similarly to the multi-headed self-attention layer, except it creates the query matrix through the layers below it and obtains the key/value matrices from the encoder’s output.

The Final Linear Transformation and Softmax Layer

The decoder component ultimately outputs a real-valued vector.How do we turn floating-point numbers into a word?This is the job of the linear transformation layer, followed by the Softmax layer.

The linear transformation layer is a simple fully connected neural network that projects the vector produced by the decoder component into a much larger vector called logits.

Let’s assume that our model has learned ten thousand different English words from the training set (our model’s “output vocabulary”). Therefore, the logits vector is a vector of length ten thousand cells—each cell corresponds to the score of a particular word.

The subsequent Softmax layer converts those scores into probabilities (all positive, with a maximum of 1.0). The cell with the highest probability is selected, and the corresponding word is output as the result for that time step.

This image starts with the output vector produced by the decoder component and then transforms it into an output word.

Summary of the Training Phase

Now that we have gone through the entire forward propagation process of the transformer, we can intuitively feel its training process.

During training, an untrained model will go through the same forward propagation. But since we train it with a labeled training set, we can compare its output with the true output.

To visualize this process, let’s assume our output vocabulary only contains six words: “a”, “am”, “i”, “thanks”, “student” and “”（end of sentence abbreviation).

The output vocabulary of our model is predefined in our preprocessing process before training.

Once we define our output vocabulary, we can use a vector of the same width to represent each word in our vocabulary. This is also known as one-hot encoding. So, we can represent the word “am” using the following vector:

Example: One-hot encoding of our output vocabulary

Next, we discuss the model’s loss function—this is the standard we use to optimize during training. It allows us to train a model that produces results as accurately as possible.

Loss Function

For instance, we are training the model, and now it’s the first step—a simple example—translating “merci” to “thanks”.

This means we want an output that represents the probability distribution of the word “thanks”. However, since this model is not well trained yet, it is unlikely to yield this result right now.

Because the model’s parameters (weights) are randomly generated, the probability distribution produced by the (untrained) model assigns random values to each cell/word. We can compare it with the true output and slightly adjust all model weights using the backpropagation algorithm to generate outputs closer to the result.

How would you compare two probability distributions? We can simply subtract one from the other. More details can be found in cross-entropy and KL divergence.

Cross-Entropy:

https://colah.github.io/posts/2015-09-Visual-Information/

KL Divergence:

https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained

But note that this is an overly simplified example. A more realistic scenario is processing a sentence. For example, inputting “je suis étudiant” and expecting the output to be “i am a student”. We would want our model to successfully output probability distributions in these cases:

Each probability distribution is represented by a vector with a width equal to the size of the vocabulary (in our example, it is 6, but in reality, it is usually 3000 or 10000).

The first probability distribution has the highest probability associated with the cell for “i”

The second probability distribution has the highest probability associated with the cell for “am”

And so on, the fifth output’s distribution indicates the highest probability associated with the cell for “”.

Target probability distribution obtained by training the model based on the example

After training on a sufficiently large dataset, we hope the probability distribution output by the model looks like this:

We expect that after training, the model will output the correct translation. Of course, if this segment comes entirely from the training set, it is not a very good evaluation metric. Note that each position (word) receives some probability, even if it is not very likely to be the output at that time step—this is a useful property of softmax that helps the model train.

Since this model produces one output at a time, let’s assume it only selects the word with the highest probability and discards the rest. This is one method (called greedy decoding). Another method to accomplish this task is to keep the two words with the highest probabilities (for example, “I” and “a”), then run the model twice in the next step: once assuming the first position outputs the word “I”, and once assuming it outputs the word “me”, and whichever version produces less error keeps the highest probability translation. We then repeat this step for the second and third positions. This method is called beam search (in our example, the beam width is 2 as we retained the results of two beams, like the first and second positions), and it ultimately returns two beam results (top_beams is also 2). These are parameters that can be set in advance.

Going Further

I hope the above has helped you understand the main concepts of the Transformer. If you want to delve deeper into this field, I recommend taking the following steps: read Attention Is All You Need, the Transformer blog, and the Tensor2Tensor announcement, and check out Łukasz Kaiser’s introduction to learn about the model and its details.

Attention Is All You Need:

https://arxiv.org/abs/1706.03762

Transformer Blog:

https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html

Tensor2Tensor Announcement:

https://ai.googleblog.com/2017/06/accelerating-deep-learning-research.html

Łukasz Kaiser’s Introduction:

https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb



Download 1: OpenCV-Contrib Extension Module Chinese Version Tutorial

Reply: Extension Module Chinese Tutorial in the background of the “Beginner's Visual Learning” public account to download the first OpenCV extension module tutorial in Chinese covering installation of extension modules, SFM algorithms, stereo vision, target tracking, biological vision, super-resolution processing, and more than 20 chapters of content.


Download 2: 52 Lectures on Python Visual Practical Projects

Reply: Python Visual Practical Projects in the background of the “Beginner's Visual Learning” public account to download 31 visual practical projects including image segmentation, mask detection, lane line detection, vehicle counting, eyeliner addition, license plate recognition, character recognition, emotion detection, text content extraction, and facial recognition to help quickly learn computer vision.


Download 3: 20 Lectures on OpenCV Practical Projects

Reply: OpenCV Practical Projects 20 Lectures in the background of the “Beginner's Visual Learning” public account to download 20 practical projects based on OpenCV, achieving advanced learning of OpenCV.

Group Chat

Welcome to join the reader group of the public account to communicate with peers. Currently, there are WeChat groups for SLAM, 3D vision, sensors, autonomous driving, computational photography, detection, segmentation, recognition, medical imaging, GAN, algorithm competitions, etc. (will gradually be subdivided later). Please scan the WeChat ID below to join the group, and note: “Nickname + School/Company + Research Direction”, for example: “Zhang San + Shanghai Jiao Tong University + Visual SLAM”. Please note the format; otherwise, it will not be approved. After successful addition, you will be invited into the relevant WeChat group based on your research direction. Please do not send advertisements in the group; otherwise, you will be removed from the group. Thank you for your understanding~