In-Depth Understanding of Transformer

Click on the aboveBeginner Learning Visuals” to select “Star” or “Pin

Important content delivered promptlyIn-Depth Understanding of Transformer

Author: Wang Bo Kings, Sophia

Overview of the Content of This Article: Wang Bo Kings’ Recent Learning Notes on Transformer
Recommended AI Doctor Notes Series
Weekly Zhi Hua’s “Machine Learning” Handwritten Notes Officially Open Source! Printable version with PDF download link
Table of Contents:
  • Background Knowledge

  • High-Level Understanding

  • Understanding Tensor Through Examples

    • Encoding

  • High-Level Understanding of Self-Attention

    • Details of Self-Attention

    • Self-Attention Matrix Multiplication

    • Multi-Headed Enhancement

    • Overall Process

  • Using Positional Encoding

    • Encoding Rules

  • Residual Neural Network Residuals

  • Decoder

  • Linear and Softmax Layers

  • Review Training Process

    • Loss Function

    • Target Model Outputs

    • Trained Model Outputs

1. Background Knowledge

  • Transformer was proposed in Google’s paper “Attention is All You Need”

  • Google open-sourced a third-party library based on TensorFlow called Tensor2Tensor

  • Harvard University deeply interpreted this paper using PyTorch: http://nlp.seas.harvard.edu/2018/04/03/attention.html

2. High-Level Understanding

First, understand the Transformer as a black box; the function of the black box is translation. You input a statement, and it performs a translation operation on your input.

In-Depth Understanding of Transformer

The black box can be expanded, consisting of two parts: Encoders and Decoders

In-Depth Understanding of Transformer

Further refinement of the black box reveals that it consists of 6 Encoders and 6 Decoders

In-Depth Understanding of Transformer

Each Encoder has the same structure, but the weights are not shared. Each layer consists of two parts: self-supervision + fully connected

In-Depth Understanding of Transformer

The input of Self-attention is fed into a fully connected feed-forward neural network. The parameters of each encoder’s feed-forward neural network are the same, but their functions are independent.

The Decoder part also has the same hierarchical structure but includes an Encoder-Decoder-Attention layer in the middle to help focus on the corresponding statement (similar to the Seq2Seq model).

In-Depth Understanding of Transformer

3. Understanding Tensor Through Examples

First, perform an embedding to transform the input words into vectors

For specifics, see this blog: https://blog.csdn.net/qq_41664845/article/details/84313419

In-Depth Understanding of Transformer

The size of the list and the dimension of the word vector can both be set as hyperparameters, generally set to the length of the longest sentence in the training dataset. This example encodes each word as a 512-dimensional vector.

In-Depth Understanding of Transformer

It can be observed that after inputting x1, x2, and x3 through self-attention, z1, z2, and z3 are obtained, which is important to note: z1, z2, and z3 are actually produced collaboratively by x1, x2, and x3.

Encoding

Each Encoder receives a 512-dimensional vector x as input, then passes through Self-Attention, producing an equal 512-dimensional z, which is then passed through a fully connected neural network, resulting in an output r that is also 512-dimensional, then passed to the next encoder.

In-Depth Understanding of Transformer

Note that the structure of the feed-forward neural network is actually consistent.

4. High-Level Understanding of Self-Attention

Suppose the input is:

The animal didn't cross the street because it was too tired

How does the word “it” relate to “animal” in this sentence?

In-Depth Understanding of Transformer

Details of Self-Attention

Step 1: Q, K, V Calculation

For each word, we create a Query vector, a Key vector, and a Value vector. These vectors are produced by multiplying the word embeddings by three training matrices created during our training process.

The input vector dimension is 512, and the new vector dimension is 64. The new vector dimension is determined based on practical considerations.

In-Depth Understanding of Transformer

Multiplying x1 by the WQ weight matrix produces q1In-Depth Understanding of Transformer

Step 2: Dot Product

In-Depth Understanding of Transformer

q1 is multiplied by k1, and q1 is multiplied by k2. Note!!! It is q1 and k2! Look carefully, not q2

Steps 3 and 4:

The result of the dot product is divided by sqrt(dk). In this case, the vector is 64, and the square root is 8, so it is divided by 8.

Then, a Softmax operation is performed.

In-Depth Understanding of Transformer

The scores obtained from the Softmax operation represent how much the current word is represented at each unit position in the sentence.

Steps 5 and 6:

The Values are multiplied by the Softmax values to obtain v1, v2, maintaining the attention on the current word while reducing it for unrelated words.

The weighted sum of the vectors yields z1.

In-Depth Understanding of Transformer

Self-Attention Matrix Multiplication

Step one is to calculate the Query, Key, and Value matrices. X is the matrix transformed from x1 and x2.

In-Depth Understanding of Transformer

Then, the subsequent Z is obtained through operations.

In-Depth Understanding of Transformer

Multi-Headed Enhancement

  • Expands the model’s ability to focus on different positions

  • Provides projections into different subspaces

In-Depth Understanding of Transformer

Through multi-headed attention, we independently maintain a set of Q/K/V weight matrices for each “header”.

Using 8 time points to calculate the weight matrices yields 8 different matrices z.

In-Depth Understanding of Transformer

The 8 matrices are concatenated together and then multiplied by the matrix Wo.

In-Depth Understanding of Transformer

Overall Process

In-Depth Understanding of Transformer

We randomly examine two different attention headers (8 columns, taking columns 2 and 3 from the image below) to see what differences in focus there are.

In-Depth Understanding of Transformer

At these two timestamps, it is found that “it” focuses most on two: “animal” and “tired”.

Adding all the attention to the image may not make it easy to understand the meaning.

In-Depth Understanding of Transformer

5. Using Positional Encoding

The input sequence must also consider the order of the words.

The transformer embeds a new positional vector for each input word.

In-Depth Understanding of Transformer

To let the model know the order of the words, the positional encoding vector information is generated directly through rules.

For example, if the embedding dimension is 4, the actual encoding effect is as follows:

In-Depth Understanding of Transformer

Encoding Rules

In-Depth Understanding of Transformer

In-Depth Understanding of Transformer

For example, if there are 20 words, each word is encoded as a 512-dimensional vector.

In total, there are 20 rows, and each row represents a word vector containing 512 values, each value between -1 and 1. Visualization shows:

In-Depth Understanding of Transformer

The center position is divided into two, with half generated by sine and the other half by cosine.

If there are slight changes in the above Transformer to Transformer, it is shown as follows:

In-Depth Understanding of Transformer

import numpy as np
import matplotlib.pyplot as plt

# https://github.com/jalammar/jalammar.github.io/blob/master/notebookes/transformer/transformer_positional_encoding_graph.ipynb
# Code from https://www.tensorflow.org/tutorials/text/transformer
def get_angles(pos, i, d_model):
    angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model))
    return pos * angle_rates
def positional_encoding(position, d_model):
    angle_rads = get_angles(np.arange(position)[:, np.newaxis],
                            np.arange(d_model)[np.newaxis, :],
                            d_model)  # apply sin to even indices in the array; 2i
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])  # apply cos to odd indices in the array; 2i+1
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
    pos_encoding = angle_rads[np.newaxis, ...]
    return pos_encoding
tokens = 10
dimensions = 64
pos_encoding = positional_encoding(tokens, dimensions)
print (pos_encoding.shape)
plt.figure(figsize=(12,8))
plt.pcolormesh(pos_encoding[0], cmap='viridis')
plt.xlabel('Embedding Dimensions')
plt.xlim((0, dimensions))
plt.ylim((tokens,0))
plt.ylabel('Token Position')
plt.colorbar()
plt.show()

6. Residual Neural Network Residuals

Layer normalization steps

In-Depth Understanding of Transformer

In-Depth Understanding of Transformer

In-Depth Understanding of Transformer

Further visualization

In-Depth Understanding of Transformer

The Decoder part is the same as this. We stacked 2 Encoders and 2 Decoders

In-Depth Understanding of Transformer

7. Decoder

The Encoder transforms it into a collection of attention (K, V)

In-Depth Understanding of Transformer

8. Linear and Softmax Layers

The linear layer is a simple fully connected neural network that projects the vector produced by the Decoder into a larger vector, becoming the logit vector.

Assuming the experimental model’s vocabulary consists of 10,000 English words, the logits vector represents 10,000 small grids, each representing a word.

After the linear layer is a Softmax layer, which converts the scores into probabilities, selecting the highest probability as the index, and then finding the word through the index as output.

In-Depth Understanding of Transformer

9. Review Training Process

Assuming the output vocabulary only includes “a”, “am”, “I”, “thanks”, “student”, “<eos> end symbol”

In-Depth Understanding of Transformer

Once the output vocabulary is determined, the same-width vector can be used to represent the words in the vocabulary, known as one-hot encoding.

For example, in the sentence, the one-hot encoding for “am” is:

In-Depth Understanding of Transformer

Loss Function

The model’s parameter weights are randomly initialized.

In-Depth Understanding of Transformer

In practice, a simple subtraction suffices.

Target Model Outputs

In-Depth Understanding of Transformer

Trained Model Outputs

In-Depth Understanding of Transformer

Although not very accurate, it is possible to find the maximum probability value through comparison.

Download 1: OpenCV-Contrib Extension Module Chinese Tutorial

Reply "Extension Module Chinese Tutorial" in the backend of "Beginner Learning Visuals" public account to download the first OpenCV extension module tutorial in Chinese on the internet, covering installation of extension modules, SFM algorithms, stereo vision, target tracking, biological vision, super-resolution processing, and more than 20 chapters of content.

Download 2: Python Visual Practice Project 52 Lectures

Reply "Python Visual Practice Project" in the backend of "Beginner Learning Visuals" public account to download 31 visual practice projects including image segmentation, mask detection, lane line detection, vehicle counting, eyeliner addition, license plate recognition, character recognition, emotion detection, text content extraction, facial recognition, etc., to help quickly learn computer vision.

Download 3: OpenCV Practical Projects 20 Lectures

Reply "OpenCV Practical Projects 20 Lectures" in the backend of "Beginner Learning Visuals" public account to download 20 practical projects based on OpenCV to advance OpenCV learning.

Communication Group

Welcome to join the reader group of the public account to exchange with peers. Currently, there are WeChat groups for SLAM, 3D vision, sensors, autonomous driving, computational photography, detection, segmentation, recognition, medical imaging, GAN, algorithm competitions, etc. (will gradually subdivide in the future). Please scan the WeChat ID below to join the group, and note: "Nickname + School/Company + Research Direction", for example: "Zhang San + Shanghai Jiaotong University + Visual SLAM". Please follow the format; otherwise, it will not be approved. After successful addition, you will be invited into the relevant WeChat group based on your research direction. Please do not send advertisements in the group; otherwise, you will be removed from the group. Thank you for your understanding~


Leave a Comment