Understanding Transformer Models: A Comprehensive Guide

Author: Chen Zhi Yan





This article is approximately 3500 words long and is recommended for a 7-minute read.
The Transformer is the first model that completely relies on the self-attention mechanism to compute its input and output representations.

The mainstream sequence-to-sequence models are based on encoder-decoder recurrent or convolutional neural networks. The introduction of the attention mechanism optimizes the performance of the encoder-decoder, achieving optimal network performance. By utilizing the attention mechanism to construct a new network architecture, the Transformer outperforms recurrent or convolutional neural networks. The Transformer can be trained in parallel, resulting in shorter training times.

1 Transformer Model Architecture

The sequence-to-sequence model employs an encoder-decoder structure, where the encoder maps the input sequence (x1, x2, …, xn) into a symbolic representation z = (z1, z2, …, zm). Based on the given Z, the decoder generates the output sequence (y1, y2, …, ym). At each time step, the model uses the vector generated from the previous time step and the input for that time step to produce the output symbol.

The architecture of the Transformer model is shown in Figure 1-1, where the encoder-decoder structure employs stacked multi-head attention mechanisms and fully connected layers. The left side of the figure represents the encoder structure, while the right side represents the decoder structure:

Figure 1-1 Stacked Encoder-Decoder Structure (Source: Internet)

The encoder consists of 6 identical blocks stacked together (N=6). Each block is further divided into two sub-layers: a multi-head self-attention mechanism and a feedforward fully connected layer. After each sub-layer in the block, a normalization layer (Add&Norm) is added. The output of each sub-layer is normalized as LayerNorm(x + Sublayer(x)). The dimension of all sub-layer outputs in the module is 512, i.e., d_model = 512.

The decoder, similarly, is also composed of 6 identical blocks stacked together (N=6). Each block adds a third sub-layer on top of the two sub-layers in the encoder, which is an additional multi-head self-attention sub-layer. Like the encoder, a normalization layer (Add&Norm) is added after each sub-layer in the block. In the decoder, the self-attention sub-layer in the decoder stack is modified to prevent position encoding and subsequent position encoding from being related. Through this masking, predictions for position i can only depend on known outputs at positions less than i.

2 Self-Attention Mechanism

The attention function maps the triplet Q (Query), K (Key), V (Value) into an output, where the triplet Q, K, V, and the output are all vectors. The output is a weighted sum of V, where the weights are computed from the combination of Q and K.

1) Scaled Dot-Product Attention

The formula for scaled dot-product attention is as follows:

In the above formula, the dimensions of the vectors in Q and K are both d_k, while the dimension of the vector in V is d_v. The dot product of all K vectors and Q vectors is calculated, divided by the square root of d_k, and a Softmax function is applied to obtain the weights of these values. In self-attention, the dot product is used to convert attention calculations into matrix operations, which are faster. Common methods for calculating attention include dot-product and MLP networks, but dot-product can be converted into matrix operations, leading to faster computation.

The two most commonly used attention functions are: Additive Attention and Dot-Product Attention. Aside from the scaling factor in the scaled dot-product attention mechanism, the additive attention function uses a feedforward network with a single hidden layer to compute the compatibility function. Although both have similar theoretical complexities, the dot-product attention function is faster and more space-efficient because it can utilize highly optimized matrix multiplication codes. For smaller values of d_k, both mechanisms perform similarly, but for larger d_k values, the dot product increases, pushing the Softmax function into a region of very small gradients. To counteract this effect, we scale the dot product by the square root of d_k.

The Transformer model employs multi-head attention mechanisms in three places:

In the encoder-decoder attention layer, Q values come from the previous decoder layer, while K and V values come from the encoder’s output, allowing every position in the decoder to be related to the input sequence’s position. This architecture mimics the attention mechanism in sequence-to-sequence model encoders and decoders.
The encoder includes a self-attention layer, where Q, K, and V values all come from the output of the previous encoder layer, allowing position information in the encoder to participate in the previous layer’s position encoding.
Similarly, the self-attention mechanism in the decoder allows position information in the decoder to participate in the decoding of all position information.

2) Feedforward Network

Each block in the Transformer encoder-decoder architecture contains a fully connected feedforward network in addition to the multi-head attention mechanism. The fully connected feedforward layer consists of two linear transformations with ReLU activation functions.

The parameters of the fully connected feedforward networks differ between layers. The input and output dimensions of the model are 512 = 512, while the internal dimension is 2048, i.e., d_ff = 2048.

3) Embedding and Softmax

Similar to other sequence-to-sequence models, the Transformer model uses word embedding techniques to convert input and output tokens into vectors of dimension d_model. It employs a trainable linear transformation and a Softmax function to convert the decoder’s output into probabilities for predicting the next token. In the Transformer model, the two embedding layers and Softmax layer share a weight matrix.

3 Positional Encoding

Since the Transformer model does not use recursion or convolution, positional encoding must be inserted to obtain precise positional information of the input sequence. Positional encoding accurately describes the absolute and relative positional information of each word in the input sequence, injected into the input embeddings at the bottom of the encoder-decoder. The positional encoding has the same dimension as the input embedding, allowing for addition. There are various methods for positional encoding, and the Transformer model adopts a sinusoidal function with different frequencies:

Where pos is the position, and i is the dimension, meaning that each dimension of the positional encoding corresponds to a sine curve. The wavelengths vary geometrically from 2π to 10000·2π. This function is chosen because it allows the model to easily learn relative positions. For any fixed offset k, the positional encoding can be represented as a linear function of:

Understanding Transformer Models: A Comprehensive Guide

First, we compare the self-attention mechanism with recurrent neural networks (RNN) and convolutional neural networks (CNN) regarding their ability to handle variable-length sequences. Three factors are considered: the computational complexity at each layer, the amount of computation that can be parallelized, represented by the minimum number of operations on the sequence, and the length of the longest relevant path in the network. Learning long-range dependencies in sequence learning tasks is critical, and the length of the forward and backward signal paths often affects learning efficiency. The shorter the position between input and output sequences, the shorter the forward and backward signal paths, making it easier to learn long-range dependencies. By comparing the longest paths of input-output sequence positions in the network, we can answer why the self-attention mechanism is used to build the Transformer model.

Table 3-1 Maximum Path Length, Complexity per Layer, and Minimum Operations for Different Layer Sequences

As shown in Table 3-1: Maximum Path Length, Complexity per Layer, and Minimum Operations for Different Layer Sequences. n is the sequence length, d is the representation dimension, k is the convolution kernel size, and r is the neighborhood size in constrained self-attention. In Table 3-1, the self-attention mechanism associates the positional information of the sequence, while RNN requires n operations on the sequence. From the perspective of computational complexity, when the sequence length n is less than the dimension of the representation vector d, optimal performance can be achieved in machine translation tasks. To improve computational performance for ultra-long input sequences, the neighborhood size r in constrained self-attention is limited, resulting in a longest relevant path length of:

A single convolution layer with a convolution kernel of size k cannot connect all input and output positional information, hence requiring multiple convolution layers stacked to extend the longest relevant path length. Typically, the training costs for CNNs are higher than those for RNNs.

From the comparisons in Table 3-1, it can be seen that the self-attention mechanism has advantages in complexity, parallel computation, and the length of the longest relevant path in the network.

4 Training the Transformer Model

4.1 Training Data and Batch Size

The training is conducted on the standard WMT2014 English-German dataset, which includes approximately 4.5 million sentence pairs. Sentences are encoded using byte pair encoding, with a shared vocabulary of around 37,000 tokens for the source-target vocabulary. For English-French, a larger WMT2014 English-French dataset comprising 36 million sentences is used, with tokens split into a vocabulary of 32,000. Sentence pairs are arranged together based on approximate sequence lengths. Each training batch contains a set of sentence pairs, with approximately 25,000 source tokens and 25,000 target tokens.

4.2 Hardware Configuration

The Transformer model was trained on 8 NVIDIA P100 GPUs, using the hyperparameters of the base model, with each training step taking about 0.4 seconds. The base model was trained for a total of 100,000 steps or 12 hours. For the large model, the step time was 1.0 seconds, and the large model was trained for 300,000 steps (3.5 days).

4.3 Optimizer

The Adam optimizer was used, with parameters set to β1 = 0.9, β2 = 0.98, and the learning rate adjusted according to the following formula:

Understanding Transformer Models: A Comprehensive Guide

Corresponding to the first warmup_steps training step, the learning rate increases linearly and then decreases proportionally to the square root of the step in subsequent steps, where warmup_steps = 4000.

4.4 Regularization

Three regularization methods were used during training:

Residual Dropout: The Dropout mechanism is applied to the output of each sub-layer before adding the sub-layer input and normalization. Additionally, Dropout is incorporated in the embedding process and positional encoding process of the encoder-decoder stack, with a rate of 0.1.

4.5 Training Results

Machine Translation

In the WMT2014 English-German translation task, the Transformer (big) outperformed the previously reported best models (including ensembles) by over 2.0 BLEU, achieving a BLEU score of 28.4. The configuration of this model is listed at the bottom of Table 5-2. Training on 8 P100 GPUs took 3.5 days. Even the base model surpassed all previously released models and ensembles, with significantly reduced training costs.

In the WMT2014 English-French translation task, the Transformer (big) achieved a BLEU score of 41.0, outperforming all previously published models, with training costs reduced by 1/4.

Table 4.5-1 Comparison of BLEU Scores for the Transformer Model in English-German and English-French Translation Tasks with Other Models (Source: Internet)

Table 4.5-1 also compares the translation quality and training costs with other model architectures. The number of floating-point operations used to train the model is estimated by comparing training time, the number of GPUs used, and the estimated sustained single-precision floating-point capacity per GPU.

To evaluate whether the Transformer model can be generalized to other tasks, experiments were conducted on English constituency parsing. This task presents specific challenges: outputs are under strong structural constraints and are significantly longer than inputs. Moreover, RNN sequence-to-sequence models cannot achieve state-of-the-art results in small data settings.

By training on approximately 40K sentences from the Pennsylvania Wall Street Journal dataset, the data model is a 4-layer Transformer. Additionally, it was trained in a semi-supervised setting using a larger high-confidence Berkeley parser corpus of approximately 17 million sentences. The Wall Street Journal setup used a vocabulary of 16K tokens, while the semi-supervised setup used a vocabulary of 32K tokens.

Conclusion: The Transformer is a sequence-to-sequence model that employs a self-attention mechanism, replacing RNN layers in the encoder-decoder architecture of neural networks. For translation tasks, the training speed of the Transformer can be significantly faster than that of architectures based on recurrent or convolutional layers. It has achieved notable performance in the WMT 2014 English-German and English-French translation tasks. In the former task, the performance of the Transformer model even surpassed all previously reported models.

Author Biography

Chen Zhi Yan, graduated from Beijing Jiaotong University with a master’s degree in Communication and Control Engineering. He has served as an engineer at Great Wall Computer Software and Systems Company and Datang Microelectronics Company. Currently, he is engaged in the operation and maintenance of intelligent translation teaching systems and has accumulated experience in artificial intelligence deep learning and natural language processing (NLP).

Editor: Yu Teng Kai

Proofreader: Lin Yi Lin

Data Research Department Introduction

The Data Research Department was established in early 2017, dividing into several groups based on interest. Each group adheres to the overall knowledge sharing and practical project planning of the research department while having its own characteristics:

Algorithm Model Group: Actively participates in competitions like Kaggle, original hands-on teaching series articles;

Research Analysis Group: Explores the beauty of data products through interviews and other methods;

System Platform Group: Tracks cutting-edge technologies in big data & artificial intelligence systems and dialogues with experts;

Natural Language Processing Group: Focuses on practice, actively participates in competitions and plans various text analysis projects;

Manufacturing Big Data Group: Upholds the dream of an industrial power, integrating industry, academia, research, and government to explore data value;

Data Visualization Group: Merges information with art, explores the beauty of data, and learns to tell stories with visualization;

Web Crawling Group: Crawls web information and collaborates with other groups to develop creative projects.

Click on the end of the article “Read Original” to sign up for Data Research Department Volunteers, there is always a group that suits you~

Reprint Notice

If you need to reprint, please indicate the author and source (Reprinted from: Data Research THUID: DatapiTHU) prominently at the beginning of the article, and place a prominent QR code for Data Research at the end of the article. For articles with original identification, please send the 【Article Name – Pending Authorized Public Account Name and ID】 to the contact email to apply for whitelist authorization and edit as required.

Unauthorized reprints and adaptations will be legally pursued.

Understanding Transformer Models: A Comprehensive Guide

Click “Read Original” to join the organization~

4.1 Training Data and Batch Size

4.2 Hardware Configuration

4.3 Optimizer

4.4 Regularization

4.5 Training Results

Leave a Comment Cancel reply