Ten Questions and Answers About Transformers

MLNLP(Machine Learning Algorithms and Natural Language Processing) community is one of the largest natural language processing communities at home and abroad, gathering over 500,000 subscribers, covering NLP master’s and PhD students, university teachers, and industry researchers.

The Vision of the Community is to promote communication and progress between the academic and industrial circles of natural language processing and machine learning, as well as enthusiasts both domestically and internationally.

Author｜Hong Dou Jun

Source | Zhihu

Address | https://zhuanlan.zhihu.com/p/429061708

This article is for academic sharing only. If there is any infringement, please contact the backend for deletion.

How Does Transformer Solve the Gradient Vanishing Problem?

Residuals

Why Use LN Instead of BN in Transformers?

BatchNorm normalizes each feature within a batch-size sample, while LayerNorm normalizes all features of each sample.

To illustrate, suppose there is a two-dimensional matrix. The rows represent batch-size and the columns represent sample features. Then, BN normalizes vertically, while LN normalizes horizontally.

Their starting point is to stabilize the parameters of the layer to avoid gradient vanishing or explosion, facilitating subsequent learning. However, they also have different focuses.

Generally speaking, if your features depend on statistical parameters across different samples, BN is more effective because it eliminates the size relationship between different features while retaining the size relationship between different samples (in the CV field).

In the NLP field, LN is more suitable because it eliminates the size relationship between different samples while retaining the size relationship between different features within a sample. For NLP or sequence tasks, the different features of a sample actually represent the temporal variation of character values, and the relationships among features within a sample are very tight.

What Is the Role of LN?

It allows the use of larger learning rates, accelerating training. It has a certain anti-overfitting effect, making the training process smoother.

How to Understand the “Multiple Heads” in Multi-Head Self-Attention Layers, and What Is Its Role?

It is somewhat similar to multiple convolutional kernels in CNN. Through the mapping of three linear layers, the Q, K, and V in different heads are different, and the weights of these three linear layers are initialized and then learned. Different weights can capture different correlations in the sequence.

Is the Transformer an Autoregressive Model or an Autoencoding Model?

It is an autoregressive model.

Autoregressive means using the characters predicted by itself to predict the following information. During the prediction phase (in machine translation tasks), the Transformer first predicts the first character, and then predicts the subsequent characters based on the first predicted character, making it a typical autoregressive model. The Mask task in BERT is a typical autoencoding model, which predicts the current information based on the context characters.

Why Is the Q, K Matrix Multiplication Divided by √d_k in the Original Paper?

When the values are particularly small, it actually doesn’t matter whether to divide or not. Whether in the encoder or decoder, the Q, K matrices are essentially the same matrix. The multiplication of Q and K is essentially equal to Q multiplied by the transpose of Q, which can lead to results that are too large or too small. If the result is small, it’s manageable; if large, it will amplify during the subsequent softmax, causing gradient vanishing, which is detrimental to gradient backpropagation.

Why Are the Weights of the Embedding Layer in the Encoder and Decoder Multiplied by √d_model in the Original Paper?

To prevent the weights of the embedding layer from being too small, multiplying by √d_model ensures that they are comparable to the positional encoding values, thus preserving the original vector space.

What Are the Differences Between Training and Validation in Transformers?

Transformers are parallel during training and sequential during validation. This question examines the same knowledge point as whether Transformers are autoregressive models.

For specific details, refer to this article.

Hong Dou Jun: How Do the Encoder and Decoder Work During Training and Evaluation in Transformers? 11 Likes · 12 Comments

What Is the Computational Complexity of the Transformer Model?

n is the sequence length, and d is the embedding length. The largest computational cost in Transformers comes from the multi-head self-attention layer, where the main computation is the multiplication of Q and K followed by V, i.e., two matrix multiplications. The multiplication of Q and K is a matrix of [n d] multiplied by [d n], resulting in complexity.

What Is the Significance and Role of the Three Multi-Head Self-Attention Layers in Transformers?

There are three multi-head self-attention layers in the Transformer, one in the encoder and two in the decoder.

The role of the multi-head self-attention layer in the encoder is to integrate the original text sequence information, where each character in the transformed text sequence is related to the information of the entire text sequence (this is also the most innovative idea in Transformers; however, according to recent review studies, while Transformers perform very well, the multi-head self-attention layer does not contribute overwhelmingly). The schematic diagram is as follows:

The first multi-head self-attention layer in the decoder is special; the original paper names it Masked Multi-Head Attention. It serves the purpose of integrating input text (for translation tasks, the encoder’s input is the text before translation, and the decoder’s input is the text after translation). Another task is masking to prevent information leakage. To elaborate, during information integration, the first character should not see the subsequent characters, the second character can only see the first and second characters’ information, and so on.

The second multi-head self-attention layer in the decoder functions identically to the first multi-head self-attention layer in the encoder. However, it is important to emphasize that the inputs are known; the multi-head self-attention layer integrates information by calculating the Q, K, and V matrices. Here, Q is the information integrated by the decoder, while the K and V matrices are the information integrated by the encoder, which are identical matrices. The multiplication of the Q and KV matrices allows for sufficient interaction and integration between the text before and after translation. The resulting vector matrix is then used for subsequent downstream tasks.

What Is the Role of the Mask Mechanism in Transformers?

It has two functions.

To pad sequences of unequal lengths.
Masking to prevent information leakage.

Where Is the Mask Mechanism Used?

Refer to question eleven.

The first function of the mask mechanism is used in all three multi-head self-attention layers, while the second function is only used in the first multi-head self-attention layer of the decoder.

Invitation to Technical Communication Group

△ Long press to add assistant

Scan the QR code to add the assistant on WeChat.

Please note: Name – School/Company – Research Direction

(e.g., Xiao Zhang – Harbin Institute of Technology – Dialogue System)

to apply to join the Natural Language Processing/Pytorch and other technical communication group.

About Us

MLNLP(Machine Learning Algorithms and Natural Language Processing) community is a grassroots academic community jointly established by scholars in natural language processing at home and abroad. It has developed into one of the largest natural language processing communities, gathering over 500,000 subscribers. It includes well-known brands such as Top Conference Communication Group, AI Selection, AI Talents, and AI Academic Exchange, aiming to promote the progress between the academic and industrial circles of machine learning and natural language processing, as well as enthusiasts.

The community provides an open communication platform for related practitioners’ further study, employment, and research. Everyone is welcome to follow and join us.

About Us

Leave a Comment Cancel reply