Understanding Attention Mechanism and Transformer in NLP

This article summarizes the attention mechanism in natural language processing (NLP) in a Q&A format and provides an in-depth analysis of the Transformer.

Table of Contents

1. Analysis of Attention Mechanism1. Why introduce the attention mechanism?2. What types of attention mechanisms are there? (How are they classified?)3. What is the computational process of the attention mechanism?4. What are the variants of the attention mechanism?5. A powerful attention mechanism: Why is the self-attention model so powerful for long-distance sequences? (1) Can’t convolutional or recurrent neural networks handle long-distance sequences? (2) What methods can solve the short-distance dependency problem to establish long-distance dependencies on the input sequence? (3) What is the specific computational process of the self-attention model? 2. Detailed Explanation of Transformer (Attention Is All You Need)1. What is the overall architecture of the Transformer? What components does it consist of?2. What are the differences between Transformer Encoder and Transformer Decoder?3. What are the differences between Encoder-Decoder attention and self-attention mechanism?4. What is the specific computational process of the multi-head self-attention mechanism?5. How is the Transformer applied in pre-trained models like GPT and BERT? What changes are there?

1. Analysis of Attention Mechanism

1. Why introduce the attention mechanism?

According to the universal approximation theorem, both feedforward and recurrent networks have strong capabilities. But why introduce the attention mechanism?

  • Limitations of computational power: When needing to remember a lot of “information”, the model becomes more complex; however, current computational power remains a bottleneck for the development of neural networks.

  • Limitations of optimization algorithms: Although local connections, weight sharing, and pooling can simplify neural networks and effectively alleviate the contradiction between model complexity and expressiveness, the memory capacity for information in recurrent neural networks is not high due to long-distance dependencies.

We can leverage the way the human brain processes information overload; for example, the attention mechanism can enhance the neural network’s ability to handle information.

2. What types of attention mechanisms are there? (How are they classified?)

When using neural networks to process large amounts of input information, we can also draw on the attention mechanism of the human brain, selectively processing key information to improve the efficiency of the neural network. Based on cognitive neuroscience, attention can be broadly divided into two categories:

  • Focused (focus) attention: Top-down conscious attention, actively attentive—refers to attention that is purpose-driven, task-dependent, and consciously focused on a specific object;

  • Saliency-based attention: Bottom-up conscious attention, passive attention—saliency-based attention is driven by external stimuli and does not require active intervention, nor is it task-dependent; max-pooling and gating mechanisms can be approximated as bottom-up saliency-based attention mechanisms.

In artificial neural networks, the attention mechanism generally refers specifically to focused attention.

3. What is the computational process of the attention mechanism?

Understanding Attention Mechanism and Transformer in NLP
The essence of the attention mechanism: addressing

The essence of the attention mechanism is actually an addressing process. As shown in the figure above: given a query vector Query q related to the task, by calculating the attention distribution with Key and attaching it to Value, we can compute the Attention Value. This process actually reflects how the attention mechanism alleviates the complexity of neural network models: it is not necessary to input all N pieces of input information into the neural network for computation; we only need to select some task-related information from X to input into the neural network.

The attention mechanism can be divided into three steps: first, information input; second, calculating the attention distribution α; third, calculating the weighted average of the input information based on the attention distribution α.

Step 1 – Information Input: Let X = [x1, · · · , xN] represent N pieces of input information;

Step 2 – Calculate Attention Distribution: Let Key=Value=X, then the attention distribution can be given

Understanding Attention Mechanism and Transformer in NLP

We call Understanding Attention Mechanism and Transformer in NLP the attention distribution (probability distribution), and Understanding Attention Mechanism and Transformer in NLP is the attention scoring mechanism, which has several scoring mechanisms:

Understanding Attention Mechanism and Transformer in NLP

Step 3 – Weighted Average of Information: The attention distribution Understanding Attention Mechanism and Transformer in NLP can be interpreted as the degree of attention given to the i-th piece of information when querying the context q. It encodes the input information X using a “soft” information selection mechanism as follows:

Understanding Attention Mechanism and Transformer in NLP

This encoding method is called soft attention mechanism, which has two modes: ordinary mode (Key=Value=X) and key-value pair mode (Key!=Value).

Understanding Attention Mechanism and Transformer in NLP
Soft Attention Mechanism

4. What are the variants of the attention mechanism?

Compared to the ordinary attention mechanism (left in the figure), what variants of the attention mechanism are there?

  • Variant 1 – Hard Attention: The previously mentioned attention is soft attention, where the selected information is the expectation of all input information under the attention distribution. There is another type of attention that focuses only on information at a specific position, called hard attention (hard attention). Hard attention can be implemented in two ways: (1) one is to select the input information with the highest probability; (2) the other hard attention can be achieved by random sampling on the attention distribution. The drawback of hard attention:

The drawback of hard attention is that it selects information based on maximum sampling or random sampling. Therefore, the final loss function is not differentiable with respect to the attention distribution, making it unsuitable for training with backpropagation algorithms. To use backpropagation, soft attention is generally used instead of hard attention. Hard attention needs to be trained through reinforcement learning. — “Neural Networks and Deep Learning”

  • Variant 2 – Key-Value Attention: This is the key-value pair mode shown on the right in the figure, where Key != Value, and the attention function becomes:

Understanding Attention Mechanism and Transformer in NLP
  • Variant 3 – Multi-Head Attention: Multi-head attention (multi-head attention) uses multiple queries Q = [q1, · · · , qM] to compute multiple information selections from the input information in parallel. Each attention focuses on different parts of the input information and then concatenates them:

Understanding Attention Mechanism and Transformer in NLP

5. A powerful attention mechanism: Why is the self-attention model so powerful for long-distance sequences?

(1) Can’t convolutional or recurrent neural networks handle long-distance sequences?

When using neural networks to process a variable-length vector sequence, we can typically use convolutional networks or recurrent networks for encoding to obtain an output vector sequence of the same length, as shown in the figure:

Understanding Attention Mechanism and Transformer in NLP
Variable-length sequence encoding based on convolutional and recurrent networks

From the figure above, it can be seen that both convolutional and recurrent neural networks are essentially a type of “local encoding” for variable-length sequences: convolutional neural networks are evidently based on N-gram local encoding; for recurrent neural networks, due to issues like vanishing gradients, they can only establish short-distance dependencies.

(2) What methods can solve the short-distance dependency problem to establish long-distance dependencies on the input sequence?

If we want to establish long-distance dependencies among input sequences, we can use the following two methods: one method is to increase the number of layers in the network to obtain long-distance information interaction through a deep network; the other method is to use fully connected networks. — “Neural Networks and Deep Learning”

Understanding Attention Mechanism and Transformer in NLP
Fully connected model and self-attention model: solid lines indicate learnable weights, dashed lines indicate dynamically generated weights.

As can be seen from the figure above, while the fully connected network is a very direct modeling approach for long-distance dependencies, it cannot handle variable-length input sequences. The size of the connection weights varies with different input lengths.

At this point, we can leverage the attention mechanism to “dynamically” generate different connection weights, which is the self-attention model (self-attention model). Because the weights of the self-attention model are dynamically generated, it can handle variable-length information sequences.

Overall, why is the self-attention model so powerful: it uses the attention mechanism to dynamically generate different connection weights to handle variable-length information sequences.

(3) What is the specific computational process of the self-attention model?

Similarly, given the information input: let X = [x1, · · · , xN] represent N pieces of input information; through linear transformations, we obtain the query vector sequence, key vector sequence, and value vector sequence:

Understanding Attention Mechanism and Transformer in NLP

The formula above shows that Q in self-attention refers to transformations of its own (self) input, whereas in traditional attention, Q comes from external sources.

Understanding Attention Mechanism and Transformer in NLP
Self-Attention computational process breakdown (from “Attention Is All You Need”)

The attention computation formula is:

Understanding Attention Mechanism and Transformer in NLP

In the self-attention model, scaled dot-product is typically used as the attention scoring function, and the output vector sequence can be expressed as:

Understanding Attention Mechanism and Transformer in NLP

2. Detailed Explanation of Transformer (Attention Is All You Need)

From the title of the Transformer paper, it can be seen that the core of the Transformer is Attention, which is why this article discusses the attention mechanism before introducing the Transformer. If one understands the attention mechanism, especially the self-attention model, the Transformer becomes easy to understand.

1. What is the overall architecture of the Transformer? What components does it consist of?

Understanding Attention Mechanism and Transformer in NLP
Transformer model architecture

The Transformer is essentially a Seq2Seq model, with an encoder on the left reading the input and a decoder on the right producing the output:

Understanding Attention Mechanism and Transformer in NLP
Seq2Seq model

Transformer = Transformer Encoder + Transformer Decoder

(1) Transformer Encoder (N=6 layers, each layer includes 2 sub-layers):

Understanding Attention Mechanism and Transformer in NLP
Transformer Encoder
  • Sub-layer 1: multi-head self-attention mechanism, used for self-attention.

  • Sub-layer 2: Position-wise Feed-forward Networks, a simple fully connected network that performs the same operation on each position’s vector, including two linear transformations and a ReLU activation output (input and output layer dimensions are both 512, with the middle layer being 2048):

Understanding Attention Mechanism and Transformer in NLP

Each sub-layer uses a residual network: Understanding Attention Mechanism and Transformer in NLP

(2) Transformer Decoder (N=6 layers, each layer includes 3 sub-layers):

Understanding Attention Mechanism and Transformer in NLP
Transformer Decoder
  • Sub-layer 1: Masked multi-head self-attention mechanism, used for self-attention, which differs from the encoder: since it is a sequence generation process, at time i, there are no results for times greater than i, only results for times less than i; thus, a mask is needed.

  • Sub-layer 2: Position-wise Feed-forward Networks, same as the encoder.

  • Sub-layer 3: Encoder-Decoder attention calculation.

2. What are the differences between Transformer Encoder and Transformer Decoder?

(1) The multi-head self-attention mechanism differs; the encoder does not require masking, while the decoder does.

(2) The decoder has an additional layer for encoder-decoder attention, which differs from the self-attention mechanism.

3. What are the differences between Encoder-Decoder attention and self-attention mechanism?

Both use multi-head calculations; however, the encoder-decoder attention adopts the traditional attention mechanism, where the Query is the encoding value from the previous time i obtained from the self-attention mechanism, and the Key and Value are both outputs from the encoder, differing from the self-attention mechanism. This is specifically reflected in the code:

 ## Multihead Attention ( self-attention)
            self.dec = multihead_attention(queries=self.dec,
                                           keys=self.dec,
                                           num_units=hp.hidden_units,
                                           num_heads=hp.num_heads,
                                           dropout_rate=hp.dropout_rate,
                                           is_training=is_training,
                                           causality=True,
                                           scope="self_attention")

## Multihead Attention ( Encoder-Decoder attention)
            self.dec = multihead_attention(queries=self.dec,
                                           keys=self.enc,
                                           num_units=hp.hidden_units,
                                           num_heads=hp.num_heads,
                                           dropout_rate=hp.dropout_rate,
                                           is_training=is_training,
                                           causality=False,
                                           scope="vanilla_attention")

4. What is the specific computational process of the multi-head self-attention mechanism?

Understanding Attention Mechanism and Transformer in NLP
Computational process of multi-head self-attention mechanism

The attention mechanism in the Transformer consists of Scaled Dot-Product Attention and Multi-Head Attention, as shown in the overall process in the figure above. Below is a detailed introduction to each step:

  • Expand: This involves linear transformations to generate Q, K, and V vectors;

  • Split heads: This performs head splitting; in the original text, each position’s 512 dimensions are divided into 8 heads, each with 64 dimensions;

  • Self Attention: Perform self-attention on each head; this process is consistent with what was introduced in the first part;

  • Concat heads: Concatenate each head after self-attention.

The formula for the above process is:

Understanding Attention Mechanism and Transformer in NLP

5. How is the Transformer applied in pre-trained models like GPT and BERT? What changes are there?

  • In GPT, a unidirectional language model is trained, which directly applies Transformer Decoder;

  • In BERT, a bidirectional language model is trained, applying the Transformer Encoder part, but with a Masked operation added on top of the encoder;

BERT Transformer uses bidirectional self-attention, while GPT Transformer uses constrained self-attention, where each token can only process its left context. Bidirectional Transformers are generally referred to as “Transformer encoder,” while left context is referred to as “Transformer decoder,” where the decoder cannot access information to be predicted.

References

  1. “Neural Networks and Deep Learning”

  2. “Attention Is All You Need”

  3. Analysis of Google’s BERT —- 2 hours to master the strongest NLP training model

  4. Detailed discussion | Attention Is All You Need

  5. Attention Models in Deep Learning (2017 version)

Leave a Comment