Understanding Three Attention Mechanisms in Transformer

Application of Attention Mechanism in “Attention is All You Need” 3.2.3

3.2.3 Application of Attention Mechanism in Our Model

The Transformer uses three different ways of multi-head attention mechanism as follows:

In the “encoder-decoder attention” layer, queries come from the previous layer of the decoder, while memory keys and values come from the output of the encoder. This allows each position in the decoder to attend to all positions in the input sequence. This mimics the typical encoder-decoder attention mechanism in sequence-to-sequence models.
The encoder contains a self-attention layer. In the self-attention layer, all keys, values, and queries come from the same place, i.e., the output of the previous layer of the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.
Similarly, the self-attention layer in the decoder allows each position in the decoder to attend to all positions in the decoder, including that position itself. We need to prevent information from flowing leftward in the decoder to maintain the autoregressive property. We achieve this by masking all values corresponding to illegal connections in the softmax input (setting them to -∞).

I believe everyone, like me, was puzzled when first reading this part; every word is recognizable, but together it feels difficult to understand, even a bit “impressive but unclear”.

Today, let’s speak plainly, take notes, and highlight key points. I hope to help everyone truly understand the three attention mechanisms in Transformer through a series of Q&A.

This article will cover Self Attention、Cross Attention、Causal Attention three aspects, helping you understand the three attention mechanisms in Transformer at a glance. (Every article’s standard opening must not be less; no matter how hard or tiring life is, the sense of ritual cannot be lacking.)

Three Attention Mechanisms in Transformer

1. Self Attention

Question 1: The encoder clearly states Multi-Head Attention in the diagram, so why is it called Self Attention?

Self Attention of the Encoder

First, understand three concepts: Scaled Dot-Product Attention, Self Attention、Multi-Head Attention

Scaled Dot-Product Attention and Multi-Head Attention

Scaled Dot-Product Attention(Scaled Dot-Product Attention)::The input includes queries (queries) and keys (keys) with dimension dk, and values (values) with dimension dv. We calculate the dot product of the query with all keys, each dot product result is divided by √dk, and then the softmax function is applied to obtain the attention scores.

It reflects how to calculate attention scores, focusing on the formulas for Q, K, and V.

Scaled Dot-Product Attention (Scaled Dot-Product Attention)

Self Attention:For the same sequence, the attention scores are calculated using scaled dot-product attention, and the value vectors are weighted and summed to obtain the weighted representation of each position in the input sequence.

It expresses an attention mechanism, how to use scaled dot-product attention to calculate the attention scores for the same sequence, thus obtaining the attention weights for each position in the same sequence.

Self Attention

Multi-Head Attention: Multiple attention heads run in parallel, each head independently calculates attention weights and outputs, and then concatenates the outputs of all heads to obtain the final output.

It emphasizes a practical method; in practice, we do not use a single dimension to execute a single attention function, but instead calculate with h=8 heads separately and then take a weighted average to avoid errors from individual calculations.

Multi-Head Attention (Multi-Head Attention)

Question Answer:Scaled Dot-Product Attention, Self Attention, and Multi-Head Attention actually refer to the same thing, explaining from different dimensions how to obtain the attention weights for each position in the same sequence. The diagram labeled Multi-Head Attention emphasizes the need for multiple heads to calculate attention weights.

The first attention in Transformer (Self Attention) A more rigorous description should be: the input sequence of the encoder calculates attention weights through Multi-Head Self Attention.

2. Cross Attention

Question 2: The encoder clearly states Multi-Head Attention in the diagram, so why is it called Cross Attention?

Cross Attention of Encoder-Decoder

First, understand a concept: Cross Attention

Cross Attention:: The input comes from two different sequences, one sequence serves as the query (Q), while the other sequence provides keys (K) and values (V), enabling cross-sequence interaction.

Cross Attention (Cross Attention)

The difference between Cross Attention and Self Attention:

Input Source:

Cross Attention: Comes from two different sequences, one from the encoder and one from the decoder
Self Attention: Comes from the same sequence of the encoder

Implementation Goal:

Cross Attention: The decoder sequence serves as the query (Q), while the encoder sequence provides keys (K) and values (V), used for attention transfer between the two different sequences of encoder-decoder.
Self Attention: Queries (Q), keys (K), and values (V) all come from the same sequence of the encoder, achieving attention calculation within the encoder sequence.

Question Answer:Cross Attention and Multi-Head Attention actually refer to the same thing, explaining from different dimensions how to achieve attention transfer between two different sequences. The diagram labeled Multi-Head Attention emphasizes the need for multiple heads to perform attention transfer calculations.

The second attention in Transformer (Cross Attention) A more rigorous description should be: the two sequences of encoder-decoder perform attention transfer through Multi-Head Cross Attention.

3. Causal Attention

Question 3: The encoder clearly states Masked Multi-Head Attention in the diagram, so why is it called Causal Attention?

Causal Attention of Decoder

First, understand four concepts: Predict The Next Word、Masked Language Model、Autoregressive、Causal Attention

Predict The Next Word（ Predicting the Next Word): The model usually needs to predict the next word based on the words that have already been generated. This characteristic requires the model not to “see” future information when making predictions, to avoid predictions being influenced by future information.

Predicting the Next Word

Masked Language Model（Masked Language Model）: Masking some words to help the model learn to predict the masked words, thereby helping the model learn language patterns.

Masked Language Model

Autoregressive（Autoregressive）: When generating a certain word in a sequence, the decoder considers all previously generated words, including the current word being generated. To maintain the autoregressive property, that is, the model can only predict based on already generated information when generating a sequence, we need to prevent information from flowing leftward in the decoder. In other words, when the decoder is generating the t-th word, it should not see future information (i.e., the t+1, t+2,… positions).

Causal Attention (Causal Attention): To ensure the model only relies on previous input information when generating sequences and is not influenced by future information. Causal Attention achieves this by masking (masking) future positions, allowing the model to see only the current position and previous inputs when predicting the output of a certain position.

Causal Attention (Causal Attention)

Question Answer:Causal Attention and Masked Multi-Head Attention actually refer to the same thing, how Self Attention in the decoder combines with Causal Attention to maintain autoregressive properties.

Masked Multi-Head Attention emphasizes the use of multiple independent attention heads, each of which can learn different attention weights, thereby enhancing the model’s representation ability. Causal Attention emphasizes that the model can only rely on already generated information when making predictions and cannot see future information.

The third attention in Transformer (Causal Attention) A more rigorous description should be: the single sequence of the decoder calculates attention through Multi-Head Causal Self Attention (Multi-Head Causal Self Attention).

Leave a Comment Cancel reply