Understanding Transformer Architecture and Attention Mechanisms

This article will cover three aspects of the essence of Transformer, the principles of Transformer, and the applications of Transformer, helping you understand Transformer (overall architecture & three types of attention layers) in one article.

Transformer

1. Essence of Transformer

The origin of Transformer:The Google Brain translation team proposed a novel simple network architecture called Transformer, which is completely based on the attention mechanism, discarding recurrent and convolutional operations.

Understanding Transformer Architecture and Attention Mechanisms

Attention mechanism is all you need

The mainstream sequence transformation model RNN:Before the emergence of Transformer, mainstream sequence transformation models were based on complex recurrent neural networks (RNN), consisting of an encoder and a decoder. The best-performing models at that time connected the encoder and decoder through the attention mechanism.

Transformer vs RNN

Recurrent neural networks (RNN), especially long short-term memory networks (LSTM) and gated recurrent unit networks (GRU), have firmly established their position as the state-of-the-art methods in sequence modeling and transformation problems, including language modeling and machine translation.

RNN LSTM GRU

RNN encoder-decoder architecture has a significant flaw: when processing long sequences, information loss occurs.

The encoder transforms the sequence x1, x2, x3, x4 into a single vector c, which may result in information loss. This is because all information is compressed into this single vector, increasing the risk of information loss. It is also complex for the decoder to extract information from this vector.

RNN encoder-decoder architecture

Attention mechanism:A mechanism that allows the model to focus on key parts while processing information, ignoring irrelevant information, thus improving processing efficiency and accuracy. It mimics the selective attention characteristics of human visual processing.

Attention mechanism

When the human visual mechanism recognizes a scene, it usually does not scan the entire scene comprehensively, but focuses on specific parts based on interest or need. For example, in this image, we would first notice the animal’s face, as indicated by the attention map, where the darker areas are typically the parts we notice first, allowing us to preliminarily judge that this might be a wolf.

Attention mechanism

Q, K, V:The attention mechanism calculates attention scores (vector dot products and adjustments) by matching keys (K) with queries (Q), converting the scores into weights, and then weighting the value (V) matrix to obtain the final attention vector.

Calculating attention scores using Q, K, V

Value of attention scores:Quantifies the degree to which a certain part of information is focused on in the attention mechanism, reflecting the importance of information in the attention mechanism.

Value of attention scores

The essence of Transformer:Transformer is a deep learning model based on self-attention mechanism, designed to solve sequence-to-sequence problems in natural language processing.

Compared to RNN models, Transformer models have two significant advantages.

Advantage 1: Handling long sequence data.Transformer uses self-attention mechanisms to simultaneously process all positions in the sequence, capturing long-distance dependencies, thus understanding the meaning of the text more accurately. In contrast, RNN models are limited by their recurrent structure and find it difficult to handle long sequence data.
Advantage 2: Achieving parallel computation.Since RNN models need to process each element in the sequence sequentially, their computation speed is significantly limited. In contrast, Transformer models can process the entire sequence simultaneously, greatly improving computational efficiency.

Transformer vs RNN

2. Principles of Transformer

Encoder-decoder architecture:Encoder-decoder architecture is a common framework in natural language processing (NLP) and other sequence-to-sequence (Seq2Seq) transformation tasks.

The core idea of this architecture is to encode the input sequence into a fixed-size vector representation, and then use this vector to generate the output sequence.

RNN encoder-decoder architecture

Machine translation:Machine translation is a typical Seq2Seq model with an architecture that includes both an encoder and decoder. It can achieve mapping from one sequence to another, and the lengths of the two sequences can be unequal.

Machine translation

Architecture of Transformer:Transformer also follows the encoder-decoder overall architecture, using stacked self-attention mechanisms and position-wise fully connected layers, used for the encoder and decoder, as shown in the left and right parts of the diagram.

Architecture of Transformer

Encoder::The encoder of Transformer consists of 6 identical layers, each layer includes two sub-layers: a multi-head self-attention layer and a position-wise feedforward neural network. After each sub-layer, residual connections and layer normalization operations, collectively referred to as Add&Norm, are used. This structure helps the encoder capture dependencies at all positions in the input sequence.

Encoder architecture

Decoder::The decoder of Transformer consists of 6 identical layers, each layer contains three sub-layers: a masked self-attention layer, an encoder-decoder attention layer, and a position-wise feedforward neural network. Each sub-layer has a residual connection and layer normalization operation after it, referred to as Add&Norm. This structure ensures that the decoder can consider previous outputs when generating sequences, avoiding the influence of future information.

Difference in essence between encoder and decoder: lies in the Mask mechanism of Self-Attention.

Essential differences between encoder and decoder

Core components of Transformer:Transformer models include input embeddings, position encoding, multi-head attention, residual connections, and layer normalization, masked multi-head attention, and feedforward networks.

Core components of Transformer

Input embedding: Converts the input text into vectors for the model to process.

Position encoding: Adds positional information to the input vectors, as Transformer processes data in parallel without relying on order.

Multi-head attention: Allows the model to focus on different parts of the input sequence simultaneously, capturing complex dependencies.

Residual connections and layer normalization: Helps the model train better by adding cross-layer connections and normalizing outputs, preventing gradient issues.

Masked multi-head attention: Ensures that the model relies only on known information when generating text, not future content.

Feedforward network: Performs nonlinear transformations on the input to extract higher-level features.

Core components of Transformer

Three types of attention layers in Transformer:In the Transformer architecture, there are three different attention layers (Self Attention, Cross Attention, Causal Attention).

Self Attention layer in the encoder:The input sequence of the encoder calculates attention weights through Multi-Head Self Attention.
Cross Attention layer in the decoder:The encoder-decoder sequences transfer attention through Multi-Head Cross Attention.
Causal Attention layer in the decoder:The single sequence of the decoder calculates attention using Multi-Head Causal Self Attention

Three types of attention layers in Transformer

First, understand some concepts: Scaled Dot-Product Attention, Self Attention, Multi-Head Attention, Cross Attention,Causal Attention

Scaled Dot-Product Attention and Multi-Head Attention

Scaled Dot-Product Attention:Inputs include queries (Q) and keys (K) with dimension dk, and values (V) with dimension dv. We calculate the dot product of the query with all keys, dividing each dot product result by √dk, and then applying the softmax function to obtain the attention scores.

Illustrates how to calculate attention scores, focusing on the Q, K, V calculation formulas.

Scaled Dot-Product Attention

Self Attention:For the same sequence,attention scores are calculated using scaled dot-product attention,and the value vectors are weighted and summed to obtain the weighted representation of each position in the input sequence.

Expresses an attention mechanism, how to use scaled dot-product attention to calculate attention scores for the same sequence, thus obtaining the attention weights for each position in the same sequence.

Self Attention (Self Attention)

Multi-Head Attention:Multiple attention heads run in parallel, each head independently calculates attention weights and outputs, and then concatenates the outputs of all heads to obtain the final output.

Emphasizes a practical approach; in practice, we do not use a single dimension to perform a single attention function, but instead calculate using h=8 heads separately, then take a weighted averageto avoid errors from single calculations.

Multi-Head Attention (Multi-Head Attention)

Cross Attention::Input comes from two different sequences, one sequence serves as queries (Q), while the other provides keys (K) and values (V), enabling cross-sequence interaction.

Cross Attention (Cross Attention)

Causal Attention::To ensure that the model relies only on previous input information when generating sequences, and not on future information. Causal Attention achieves this by masking future positions, allowing the model to see only the current position and its previous inputs when predicting an output for a certain position.

Causal Attention (Causal Attention)

Question 1:In the diagram, the encoder clearly states Multi-Head Attention, how can it be called Self Attention?

Self Attention of the encoder

Answer to Question 1Scaled Dot-Product Attention, Self Attention, and Multi-Head Attention actually refer to the same thing, answering how to obtain attention weights for each position in the same sequence from different dimensions.The diagram labels Multi-Head Attention to emphasize the need for multiple heads to calculate attention weights.

Question 2:In the diagram, the encoder also states Multi-Head Attention, how can it be called Cross Attention?

Cross Attention of encoder-decoder

Answer to Question 2Cross Attention and Multi-Head Attention actually refer to the same thing, answering how attention is transferred between two different sequences from different dimensions.The diagram labels Multi-Head Attention to emphasize the need for multiple heads to perform attention transfer calculations.

Question 3:In the diagram, the encoder clearly states Masked Multi-Head Attention, how can it be called Causal Attention?

Decoder’s Causal Attention

Answer to Question 3Causal Attention and Masked Multi-Head Attention actually refer to the same thing, explaining how Self Attention in the decoder combines with Causal Attention to maintain autoregressive properties.

Masked Multi-Head Attention emphasizesthe use of multiple independent attention heads, each of which can learn different attention weights, thus enhancing the model’s representation ability. Causal Attention emphasizes that the model can only rely on already generated information when making predictions and cannot see future information.

Learn more about the three attention mechanisms in Transformer:

3. Applications of Transformer

Applications of Transformer in NLP:Due to the powerful performance of Transformer, Transformer models and their variants have been widely used in various natural language processing tasks such as machine translation, text summarization, question answering systems, etc.

Transformer:Vaswani et al. first proposed a Transformer based on attention mechanisms for machine translation and English syntactic structure parsing tasks.
BERT:Devlin et al. introduced a new language representation model BERT, which considers the context of each word. Since it is bidirectional, it pre-trained a Transformer on unlabeled text. When BERT was released, it achieved state-of-the-art performance on 11 NLP tasks.
GPT:Brown et al. pre-trained a large model based on Transformer called GPT-3 with 175 billion parameters on a dataset containing 45TB of compressed plain text data. It achieved powerful performance on various types of downstream natural language tasks without any fine-tuning.

Transformer models and their variants

Applications of Transformer in CV:Vision Transformer (ViT) is a revolutionary deep learning model that has fundamentally changed the way images are processed in traditional computer vision.

ViT uses the self-attention mechanism from the Transformer model to model the features of images, which differs from CNNs that extract local features of images through convolutional and pooling layers.
ViT’s main body block structure is based on the Transformer Encoder structure, including the Multi-head Attention structure.

Vision Transformer

Essence of ViT:Treats images as a series of “visual words” or “tokens” rather than a continuous array of pixels.

Essence of ViT

Workflow of ViT:Splits images into fixed-size patches, converts them into Patch Embeddings, adds positional encoding information, processes these embeddings through a Transformer encoder containing multi-head self-attention and feedforward neural networks, and finally uses classification tokens for tasks such as image classification.

Learn more about ViT (Vision Transformer):

Leave a Comment Cancel reply