Unlocking Model Performance with Attention Mechanism

The author of this article – Teacher Tom
Unlocking Model Performance with Attention Mechanism

▷ Doctorate from a double first-class domestic university, national key laboratory

▷ Published 12 papers at top international conferences, obtained 2 national invention patents, served as a reviewer for multiple international journals

▷ Guided more than ten doctoral and master’s students

Research Areas: General visual-language cross-modal model pre-training, efficient fine-tuning of large model parameters, and efficient computation on cutting-edge topics.

01
Introduction

Why Introduce the Attention Mechanism?

Unlocking Model Performance with Attention Mechanism

Starting from the way humans observe things, we can understand the necessity of the attention mechanism.

When faced with complex scenes or information, humans do not distribute their attention evenly to every detail, but rather focus their attention on certain key parts based on the current task or need.

This selective attention ability allows us to effectively process complex information and respond quickly.

For example, when reading a novel, although every word on the page is in front of us, our attention is only focused on the current plot and story development, ignoring other unimportant information.

This mechanism of “selectively attending to certain information” is also very important in computer vision and natural language processing tasks.

In machine translation, when we try to translate a sentence from the source language to the target language, the model needs to dynamically decide which word or phrase is most important at the current translation step based on the context.

02
Encoder-Decoder Structure

Assuming we want to translate a sentence, the word order in the target language may differ from that in the source language.

For example, the Chinese “我爱你” translates to “I love you” in English, where the English sentence structure is subject-verb-object, while the Chinese structure is subject-verb-object.

The traditional approach might simply replace the words of the target language in order with those from the source language, but this clearly cannot handle differences in word order.

To solve this problem, researchers proposed the Encoder-Decoder structure.

In this structure, the Encoder processes the input source language sentence through a Recurrent Neural Network (RNN), compressing all the information from the sentence into a fixed-size intermediate state vector.

Then, the Decoder uses another RNN to gradually decode and generate the target language sentence. This way, regardless of the order of words in the source sentence, they can be transmitted to the Decoder through the intermediate state, allowing for the generation of the target language sentence.

However, this method has a problem: the intermediate state is a fixed-size vector, meaning the information it can hold is limited.

If the source sentence is very long, the Encoder needs to compress all the content into this vector, but the capacity of the vector is limited, making it impossible to perfectly carry too much information. As the length of the source sentence increases, the Decoder may not be able to fully utilize all the source information, especially some details in the sentence may be lost.

At this point, the introduction of the attention mechanism provides us with a breakthrough.

With the attention mechanism, the Decoder no longer relies solely on a fixed intermediate state vector, but instead, based on each decoding step, dynamically selects and focuses on the most relevant parts of the source sentence.

For example, if we translate the sentence “我喜欢吃苹果” (I like to eat apples), although the words “我” (I) and “苹果” (apple) are not in the same position in the source language sentence, through the attention mechanism, the Decoder can particularly focus on “苹果” (apple) when translating the word “吃” (eat), ignoring the part “我” (I).

This mechanism allows the Decoder to flexibly retrieve the most relevant information from the source sentence according to the current needs during the decoding process.

The introduction of attention makes translation models more flexible, enabling them to handle long sentences more accurately and improving translation quality, especially between languages with significantly different sentence structures.

Therefore, the attention mechanism improves the model’s performance in handling long sentences by solving the information bottleneck problem and allows the model to dynamically select which parts of the information are most important, thus generating more accurate translation results.

Weighted Sum Based Attention

The earliest applications of attention mechanisms in machine translation aimed to solve the information bottleneck problem in traditional Encoder-Decoder structures.

Traditional Encoder-Decoder structures compress the entire input sentence into a fixed-size vector, which is passed to the Decoder for translation. However, this approach has limited information carrying capacity for long sentences, leading to a decline in translation quality.

The introduction of the attention mechanism allows the Decoder to no longer rely on a fixed intermediate vector, but rather to dynamically extract information from various parts of the input sentence.

The basic weighted sum attention mechanism calculates the relevance between the query and key, then transforms that relevance into weights that are used to weight the input values (Value).

This mechanism allows the model to focus on different parts of the input, thus generating better output.

Unlocking Model Performance with Attention Mechanism

Dot-Product Attention: Simplification and Expansion of Attention

As models became deeper, computational complexity gradually became a problem.

The initial additive attention required calculating the weighted sum between the query and key, which was computationally intensive.

To address this issue, researchers proposed dot-product attention, which simplifies the computation process by directly calculating the dot product between the query and key. Its computation method is shown below:

Unlocking Model Performance with Attention Mechanism

The advantage of dot-product attention is its simplicity and efficiency, especially when training on large datasets. Currently, this type of attention is widely used in Transformer models, becoming one of the core technologies in modern natural language processing.

Scaled Dot-Product Attention: Handling High-Dimensional Issues

Although dot-product attention is efficient, as the dimensions of the query and key increase, the values of the dot product can become very large, which may lead to instability in softmax calculations, affecting gradient propagation.

To address this issue, researchers proposed scaled dot-product attention, which controls the size of the values by dividing the result of the dot product by a scaling factor (usually the square root of the key’s dimension). Its computation method is as follows:

Unlocking Model Performance with Attention Mechanism

This mechanism ensures the stability of the computation process and improves the effectiveness of attention calculations, being key to implementing multi-head attention and other complex structures.

Unlocking Model Performance with Attention Mechanism

Multi-Head Attention: Learning Information from Multiple Perspectives

Although dot-product attention performs well in computational efficiency, the representational capacity of a single attention head is limited.

To enhance the model’s expressiveness, multi-head attention was introduced, which maps the queries, keys, and values into multiple subspaces, performing attention calculations independently in each subspace, and finally concatenating all results. Its computation method is as follows:

Unlocking Model Performance with Attention Mechanism

Each head is calculated as:

Unlocking Model Performance with Attention Mechanism

Multi-head attention can learn different features from multiple subspaces, thus greatly enhancing the model’s expressiveness.

Unlocking Model Performance with Attention Mechanism

Self-Attention: Capturing Long-Distance Dependencies in Sequences

Self-attention is a key component of the Transformer architecture, initially applied in machine translation, but not limited to it.

The core idea of self-attention is that each element of the input sequence not only relates to other elements in the sequence but also shares information with all other elements in the sequence, thus capturing long-distance dependencies in the sequence.

Unlocking Model Performance with Attention Mechanism

Local Attention: Addressing Computational Issues with Long Sequences

Although the self-attention mechanism can capture global dependencies, its computational complexity is quadratic (for sequence length L, the complexity is Unlocking Model Performance with Attention Mechanism). This computation method becomes very inefficient when processing very long sequences.

Therefore, local attention was proposed, which restricts the range of attention, allowing each element to only focus on a local area of the input sequence, rather than global information.

Unlocking Model Performance with Attention Mechanism

Channel Attention and Spatial Attention: Applications in Computer Vision

In computer vision, the application of attention mechanisms has gradually expanded from sequential data to image data.

Channel attention and spatial attention mechanisms have been proposed to adjust the weights of channel and spatial information for more effective tasks such as image classification, object detection, and semantic segmentation.

Unlocking Model Performance with Attention Mechanism

The core idea of the channel attention mechanism is: different channels contain information of varying importance, and the model needs to dynamically adjust the contribution of each channel based on their importance.

Each channel can be seen as a different “perspective” on the input feature map, where some channels contribute more to the current task than others.

The channel attention mechanism enhances the feature response of important channels and suppresses less important channels by assigning a weight to each channel.

Typically, after aggregating global information, a simple fully connected layer or convolutional layer is used to generate the attention weights for the channels, indicating the importance of each channel.

These weights are normalized using a sigmoid function to ensure they are within the range of [0, 1]:

Unlocking Model Performance with Attention Mechanism

Where, Unlocking Model Performance with Attention Mechanism is the global pooled feature representation of the Unlocking Model Performance with Attention Mechanism channel (e.g., average pooling), Unlocking Model Performance with Attention Mechanism is the learned weights, and Unlocking Model Performance with Attention Mechanism is the sigmoid function.

The core idea of the spatial attention mechanism is: the contribution of pixels at different positions in the feature map of the same channel varies for the task.

The spatial attention mechanism dynamically adjusts the area of focus by calculating the importance of various spatial positions in the feature map.

That is, the model learns to assign different weights to different positions in the spatial dimension, allowing the network to focus on important regions in the image while suppressing unimportant parts.

Spatial attention first needs to obtain comprehensive information for each position through channel-level aggregation, commonly done by max pooling or average pooling for each channel to obtain the response of each spatial position across all channels.

The resulting feature map is fed into a small convolutional network, typically a single-layer convolutional layer or a convolution + activation function (like ReLU) combination, to generate the spatial attention map.

The spatial attention map is normalized using the sigmoid function to generate attention weights for each spatial position.

Unlocking Model Performance with Attention Mechanism

Where, Unlocking Model Performance with Attention Mechanism is the feature after channel average pooling, Unlocking Model Performance with Attention Mechanism is the feature after channel max pooling, Unlocking Model Performance with Attention Mechanism represents concatenating them, and Unlocking Model Performance with Attention Mechanism is the learned convolution weights.

03
Conclusion

The attention mechanism has evolved from basic weighted sums to more complex forms like self-attention, multi-head attention, local attention, etc. In this process, models continuously attempt to improve computational efficiency, capture long-distance dependencies, and enhance the expressiveness of important information by selectively attending to different parts of the input.

The introduction of each attention mechanism is aimed at optimization under different computational complexities, task requirements, and application scenarios.

Ultimately, the combination and development of these mechanisms enable deep learning models to achieve unprecedented performance in fields such as language processing and computer vision.

The editor has compiled a collection of classic CV papers for reference. Students in need can scan the code to receive it!

Unlocking Model Performance with Attention Mechanism

Reply “Classic CV” to receive it

Unlocking Model Performance with Attention Mechanism

Teacher Tom focuses on discovering students’ interests, respecting their choices, and is dedicated to providing complete research ideas and guidance on revising papers, helping students continuously progress on their academic paths.

Unlocking Model Performance with Attention Mechanism

Students with paper publishing needs, please scan the code and reply “Consultation”

One-on-one custom learning plans

Unlocking Model Performance with Attention Mechanism

Worn Wisdom, as the only company independently developing Turbo research models, has served no less than 400,000 university students over the past 20 years, collaborating with over 1,000 top research PhDs from QS top 100 universities, building a complete teaching and research system, and carefully crafting research foundational textbooks suitable for various academic backgrounds. It has achieved remarkable results in global study abroad, international competitions, language tutoring, and more.

The team of nearly 1,000 top PhD mentors will not only guide you in research, publishing research papers, and obtaining offers from prestigious universities but also provide recommendations for doctoral and master’s guidance, and internship opportunities in major companies.

Worn mentors come from professors/PhD supervisors/algorithm engineers/researchers at QS top 100 universities, including prestigious institutions such as the University of Oxford, University of California, Johns Hopkins University, Tsinghua University, Peking University, Fudan University, National University of Singapore, and other world-class universities. All teachers hold doctoral degrees.

For more valuable content, follow CV Frontier Paper Interpretations to stay updated!

Leave a Comment