Nine Layers of Understanding Attention Mechanism

↑ ClickBlue Text Follow the Extreme City Platform

Author丨Electric Light Phantom Alchemy@Zhihu (Authorized)

Source丨https://zhuanlan.zhihu.com/p/362366192

Editor丨Extreme City Platform

Extreme City Guide

Attention has become popular in the entire AI field, whether in machine vision or natural language processing, it is inseparable from Attention, transformer, or BERT. The author of this article follows the EM nine-layer tower and proposes the Attention nine-layer tower. >>Join the Extreme City CV technology exchange group to stay at the forefront of computer vision

Attention has become popular in the entire AI field, whether in machine vision or natural language processing, it is inseparable from Attention, transformer, or BERT. Below, I imitate the EM nine-layer tower and propose the Attention nine-layer tower. I hope to communicate with everyone. If you have better ideas, feel free to raise them in the comments for discussion.

The Attention Nine-Layer Tower—The Nine Realms of Understanding Attention are as follows:

Seeing the Mountain as a Mountain—Attention is a type of attention mechanism
Seeing the Mountain and the Stone—Mathematically, Attention is a widely used weighted average
Seeing the Mountain and the Peak—In natural language processing, Attention is all you need
Seeing the Mountain and the Water—The BERT series of large-scale unsupervised learning pushes Attention to new heights
Water Turns to Mountain—In computer vision, Attention is an effective non-local information fusion technique
The Mountain is High and the Water is Deep—In computer vision, Attention will be all you need
Mountain Water Reincarnation—In structured data, Attention is a powerful tool to assist GNN
There are Mountains in the Mountains—The relationship between logical interpretability and Attention
Mountain Water Unity—Various variants of Attention and their internal connections

1. Attention is a Type of Attention Mechanism

As the name suggests, the essence of attention is the application of biological attention mechanisms in artificial intelligence. What are the benefits of attention mechanisms? Simply put, it can focus on the features necessary to complete the target scene. For example, there is a series of features. The target scene may only need these two features, so attention can effectively “pay attention to” these two features while ignoring others. Attention first appeared in recurrent neural networks (RNNs) [1], where the author Sukhbaatar gave such an example:

Nine Layers of Understanding Attention Mechanism

In the above image, if we need to combine sentences (1) to (4) and answer the correct answer A based on question Q. It can be seen that Q has no direct correlation with (3), but we need to get the correct answer bedroom from (3). A more natural idea is to guide the model’s attention, starting from the question, to look for clues among the four sentences, thus answering the question.

As shown in the figure, through the apple in the question, we turned to the fourth sentence, then the attention shifted to the third sentence to determine the answer bedroom.

At this point, we should have grasped the earliest understanding of attention, reaching the first layer—Seeing the Mountain as a Mountain.

Now our question is, how to design such a model to achieve this effect?

The earliest implementation was based on explicit memory, storing the results of each step, “manually implementing” the transfer of attention.

Using the previous example,

As shown in the figure, by processing the memory and updating what is noticed, attention is achieved. This method is relatively simple but very hand-crafted, and has gradually been phased out; we need to upgrade our cognition to reach a more abstract level.

2. Attention is a Weighted Average

The classic definition of Attention comes from the groundbreaking paper Attention is all you need [2]. Although some previous works also discovered similar techniques (such as self-attention), this paper gained the highest honor of being recorded in history by proposing the bold and gradually confirmed assertion that “attention is all you need.” This classic definition is represented by the following formula.

The meaning of the formula will be discussed below, but first, let’s talk about its significance. This formula has basically been the classic definition that researchers first encountered in the past five years. This formula’s status in natural language processing is akin to that of Newton’s laws in classical mechanics; it has become the fundamental formula for building complex models.

This formula may seem complex, but once understood, it is very simple and fundamental. Let’s discuss the meaning of each letter. Literally: Q represents query, K represents key, V represents value, and is the dimension of K. At this point, someone might ask, what are query, key, and value? Since these three concepts were introduced in this paper, the item in the Q position is query, the item in the K position is key, and the item in the V position is value. This is the best interpretation. In other words, this formula is similar to Newton’s law and can serve as a defining expression.

To facilitate understanding, I will provide a few examples to explain these three concepts.

1. 【Search Field】When searching for videos on Bilibili, the key is the sequence of keywords in the Bilibili database (such as dance, meme, Ma Baoguo, etc.), the query is the sequence of keywords you input, such as Ma Baoguo, meme, and the value is the sequence of videos you find.

2. 【Recommendation System】When buying items on Taobao, the key is all the product information in the Taobao database, the query is the product information you have recently focused on, such as high heels, skinny pants, and the value is the product information pushed to you.

The above two examples are quite specific; in artificial intelligence applications, key, query, and value are often latent variable features. Therefore, their meanings are often not so obvious; what we need to grasp is this computational structure.

Returning to the formula itself, this formula essentially represents a weighted average according to the relationship matrix. The relationship matrix is , and softmax normalizes the relationship matrix into a probability distribution, then resamples V according to this probability distribution, ultimately obtaining the new attention result.

The following diagram illustrates the specific meaning of Attention in NLP. Now, considering the features of a word like it, its features will be weighted based on the features of other words. For example, the relationship between the animal and it might be closer (since it refers to the animal), so their weights are high, and this weight will affect the next layer’s features of it. For more interesting content, please refer to The Annotated Transformer [3] and illustrate self-attention [4].

Nine Layers of Understanding Attention Mechanism

At this point, you can roughly understand the basic module of attention, reaching the second layer, seeing the mountain and the stone.

3. In Natural Language Processing, Attention is All You Need.

The importance of the paper Attention is all you need is not only that it proposed the concept of attention but more importantly, it proposed the Transformer, a structure entirely based on attention. Being entirely based on attention means no recurrent or convolutional operations, but solely using attention. The following diagram compares the computational load of attention with recurrent and convolutional operations.

It can be seen that compared to recurrent models, attention reduces the sequence operation to O(1), even though the complexity of each layer has increased. This is a typical idea in computer science to sacrifice space for time; due to improvements in computational structure (such as adding constraints and sharing weights) and hardware advancements, this space is not a significant issue.

Convolution is also a typical model that does not require sequential operations, but its problem is that it relies on a 2D structure (hence naturally suitable for images), and its computational load is still proportional to the logarithm of the input’s edge length, i.e., Ologk(n). However, the advantage of attention is that in the ideal case, it can reduce the computational load to O(1). In other words, at this point, we can actually see that attention indeed has stronger potential compared to convolution.

The Transformer model is basically a simple stacking of attention modules. Since many articles have explained its structure, this article will not elaborate further. It has outperformed other models in machine translation and other fields, demonstrating its powerful potential. Understanding the Transformer means you have a preliminary grasp of the strength of attention, entering the realm of seeing the mountain and the peak.

4. Seeing the Mountain and the Water—The BERT Series of Large-Scale Unsupervised Learning Pushes Attention to New Heights.

The introduction of BERT [5] has pushed attention to a whole new level. BERT creatively proposed a method of unsupervised pre-training on large-scale datasets followed by fine-tuning on target datasets, using a unified model to solve a large number of different problems. BERT performs exceptionally well, achieving remarkable improvements on 11 natural language processing tasks. It improved by 7.7% on GLUE, 4.6% on MultiNLI, and 5.1% on SQuAD v2.0.

BERT’s approach is actually very simple; it fundamentally involves large-scale pre-training. It learns semantic information from large-scale data and then applies this semantic information to smaller datasets. BERT’s contributions mainly include: 1) Proposing a bidirectional pre-training method. (2) Proving that a unified model can solve different tasks without having to design different networks for different tasks. (3) Achieving improvements on 11 natural language processing tasks.

(2) and (3) do not require much explanation. Here, let’s explain (1). The previous OpenAI GPT inherited Attention is all you need and adopted a unidirectional attention (the right side of the diagram), meaning the output can only attend to previous content, while BERT (the left side of the diagram) adopts bidirectional attention. BERT’s simple design has allowed it to significantly surpass GPT. This is also a typical example in AI where a small design leads to a big difference.

Comparison of BERT and GPT

BERT proposes several simple unsupervised pre-training methods. The first is Mask LM, which involves masking part of a sentence and predicting the other part. The second is Next Sentence Prediction (NSP), which involves predicting what the next sentence is. This simple pre-training allows BERT to capture some basic semantic information and logical relationships, helping it achieve extraordinary results in downstream tasks.

Understanding how BERT unified the NLP landscape brings us into the new realm of seeing the mountain and the water.

5. Water Turns to Mountain—In Computer Vision, Attention is an Effective Non-Local Information Fusion Technique.

Can the attention mechanism help in computer vision? Returning to our initial definition, attention itself is a weighting, and weighting means the ability to fuse different information. CNNs inherently have a flaw: each operation can only focus on local information near the convolution kernel and cannot fuse information from afar (non-local information). Attention can help weight and fuse information from afar, playing a supportive role. Networks based on this idea are called non-local neural networks [6].

For example, the information of the ball in the image may be related to the person’s information; this is where attention comes into play.

The non-local operation proposed is very similar to attention; assuming there are two points of image features and , the new feature can be calculated as:

In the formula, is the normalization term, and functions f and g can be flexibly chosen (note that the previously mentioned attention is actually a special case of f and g). In the paper, f is taken as a Gaussian relationship function, and g is taken as a linear function. The proposed non-local module was added to the CNN baseline methods, achieving SOTA results on multiple datasets.

Subsequently, some literature has proposed other methods to combine CNNs and attention [7], all achieving improved effects. Seeing this, we have gained a new level of understanding of attention.

6. The Mountain is High and the Water is Deep—In Computer Vision, Attention Will Be All You Need.

In NLP, the transformer has already unified the landscape; can the transformer also dominate in computer vision? This idea is non-trivial because language is serialized one-dimensional information, while images are inherently two-dimensional information. CNNs are naturally suited for such two-dimensional information, while transformers are suited for one-dimensional information. The previous section has discussed many works that consider combining CNN and attention, but can a pure transformer network be designed for visual tasks?

Recently, more and more articles indicate that Transformers can adapt well to image data and are likely to achieve dominance in the field of vision.

The first application of the visual transformer came from Google, called Vision Transformer [8]. The title is also interesting: an image is worth 16×16 words. The core idea of this paper is to convert an image into 16×16 words, then input it into a transformer for encoding, and subsequently use a simple small network for downstream task learning, as shown in the diagram below.

The Vision Transformer mainly applies transformers to image classification tasks. Can transformers also be used for object detection? The model proposed by Facebook, DETR (detection transformer), gives a positive answer [9]. The architecture of the DETR model is also very simple, as shown in the diagram below: the input is a series of extracted image features, processed through two transformers, and outputs a series of object features, which are then regressed to bbox and cls through a forward network. For a more detailed introduction, please refer to @the Turing Machine‘s article: https://zhuanlan.zhihu.com/p/266069794

Nine Layers of Understanding Attention Mechanism

In other areas of computer vision, Transformers are also blossoming with new vitality. Currently, the trend of Transformers replacing CNNs has become inevitable, meaning that Attention is all you need will also hold in computer vision. At this point, you will find that attention is high and deep, very mysterious.

7. Mountain Water Reincarnation—In Structured Data, Attention is a Powerful Tool to Assist GNN.

In the previous layers, we have seen that attention can be well applied in one-dimensional data (like language) and two-dimensional data (like images); now, can it perform excellently on high-dimensional data (like graph data)?

The earliest classic paper applying attention to graph structures is Graph Attention Networks (GAT; oh, this cannot be called GAN) [10]. The basic problem solved by graph neural networks is how to obtain a feature representation of a graph given its structure and node features to achieve good results in downstream tasks (such as node classification). Therefore, readers who have climbed to the seventh layer should be able to think that attention can be well applied in this relational modeling.

The GAT network structure is not complicated, even if the mathematical formulas are a bit numerous. Just look at the diagram below.

GAT Network Structure

First, attention is performed between every two nodes to obtain a set of weights; for example, the weight between 1 and 2 is represented in the diagram. Then, this set of weights is used to perform a weighted average, followed by a leaky ReLU activation. Finally, the results of multiple heads are averaged or concatenated.

Understanding that GAT is actually a not difficult application of attention brings us to the seventh layer, Mountain Water Reincarnation.

8. There are Mountains in the Mountains—The Relationship Between Logical Interpretability and Attention

Although we have discovered that attention is very useful, how to deeply understand attention is an unresolved issue in the research community. Further, what it means to deeply understand is a completely new question. Consider when CNN was proposed? LeNet was introduced in 1998. We still haven’t fully understood CNN, and attention has updated our understanding.

I believe that attention can be understood better than CNN. Why? Simply put, the weighted analysis of attention naturally possesses visualizable properties. Visualization is a powerful tool for understanding high-dimensional spaces.

Let me give two examples. The first example is BERT in NLP; analysis papers show [11] that the learned features have very strong structural characteristics.

Another recent work from FACEBOOK, DINO [12], shows the attention map obtained from unsupervised training. Isn’t it quite shocking?

Up to now, the reader has reached a new realm, there are mountains in the mountains.

9. Mountain Water Unity—Various Variants of Attention and Their Internal Connections

Just as CNNs can build very powerful detection models or more advanced models, the most impressive aspect of attention is that it can serve as a basic module to build very complex (and sometimes convoluted) models.

Here are some variants of attention [13]. First is global attention and local attention.

Nine Layers of Understanding Attention Mechanism

Global attention is what was discussed above, while local attention allows certain features to fuse before performing attention, which has been greatly advanced by the recent popular swin transformer.

Video: The Soul Painter takes you through understanding the Swin Transformer that beats CNN in one minute.

https://www.zhihu.com/zvideo/1359837715438149632

Next is hard attention and soft attention.

Previously, we mainly discussed soft attention. However, from a sampling perspective, we can consider hard attention, treating probabilities as a distribution and then performing polynomial sampling. This may have enlightening implications in reinforcement learning.

Recently, there has been a wave of work suggesting that MLPs are also quite powerful [14]. The author believes they reference the attention model, using different structures to achieve the same effect. Of course, it is also possible that attention will ultimately be surpassed by MLPs.

However, the concept of attention will never become outdated. As the most fundamental and powerful data relationship modeling module, attention will become a fundamental skill for every AI practitioner.

What will also never become outdated is the ability to understand and analyze data. The above has introduced a large number of models, but truly solving a specific problem still relies on a thorough understanding of the problem’s structure. This topic we can discuss slowly when we have the chance.

At this point, we have reached the ninth layer of Mountain Water Unity. All phenomena return to spring; all models merely facilitate our deep understanding of data.

Author: Top 1 answerer in the Zhihu graduate student section @Electric Light Phantom Alchemy

https://www.zhihu.com/people/zhao-ytc

References:

^Sukhbaatar, Sainbayar, et al. “End-to-end memory networks.” arXiv preprint arXiv:1503.08895 (2015).
^Vaswani, Ashish, et al. “Attention is all you need.” arXiv preprint arXiv:1706.03762 (2017).
^http://nlp.seas.harvard.edu/2018/04/03/attention.html
^https://jalammar.github.io/illustrated-gpt2/#part-2-illustrated-self-attention
^Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).
^Wang, Xiaolong, et al. “Non-local neural networks.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
^Fu, Jun, et al. “Dual attention network for scene segmentation.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.
^Dosovitskiy, Alexey, et al. “An image is worth 16×16 words: Transformers for image recognition at scale.” arXiv preprint arXiv:2010.11929 (2020).
^Carion N, Massa F, Synnaeve G, et al. End-to-end object detection with transformers[C]//European Conference on Computer Vision. Springer, Cham, 2020: 213-229.
^Veličković, Petar, et al. “Graph attention networks.” arXiv preprint arXiv:1710.10903 (2017).
^https://arxiv.org/abs/2002.12327
^https://ai.facebook.com/blog/dino-paws-computer-vision-with-self-supervised-transformers-and-10x-more-efficient-training
^https://towardsdatascience.com/attention-in-neural-networks-e66920838742 https://towardsdatascience.com/attention-in-neural-networks-e66920838742
^https://arxiv.org/abs/2105.02723

If you find this useful, please share it with your friends.

△Click the card to follow the Extreme City Platform for the latest CV insights

Reply “RegNet” in the public account for resource links～

Extreme City Insights

Top Conference Insights：10 most influential papers in 20 years of CVPR! ｜ Latest 18 Oral Papers from CVPR2021 ｜Experience sharing on academic paper submissions and rebuttals

Practical Tutorials：Custom CUDA operator tutorial and runtime analysis in PyTorch ｜Using detach in PyTorch does not prevent parameter updates

Recruitment Experiences：Summary of interview experiences in computer vision for autumn recruitment ｜Summary of algorithm engineer interview questions

Latest CV Competitions:2021 Qualcomm AI Application Innovation Competition ｜CVPR 2021 | Short-video Face Parsing Challenge

1. Attention is a Type of Attention Mechanism

2. Attention is a Weighted Average

3. In Natural Language Processing, Attention is All You Need.

4. Seeing the Mountain and the Water—The BERT Series of Large-Scale Unsupervised Learning Pushes Attention to New Heights.

5. Water Turns to Mountain—In Computer Vision, Attention is an Effective Non-Local Information Fusion Technique.

6. The Mountain is High and the Water is Deep—In Computer Vision, Attention Will Be All You Need.

7. Mountain Water Reincarnation—In Structured Data, Attention is a Powerful Tool to Assist GNN.

8. There are Mountains in the Mountains—The Relationship Between Logical Interpretability and Attention

9. Mountain Water Unity—Various Variants of Attention and Their Internal Connections

Leave a Comment Cancel reply