Understanding the Nine Layers of Attention Mechanism

This article is written by: Electric Light Phantom Alchemist

Graduate topic Top 1, Shanghai Jiao Tong University Computer Science first place, first prize in high school physics competition, meme master, national award in computer science from Shanghai Jiao Tong University, currently a PhD student at CUHK

https://zhuanlan.zhihu.com/p/362366192

Attention has become a hot topic in the entire AI field, whether in machine vision or natural language processing, it is inseparable from Attention, transformer, or BERT. Below, I emulate the EM Nine Layers and propose the Nine Layers of Attention. I hope to discuss with everyone. If you have better ideas, feel free to bring them up in the comments.

The Nine Layers of Attention – Understanding the Nine Realms of Attention are as follows:

Seeing Mountains as Mountains – Attention is a type of attention mechanism
Seeing Mountains and Stones – Mathematically, Attention is a widely used weighted average
Seeing Mountains and Peaks – In natural language processing, Attention is all you need
Seeing Mountains and Water – The BERT series large-scale unsupervised learning has pushed Attention to new heights
Water Turns Back to Mountains – In computer vision, Attention is an effective non-local information fusion technology
Mountains High and Water Deep – In computer vision, Attention will be all you need
Mountains and Water Cycle – In structured data, Attention is a powerful tool to assist GNN
There Are Mountains Within Mountains – The relationship between logical interpretability and Attention
Mountains and Water Unite – Various variants of Attention and their internal connections

Attention is a type of attention mechanism

As the name suggests, the essence of attention is the application of biological attention mechanisms in artificial intelligence. What are the benefits of attention mechanisms? Simply put, it can focus on the features needed to achieve the target scene. For example, there may be a series of features Understanding the Nine Layers of Attention Mechanism

. Perhaps the target scene only needs Understanding the Nine Layers of Attention Mechanism

, then attention can effectively “notice” these two features while ignoring others. Attention first appeared in recurrent neural networks (RNNs)^[1], where author Sukhbaatar provided such an example:

In the above figure, if we need to combine sentences (1) to (4) to answer the correct answer A based on question Q. It can be seen that Q has no direct connection with (3), but we need to derive the correct answer bedroom from (3). A more natural idea is to guide the model’s attention, starting from the question, searching for clues from the four sentences to answer the question.

As shown in the figure, through the apple in the question, we move to the fourth sentence, and then the attention shifts to the third sentence to determine the answer bedroom.

At this point, we should have grasped the earliest understanding of attention,reaching the first layer – Seeing Mountains as Mountains.

Now our question is, how to design such a model to achieve this effect?

The earliest implementation was based on explicit memory, storing the results at each step, “artificially implementing” the transfer of attention.

Still using the previous example,

As shown in the figure, through the processing of memory and the update of noticed items, attention is realized. This method is relatively simple, but very hand-crafted, and has gradually been phased out; we need to upgrade our cognition to reach a more abstract level.

Attention is a weighted average

The classic definition of Attention comes from the groundbreaking paper Attention is all you need^[2]. Although some earlier works also discovered similar technologies (such as self-attention), this paper gained supreme glory for proposing the bold and gradually validated assertion that “attention is all you need.” This classic definition is represented by the following formula.

The meaning of the formula will be explained below, but first, let’s talk about its significance. This formula is basically the classic definition that researchers have encountered in the past five years.The status of this formula in natural language processing is akin to Newton’s laws in classical mechanics, becoming the basic formula for constructing complex models.

This formula seems complex, but once understood, it will be found to be very simple and fundamental. Let’s discuss the meaning of each letter. Literally: Q represents query, K represents key, V represents value, is the dimension of K. At this point, someone might ask, what are query, key, and value? Because these three concepts were introduced in this paper, what is placed in the Q position is called query, what is placed in the K position is called key, and what is placed in the V position is called value. This is the best interpretation.In other words, this formula is similar to Newton’s law and can serve a definitional role.

To facilitate understanding, let me provide a few examples to explain these three concepts.

1. 【Search Field】 When searching for videos on Bilibili, the key is the sequence of keywords in the Bilibili database (such as dance, meme, Ma Baoguo, etc.), the query is the sequence of keywords you input, such as Ma Baoguo or meme, and the value is the sequence of videos you find.

2. 【Recommendation System】 When buying items on Taobao, the key is all the product information in Taobao’s database, the query is the product information you are currently interested in, such as high heels or tight pants, and the value is the product information pushed to you.

The above two examples are quite specific; in artificial intelligence applications, key, query, and value are often latent variable features. Therefore, their meanings are often not so apparent, and what we need to grasp is this computational structure.

Returning to the formula itself, this formula essentially representsa weighted average based on the relationship matrix. The relationship matrix is

, and softmax normalizes the relationship matrix to a probability distribution, then resamples V according to this probability distribution, ultimately obtaining the new attention results.

The following figure illustrates the specific meaning of Attention in NLP. We now consider the feature of a word “it”; its feature will be weighted based on the features of other words. For example, the animal may have a close relationship with “it” (because “it” refers to the animal), so their weight will be high, which will influence the next layer’s feature of “it.” More interesting content can be found in The Annotated Transformer^[3] and illustrate self-attention^[4].

At this point, one can roughly understand the basic module of attention, reaching the second layer, seeing mountains and stones.

In Natural Language Processing

Attention Is All You Need

The importance of the paper Attention is all you need lies not only in proposing the concept of attention but more importantly in introducing the Transformer structure that is entirely based on attention. Being entirely based on attention means no recurrent or convolutional operations, but entirely using attention. The following diagram compares the computational complexity of attention with recurrent and convolutional operations.

It can be seen that compared to recurrent operations, attention reduces the required sequence operations to O(1), although the complexity of each layer increases. This is a typical computer science idea of exchanging space for time; due to improvements in computational structures (such as adding constraints and sharing weights) and hardware advancements, this amount of space is not significant.

Convolution is also a typical model that does not require sequential operations, but its problem is that it relies on a 2D structure (therefore naturally suitable for images), and its computational complexity is still proportional to the logarithm of the input’s edge length, which is Ologk(n). However, the advantage of attention is that in the ideal case, it can reduce the computational complexity to O(1). This means that here we can actually see that attention does indeed have greater potential than convolution.

The Transformer model is essentially a simple stacking of attention modules. As many articles have already explained its structure, this article will not elaborate on it here. It has outperformed other models in fields like machine translation, demonstrating its powerful potential. Understanding the Transformer gives us a preliminary grasp of the power of attention, entering the realm of seeing mountains and peaks.

Seeing Mountains and Water

The BERT series large-scale unsupervised learning has pushed

Attention to new heights

The introduction of BERT^[5] has pushed attention to a whole new level. BERT creatively proposed a method of unsupervised pre-training on large datasets followed by fine-tuning on target datasets, using a unified model to solve a large number of different problems. BERT has performed exceptionally well, achieving remarkable improvements across 11 natural language processing tasks. It improved by 7.7% on GLUE, 4.6% on MultiNLI, and 5.1% on SQuAD v2.0.

BERT’s approach is actually quite simple; at its core, it is large-scale pre-training. It learns semantic information from large datasets and then applies this semantic information to smaller datasets. BERT’s contributions are: (1) proposing a bidirectional pre-training approach; (2) demonstrating that a single unified model can solve different tasks without needing to design different networks for each task; (3) achieving improvements across 11 natural language processing tasks.

(2) and (3) do not require further explanation. Let me explain (1). The previous OpenAI GPT inherited the attention is all you need concept and adopted a unidirectional attention (right image below), meaning the output can only attend to previous content, while BERT (left image below) uses bidirectional attention. This simple design of BERT allows it to significantly outperform GPT. This is also a typical example in AI where a small design leads to a big difference.

BERT proposed several simple unsupervised pre-training methods. The first is Mask LM, which involves masking part of a sentence and predicting the other part. The second is Next Sentence Prediction (NSP), which predicts what the next sentence is. This simple pre-training allows BERT to capture some fundamental semantic information and logical relationships, helping BERT achieve extraordinary results in downstream tasks.

Understanding how BERT unified the NLP landscape brings us to the new realm of seeing mountains and water.

Water Turns Back to Mountains

In Computer Vision,Attention

is an effective non-local information fusion technology

Can the attention mechanism help in computer vision? Returning to our initial definition, attention itself is a weighting process, which means it can fuse different information. CNNs have a flaw; each operation can only focus on local information (nearby information) and cannot fuse distant information (non-local information). Attention can help weight and fuse distant information, playing a supportive role. Networks based on this idea are called non-local neural networks^[6].

The proposed non-local operation is very similar to attention. Assuming we have Understanding the Nine Layers of Attention Mechanism

and

as two points’ image features, we can compute the new feature Understanding the Nine Layers of Attention Mechanism

as:

In the formula, Understanding the Nine Layers of Attention Mechanism

is a normalization term, and functions f and g can be flexibly chosen (note that the previously mentioned attention is actually a special case of f and g). In the paper, f is taken as a Gaussian relationship function, and g is taken as a linear function. The proposed non-local module was added to the CNN baseline method and achieved SOTA results across multiple datasets.

Subsequently, several papers proposed other methods of combining CNNs and attention^[7], all achieving improved results. Seeing this, we also gain a new level of understanding of attention.

Mountains High and Water Deep

In Computer Vision

Attention will be all you need

In NLP, transformers have dominated the field, but can transformers also dominate in computer vision? This idea is non-trivial because language is serialized one-dimensional information, while images are inherently two-dimensional information. CNNs are naturally suited for two-dimensional information like images, while transformers are designed for one-dimensional information like language. The previous section discussed many works that considered combining CNNs and attention; the question now is whether a pure transformer network can be designed for vision tasks.

Recently, an increasing number of papers indicate that transformers can adapt well to image data and are expected to achieve a dominant position in the vision field.

The first application of the vision transformer came from Google, called Vision Transformer^[8]. The name is also interesting: an image is worth 16×16 words. The core idea of this paper is to convert an image into 16×16 words, which are then input into the transformer for encoding, followed by a simple small network for downstream task learning, as shown in the figure below.

The Vision Transformer mainly applies transformers to image classification tasks. Can transformers also be used for object detection? The model proposed by Facebook, DETR (Detection Transformer), provides a positive answer^[9]. The architecture of the DETR model is also very simple, as shown in the figure below: the input is a series of extracted image features, processed through two transformers, outputting a series of object features, which are then regressed to bbox and cls through a forward network.

More detailed information can be found in @Tuo Fei Lun’s article:(https://zhuanlan.zhihu.com/p/266069794)

Tuo Fei Lun: A New Paradigm in Computer Vision: Transformerzhuanlan.zhihu.com Understanding the Nine Layers of Attention Mechanism

In other areas of computer vision, transformers are also shining with new vitality. Currently, the trend of transformers replacing CNNs has become inevitable, meaning that Attention is all you need will also hold true in computer vision. At this point, you will find that attention is deep and profound.

Mountains and Water Cycle

In Structured Data

Attention is a powerful tool to assist GNN

In the previous layers, we have seen that attention can perform well in one-dimensional data (like language) and two-dimensional data (like images); can it also perform excellently in high-dimensional data (like graph data)?

The classic paper that first applied attention to graph structures is Graph Attention Networks (GAT, oh right, this cannot be called GAN)^[10]. The basic problem solved by graph neural networks is how to obtain a feature representation of a graph given the structure of the graph and the features of its nodes, to achieve good results in downstream tasks (like node classification). So readers who have climbed to the seventh layer should realize that attention can be well applied in this kind of relational modeling.

The GAT network structure is not complex, although the mathematical formulas may be a bit many. Let’s look directly at the diagram below.

For every two nodes, an attention operation is performed to obtain a set of weights, for example, Understanding the Nine Layers of Attention Mechanism

represents the weight between nodes 1 and 2. Then this set of weights is used to perform a weighted average, followed by applying leakyReLU as an activation function. Finally, the outputs from multiple heads can be averaged or concatenated.

Once you understand that GAT is actually a straightforward application of attention, you have reached the seventh layer, the cycle of mountains and water.

There Are Mountains Within Mountains

The Relationship Between Logical Interpretability and Attention

Despite discovering that attention is very useful, how to deeply understand attention remains an unresolved issue in the research community. Furthermore, what constitutes a deep understanding is an entirely new question. Consider this: when was CNN proposed? LeNet was proposed in 1998. We still do not fully understand CNN, and attention is even newer to us.

I believe that attention can be understood even better than CNN. Why? Simply put, the analysis of attention as a weighting process inherently possesses visualizable properties. Visualization is a powerful tool for understanding high-dimensional spaces.

Let me give two examples. The first example is from NLP’s BERT, where analysis shows^[11] that the learned features possess very strong structural characteristics.

Another recent work from FACEBOOK, DINO^[12], shows the attention map obtained from unsupervised training. Isn’t it astonishing?

So far, the readers have reached a new realm, mountains within mountains.

Mountains and Water Unite

Various Variants of Attention and Their Internal Connections

Just as CNN can build very powerful detection models or more advanced models, the most remarkable aspect of attention is that it can serve as a basic module to construct very complex (and sometimes convoluted) models.

Here, I will briefly list some variants of attention^[13]. First is global attention and partial attention.

Global attention is what we discussed earlier; partial attention allows certain features to be fused before further attention is applied. The recently popular Swin Transformer can be seen as a significant development of this variant.

Next is hard attention and soft attention.

What we discussed earlier was primarily soft attention. However, from a sampling perspective, we can consider hard attention, treating probabilities as a distribution and then performing multinomial sampling. This may have implications in reinforcement learning.

Recently, a number of works have emerged suggesting that MLPs are quite powerful[14]. The author believes they also reference the attention model, employing different structures to achieve the same effect. Of course, it’s possible that attention will eventually be overshadowed by MLPs.

However, the concept of attention will never go out of style. As the simplest and most powerful basic module for data relationship modeling, attention will undoubtedly become a fundamental skill for every AI practitioner.

What also will not fade away is the ability to understand and analyze data. The numerous models introduced above may help us solve specific problems, but they must stem from a sufficient understanding of the problem’s structure. We can discuss this topic more slowly when we have the opportunity.

At this point, we have reached the ninth layer, the realm of mountains and water uniting. All phenomena return to spring, and all models merely facilitate our deeper understanding of data.

Note: TFM

Transformer Flow Group

For more fresh information about transformers, if you are already friends with CV Jun’s other accounts, please send a private message directly.

Leave a Comment Cancel reply