Collect! Various Amazing Self-Attention Mechanisms

2025-05-07 by AI Agent

Click the above“Beginner Learning Vision”, choose to add “Star Mark” or “Pin”

Important content delivered at the first time

Editor’s Recommendation

This article summarizes the main content of Teacher Li Hongyi’s introduction to various attention mechanisms in the Spring 2022 Machine Learning course, which also serves as a supplement to the 2021 course.

Reprinted from丨PaperWeekly

Reference video can be found at:

https://www.bilibili.com/video/BV1Wv411h7kN/?p=51&vd_source=eaef25ec79a284858ac2a990307e06ae

In the transformer video of the 2021 course, Teacher Li detailed some of the self-attention content, but self-attention actually has various forms of variations:

Collect! Various Amazing Self-Attention Mechanisms

First, let’s briefly review the previous self-attention. Assuming the input sequence (query) length is N, in order to capture the relationship between each value or token, it is necessary to generate N keys corresponding to it, and by performing a dot-product between the query and key, we can produce an Attention Matrix (attention matrix) with dimensions N*N. The biggest problem with this method is that when the sequence length is too long, the corresponding Attention Matrix dimension becomes too large, causing computational difficulties.

Collect! Various Amazing Self-Attention Mechanisms

For transformers, self-attention is just one module in a large network architecture. From the above analysis, we know that the computational load for self-attention is proportional to the square of N. When N is very small, simply increasing the computational efficiency of self-attention may not significantly impact the overall network’s computational efficiency. Therefore, improving the computational efficiency of self-attention to greatly enhance the efficiency of the entire network is contingent upon N being particularly large, such as in image recognition (image processing).

Collect! Various Amazing Self-Attention Mechanisms

How can we speed up the computation of self-attention? Based on the above analysis, we can see that the biggest issue affecting the efficiency of self-attention is the computation of the Attention Matrix. If we can selectively compute certain values in the Attention Matrix based on some human knowledge or experience, or if we can determine some values without computation, theoretically, we can reduce the computational load and improve computational efficiency.

Collect! Various Amazing Self-Attention Mechanisms

For example, when we are doing text translation, sometimes it is not necessary to consider the entire sequence when translating the current token; knowing just the neighboring tokens can achieve accurate translations, which is called local attention. This can significantly enhance computational efficiency, but the downside is that it only focuses on the local values, which is quite similar to CNN.

Collect! Various Amazing Self-Attention Mechanisms

If the above local attention is not satisfactory, another approach is to give the current token a certain interval (stride) of left and right neighbors to capture the relationships between the current token and both past and future tokens. Of course, the value of stride can be determined by oneself.

Collect! Various Amazing Self-Attention Mechanisms

Another method is global attention, which involves selecting certain tokens in the sequence as special tokens (for example, punctuation marks), or adding special tokens to the original sequence. This allows the special tokens to establish global relationships with the sequence, while the other non-special tokens do not have attention. For instance, adding two special tokens at the beginning of the original sequence:

Collect! Various Amazing Self-Attention Mechanisms

Which attention method is the best? Only children make choices… For a network, some heads can do local attention while others can do global attention… thus no choice is needed. See the following examples:

Longformer combines the three attention methods mentioned above.
Big Bird is based on Longformer and randomly selects attention assignments to further improve computational efficiency.

Collect! Various Amazing Self-Attention Mechanisms

The methods mentioned above are all manually set to determine where attention needs to be calculated and where it does not. But is this the best way to calculate? Not necessarily. For the Attention Matrix, if certain position values are very small, we can directly set these positions to 0, which will not have a significant impact on the actual prediction results. In other words, we only need to identify the values in the Attention Matrix that are relatively large. But how do we find out which positions have very small/very large values?

Collect! Various Amazing Self-Attention Mechanisms

The following two papers propose a clustering scheme, which first clusters the queries and keys. Queries and keys belonging to the same class calculate attention, while those not in the same class do not participate in the calculation, thus speeding up the computation of the Attention Matrix. For example, in the following example, there are 4 classes: 1 (red box), 2 (purple box), 3 (green box), and 4 (yellow box). The two papers introduce methods for quick rough clustering.

Collect! Various Amazing Self-Attention Mechanisms

Collect! Various Amazing Self-Attention Mechanisms

Is there a way to learn whether or not to calculate attention? It is possible. We can train another network, where the input is the input sequence and the output is a weight sequence of the same length. By concatenating all weight sequences and transforming them, we can obtain a matrix indicating where to calculate attention and where not to. One detail is that certain different sequences may output the same weight sequence through the neural network, which can greatly reduce the computational load.

Collect! Various Amazing Self-Attention Mechanisms

What we discussed above is the N*N matrix, but in practice, such a matrix is usually not full rank, meaning we can reduce the dimensionality of the original N*N matrix by removing redundant columns to obtain a relatively small matrix.

Collect! Various Amazing Self-Attention Mechanisms

Specifically, select K representative keys from N keys, each key corresponding to a value, and then perform a dot product with the query. Then perform gradient descent to update the value.

Why select representative keys instead of representative queries? Because queries correspond to outputs, thus the output would be shortened, resulting in loss of information.

Collect! Various Amazing Self-Attention Mechanisms

How to select representative keys? Here are two methods: one is to directly convolve (conv) the keys, and the other is to perform matrix multiplication with a matrix.

Collect! Various Amazing Self-Attention Mechanisms

Let’s review the computation process of the attention mechanism, where I is the input matrix and O is the output matrix.

Collect! Various Amazing Self-Attention Mechanisms

First, ignore softmax, and it can be expressed in the following form:

Collect! Various Amazing Self-Attention Mechanisms

The above process can be accelerated. If we compute V*K^T first and then multiply by Q, it will yield the same result as K^T*Q multiplied by V, but the computational load will be significantly reduced.

Appendix: Linear algebra explanation regarding this part

Collect! Various Amazing Self-Attention Mechanisms

Still explaining the above example. If we compute K^T*Q, it will execute N*d*N multiplications. V*A will execute d’*N*N multiplications, so the total computational load is (d+d’)N^2.

Collect! Various Amazing Self-Attention Mechanisms

If we compute V*K^T, it will execute d’*N*d multiplications. Then multiplying by Q will execute d’*d*N multiplications, so the total computational load is 2*d’*d*N.

Collect! Various Amazing Self-Attention Mechanisms

And (d+d’)N^2>>2*d’*d*N, so by changing the order of operations, we can greatly enhance computational efficiency.

Now let’s bring back softmax. The original self-attention looks like this, taking b1 as an example:

Collect! Various Amazing Self-Attention Mechanisms

If we can convert exp(q*k) into the form of two mappings multiplied, we can further simplify the above formula:

Collect! Various Amazing Self-Attention Mechanisms

▲ Simplification of the denominator

Collect! Various Amazing Self-Attention Mechanisms

▲ Simplification of the numerator

Taking the items inside the parentheses as a vector, M vectors form an M-dimensional matrix, which, when multiplied by φ(q1), yields the numerator.

Collect! Various Amazing Self-Attention Mechanisms

Graphically represented as follows:

Collect! Various Amazing Self-Attention Mechanisms

From the above, it can be seen that the blue vector and the yellow vector are actually unrelated to the 1 in b1. This means that when we compute b2, b3…, the blue and yellow vectors do not need to be recalculated.

Collect! Various Amazing Self-Attention Mechanisms

Self-attention can also be viewed in another way. This computation method is almost identical to the results produced by the original self-attention but significantly reduces the computational load. Simply put, first find a transformation φ(), transform k, and then perform a dot product with v to obtain an M-dimensional vector. Then transform q and multiply it with the corresponding dimension of M. The M-dimensional vector only needs to be computed once.

Collect! Various Amazing Self-Attention Mechanisms

b1 is computed as follows:

Collect! Various Amazing Self-Attention Mechanisms

b2 is computed as follows:

Collect! Various Amazing Self-Attention Mechanisms

This can be understood as treating φ(k) and v to compute the vector as a template, and then using φ(q) to find which template is the most important and perform matrix operations to obtain the output b.

So how to choose φ? Different literature has different methods:

Collect! Various Amazing Self-Attention Mechanisms

Do we necessarily need q and k when calculating self-attention? Not necessarily. In the Synthesizer literature, the attention matrix is not obtained through q and k, but learned as network parameters. Although different input sequences correspond to the same attention weights, performance does not deteriorate significantly. This also raises a question: what is the value of attention?

Collect! Various Amazing Self-Attention Mechanisms

Is it necessary to use attention to process sequences? Is it possible to try discarding attention? Are there attention-free methods? Below are several MLP methods used as alternatives to attention for processing sequences.

Collect! Various Amazing Self-Attention Mechanisms

The last page summarizes all the methods discussed today. In the figure below, the vertical axis represents the LRA score, where higher values indicate better network performance; the horizontal axis indicates how many sequences can be processed per second, with further right indicating faster speed; larger circles represent more memory used (greater computational load).

Collect! Various Amazing Self-Attention Mechanisms

That concludes Teacher Li’s summary of “Various Amazing Self-Attention Mechanisms” for this class.

Download 1: OpenCV-Contrib Extension Module Chinese Tutorial

Reply in the "Beginner Learning Vision" public account backend: Extension Module Chinese Tutorial to download the first OpenCV extension module tutorial in Chinese, covering installation of extension modules, SFM algorithms, stereo vision, object tracking, biological vision, super-resolution processing, and more than twenty chapters of content.

Download 2: Python Vision Practical Project 52 Lectures

Reply in the "Beginner Learning Vision" public account backend: Python Vision Practical Project to download 31 visual practical projects including image segmentation, mask detection, lane line detection, vehicle counting, eyeliner addition, license plate recognition, character recognition, emotion detection, text content extraction, facial recognition, etc., to help quickly learn computer vision.

Download 3: OpenCV Practical Project 20 Lectures

Reply in the "Beginner Learning Vision" public account backend: OpenCV Practical Project 20 Lectures to download 20 practical projects based on OpenCV for advanced learning of OpenCV.

Discussion Group

Welcome to join the public account reader group to communicate with peers. Currently, there are WeChat groups for SLAM, 3D vision, sensors, autonomous driving, computational photography, detection, segmentation, recognition, medical imaging, GAN, algorithm competitions, etc. (which will gradually be subdivided). Please scan the WeChat number below to join the group, and note: "Nickname + School/Company + Research Direction", for example: "Zhang San + Shanghai Jiao Tong University + Visual SLAM". Please follow the format for remarks, otherwise, you will not be approved. After successful addition, you will be invited to join the relevant WeChat group based on your research direction. Please do not send advertisements in the group, or you will be removed from the group. Thank you for your understanding~

Leave a Comment Cancel reply