Is CNN a Type of Local Self-Attention?

Click the "Xiaobai Studies Vision" above, select "Star" or "Top" to receive essential information promptly.

Is CNN a Type of Local Self-Attention?

Is CNN a Type of Local Self-Attention?
Is CNN a Type of Local Self-Attention?
Author: Hohou
https://www.zhihu.com/question/448924025/answer/1791134786
(This answer references: Li Hongyi’s 2021 Machine Learning Course)
CNN is not a type of local attention, so let’s analyze what CNN and attention are doing.

1: CNN can be understood as a fully connected layer with local order and weight sharing. Thus, CNN has two features that fundamentally differ from fully connected layers: weight sharing and local connectivity. This greatly reduces the number of parameters while ensuring that fundamental features are not lost.

2: The steps of attention involve obtaining the attention matrix through the dot product of Q and K, representing the similarity between the two; that is, the more similar Q and K are, the larger the dot product. After scaling and applying softmax, we obtain the attention score, which is then multiplied by V to obtain the result after applying attention.

The essence of attention is to calculate the similarity between Q and K, emphasizing the parts of Q that are similar to K.

The fundamental difference can also be understood as CNN extracting features, while attention emphasizes features. The two can be quite different. If we say that all models are doing something around features, it is clearly unreasonable to say that they are all the same.

Update: Due to new understanding of CNN, I am updating my previous answer, but I will not delete my earlier response. I will add further supplements and explanations regarding this topic in the future, and everyone is welcome to bookmark it.

Regarding the difference between RNN and self-attention, you can check my answer; I hope it helps.
https://zhuanlan.zhihu.com/p/360374591

First, to conclude, CNN can be seen as a simplified version of Self-Attention, or Self-Attention can be viewed as a generalization of CNN.

Is CNN a Type of Local Self-Attention?

Previously, when comparing CNN and self-attention, we subconsciously thought of CNN as being used for image processing and self-attention for NLP, thus creating the illusion that these two methods are unrelated. Now, we will discuss the differences and connections between CNN and self-attention from the perspective of image processing for a better comparison. The process of using self-attention for image processing:

Is CNN a Type of Local Self-Attention?
First, if we use self-attention to process images, for each pixel (pixel value), we generate a Query, while other pixels generate Keys. The dot product of Query and Key is then calculated, followed by softmax (as mentioned in the comments, it doesn’t necessarily have to be softmax; other activation functions like ReLU can also be used), and finally, each pixel’s output after softmax is obtained by multiplying with the corresponding Value and summing. Here, you will notice that when performing self-attention on an image, each pixel considers all pixels in the entire image, taking into account all information in the image.

Now, if we use CNN to process an image, we select different convolution kernels to process the image. Each pixel (pixel value) only needs to consider the other pixels within that convolution kernel; it only needs to consider the receptive field and does not need to consider all the information in the entire image.

Thus, we can arrive at a general conclusion: CNN can be seen as a simplified version of self-attention, where CNN only needs to consider information within the convolution kernel (receptive field), while self-attention needs to consider global information.

Conversely, we can also understand that self-attention is a more complex version of CNN. CNN needs to define the receptive field and only considers information within that receptive field, while the range and size of the receptive field must be set manually. For self-attention, using attention to find related pixels is like having the receptive field automatically learned; that is, for each pixel, it determines which other pixels should be considered based on their relevance.

In simple terms, CNN learns information only from the pixels within the convolution kernel, while self-attention learns information from all pixels in the entire image. (This only considers a single layer of convolution; if there are multiple layers of convolution, CNN can achieve effects similar to self-attention.)

Now that we know the connection between self-attention and CNN, what conclusions can we draw?

We understand that CNN is a special case of Self-Attention, or Self-Attention is a generalization of CNN; CNN is very flexible, and by imposing certain restrictions on Self-Attention, they become similar. (Conclusion derived from the paper: https://arxiv.org/abs/1911.03584)

Is CNN a Type of Local Self-Attention?
For self-attention, it is a very flexible model, so it requires more data for training. If there is insufficient data, it may lead to overfitting. However, CNN, due to its more constraints, can train a relatively good model even with less training data.
Is CNN a Type of Local Self-Attention?
As shown in the figure, when the training data is small, CNN performs better, while when the training data is large, self-attention shows better performance. This means that self-attention has greater flexibility and requires more training data; it is prone to overfitting when training data is small, while CNN has less flexibility and performs better with less training data. However, when there is more training data, CNN cannot benefit from the larger dataset.
Is CNN a Type of Local Self-Attention?
Author: Anonymous User
https://www.zhihu.com/question/448924025/answer/1784363556
There are similarities, but also differences. CNN can be seen as performing an inner product with a fixed, static template at each position, which is a local projection, while attention calculates the inner product between different positions, which can be seen as a distance metric. The weighting matrix actually defines a distance metric. In a more general sense, CNN is more local, while self-attention emphasizes relations more. It may be more appropriate to say that CNN is a special degenerate form of attention.
Author: Lin Jianhua
https://www.zhihu.com/question/448924025/answer/1793085963
I believe that the convolutional layer of CNN and self-attention are not the same thing.
The K and Q in self-attention are generated by the data, reflecting the internal relationships of the data.

The convolutional layer of CNN can be seen as K composed of parameters and Q generated by different data, reflecting the relationship between the data and the parameters.

This means that self-attention constructs different spaces through parameters, allowing data to present different self-correlation properties in different spaces.

On the other hand, CNN convolution constructs certain fixed features through parameters and processes the data based on its performance on these features.

Author: Alang https://www.zhihu.com/question/448924025/answer/1786277036
In CNN, the core is to use local features to obtain global features. It can be said that attention is focused on local convolution kernels. Finally, the overall feature representation is formed through local convolution kernel features.

The self-attention mechanism is focused on itself, calculating with itself, allowing each word to incorporate global information, aiming to help local features better represent themselves with the help of global information.

Thus, CNN goes from local to global, while the self-attention mechanism assists local features from a global perspective. If one must relate CNN to attention, I personally understand it as local attention (note that there is no self).

Author: Aluea https://www.zhihu.com/question/448924025/answer/179309914
Reversing the order, self-attention is a CNN with strong inductive bias.
This is not difficult to understand; let’s look at what self-attention specifically does.

Assuming that for a layer of self-attention, there are four candidate features a, b, c, d that can be input simultaneously, but only the combined representations ac and bd contribute to downstream tasks. Self-attention will focus on these two combinations while masking others; for instance, [a,b,c] -> [a’,0,c’]. Ps. a’ represents the output representation of a.

For a layer of CNN, it is more straightforward; it simply maps [a,b,c] -> [a’,b’,c’]. Can CNN achieve the same functionality as self-attention? Absolutely, just add another layer of CNN with two filters, one filtering ac and the other filtering bd, and that’s it.

Of course, CNN can also fit distributions without doing this; however, self-attention must do this, which constitutes a stronger inductive bias.

Regarding the importance of inductive bias, I won’t elaborate further.
Author: mof.ii
https://www.zhihu.com/question/448924025/answer/1797006034
CNN is not a type of local self-attention, but it is possible to implement local self-attention as a layer and create a full self-attention network. Refer to Google Brain’s Stand-Alone Self-Attention in Vision Models presented at NeurIPS19.
The second section of the article compares the computational methods of convolutional layers and Self-Attention layers in detail, which is worth a look.
Good news! The "Xiaobai Studies Vision" knowledge community is now open to the public👇👇👇

Download 1: Chinese Tutorial for OpenCV-Contrib Extension Modules
Reply "Chinese Tutorial for Extension Modules" in the background of "Xiaobai Studies Vision" official account to download the first Chinese tutorial on OpenCV extension modules covering installation, SFM algorithms, stereo vision, object tracking, biological vision, super-resolution processing, and more than twenty chapters.

Download 2: 52 Lectures on Python Vision Practical Projects
Reply "Python Vision Practical Projects" in the background of "Xiaobai Studies Vision" official account to download 31 practical projects including image segmentation, mask detection, lane detection, vehicle counting, eyeliner addition, license plate recognition, character recognition, emotion detection, text content extraction, and face recognition to help quickly learn computer vision.

Download 3: 20 Lectures on OpenCV Practical Projects
Reply "20 OpenCV Practical Projects" in the background of "Xiaobai Studies Vision" official account to download 20 practical projects based on OpenCV for advancing OpenCV learning.

Group Chat

Welcome to join the reader group of the official account to communicate with peers. Currently, there are WeChat groups for SLAM, 3D vision, sensors, autonomous driving, computational photography, detection, segmentation, recognition, medical imaging, GAN, algorithm competitions, etc. (which will be gradually subdivided in the future). Please scan the WeChat ID below to join the group, and note: "nickname + school/company + research direction", for example: "Zhang San + Shanghai Jiao Tong University + Vision SLAM". Please follow the format; otherwise, your request will not be approved. After successfully adding, you will be invited to relevant WeChat groups based on your research direction. Please do not send advertisements in the group, or you will be removed from the group. Thank you for your understanding~

Leave a Comment