Is CNN a Type of Local Self-Attention?

This article is reprinted from: Deep Learning EnthusiastsLink:https://www.zhihu.com/question/448924025/answer/1801015343Editor: Deep Learning and Computer VisionStatement: For academic sharing only, please delete if infringing

Is CNN a Type of Local Self-Attention?Author: Houhouhttps://www.zhihu.com/question/448924025/answer/1791134786(This answer refers to: Li Hongyi’s 2021 Machine Learning Course)CNN is not a type of local attention, so let’s analyze what CNN and attention are doing.

1: CNN can be understood as a local ordered fully connected layer with weight sharing, so CNN has two characteristics that are fundamentally different from fully connected layers: weight sharing and local connectivity. This greatly reduces the number of parameters while ensuring that some fundamental features are not lost.

2: The steps of attention involve multiplying Q and K to obtain the attention matrix, which represents the similarity between the two. The more similar Q and K are, the larger the product, which is then scaled and softmaxed to get the attention score, and then multiplied by V to get the result after attention.

The essence of attention is to calculate the similarity between Q and K, emphasizing the parts in Q that are similar to K.

The fundamental difference can also be understood as CNN extracts features, while attention emphasizes features. The two can be quite different. If we say that all models revolve around features, then it is clearly unreasonable to say that they are all the same.

Update: Due to a new understanding of CNN, I am updating my previous answer, but I will not delete my earlier response. Any further supplements and explanations on this topic will be added later, and everyone is welcome to bookmark.

For the differences between RNN and self-attention, you can refer to my answer here, I hope it helps you.https://zhuanlan.zhihu.com/p/360374591

First, the conclusion is that CNN can be seen as a simplified version of Self-Attention, or Self-Attention can be considered a generalization of CNN.

Previously, when comparing CNN and self-attention, we subconsciously thought of CNN for image processing and self-attention for NLP, which creates the illusion of how these two methods could be related. Therefore, the following discussion will focus on the differences and connections between CNN and self-attention in image processing, to better compare CNN and self-attention. The process of self-attention applied to image processing:

First, if we use self-attention to process images, for each pixel produce a Query, while other pixels produce Keys, Query*Key, then softmax (as mentioned in comments, it doesn’t necessarily have to be softmax, other activation functions like Relu can also be used), finally multiply by the corresponding Value and sum it up as the output value for each pixel after softmax. Here, one would notice that when applying self-attention to images, every pixel considers all the pixels in the entire image, taking into account all the information in the entire image.

In contrast, when we use CNN to process images, we select different convolutional kernels to process the image, and each pixel (value) only needs to consider the other pixels within the convolutional kernel, only needing to consider the receptive field, without needing to consider all the information in the entire image.

Thus, we can arrive at a general conclusion: CNN can be seen as a simplified version of self-attention, meaning CNN only needs to consider the information within the convolutional kernel (receptive field), while self-attention needs to consider global information.

Conversely, we can also understand that self-attention is a more complex version of CNN; CNN defines the receptive field and only considers the information within that receptive field, while the range and size of the receptive field must be set manually. In self-attention, the attention mechanism finds relevant pixels, as if the receptive field is learned automatically, determining which other pixels need to be considered based on the relevance to the center pixel.

In simple terms, CNN learns only the information within the convolutional kernel for each pixel, while Self-Attention learns information from the entire image for each pixel. (Here, we only consider a single layer of convolution; if there are multiple layers, CNN can achieve effects similar to self-attention.)

Now that we know the connection between self-attention and CNN, what conclusions can we draw?

We understand that CNN is a special case of Self-Attention, or Self-Attention is a generalization of CNN, a very flexible CNN. By imposing certain restrictions on Self-Attention, it becomes similar to CNN. (Conclusion derived from the paper: https://arxiv.org/abs/1911.03584)

For self-attention, it is a very flexible model, so it requires more data for training. If the data is insufficient, it may lead to overfitting. However, for CNN, due to its many constraints, it can train a relatively good model even with less training data.As shown in the figure, when the training data is small, CNN performs better; when the training data is large, self-attention performs better. That is, self-attention is more elastic and requires more training data. When the training data is small, it is prone to overfitting, while CNN has less elasticity and performs better with smaller training data. When there is a large amount of training data, CNN cannot benefit from it.Author: Anonymous Userhttps://www.zhihu.com/question/448924025/answer/1784363556There are similarities, but also differences. CNN can be considered as performing an inner product with a fixed static template at each position, which is a local projection, while attention calculates the inner product between different positions, which can be seen as a distance metric, where the weighting matrix defines a distance metric. More generally, CNN is more local, while self-attention emphasizes relations. Saying CNN is a special case of attention may be more appropriate.Author: Lin Jianhuahttps://www.zhihu.com/question/448924025/answer/1793085963I believe that the convolutional layer of CNN and self-attention are not the same thing. Self-attention’s K and Q are generated from data, reflecting the internal relationships of the data.

The convolutional layer of CNN can be seen as K composed of parameters and Q generated from different data, reflecting the relationship between data and parameters.

That is, self-attention builds different spaces through parameters, allowing data to present different self-correlation properties in different spaces.

Conversely, CNN convolutions build certain fixed features and process data based on their performance on these features.

Author: Aliang https://www.zhihu.com/question/448924025/answer/1786277036The core of CNN lies in using local features to obtain global features. It can be said that attention is focused on the local convolution kernel each time. Finally, the overall feature representation is formed through local convolutional kernel features.

The self-attention mechanism emphasizes ‘self’, computing itself, which allows each word to incorporate global information to better represent local features.

Thus, CNN is local to global, while self-attention is global assisting local. If one must relate CNN to attention, I personally understand it as local attention (note the absence of ‘self’).

Author: Aluea https://www.zhihu.com/question/448924025/answer/179309914Reversing the order, self-attention is a CNN with strong inductive bias.This is not difficult to understand, let’s look at what self-attention specifically does.

Assuming for one layer of self-attention, there are four candidate features a, b, c, d that can be input simultaneously, but only the combined representations of ac and bd contribute to downstream tasks. Then self-attention will focus on these two combinations while masking other features; for example, [a,b,c]->[a’,0,c’]. Ps. a’ represents the output representation of a.

For one layer of CNN, it is more straightforward, saying it as it is, [a,b,c]->[a’,b’,c’]. Can CNN perform functions similar to self-attention? Absolutely, just add another layer of CNN with two filters, one for ac and one for bd, and that’s it.

Of course, CNN can choose not to do this and can fit distributions; however, self-attention must do this, which is a stronger inductive bias.

Without going into detail about the importance of inductive bias.Author: mof.iihttps://www.zhihu.com/question/448924025/answer/1797006034CNN is not a type of local self-attention, but it is possible to implement local self-attention as a layer and create a full self-attention network. Refer to Google Brain’s Stand-Alone Self-Attention in Vision Models presented at NeurIPS19.The second section of the article compares the computation methods of convolutional layers and self-attention layers in detail, which is worth a look.

For reference links, click the original text at the bottom left of the article. For academic sharing only, please delete if infringing.

Editor / Garvey

Reviewed by / Fan Ruiqiang

Rechecked by / Fan Ruiqiang

Click below

Leave a Comment Cancel reply