Follow the official account “ML_NLP“

Set as “Starred“, delivering heavy content first-hand!

Author｜xys430381_1 Translation | Jishi Platform Link | https://blog.csdn.net/xys430381_1/article/details/89323444This article is for academic sharing only, copyright belongs to the author. If there is any infringement, please contact us to delete the article.

Overview

Why Visual Attention is Needed
Classification of Attention and Basic Concepts

Soft Attention

The application of two-level attention models in deep convolutional neural networks for fine-grained image classification—CVPR2015
1. Spatial Transformer Networks (Spatial Domain Attention)—2015 NIPS
2. SENET (Channel Domain)—2017 CPVR
3. Residual Attention Network (Mixed Domain)—2017
Non-local Neural Networks, CVPR2018
Interaction-aware Attention, ECCV2018
CBAM: Convolutional Block Attention Module (Channel Domain + Spatial Domain), ECCV2018
DANet: Dual Attention Network for Scene Segmentation (Spatial Domain + Channel Domain), CPVR2019
CCNet
OCNet
GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond
Attention-Augmented Convolutions
PAN: Pyramid Attention Network for Semantic Segmentation (Layer Domain)—CVPR2018
Multi-Context Attention for Human Pose Estimation
Tell Me Where to Look: Guided Attention Inference Network

Hard Attention

A study introducing hard attention mechanisms to guide visual question-answering tasks
1. Diversified visual attention networks for fine-grained object classification—2016
2. Deep networks with internal selective attention through feedback connections (Channel Domain)—NIPS 2014
3. Fully Convolutional Attention Networks for Fine-Grained Recognition
4. Temporal Attention (RNN)

Self-Attention
Related Works
Disadvantages and Improvement Strategies of Self-Attention
Summary of Self-Attention

Overview

Why Visual Attention is Needed

The basic idea of the attention mechanism in computer vision is to allow the system to learn to focus on important information while ignoring irrelevant information. Why ignore irrelevant information?

Classification of Attention and Basic Concepts

What is “attention” in neural networks? How is it used? Here is a detailed explanation: http://www.sohu.com/a/198312880_390227) This article is divided into: hard attention, soft attention, as well as Gaussian attention and spatial transformation

Divided by the differentiability of attention:

Hard attention, is a 0/1 problem, which regions are attended to and which are not. Hard attention has been well-known for years in image applications: image cropping (image cropping) differs from soft attention in that hard attention focuses more on points, meaning every point in the image can extend attention, and it is a random predictive process emphasizing dynamic changes. Of course, the key point is that hard attention is non-differentiable, and the training process often involves reinforcement learning (see article: Mnih, Volodymyr, Nicolas Heess, and Alex Graves. “Recurrent models of visual attention.” Advances in neural information processing systems. 2014.)

Hard attention can be implemented in Python (or TensorFlow) as:

g = I[y:y+h, x:x+w]

The only existing problem is that it is non-differentiable; if you want to learn model parameters, you must use a score-function estimator. I briefly introduced this in my previous article.

Soft attention, is a continuous distribution problem between [0,1], indicating the degree to which each area is attended to, represented by a score from 0 to 1. The key point of soft attention is that it focuses more on regions or channels, and soft attention is deterministic, meaning that after training, it can be generated directly through the network. The crucial point is that soft attention is differentiable, which is very important. Differentiable attention can compute gradients through neural networks and learn attention weights through forward propagation and backward feedback. However, this type of soft attention is computationally expensive. The black parts of the input have no impact on the result but still need to be processed. It is also over-parameterized: the sigmoid activation function used to implement attention is independent of each other. It can select multiple targets at once, but in practice, we often want selectivity and can only focus on a single element in the scene. The following two mechanisms introduced by DRAW and Spatial Transformer Networks effectively solve this problem. They can also adjust the input size, further improving performance.

Divided by the domain of attention:

Spatial Domain
Channel Domain
Layer Domain
Mixed Domain
Time Domain: There is another special implementation of hard attention in the time domain, but since hard attention is implemented using reinforcement learning, training is somewhat different.

A concept: Self-attention is the autonomous learning between feature maps, distributing weights (which can be spatial, temporal, or inter-channel).

Soft Attention

The application of two-level attention models in deep convolutional neural networks for fine-grained image classification—CVPR2015

1. Spatial Transformer Networks (Spatial Domain Attention)—2015 NIPS

The Spatial Transformer Networks (STN) model [4] is an article from NIPS 2015, which transforms spatial information from the original image into another space while preserving key information through the attention mechanism.

This article argues that previous pooling methods are too violent, directly merging information that leads to key information being unrecognizable, thus proposing a module called the spatial transformer to correspondingly transform the spatial domain information in the image to extract key information.

The spatial transformer is essentially an implementation of the attention mechanism, as the trained spatial transformer can identify the areas in the image that need to be focused on, while this transformer can also perform rotation and scaling transformations, allowing important local information in the image to be extracted through transformation. For example, this intuitive experimental figure:

(a) Column shows the original image information, where the first handwritten digit 7 has no transformation, the second handwritten digit 5 has undergone some rotation, while the third handwritten digit 6 has added some noise; (b) Column shows the colored bounding boxes learned from the spatial transformer, each bounding box corresponds to a spatial transformer learned from the image; (c) Column shows the feature map after transformation by the spatial transformer, where the key area of 7 is selected, 5 is rotated into the correct image, and the noise information of 6 is not recognized.

2. SENET (Channel Domain)—2017 CPVR

The middle module is the innovative part of SENet, which is the attention mechanism module. This attention mechanism is divided into three parts: squeeze, excitation, and scale (attention). The process is as follows:

Perform Global Average Pooling on the input features to obtain 1_1_Channel.
Then, the bottleneck features interact, first compressing the number of channels and then reconstructing back to the number of channels.
Finally, apply a sigmoid to generate channel-wise attention weights between 0 and 1, which are then scaled back to the original input features.

For details, see “Paper Reading Notes—SENET” https://blog.csdn.net/xys430381_1/article/details/89158063

3. Residual Attention Network (Mixed Domain)—2017

The attention mechanism in this article is based on the basic masking mechanism of soft attention, but differs in that this attention mechanism’s mask draws on the idea of residual networks, not only adding masks based on the information of the current network layer but also passing down the information from the previous layer, thus preventing the problem of insufficient information after masking that can prevent deep stacking of network layers.

The proposed attention mask is not only focused on spatial or channel domains; this mask can be seen as the weight of each feature element. By finding the corresponding attention weight for each feature element, it simultaneously forms the attention mechanism in both spatial and channel domains.

Many people may wonder why this approach should be a very natural transition from spatial or channel domains, and why no one thought of doing single-domain attention. The reasons are:

If you assign a mask weight to each feature element, the information after masking will be very little, which may directly destroy the deep feature information of the network;
Additionally, if you can add an attention mechanism, the identity mapping characteristics of the residual unit will be destroyed, making it difficult to train.

The innovation of the attention mechanism in this article is the introduction of residual attention learning, which not only uses the masked feature tensor as input for the next layer but also uses the unmasked feature tensor as input for the next layer, allowing for richer features to be obtained, thus better focusing on key features. The model structure in the article is very clear; overall, it consists of a three-stage attention module. Each attention module can be divided into two branches (see stage 2), the upper branch is called the trunk branch, which is the basic structure of the residual network (ResNet). The lower branch is the soft mask branch, which contains the main part of the residual attention learning mechanism. Through down-sampling and up-sampling, as well as residual modules, the attention mechanism is formed.

The innovative residual attention mechanism in the model structure is: H is the output of the attention module, F is the feature tensor from the previous layer, and M is the attention parameters of the soft mask. This constitutes the residual attention module, which can input both the image features and the features after enhancing attention into the next module. The function F can choose different functions to obtain results in different attention domains:

f1 is the direct sigmoid activation function applied to the image feature tensor, which is the mixed domain attention;
f2 directly performs global average pooling on the image feature tensor, resulting in channel domain attention (similar to SENet);
f3 is the activation function that takes the average of the image feature tensor in the channel domain, effectively ignoring the channel domain information, thus obtaining spatial domain attention.

Non-local Neural Networks, CVPR2018

FAIR’s masterpiece, mainly inspired by traditional methods using non-local similarity for image denoising.

The main idea is simple: the convolution units in CNNs only focus on the local area of the kernel size, and even as the receptive field grows, it ultimately still operates in a local area, thus ignoring the contributions of distant pixels to the current area.

Thus, non-local blocks aim to capture this long-range relationship: for 2D images, it is the relationship weight of any pixel in the image to the current pixel; for 3D videos, it is the relationship weight of all pixels in all frames to the current frame’s pixel.

The network architecture diagram is also straightforward: The article discusses various implementation methods, and here we briefly mention the Matmul method, which is best implemented in the DL framework:

First, perform linear mapping (essentially a 1x1x1 convolution to compress the channel number) on the input feature map X to obtain θ, ϕ, g features.
Through reshape operations, forcibly merge the above three features except for the channel number, then perform matrix dot multiplication to obtain something similar to the covariance matrix (this process is very important, computing the self-correlation in features, i.e., obtaining the relationship of each pixel in each frame to all other pixels in all frames).
Then perform Softmax operation on the self-correlation features by columns or rows (depending on the form of matrix g) to obtain weights between 0 and 1, which are the self-attention coefficients we need.
Finally, multiply the attention coefficients back into the feature matrix g, then increase the channel number and perform residual connection with the original input feature map X to complete the bottleneck.

Visual effects of attention maps embedded in action recognition frameworks:

In the diagram, the arrows indicate the contribution relationship of certain pixels from previous frames to the current pixel of the ankle joint in the last image (current frame). Since it is soft attention, every pixel in every frame has a contribution relationship, and the yellow arrows describe the relationships with the highest responses.

Summary

Pros: Non-local blocks are very general and can easily be embedded in any existing 2D and 3D convolutional networks to improve or visualize related CV tasks. For example, a recent article has applied non-local to the Video ReID task [2].

Cons: The results in the article suggest placing non-local as close to the front layers as possible, but in practice, doing 3D tasks in the front layers due to the relatively large temporal T, the construction and dot multiplication steps involve a huge number of parameters and require a large amount of GPU memory.

Interaction-aware Attention, ECCV2018

Article from Meitu and the Chinese Academy of Sciences.

This article discusses a lot about Multi-scale feature fusion and tells a bunch of stories, but ultimately, the key contribution is the design of a new loss based on PCA on the covariance matrix of the non-local block for better feature interaction. The authors believe that this process allows features to perform better non-local interactions in the channel dimension, hence the name Interaction-aware attention.

So the question arises, how to achieve attention weights through PCA?

The article does not directly use the eigenvalue decomposition of the covariance matrix to achieve this, but instead uses the following equivalent form:

CBAM: Convolutional Block Attention Module (Channel Domain + Spatial Domain), ECCV2018

This is based on the Squeeze-and-Excitation module in SE-Net [5] for further expansion.

Specifically, the article treats channel-wise attention as teaching the network “what” to look at; while spatial attention is teaching the network “where” to look, thus its main advantage over the SE Module is the latter.

Overview: Attention Mechanisms in Computer Vision — Overall Diagram

Channel attention formula:

Spatial attention formula: (spatial domain attention is obtained by performing AvgPool and MaxPool on the channel axis)

CBAM is especially lightweight and easy to deploy on the edge.

DANet: Dual Attention Network for Scene Segmentation (Spatial Domain + Channel Domain), CPVR2019

code: https://github.com/junfu1115/DANet Recently posted on arXiv, applying the idea of Self-attention to image segmentation, it can achieve precise segmentation by leveraging long-range contextual relationships.

The main idea is also the fusion transformation of the above articles CBAM and non-local:

Perform spatial-wise self-attention on deep feature maps while also performing channel-wise self-attention, and finally merge the two results through element-wise sum. The benefits of this approach are:

On the idea of CBAM performing spatial and channel self-attention separately, it directly uses the non-local self-correlation matrix in Matmul form for computation, avoiding the complex operations of manually designed pooling and multi-layer perceptrons.

CCNet

The highlight of this article is the clever method used to reduce the number of parameters. In the above DANet, the attention map calculates the similarity between all pixels, resulting in a spatial complexity of (HxW)x(HxW), while this article adopts the criss-cross idea, calculating the similarity of each pixel only with the pixels in its row and column, thereby indirectly calculating the similarity of each pixel with every other pixel, reducing the spatial complexity to (HxW)x(H+W-1), as shown in the diagram below: The entire network architecture is the same as DANet, but the attention module is different, as shown in the diagram below: during matrix multiplication, each pixel only extracts the corresponding pixel from the feature map in the cross position for dot multiplication, calculating similarity. The attention map obtained after one round of this attention computation is shown in the diagram R1 below, where each element only has similarity in its cross position, and through two rounds of this computation, each element will obtain the similarity of the entire image, as shown in R2.

The reason for obtaining this result is shown in the diagram below. After one round of computation, each pixel can obtain the similarity in its cross position, while pixels in different rows and columns (not in its cross) have no similarity, but these different row and column pixels also undergo similarity calculations. By calculating the similarity in their crosses, the two crosses must intersect, and during the second round of attention calculation, through the intersection, it indirectly calculates the similarity between these two different row and column pixels.

The experimental results achieved SOTA levels, but the accuracy of the attention method that does not calculate all pixels is higher.

OCNet

OCNet: Object Context Network for Scene Parsing (Microsoft Research) Paper Analysis https://blog.csdn.net/mieleizhi0522/article/details/84873101 Image Semantic Segmentation (13)—OCNet: Object Semantic Network for Scene Parsing https://blog.csdn.net/kevin_zhao_zl/article/details/88732455

Abstract: The paper focuses on the semantic aggregation strategy in semantic segmentation, which no longer predicts pixel by pixel but aggregates similar pixel points for semantic segmentation, thus proposing the object semantic pooling strategy, which uses the information of pixel collections belonging to the same object to obtain the label of a certain pixel contained in that object, where the pixel collection is referred to as the object semantics. The specific implementation is influenced by the self-attention mechanism, which includes two steps: 1) calculating the similarity between a single pixel and all pixels to obtain the mapping of the object semantics and each pixel; 2) obtaining the target pixel’s label. The results are more accurate than existing semantic aggregation strategies such as PPM and ASPP, which do not distinguish whether there is a belonging relationship between a single pixel and the object semantics.

It must be said that this paper collides with DANet, and it collides quite severely, using the same core content.

GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond

Interpretation of the paper Non-local Networks Meet Squeeze-Excitation Networks and Beyond https://blog.csdn.net/nijiayan123/article/details/90035599

The GCNet network structure combines non-local networks and Squeeze-Excitation networks. We know that non-local networks (NLNet) can capture long-range dependencies. It can be found that the network structure of NLnet uses self-attention mechanisms to model pixel relationships. **In this article, the global context of the non-local network is almost the same at different positions, indicating that it has learned a global context without position dependence, thus leading to a large waste of computation. The authors here propose a simplified model to obtain global context information.** It uses a query-independent (can be understood as no query dependence) modeling method, and can also share this simplified structure with the SENet network structure. Therefore, the authors combined these three methods to produce a global context (GC) block.

Attention-Augmented Convolutions

Using self-attention to enhance convolutions: This is a dialogue between old and new generations of neural networks (with implementation) https://mp.weixin.qq.com/s/z2scG7VvguXXZVpRhyLpxw

PyTorch implementation address: https://github.com/leaderj1001/Attention-Augmented-Conv2d

PS: The insight is to add multiple attention heads on the basis of convolution operators (each self-attention head is a no-local block), and then concatenate these heads together.

Mechanism

Convolution operations have a significant flaw, as they only work on local neighbors, and thus miss global information. On the other hand, self-attention is a recent advancement in capturing long-range interactivity, but is still mainly applied to sequence modeling and generative modeling tasks. In this paper, we investigate the issue of using self-attention (as a replacement for convolution) for discriminative visual tasks. We propose a novel two-dimensional relative self-attention mechanism, and research shows that it is sufficient to replace convolution as a standalone primitive for image classification tasks. In controlled experiments, we found that the best results were obtained when combining convolution with self-attention. Therefore, we propose using this self-attention mechanism to enhance the convolution operator, specifically by concatenating the convolution feature map with a set of feature maps generated by self-attention.

Single Attention Head

Given an input tensor of shape (H, W, F_in), we unfold it into a matrix and execute multi-head attention as proposed in the Transformer architecture. The output of a single head h’s self-attention mechanism is

Where are the learnable linear transformations that map the input X to queries Q, keys K (xys: when performing matrix multiplication for self-correlation, one is called query and the other key) and values V. The outputs of all heads can be concatenated:

We conducted extensive experiments using different scales and types of models (including ResNet and a current best mobile limited network), and the results show that attention augmentation can achieve stable improvements in ImageNet image classification and COCO object detection tasks, while also keeping the number of parameters roughly constant. It is particularly noteworthy that our method achieved a top-1 accuracy on ImageNet that is 1.3% better than the ResNet50 baseline, and our method also exceeded the RetinaNet baseline by 1.4 mAP on COCO object detection. Attention augmentation requires minimal computational burden to achieve systematic improvements, and outperformed popular Squeeze-and-Excitation channel attention methods in all experiments. An astonishing result from the experiments is that the fully self-attention model (a special case of attention augmentation) performed only slightly worse than the corresponding fully convolutional model, indicating that self-attention itself is a powerful fundamental method for image classification.

PAN: Pyramid Attention Network for Semantic Segmentation (Layer Domain)—CVPR2018

Highlight 1: The paper combines the Attention mechanism with a pyramid structure as its highlight, allowing for the extraction of relatively precise dense features based on high-level semantic guidance, replacing the complex dilated convolutions and multiple encoder-decoder operations found in other methods, breaking away from the commonly used U-Net structure; Highlight 2: It uses a global pooling operation for weighting the low-level features, serving to select the features. Module One: FPA (Feature Pyramid Attention) Feature Pyramid Attention

Problem addressed: Different scale images and different sized objects pose challenges for object segmentation.
Existing methods: Similar to PSPNet and DeepLab, spatial pyramid pooling is used to achieve different scales and multi-porous pyramid pooling ASPP structures. Problem 1: Pooling easily loses local information. Problem 2: ASPP, being a sparse operation, can cause checkerboard artifacts. Problem 3: Simply concatenating multiple scales lacks contextual information and performs poorly without focusing on contextual information (the diagram below depicts existing methods). This part is mainly used for processing high-level features.
Proposed solution: As shown in the right diagram, after extracting high-level features, pooling operations are no longer performed; instead, three consecutive convolutions are used to achieve higher-level semantics. We know that higher-level semantics are closer to ground truth, focusing on some object information, so higher-level semantics are used as an attention guide, multiplied with high-level features through a 1×1 convolution without changing size, thereby strengthening the weight of parts with object information to obtain attention outputs. Additionally, because the pyramid convolution structure uses different sized convolution kernels, representing different receptive fields, it also addresses the problem of different objects at different scales.

Module Two:

Problem addressed: While high-level features can achieve effective classification, reconstructing the original image’s resolution or making precise predictions is often not feasible.
Existing methods: Similar to SegNet, Refinenet, Tiramisu structures, etc., they all adopt U-Net structures, using decoders (deconvolutions) and adding low-level features layer by layer to restore image details. The paper mentions that while this can achieve a combination of low and high-level features and image reconstruction, it incurs a computation burden.
Proposed solution: As shown in the diagram below, the structure of the decoder is discarded; the original form directly uses low-level features plus high-level features obtained from FPA, but when skipping low-level features, the paper sets appropriate weights guided by high-level features, ensuring consistency in weights between low and high-level features. High-level features use Global Pooling to obtain weights, while low-level features undergo a convolution layer to achieve the same number of maps as high-level features, and then they are multiplied and added together, thus reducing the complexity of the decoder while also introducing a new form of high-low feature fusion. Specifically:

We perform 3×3 convolution operations on low-level features to reduce the number of channels in CNN feature maps.
The global context information generated from high-level features undergoes 1×1 convolution, batch normalization, and nonlinear transformation operations, and is then multiplied with low-level features.
Finally, high-level features are added to the weighted low-level features and undergo a stepwise up-sampling process.
Our GAU module can not only adapt more effectively to feature maps at different scales but also provides guiding information for low-level feature maps in a straightforward manner.

Multi-Context Attention for Human Pose Estimation

Tell Me Where to Look: Guided Attention Inference Network

Hard Attention

A Study Introducing Hard Attention Mechanisms to Guide Learning Visual Question-Answering Tasks

Soft attention mechanisms have been widely applied and successful in the field of computer vision. However, we find that research on hard attention mechanisms in computer vision tasks is still relatively sparse. Hard attention mechanisms can select important features from the input information, making them a more efficient and direct method than soft attention mechanisms. In this article, we will introduce a study that guides learning visual question-answering tasks by introducing hard attention mechanisms. Additionally, combining L2 regularization to filter feature vectors can efficiently promote the filtering process and achieve better overall performance without a dedicated learning process.

1. Diversified Visual Attention Networks for Fine-Grained Object Classification—2016

2. Deep Networks with Internal Selective Attention Through Feedback Connections (Channel Domain)—NIPS 2014

A Deep Attention Selective Network (dasNet) is proposed. After training, attention is dynamically adjusted through reinforcement learning (Separable Natural Evolution Strategies). Specifically, attention adjusts the weights of each conv filter (similar to SENet, right?). The policy is a neural network, and the RL part of the algorithm is as follows:

3. Fully Convolutional Attention Networks for Fine-Grained Recognition

This paper utilizes a reinforcement learning-based visual attention model to simulate learning to locate object parts and perform object classification within scenes. This framework simulates the recognition process of the human visual system, learning a task-driven strategy through a series of glimpses to locate parts of the object. So, what is a glimpse? Each glimpse corresponds to a part of the object. The original image and the previous glimpse’s position are used as input, while the next glimpse’s position is used as output, serving as the next object part. Each glimpse’s position is treated as an action, with the image and previous glimpse’s position as the state, and rewards measure classification accuracy. This method can simultaneously locate multiple parts, while previous methods could only locate one part at a time.

4. Temporal Attention (RNN)

This concept is relatively broad, as computer vision only recognizes single images without a time domain concept. However, this paper proposes a model based on Recurrent Neural Networks (RNN) for attention mechanisms.

RNN models are particularly suitable for scenarios where data has temporal characteristics, such as using RNNs to produce attention mechanisms performing well in natural language processing problems. This is because natural language processing involves text analysis, which inherently has a temporal correlation; for example, one word is often followed by another, indicating a temporal dependency.

However, image data does not naturally possess temporal characteristics; a single image is often a sample at a single time point. But in video data, RNNs are a suitable data model, thus enabling the use of RNNs to generate attention.

RNN models are specifically referred to as temporal attention because this model adds a time dimension on top of previously introduced spatial, channel, and mixed domains. This dimension arises from the temporal characteristics of sampling points.

In the Recurrent Attention Model, attention mechanisms are viewed as sampling a point in a region of an image, where this sampling point is the point needing attention. The attention in this model is no longer differentiable, making it a hard attention (hard attention) model. Training this model requires using reinforcement learning, resulting in longer training times.

This model requires understanding not just the RNN attention model, but more importantly, how this model converts image information into temporal sampling signals:

This is a key point in the model, called the Glimpse Sensor. This sensor’s key point is to determine the points in the image that need to be focused on (pixels), and at this point, the sensor begins to collect three types of information, all of equal information volume: one is very detailed (the innermost box’s information), one is medium local information, and one is a rough overview.

These three types of sampled information arise from the positions in the image, and as time t increases, the sampling positions begin to change. As for how l changes with t, this is something that requires reinforcement learning to train.

Self-Attention

The application of self-attention mechanisms in computer vision https://cloud.tencent.com/developer/news/374449) Using Attention to master CV, an overview of self-attention semantic segmentation progress https://mp.weixin.qq.com/s/09_rc9J-4cEs7GrYQaL16Q

Self-attention mechanisms improve upon attention mechanisms by reducing reliance on external information and are better at capturing internal correlations of data or features.

In neural networks, we know that convolutional layers obtain output features through the linear combination of convolution kernels and original features. Since convolution kernels are usually local, to increase the receptive field, convolution layers are often stacked, which is not efficient. Additionally, many computer vision tasks are affected by insufficient semantic information, impacting final performance. Self-attention mechanisms capture global information to obtain a larger receptive field and contextual information.

Self-attention mechanisms (self-attention) have made significant progress in sequence models; on the other hand, contextual information (context information) is critical for many visual tasks, such as semantic segmentation and object detection. Self-attention mechanisms provide an effective modeling method for capturing global contextual information through the triplet of (key, query, value). Next, we will introduce several corresponding works, analyze their advantages and disadvantages, and discuss improvement directions.

Related Works

Attention is all you need [1] is the first work proposing the use of self-attention mechanisms in sequence models to replace recurrent neural networks, achieving great success. One important module is the scaled dot-product attention module, which captures long-distance dependencies through the (key, query, value) triplet. As shown in the diagram below, the key and query obtain the corresponding attention weights through multiplication, and finally, the obtained weights are multiplied with the value to produce the final output.

Non-local neural network [2] inherits the modeling method of the (key, query, value) triplet and proposes an efficient non-local module, as shown in the diagram below. After adding the non-local module to the ResNet network, both object detection and instance segmentation saw an improvement of over one point (mAP), demonstrating the importance of contextual modeling.

Danet [3] is a work from the Chinese Academy of Sciences Automation, whose core idea is to supervise semantic segmentation tasks through contextual information. The authors use two forms of attention, as shown in the diagram below, which are spatial and channel-wise, followed by feature fusion, ultimately leading to the semantic segmentation head network. The approach is simple and has achieved good results.

Ocnet [4] is a work from Microsoft Research Asia. It also uses the (key, query, value) triplet to better supervise semantic segmentation tasks by capturing global contextual information. Unlike Danet [3], it only uses spatial information. It also achieves good results.

DFF [5] is a work from Microsoft Research Asia’s Visual Computing Group. As shown in the diagram below, it models the motion information between different frames of a video through optical flow, proposing a very elegant video detection framework DFF. One important operation is warp, which aligns points from one to another. Since then, many works on video detection have emerged, such as FGFA [6], Towards High Performance [7], most of which are based on the warp feature for alignment. Generally, we assume that the motion information does not travel too far, which inspires us to find the corresponding feature points after motion through the neighborhood of each point. The specific method will not be introduced here; we welcome everyone to think about it (related operations and self-attention mechanisms).

Disadvantages and Improvement Strategies of Self-Attention

The previous sections mainly provided a simple introduction to the uses of self-attention mechanisms. Next, we analyze their disadvantages and corresponding improvement strategies. Since each point must capture global contextual information, this leads to a high computational complexity and memory capacity of the self-attention mechanism module. If we could know some prior information, such as the aforementioned feature alignment typically being within a certain neighborhood, we could limit the computation to a certain neighborhood. Additionally, how to efficiently sparsify information and its relationship with graph convolutions are all open questions, and we encourage everyone to think actively.

Next, we introduce other improvement strategies. Senet [9] inspires us that information on channels is very important, as shown in the diagram below.

CBAW [10] proposes a module combining spatial and channel information, achieving good results across various tasks, as shown in the diagram below.

Finally, we introduce a work from Baidu IDL that combines channel as spatial modeling [11]. Essentially, it directly incorporates channel information when reshaping the (key, query, value) triplet, but this raises a significant issue of greatly increasing computational complexity. We know that grouped convolutions are an effective solution for reducing the number of parameters; this method also employs grouping. However, even with grouping, it still cannot fundamentally resolve the issues of high computational complexity and large parameter counts. The authors cleverly adjust the order of computing keys, queries, and values after Taylor series expansion, effectively reducing the corresponding computational complexity. The table below shows the optimized computational load and complexity analysis, while the diagram below illustrates the overall framework of the CGNL module.

Summary of Self-Attention

As an effective method for modeling context, self-attention mechanisms have achieved good results in many visual tasks. However, the disadvantages of this modeling method are also evident: one is the lack of consideration for information on channels, and the other is the still high computational complexity. Corresponding improvement strategies include how to effectively combine information on spatial and channel dimensions, and how to sparsify captured information. The benefits of sparsification are to be more robust while maintaining smaller computational loads and memory usage. Lastly, graph convolutions, as a hot research direction in recent years, and how to connect self-attention mechanisms with graph convolutions, as well as a deeper understanding of self-attention mechanisms, are crucial directions for the future.

Overview: Attention Mechanisms in Computer Vision

Table of Contents