Understanding Attention Mechanisms in Computer Vision

This article introduces the mechanism of visual attention in computer vision. To broaden the audience and enhance readability, a progressive writing style is adopted. The entirety of Part 1 and most of Part 2 are free of professional barriers, while the subsequent sections delve deeper into the attention mechanism in computer vision.

1 Introduction

Attention is actually a very common yet often overlooked phenomenon. For example, when a bird flies across the sky, your attention often follows the bird, making the sky a background information in your visual system.

Understanding Attention Mechanisms in Computer Vision

Typically, neural networks recognize objects by training on large amounts of data [1]. A neural network contains many neurons; for instance, if a neural network has seen a lot of handwritten digits, it can recognize a new handwritten digit’s value.

Understanding Attention Mechanisms in Computer Vision

However, the neural network treats all image features equally. Although the neural network learns the features of the image for classification, these features have no distinction in the neural network’s ‘eyes’; it does not focus on any particular ‘region’. In contrast, human attention tends to concentrate on a specific area of the image, while the attention given to other information decreases accordingly.

Understanding Attention Mechanisms in Computer Vision

The basic idea of attention mechanisms in computer vision is to enable the system to learn attention—to ignore irrelevant information while focusing on key information. Why ignore irrelevant information?

For example, when we sit in a coffee shop playing on our phones, if our attention is on our phone, we are generally unaware of what is being said around us. However, if we happen to want to hear someone speak, we look away from our phone and focus our attention on that person’s voice, allowing us to clearly hear the conversation.

Visual attention is similar; when you glance quickly, it is difficult to notice some information, but if you focus your attention, the details of the object will form an impression in your mind.

Neural networks work similarly; if you do not tell it to focus on the bird, the information in the entire image is still dominated by the sky, so it will think this is a photo of the sky, not the bird.

This article focuses on attention mechanisms in computer vision, but similar attention mechanisms also exist in natural language processing (NLP) and visual question answering (VQA) systems. For related articles, see the Review of Attention Models.

2 Overview of Attention Research Progress

Having roughly understood the underlying concept of attention mechanisms, we should take time to study how to implement attention mechanisms in visual systems. Early research on attention focused on analyzing brain imaging mechanisms, employing a winner-take-all [2] mechanism to study how to model attention, which will not be analyzed further here.

In today’s deep learning era, building neural networks with attention mechanisms has become increasingly important. On one hand, such networks can learn attention mechanisms autonomously; on the other hand, attention mechanisms can help us understand the world as seen by neural networks [3].

In recent years, most research combining deep learning with visual attention mechanisms has focused on using masks to form attention mechanisms. The principle of masks is to identify key features in image data through another layer of new weights, allowing deep neural networks to learn the regions that need attention in each new image, thus forming attention.

This idea has evolved into two different types of attention: soft attention and hard attention (hard attention).

The key point of soft attention is that it focuses more on regions [4] or channels [5], and soft attention is deterministic; after learning, it can be generated directly through the network. The most critical aspect is that soft attention is differentiable, which is very important. Differentiable attention can compute gradients through neural networks and learn the attention weights through forward propagation and backward feedback [6].

Hard attention [7] differs from soft attention in that it focuses more on points, meaning that every point in the image can extend attention, and hard attention is a random prediction process, emphasizing dynamic changes. Most importantly, hard attention is non-differentiable, and the training process is often completed using reinforcement learning.

To clearly introduce the attention mechanism in computer vision, this article will analyze several implementation methods from the perspective of attention domains (attention domain). There are mainly three types of attention domains: spatial domain, channel domain, and mixed domain.

There is also a relatively unique implementation of strong attention in the time domain, but because strong attention is implemented using reinforcement learning, the training process is different, so it will be analyzed in detail later.

3 Soft Attention in the Spatial Domain

This section will quickly present the issues by introducing three articles, showcasing how to implement deep learning models with attention mechanisms through the different attention domains in these articles. Each article’s introduction will be divided into two parts: first, an introduction to the model’s design concept, and then a deep dive into the model structure (architecture).

For those who wish to read the attention model in full or are reading this article for the first time, you can skip the model structure part and only understand the design concept. Later, when reading carefully, revisit the model structure part.

3.1 Spatial Domain

Design Concept:

The Spatial Transformer Networks (STN) model [4] is an article from NIPS 2015. This article uses attention mechanisms to transform spatial information from the original image into another space while preserving key information.

This article’s idea is very clever because pooling layers in convolutional neural networks (CNNs) directly use methods like max pooling or average pooling to compress image information, reducing computational load and improving accuracy.

However, this article argues that previous pooling methods are too brute-force; directly merging information can lead to the loss of key information. Therefore, it proposes a module called a spatial transformer, which performs corresponding spatial transformations on the spatial domain information of the image, allowing key information to be extracted.

Unlike pooling layers, where the receptive fields are fixed and local, the spatial transformer module is a dynamic mechanism that can actively spatially transform an image (or a feature map) by producing an appropriate transformation for each input sample.

Understanding Attention Mechanisms in Computer Vision

Spatial Transformer Model Experimental Results

For example, this intuitive experimental diagram:

  1. (a) Column shows the original image information, where the first handwritten digit 7 is unchanged, the second handwritten digit 5 has undergone some rotation, and the third handwritten digit 6 has added some noise signals;

  2. (b) The colored borders in column (b) are the learned bounding boxes of the spatial transformer; each bounding box corresponds to a learned spatial transformer for the image;

  3. (c) The feature map in column (c) shows the transformed features after the spatial transformer; it can be seen that the key area of 7 has been selected, 5 has been rotated into a forward image, and the noise information of 6 has not been recognized.

  4. Finally, the transformed feature maps can predict the value of the handwritten digits in column (d).

The spatial transformer is essentially an implementation of the attention mechanism, as the trained spatial transformer can identify the areas in the image that need attention, and it can also perform rotation and scaling transformations, allowing important local information to be extracted through transformations.

Model Structure:

Understanding Attention Mechanisms in Computer Vision

This is the most important spatial transformation module in the spatial transformer network, which can be added as a new layer directly to existing network structures, such as ResNet. Let’s carefully study the input of this model:

Understanding Attention Mechanisms in Computer Vision

The data type used in neural network training is tensors, where H represents the height of the tensor from the previous layer, W represents the width, and C represents the channels of the tensor. For example, the basic three channels of an image (RGB) or different channels produced by different convolution kernels after passing through a convolutional layer (convolutional layer).

This input then enters two routes: one route goes to the localization network (localisation net), and the other route sends the raw signal directly to the sampling layer (sampler).

The localization network learns a set of parameters θ, which can serve as the parameters for the grid generator, producing a sampling signal that is effectively a transformation matrixUnderstanding Attention Mechanisms in Computer Vision, which, when multiplied with the original image, yields the transformed matrix V.

Understanding Attention Mechanisms in Computer Vision

This V is the transformed image feature; the size of the transformed matrix can be adjusted by manipulating the transformation matrix.

Understanding Attention Mechanisms in Computer Vision

This transformed image shows that the sampling matrix generated in the spatial transformer can extract key signals from the original image; the sampling matrix in (a) is the identity matrix, with no transformations, while the matrix in (b) can produce sampling matrices that allow for scaling and rotation transformations.

Understanding Attention Mechanisms in Computer Vision

The matrix on the left side of the final equation is the corresponding sampling matrix θ.

The main benefit of adding this module is its ability to identify key information from the previous layer’s signals (attention), and this information matrix is a differentiable matrix, as each target point’s information is a combination of all source points’ information. This combination can be represented as a linear combination, and complex transformation information can also be represented using a kernel function:

Understanding Attention Mechanisms in Computer Vision

V is the transformed information, U is the information before transformation, and k is a kernel function.

Theoretically, such a module can be added to any layer since it can process both channel and matrix information simultaneously.

However, since the article proposes uniform processing transformations for all channel information, I believe this module is actually more suitable for the original image input layer after transformations, because after the convolutional layer, the amount of information and importance contained in each convolution kernel’s (filter) produced channel information is different. Using the same transformer may not provide strong interpretability. Thus, we can introduce the second type of attention domain mechanism—the channel domain (channel domain) attention mechanism.

3.2 Channel Domain

Design Concept:

The principle of channel domain [5] attention mechanisms is quite simple; we can understand it from the perspective of basic signal transformation. In signal system analysis, any signal can be expressed as a linear combination of sine waves. After time-frequency transformation (note 4), the continuous time-domain sine wave signal can be replaced with a frequency signal value.

Understanding Attention Mechanisms in Computer Vision

Signal Time-Frequency Decomposition Diagram

In convolutional neural networks, each image is initially represented by three channels (R, G, B), and then, after passing through different convolution kernels, each channel generates new signals. For example, if the image features are convolved with 64 kernels, it will produce a matrix of 64 new channels (H, W, 64), where H and W represent the height and width of the image features.

The features of each channel represent the contributions of that image across different convolution kernels, similar to time-frequency transformation. The convolution with convolution kernels is akin to performing a Fourier transform on the signal, allowing the information of a channel feature to be decomposed into signal components across 64 convolution kernels.

Understanding Attention Mechanisms in Computer Vision

Different Convolution Kernels

Since each signal can be decomposed into components based on kernel functions, the new 64 channels produced will contribute to key information to varying degrees. If we assign a weight to the signal on each channel to represent its relevance to key information, a larger weight indicates a higher correlation, meaning that channel is more deserving of our attention.

Model Structure:

The paper [5] proposes a significant SENet model structure, which won the ImageNet championship with this model; it is a highly creative design.

Understanding Attention Mechanisms in Computer Vision

First, the far left is the original input image feature X, which then undergoes transformations, such as convolution transformationsUnderstanding Attention Mechanisms in Computer Vision, producing new feature signals U. U has C channels, and we hope to learn the weights of each channel through the attention module to generate channel domain attention.

The middle module is the innovative part of SENet, which is the attention mechanism module. This attention mechanism is divided into three parts: squeeze, excitation, and attention.

Squeeze FunctionUnderstanding Attention Mechanisms in Computer Vision:

Understanding Attention Mechanisms in Computer Vision

This function clearly performs a global average, summing and averaging all feature values within each channel; it is the mathematical expression of global average pooling (global average pooling).

Excitation FunctionUnderstanding Attention Mechanisms in Computer Vision:

Understanding Attention Mechanisms in Computer Vision

The δ function is ReLU, while σ is a sigmoid activation function. The dimensions of W1 and W2 areUnderstanding Attention Mechanisms in Computer Vision,Understanding Attention Mechanisms in Computer Vision. By training to learn these two weights, a one-dimensional excitation weight is obtained to activate each channel layer.

Scale FunctionUnderstanding Attention Mechanisms in Computer Vision:

Understanding Attention Mechanisms in Computer Vision

This step is essentially a scaling process, where the values of different channels are multiplied by different weights to enhance attention to key channel domains.

3.3 Mixed Domain

After understanding the design concepts of the first two types of attention domains, a simple comparison can be made. Firstly, the spatial domain attention ignores the information in the channel domain, processing the image features in each channel equally, which limits the spatial domain transformation methods to the original image feature extraction stage, resulting in weak interpretability in other layers of the neural network.

On the other hand, channel domain attention directly performs global average pooling on the information within a channel, neglecting the local information within each channel, which is also a rather brute-force approach. Therefore, by combining the two ideas, a mixed domain attention mechanism model can be designed [8].

Design Concept:

The attention mechanism proposed in article [8] is related to deep residual networks (Deep Residual Network). The basic idea is to apply the attention mechanism to ResNet, allowing the network to train deeper.

The attention mechanism in the article is based on the soft attention mask mechanism. However, unlike the previous methods, this attention mechanism’s mask draws from the idea of residual networks, not only adding the mask based on the current network layer’s information but also passing down the information from the previous layer. This prevents the issue of too little information after masking, which can hinder the stacking of network layers.

As previously mentioned, the attention mask proposed in [8] is not limited to spatial or channel domain attention; this mask can be seen as the weight of each feature element. By finding the corresponding attention weight for each feature element, both spatial and channel domain attention mechanisms can be formed simultaneously.

Many readers may wonder why this approach, which seems like a natural transition from spatial or channel domain attention, hasn’t been considered by those who have focused on single-domain attention. The reasons are:

  • If you assign a mask weight to each feature element, the information after masking will be very sparse, potentially damaging the deep-layer feature information of the network;

  • Furthermore, if you can add attention mechanisms, the identity mapping property of residual units will be disrupted, making training difficult.

First, dot production with mask ranging from zero to one repeatedly will degrade the value of features in deep layers. Second, soft mask can potentially break the good property of trunk branch, for example, the identical mapping of Residual Unit.

Thus, the innovation of the attention mechanism in this article lies in the introduction of residual attention learning, which not only uses the masked feature tensor as input for the next layer but also includes the feature tensor before masking as input for the next layer. This results in richer features, thereby better focusing on key features.

Model Structure

Understanding Attention Mechanisms in Computer Vision

The model structure in the article is very clear; overall, it consists of a three-stage attention module (3-stage attention module). Each attention module can be divided into two branches (see stage 2); the upper branch is called the trunk branch, which is based on the standard residual network (ResNet) structure. The lower branch is the soft mask branch, which contains the main part of the residual attention learning mechanism. Through down-sampling and up-sampling, along with residual units, the attention mechanism is formed.

The innovative residual attention mechanism in the model structure is:

Understanding Attention Mechanisms in Computer Vision

H is the output of the attention module, F is the image tensor feature from the previous layer, and M is the soft mask attention parameter. This constitutes the residual attention module, which can input both the image features and the enhanced attention features into the next module. The F function can choose different functions to yield results from different attention domains:

Understanding Attention Mechanisms in Computer Vision

  1. f_{1} is the sigmoid activation function applied directly to the image feature tensor, resulting in mixed domain attention;

  2. f_{2} applies global average pooling directly to the image feature tensor, yielding channel domain attention (similar to SENet [5]);

  3. f_{3} computes the average activation function of the image feature tensor in the channel domain, which effectively ignores the information in the channel domain, thus obtaining spatial domain attention.

4 Temporal Domain Attention

This concept is relatively broad, as computer vision typically recognizes images without the concept of a temporal domain. However, article [7] introduces a model based on recurrent neural networks (RNN) for attention mechanism recognition.

RNN models are particularly suitable for scenarios where data has temporal features; for instance, RNNs perform well in natural language processing tasks. This is because natural language processing involves text analysis, where there is a temporal correlation behind the text, such as one word being followed by another, creating a temporal dependency.

In contrast, image data does not inherently possess temporal features; an image often represents a sample at a single point in time. However, in video data, RNNs serve as a suitable data model, allowing the use of RNNs to generate recognition attention.

The RNN model is referred to as temporal domain attention because it introduces a new temporal dimension above the previously discussed spatial, channel, and mixed domains. This dimension arises from the temporal features of sampling points.

The Recurrent Attention Model [7] treats the attention mechanism as sampling a point in a region on an image, where that sampling point is the focus of attention. In this model, the attention is no longer a differentiable piece of information, making it a strong attention (hard attention) model. Training this model requires the use of reinforcement learning, thus taking more time.

The key point to understand is not the RNN attention model, as it has been introduced in more detail in natural language processing, particularly in machine translation. Instead, it is crucial to understand how this model converts image information into sequential sampling signals:

Understanding Attention Mechanisms in Computer Vision

This is the key point in the model, called the Glimpse Sensor, which I translate as Glimpse Sensor. The key point of this sensor is to first determine the points (pixels) in the image that need attention, and then it begins to collect three types of information, all of equal information volume: one is very detailed (the innermost frame), one is moderate local information, and one is a rough thumbnail image.

These three pieces of sampled information arise from the image information at position $l_{t-1}$, and as time progresses, the sampling position changes with increasing t. How l changes with t is what needs to be trained using reinforcement learning.

For those interested in RNN attention, it is recommended to explore natural language processing, as this area has been discussed in more detail. For further insights, refer to the Review of Attention Models.

5 Conclusion

This article introduces the attention mechanism in computer vision, first illustrating what attention mechanisms are and why they are necessary. Next, it discusses the latest research progress on attention mechanisms in computer vision from the perspectives of soft and hard attention. The article then analyzes the design concepts and model structures of three soft attention mechanisms from the perspective of attention domains, and finally introduces the temporal domain attention model.

Notes

Note 1: No attention mechanism

Note 2: Regions refer not only to a specific area of the image but may also include channel information.

Note 3: The term strong attention is specifically used to correspond to reinforcement learning.

Note 4: Generally, Fourier transforms are used, which are also convolution changes.

References

[1] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems. 2012.

[2] Itti, Laurent, and Christof Koch. “Computational modelling of visual attention.” Nature reviews neuroscience 2.3 (2001): 194.

[3] Zhang, Quanshi, Ying Nian Wu, and Song-Chun Zhu. “Interpretable Convolutional Neural Networks.” arXiv preprint arXiv:1710.00935 (2017).

[4] Jaderberg, Max, Karen Simonyan, and Andrew Zisserman. “Spatial transformer networks.” Advances in neural information processing systems. 2015.

[5] Hu, Jie, Li Shen, and Gang Sun. “Squeeze-and-excitation networks.” arXiv preprint arXiv:1709.01507 (2017).

[6] Zhao, Bo, et al. “Diversified visual attention networks for fine-grained object classification.” IEEE Transactions on Multimedia 19.6 (2017): 1245-1256.

[7] Mnih, Volodymyr, Nicolas Heess, and Alex Graves. “Recurrent models of visual attention.” Advances in neural information processing systems. 2014.

[8] Wang, Fei, et al. “Residual attention network for image classification.” arXiv preprint arXiv:1704.06904 (2017).

Leave a Comment