Summary and Code Implementation of Attention Mechanism in Deep Learning (2017-2021)

Follow the official account "ML-CVer"
Set as "Star", DLCV messages will be delivered!

Author丨mayiwei1998
Source丨GiantPandaCV
Editor丨极市平台

Abstract

Due to the network structures in many papers being embedded into code frameworks, the code tends to be redundant. The author of this article has organized and reproduced the core code based on Attention networks from recent years.

Author Information: First-year graduate student in Computer Science at Xiamen University, welcome everyone to follow on Github: xmu-xiaoma666, Zhihu: 努力努力再努力 (Keep working hard).

In recent years, Attention-based methods have gained popularity in academia and industry due to their interpretability and effectiveness. However, since the network structures proposed in papers are often embedded into classification, detection, segmentation, and other code frameworks, the code tends to be redundant, making it difficult for beginners like me to find the core code of the network, leading to certain difficulties in understanding the ideas in papers and networks. Therefore, I have organized and reproduced the core code of the Attention, MLP, and Re-parameter papers I have recently read to facilitate readers’ understanding.

This article mainly provides a brief introduction to the Attention part of the project. The project will continuously update the latest research work, and everyone is welcome to follow and star this work. If there are any issues during the reproduction and organization process, please feel free to raise them in the issues section, and I will respond promptly~

Project Address

https://github.com/xmu-xiaoma666/External-Attention-pytorch (Click Read Original to jump)

1. External Attention

1.1. Citation

Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks.—arXiv 2021.05.05

Paper Link: https://arxiv.org/abs/2105.02358

1.2. Model Structure

Summary and Code Implementation of Attention Mechanism in Deep Learning (2017-2021)

1.3. Introduction

This is a paper published on arXiv in May, mainly addressing two pain points of Self-Attention (SA): (1) O(n^2) computational complexity; (2) SA calculates Attention based on different positions in the same sample, ignoring the relationships between different samples. Therefore, this paper adopts two concatenated MLP structures as memory units, reducing the computational complexity to O(n); additionally, these two memory units are learned based on all training data, implicitly considering the relationships between different samples.

1.4. Usage

from attention.ExternalAttention import ExternalAttention
import torch


input=torch.randn(50,49,512)
ea = ExternalAttention(d_model=512,S=8)
output=ea(input)
print(output.shape)

2. Self Attention

2.1. Citation

Attention Is All You Need—NeurIPS2017

Paper Link: https://arxiv.org/abs/1706.03762

2.2. Model Structure

Summary and Code Implementation of Attention Mechanism in Deep Learning (2017-2021)

2.3. Introduction

This is a paper published by Google at NeurIPS2017, which has had a significant impact in various fields such as CV, NLP, and multimodal. It has been cited over 22,000 times. The Self-Attention proposed in Transformer is a type of Attention used to compute the weights between different positions in features, thus updating the features. First, the input feature is mapped into three features Q, K, and V through FC, then the attention map is obtained by dot product of Q and K, and finally, the weighted features are obtained by dot product of the attention map and V. The features are then mapped through FC to get a new feature. (There are many excellent explanations online about Transformer and Self-Attention, so I will not provide a detailed introduction here)

2.4. Usage

from attention.SelfAttention import ScaledDotProductAttention
import torch

input=torch.randn(50,49,512)
sa = ScaledDotProductAttention(d_model=512, d_k=512, d_v=512, h=8)
output=sa(input,input,input)
print(output.shape)

3. Squeeze-and-Excitation (SE) Attention

3.1. Citation

Squeeze-and-Excitation Networks—CVPR2018

Paper Link: https://arxiv.org/abs/1709.01507

3.2. Model Structure

Summary and Code Implementation of Attention Mechanism in Deep Learning (2017-2021)

3.3. Introduction

This is a paper from CVPR2018, which is also very influential, with over 7,000 citations. This paper focuses on channel attention, and due to its simple structure and effectiveness, it has sparked a small wave of interest in channel attention. The idea of this article can be said to be very simple: first, perform AdaptiveAvgPool on the spatial dimension, then learn channel attention through two FC layers, and normalize using Sigmoid to obtain the Channel Attention Map, which is then multiplied with the original features to obtain the weighted features.

3.4. Usage

from attention.SEAttention import SEAttention
import torch

input=torch.randn(50,512,7,7)
se = SEAttention(channel=512,reduction=8)
output=se(input)
print(output.shape)

4. Selective Kernel (SK) Attention

4.1. Citation

Selective Kernel Networks—CVPR2019

Paper Link: https://arxiv.org/pdf/1903.06586.pdf

4.2. Model Structure

Summary and Code Implementation of Attention Mechanism in Deep Learning (2017-2021)

4.3. Introduction

This is a paper from CVPR2019 that pays tribute to the ideas of SENet. In traditional CNNs, each convolutional layer uses the same size convolution kernel, which limits the model’s expressive power; the Inception model structure has also verified that learning with multiple different convolution kernels can indeed enhance the model’s expressiveness. The author borrowed the ideas from SENet and dynamically calculated the weights for each convolution kernel to fuse the results of different kernels.

In my opinion, the reason this paper can also be called lightweight is that when performing channel attention on features of different kernels, the parameters are shared (i.e., because features are fused before performing Attention, the results from different convolution kernels share the parameters of one SE module).

This method consists of three parts: Split, Fuse, Select. Split is a multi-branch operation, convolving with different kernels to obtain different features; the Fuse part uses the SE structure to obtain the channel attention matrix (N convolution kernels can yield N attention matrices, and this operation shares parameters for all feature parameters), thus obtaining features after SE from different kernels; the Select operation sums these features.

4.4. Usage

from attention.SKAttention import SKAttention
import torch

input=torch.randn(50,512,7,7)
se = SKAttention(channel=512,reduction=8)
output=se(input)
print(output.shape)

5. CBAM Attention

5.1. Citation

CBAM: Convolutional Block Attention Module—ECCV2018

Paper Link: https://openaccess.thecvf.com/content_ECCV_2018/papers/Sanghyun_Woo_Convolutional_Block_Attention_ECCV_2018_paper.pdf

5.2. Model Structure

Summary and Code Implementation of Attention Mechanism in Deep Learning (2017-2021)
Summary and Code Implementation of Attention Mechanism in Deep Learning (2017-2021)

5.3. Introduction

This is a paper from ECCV2018 that simultaneously uses Channel Attention and Spatial Attention, linking the two (the paper also conducted ablation experiments on parallel and two types of concatenation). In terms of Channel Attention, the structure is similar to SE, but the author proposed that AvgPool and MaxPool have different representation effects, so the author performed AvgPool and MaxPool on the original features in the Spatial dimension, then extracted channel attention using the SE structure, sharing parameters, and finally added the two features and normalized them to obtain the attention matrix.

Spatial Attention is similar to Channel Attention; first, two types of pooling are performed in the channel dimension, then the two features are concatenated, and 7×7 convolutions are used to extract Spatial Attention (the reason for using 7×7 is that the convolution kernel must be large enough to capture spatial attention). After normalization, the spatial attention matrix is obtained.

5.4. Usage

from attention.CBAM import CBAMBlock
import torch

input=torch.randn(50,512,7,7)
kernel_size=input.shape[2]
cbam = CBAMBlock(channel=512,reduction=16,kernel_size=kernel_size)
output=cbam(input)
print(output.shape)

6. BAM Attention

6.1. Citation

BAM: Bottleneck Attention Module—BMCV2018

Paper Link: https://arxiv.org/pdf/1807.06514.pdf

6.2. Model Structure

Summary and Code Implementation of Attention Mechanism in Deep Learning (2017-2021)

6.3. Introduction

This is a work published around the same time as CBAM by the same author, which is very similar to CBAM, also involving dual Attention. The difference is that CBAM concatenates the results of the two attentions, while BAM directly adds the two attention matrices.

In terms of Channel Attention, it is basically the same as SE. For Spatial Attention, pooling is performed in the channel dimension, and then two 3×3 dilated convolutions are used, and finally a 1×1 convolution is applied to obtain the Spatial Attention matrix.

Finally, the Channel Attention and Spatial Attention matrices are summed (using broadcasting), resulting in the combined attention matrix.

6.4. Usage

from attention.BAM import BAMBlock
import torch

input=torch.randn(50,512,7,7)
bam = BAMBlock(channel=512,reduction=16,dia_val=2)
output=bam(input)
print(output.shape)

7. ECA Attention

7.1. Citation

ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks—CVPR2020

Paper Link: https://arxiv.org/pdf/1910.03151.pdf

7.2. Model Structure

Summary and Code Implementation of Attention Mechanism in Deep Learning (2017-2021)

7.3. Introduction

This is a paper from CVPR2020.

As shown in the figure, SE implements channel attention using two fully connected layers, while ECA requires a convolution. The author’s reasoning is that calculating attention between all pairs of channels is unnecessary, and using two fully connected layers indeed introduces too many parameters and computational costs.

Thus, after performing AvgPool, only a one-dimensional convolution with a receptive field of k is used (equivalent to only calculating attention with adjacent k channels), significantly reducing the parameters and computational costs (i.e., SE is a global attention, while ECA is a local attention).

7.4. Usage:

from attention.ECAAttention import ECAAttention
import torch

input=torch.randn(50,512,7,7)
eca = ECAAttention(kernel_size=3)
output=eca(input)
print(output.shape)

8. DANet Attention

8.1. Citation

Dual Attention Network for Scene Segmentation—CVPR2019

Paper Link: https://arxiv.org/pdf/1809.02983.pdf

8.2. Model Structure

Summary and Code Implementation of Attention Mechanism in Deep Learning (2017-2021)
Summary and Code Implementation of Attention Mechanism in Deep Learning (2017-2021)

8.3. Introduction

This is a CVPR2019 paper that simply applies self-attention to the scene segmentation task. The difference is that self-attention focuses on the attention between each position, while this paper extends self-attention by adding a channel attention branch, operating similarly to self-attention, but without the three Linear layers to generate Q, K, and V. Finally, the features after the two attentions are summed element-wise.

8.4. Usage

from attention.DANet import DAModule
import torch

input=torch.randn(50,512,7,7)
danet=DAModule(d_model=512,kernel_size=3,H=7,W=7)
print(danet(input).shape)

9. Pyramid Split Attention (PSA)

9.1. Citation

EPSANet: An Efficient Pyramid Split Attention Block on Convolutional Neural Network—arXiv 2021.05.30

Paper Link: https://arxiv.org/pdf/2105.14447.pdf

9.2. Model Structure

Summary and Code Implementation of Attention Mechanism in Deep Learning (2017-2021)
Summary and Code Implementation of Attention Mechanism in Deep Learning (2017-2021)

9.3. Introduction

This is a paper uploaded on arXiv on May 30 by Shenzhen University, with the goal of exploring how to obtain and explore spatial information at different scales to enrich the feature space. The network structure is relatively simple and mainly consists of four steps. The first step is to divide the original features into n groups based on channels and perform different scale convolutions on different groups to obtain new features W1; the second step is to apply SE on the original features to obtain different attention maps; the third step is to perform SOFTMAX on different groups; the fourth step is to multiply the obtained attention with the original features W1.

9.4. Usage

from attention.PSA import PSA
import torch

input=torch.randn(50,512,7,7)
psa = PSA(channel=512,reduction=8)
output=psa(input)
print(output.shape)

10. Efficient Multi-Head Self-Attention (EMSA)

10.1. Citation

ResT: An Efficient Transformer for Visual Recognition—arXiv 2021.05.28

Paper Link: https://arxiv.org/abs/2105.13677

10.2. Model Structure

Summary and Code Implementation of Attention Mechanism in Deep Learning (2017-2021)

10.3. Introduction

This is a paper uploaded on arXiv by Nanjing University on May 28. The main issues addressed in this paper are two pain points of SA: (1) The computational complexity of Self-Attention is quadratic in relation to n (where n is the size of the spatial dimension); (2) Each head only has partial information of q, k, v, and if the dimensions of q, k, v are too small, it will lead to a loss of continuous information and thus performance loss. The idea presented in this paper is quite simple: in SA, before the FC, a convolution is used to reduce the spatial dimension, thus obtaining smaller K and V in the spatial dimension.

10.4. Usage

from attention.EMSA import EMSA
import torch
from torch import nn
from torch.nn import functional as F

input=torch.randn(50,64,512)
emsa = EMSA(d_model=512, d_k=512, d_v=512, h=8,H=8,W=8,ratio=2,apply_transform=True)
output=emsa(input,input,input)
print(output.shape)

Conclusion

Currently, the Attention work organized in this project is not comprehensive enough. As the readership increases, this project will be continuously improved. Everyone is welcome to star and support. If there are any inappropriate expressions or errors in the code implementation in the article, please feel free to point them out~

Breaking news! A machine learning algorithms and computer vision exchange group has been established.

Additionally, we offer welfare resources! Qiu Xipeng's deep learning and neural networks, official Chinese tutorial for pytorch, data analysis with Python, machine learning notes, official documentation of pandas in Chinese, effective java (Chinese version), and 20 other welfare resources.

How to obtain: After entering the group, click on the group announcement to get the download link.

Recommended Reading:

Recommended! Complete collection of CV pre-trained models!
Summary of excellent computer vision teams at home and abroad|Latest version

Data augmentation in YOLOv4

Leave a Comment