
Abstract
Due to the network structures in many papers being typically embedded within code frameworks, the code tends to be quite redundant. The author of this article has organized and reproduced the core code based on Attention networks from recent years.
Author Information: First-year graduate student in Computer Science at Xiamen University, feel free to follow on Github: xmu-xiaoma666, Zhihu: Work Harder.
In recent years, Attention-based methods have gained popularity in both academia and industry due to their interpretability and effectiveness. However, since the network structures proposed in papers are often embedded within classification, detection, segmentation, and other code frameworks, the code can be quite redundant, making it difficult for beginners like myself to locate the core code of the networks, which can lead to difficulties in understanding the papers and the underlying ideas. Therefore, I have organized and reproduced the core code from recent papers on Attention, MLP, and Re-parameterization to facilitate readers’ understanding.
This article primarily provides a brief introduction to the Attention section of this project. The project will continue to update with the latest paper work, and readers are welcome to follow and star this work. If there are any issues during the reproduction and organization process, please feel free to raise them in the issues section, and I will respond promptly~
Project Address
https://github.com/xmu-xiaoma666/External-Attention-pytorch (Click Read Original to jump)
1. External Attention
1.1. Citation
Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks.—arXiv 2021.05.05
Paper link: https://arxiv.org/abs/2105.02358
1.2. Model Structure

1.3. Introduction
This is a paper uploaded on arXiv in May, which mainly addresses two pain points of Self-Attention (SA): (1) O(n^2) computational complexity; (2) SA calculates Attention based on different positions within the same sample, ignoring the relationships between different samples. Therefore, this paper uses two concatenated MLP structures as memory units, reducing the computational complexity to O(n); additionally, these memory units are learned based on all training data, thus implicitly considering the relationships between different samples.
1.4. Usage
from attention.ExternalAttention import ExternalAttention
import torch
input=torch.randn(50,49,512)
ea = ExternalAttention(d_model=512,S=8)
output=ea(input)
print(output.shape)
2. Self Attention
2.1. Citation
Attention Is All You Need—NeurIPS2017
Paper link: https://arxiv.org/abs/1706.03762
2.2. Model Structure

2.3. Introduction
This is a paper published by Google at NeurIPS2017, which has had a significant impact across various fields including CV, NLP, and multi-modal tasks, currently with over 22,000 citations. The Self-Attention proposed in the Transformer is a form of Attention that calculates the weights between different positions in features, thus updating the features. First, the input feature is mapped through FC to obtain Q, K, and V features, then the attention map is obtained by dot product of Q and K, and finally, the attention map is multiplied by V to obtain the weighted features. Lastly, a FC layer is used to map the features, resulting in a new feature. (There are many excellent explanations available online about Transformers and Self-Attention, so I won’t go into detail here.)
2.4. Usage
from attention.SelfAttention import ScaledDotProductAttention
import torch
input=torch.randn(50,49,512)
sa = ScaledDotProductAttention(d_model=512, d_k=512, d_v=512, h=8)
output=sa(input,input,input)
print(output.shape)
3. Squeeze-and-Excitation(SE) Attention
3.1. Citation
Squeeze-and-Excitation Networks—CVPR2018
Paper link: https://arxiv.org/abs/1709.01507
3.2. Model Structure

3.3. Introduction
This is a paper from CVPR2018 that is also very influential, currently having over 7,000 citations. This paper focuses on channel attention, and due to its simple structure and effectiveness, it sparked a small wave of interest in channel attention. The idea of this paper is quite simple: first, perform AdaptiveAvgPool on the spatial dimension, then learn channel attention through two FC layers, and normalize using Sigmoid to obtain the Channel Attention Map, finally multiplying the Channel Attention Map with the original features to get the weighted features.
3.4. Usage
from attention.SEAttention import SEAttention
import torch
input=torch.randn(50,512,7,7)
se = SEAttention(channel=512,reduction=8)
output=se(input)
print(output.shape)
4. Selective Kernel(SK) Attention
4.1. Citation
Selective Kernel Networks—CVPR2019
Paper link: https://arxiv.org/pdf/1903.06586.pdf
4.2. Model Structure

4.3. Introduction
This is a CVPR2019 paper that pays homage to the ideas of SENet. In traditional CNNs, each convolutional layer uses convolution kernels of the same size, which limits the model’s expressive power; the Inception model structure has also validated that using multiple convolution kernels of different sizes can indeed enhance the model’s expressive power. The author draws on the ideas of SENet, dynamically calculating the weights of each convolution kernel to fuse the results from different convolution kernels.
In my opinion, the reason this paper can also be termed lightweight is that when performing channel attention on features from different kernels, the parameters are shared (i.e., before doing Attention, the features are first fused, so the results from different convolution kernels share parameters from a single SE module).
This method is divided into three parts: Split, Fuse, Select. Split is a multi-branch operation that convolves with different convolution kernels to obtain different features; the Fuse part uses the SE structure to obtain the channel attention matrix (N convolution kernels can yield N attention matrices, and this operation shares parameters across all feature parameters), resulting in features after SE from different kernels; the Select operation sums these features.
4.4. Usage
from attention.SKAttention import SKAttention
import torch
input=torch.randn(50,512,7,7)
se = SKAttention(channel=512,reduction=8)
output=se(input)
print(output.shape)
5. CBAM Attention
5.1. Citation
CBAM: Convolutional Block Attention Module—ECCV2018
Paper link: https://openaccess.thecvf.com/content_ECCV_2018/papers/Sanghyun_Woo_Convolutional_Block_Attention_ECCV_2018_paper.pdf
5.2. Model Structure


5.3. Introduction
This is a paper from ECCV2018 that simultaneously uses Channel Attention and Spatial Attention, chaining the two together (the paper also conducted ablation experiments on parallel and two types of chaining). In terms of Channel Attention, the basic structure is similar to SE, but the author proposes that AvgPool and MaxPool have different representation effects. Therefore, the author performs AvgPool and MaxPool on the original features along the Spatial dimension, then extracts channel attention using the SE structure, noting that parameters are shared. The two features are then summed and normalized to obtain the attention matrix.
Spatial Attention is similar to Channel Attention; first, two types of pooling are performed on the channel dimension, then the two features are concatenated, and a 7×7 convolution is used to extract Spatial Attention (the kernel size must be large enough because it extracts spatial attention). Finally, normalization is performed to obtain the spatial attention matrix.
5.4. Usage
from attention.CBAM import CBAMBlock
import torch
input=torch.randn(50,512,7,7)
kernel_size=input.shape[2]
cbam = CBAMBlock(channel=512,reduction=16,kernel_size=kernel_size)
output=cbam(input)
print(output.shape)
6. BAM Attention
6.1. Citation
BAM: Bottleneck Attention Module—BMCV2018
Paper link: https://arxiv.org/pdf/1807.06514.pdf
6.2. Model Structure

6.3. Introduction
This is a work by the same authors as CBAM during the same period, and it is very similar to CBAM, also utilizing dual Attention. The difference is that CBAM concatenates the results of the two attentions, while BAM directly adds the two attention matrices.
In terms of Channel Attention, it is basically the same as the SE structure. For Spatial Attention, pooling is performed on the channel dimension, followed by two 3×3 dilated convolutions, and finally a 1×1 convolution is used to obtain the Spatial Attention matrix.
Finally, the Channel Attention and Spatial Attention matrices are added together (using broadcasting), resulting in the combined attention matrix of spatial and channel attention.
6.4. Usage
from attention.BAM import BAMBlock
import torch
input=torch.randn(50,512,7,7)
bam = BAMBlock(channel=512,reduction=16,dia_val=2)
output=bam(input)
print(output.shape)
7. ECA Attention
7.1. Citation
ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks—CVPR2020
Paper link: https://arxiv.org/pdf/1910.03151.pdf
7.2. Model Structure

7.3. Introduction
This is a CVPR2020 paper.
As shown in the figure above, SE implements channel attention using two fully connected layers, while ECA requires a convolution. The author does this because they believe that computing the attention between all pairs of channels is unnecessary, and using two fully connected layers indeed introduces too many parameters and computational load.
Thus, after performing AvgPool, only a 1D convolution with a receptive field of k is used (equivalent to only computing the attention of adjacent k channels), greatly reducing the parameters and computational load. (i.e., SE is a global attention, while ECA is a local attention).
7.4. Usage:
from attention.ECAAttention import ECAAttention
import torch
input=torch.randn(50,512,7,7)
eca = ECAAttention(kernel_size=3)
output=eca(input)
print(output.shape)
8. DANet Attention
8.1. Citation
Dual Attention Network for Scene Segmentation—CVPR2019
Paper link: https://arxiv.org/pdf/1809.02983.pdf
8.2. Model Structure


8.3. Introduction
This is a CVPR2019 paper, with a very simple idea: applying self-attention to the task of scene segmentation. The difference is that self-attention focuses on the attention between each position, while this paper expands self-attention by also adding a channel attention branch. The operation is the same as self-attention, but the three Linears that generate Q, K, and V are removed in the channel attention. Finally, the features after the two attentions are summed element-wise.
8.4. Usage
from attention.DANet import DAModule
import torch
input=torch.randn(50,512,7,7)
danet=DAModule(d_model=512,kernel_size=3,H=7,W=7)
print(danet(input).shape)
9. Pyramid Split Attention(PSA)
9.1. Citation
EPSANet: An Efficient Pyramid Split Attention Block on Convolutional Neural Network—arXiv 2021.05.30
Paper link: https://arxiv.org/pdf/2105.14447.pdf
9.2. Model Structure


9.3. Introduction
This is a paper uploaded on arXiv by Shenda University on May 30, whose purpose is to explore how to obtain and explore spatial information at different scales to enrich feature space. The network structure is relatively simple, mainly divided into four steps: first, divide the original feature into n groups based on channels and perform convolutions of different scales on different groups to obtain new features W1; second, perform SE on the original features to obtain different attention maps; third, apply SOFTMAX to different groups; fourth, multiply the obtained attention with the original feature W1.
9.4. Usage
from attention.PSA import PSA
import torch
input=torch.randn(50,512,7,7)
psa = PSA(channel=512,reduction=8)
output=psa(input)
print(output.shape)
10. Efficient Multi-Head Self-Attention(EMSA)
10.1. Citation
ResT: An Efficient Transformer for Visual Recognition—arXiv 2021.05.28
Paper link: https://arxiv.org/abs/2105.13677
10.2. Model Structure

10.3. Introduction
This is a paper uploaded on arXiv by Nanjing University on May 28. The main issues addressed in this paper are two pain points of SA: (1) The computational complexity of Self-Attention is squared in relation to n (where n is the spatial dimension size); (2) Each head only has partial information of q,k,v, and if the dimensions of q,k,v are too small, it can lead to loss of continuous information and thus performance degradation. The solution proposed in this paper is quite simple: before the FC in SA, a convolution is used to reduce the spatial dimension, obtaining smaller K and V dimensions.
10.4. Usage
from attention.EMSA import EMSA
import torch
from torch import nn
from torch.nn import functional as F
input=torch.randn(50,64,512)
emsa = EMSA(d_model=512, d_k=512, d_v=512, h=8,H=8,W=8,ratio=2,apply_transform=True)
output=emsa(input,input,input)
print(output.shape)
Conclusion
Currently, the Attention work organized by this project is indeed not comprehensive enough. As the readership increases, the project will continue to improve, and support through stars is welcome. If there are any inappropriate expressions in the article or errors in the code implementation, please feel free to point them out~
Download CVPR and Transformer Materials
Reply in the background: CVPR2021 to download the collection of CVPR 2021 papers and open-source code
Reply in the background: Transformer Review to download the latest two Transformer review PDFs
CVer-Transformer Group Established
Scan to add the CVer assistant to apply to join the CVer-Transformer WeChat group, covering directions including: object detection, image segmentation, object tracking, face detection & recognition, OCR, pose estimation, super-resolution, SLAM, medical imaging, Re-ID, GAN, NAS, depth estimation, autonomous driving, lane detection, model pruning & compression, denoising, dehazing, deraining, style transfer, remote sensing images, action recognition, video understanding, image fusion, image retrieval, paper submission & communication, PyTorch and TensorFlow, etc.
Please note: research direction + location + school/company + nickname (e.g., Transformer + Shanghai + SJTU + Kaka). Following this format will expedite approval and invitation to the group.
▲ Long press to add the assistant on WeChat and join the group
▲ Click the card above to follow the CVer public account
Organizing is not easy, please like and share