[GiantPandaCV Guide] In recent years, Attention-based methods have gained popularity in both academia and industry due to their interpretability and effectiveness. However, the network structures proposed in papers are often embedded within code frameworks for classification, detection, segmentation, etc., leading to redundancy in code. For beginners like me, it can be challenging to find the core code of the network, which makes understanding the concepts in papers and online more difficult. Therefore, I have organized and reproduced the core code from the recent papers on Attention, MLP, and Re-parameterization for the convenience of readers. This article will briefly introduce the Attention part of this project. The project will continue to update with the latest paper work, and everyone is welcome to follow and star this work. If there are any issues during the reproduction and organization of the project, please feel free to raise them in the issues section, and I will respond promptly.~
Project Address
https://github.com/xmu-xiaoma666/External-Attention-pytorch
11. Shuffle Attention
11.1. Citation
SA-NET: Shuffle Attention For Deep Convolutional Neural Networks[1]
Paper Address: https://arxiv.org/pdf/2102.00240.pdf
11.2. Model Structure

11.3. Introduction
This is a paper published by Nanjing University at ICASSP 2021, which captures two types of attention: channel attention and spatial attention. The ShuffleAttention proposed in this paper mainly consists of three steps:
1. First, the input features are divided into groups, and then the features of each group are split into two branches to calculate channel attention and spatial attention, respectively. Both types of attention use trainable parameters (when I looked at the structure diagram, I thought it used FC here, but after reading the source code, I found that a set of learnable parameters is created for each channel) + sigmoid method for calculation.
2. Next, the results of the two branches are concatenated together, and merged to obtain a feature map that is consistent with the input size.
3. Finally, a shuffle layer is used for channel shuffle (similar to ShuffleNet[2]).
The authors conducted experiments on the classification dataset ImageNet-1K and the object detection dataset MS COCO, as well as instance segmentation tasks, showing that the performance of SA surpasses the current SOTA methods, achieving higher accuracy with lower model complexity.
11.4. Core Code
11.5. Usage
12. MUSE Attention
12.1. Citation
MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning[3]
Paper Address: https://arxiv.org/abs/1911.09483
12.2. Model Structure

12.3. Introduction
This is a paper published by Peking University team in 2019 on arXiv, which mainly addresses the drawback of Self-Attention (SA) having only global capture capability. As shown in the figure below, when the sentence length increases, the global capture capability of SA weakens, leading to a decline in the final model performance. Therefore, the authors introduced 1D convolutions with multiple different receptive fields to capture multi-scale local attention, compensating for SA’s shortcomings in modeling long sentences.

The implementation, as shown in the model structure, adds the results of SA and multiple convolutions, allowing for both global and local perception (this is quite similar to the motivation of recent works such as VOLO[4] and CoAtNet[5]). Ultimately, by introducing multi-scale local perception, the model’s performance in translation tasks has been improved.
12.4. Core Code
12.5. Usage
13. SGE Attention
13.1. Citation
Spatial Group-wise Enhance: Improving Semantic Feature Learning in Convolutional Networks[6]
Paper Address: https://arxiv.org/pdf/1905.09646.pdf
13.2. Model Structure

13.3. Introduction
This paper is a lightweight attention work published by the authors of SKNet[7] in 2019 on arXiv. From the core code below, it can be seen that the introduced parameters are indeed very few, with self.weight and self.bias being on the order of groups (almost constant level).
The core idea of this paper is to use the similarity of local and global information to guide the enhancement of semantic features. The overall operation can be divided into the following steps:
1) Group the features, and for each group, perform a dot product with the feature after global pooling to get the initial attention mask (similarity).
2) Normalize the attention mask by subtracting the mean and dividing by the standard deviation, while learning two scaling offset parameters for each group to make the normalization operation reversible.
3) Finally, apply sigmoid to obtain the final attention mask and scale the features at each position in the original feature group.
In the experimental section, the authors also conducted experiments on classification tasks (ImageNet) and detection tasks (COCO), achieving better performance with fewer parameters and computational load compared to networks like SK[7], CBAM[8], and BAM[9], demonstrating the efficiency of the proposed method.
13.4. Core Code
13.5. Usage
14. A2 Attention
14.1. Citation
A2-Nets: Double Attention Networks[10]
Paper Address: https://arxiv.org/pdf/1810.11579.pdf
14.2. Model Structure

14.3. Introduction
This is a paper presented at NeurIPS 2018, which mainly focuses on spatial attention. The method in this paper is quite similar to self-attention, but the packaging is more elaborate.
The input is transformed into A, B, and V using 1×1 convolutions (similar to self-attention’s Q, K, V). The method in this paper is divided into two steps. In the first step, feature gathering is performed where A and B are multiplied to obtain an attention that aggregates global information, denoted as G. Then G is multiplied with V to obtain second-order attention. (Personally, I think this is somewhat similar to Attention on Attention (AOA)[11], the paper from ICCV 2019).
According to the experimental results, this structure performs quite well, with the authors achieving excellent results in classification (ImageNet) and action recognition (Kinetics, UCF-101) tasks, showing significant improvements compared to models like Non-Local[12] and SENet[13].
14.4. Core Code
14.5. Usage
15. AFT Attention
15.1. Citation
An Attention Free Transformer[14]
Paper Address: https://arxiv.org/pdf/2105.14103v1.pdf
15.2. Model Structure

15.3. Introduction
This is a work released by the Apple team on June 16, 2021, on arXiv, which mainly focuses on simplifying Self-Attention.
In recent years, Transformers have been used in various tasks, but due to the time and space complexity of Self-Attention being quadratic with respect to the input data size, it cannot be used for very large data. In recent years, many works have been proposed to simplify the complexity of SA: sparse attention, local hashing, low-rank decomposition…
This paper proposes an Attention Free Transformer (AFT), which also consists of QKV three parts, but unlike traditional methods, QK does not perform a dot product. Instead, KV is directly fused to ensure interaction at corresponding positions, and then Q is multiplied with the fused features at the corresponding positions to reduce computational load.
Overall, the principle is similar to Self-Attention, but instead of using dot products, it uses element-wise multiplication, significantly reducing the computational load.
15.4. Core Code
15.5. Usage
[Final Note]
Currently, the Attention works organized by this project are indeed not comprehensive enough. As the readership increases, this project will be continuously improved. Everyone is welcome to star and support. If there are any inappropriate expressions in the article or errors in the code implementation, please feel free to point them out~
[References]
[1]. Zhang, Qing-Long, and Yu-Bin Yang. “SA-NET: Shuffle Attention for Deep Convolutional Neural Networks.” ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.
[2]. Zhang, Xiangyu, et al. “ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
[3]. Zhao, Guangxiang, et al. “MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning.” arXiv Preprint arXiv:1911.09483 (2019).
[4]. Yuan, Li, et al. “VOLO: Vision Outlooker for Visual Recognition.” arXiv Preprint arXiv:2106.13112 (2021).
[5]. Dai, Zihang, et al. “CoAtNet: Marrying Convolution and Attention for All Data Sizes.” arXiv Preprint arXiv:2106.04803 (2021).
[6]. Li, Xiang, Xiaolin Hu, and Jian Yang. “Spatial Group-Wise Enhance: Improving Semantic Feature Learning in Convolutional Networks.” arXiv Preprint arXiv:1905.09646 (2019).
[7]. Wu, Weikun, et al. “SK-Net: Deep Learning on Point Cloud via End-to-End Discovery of Spatial Keypoints.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 04. 2020.
[8]. Woo, Sanghyun, et al. “CBAM: Convolutional Block Attention Module.” Proceedings of the European Conference on Computer Vision (ECCV). 2018.
[9]. Park, Jongchan, et al. “BAM: Bottleneck Attention Module.” arXiv Preprint arXiv:1807.06514 (2018).
[10]. Chen, Yunpeng, et al. “A2-Nets: Double Attention Networks.” arXiv Preprint arXiv:1810.11579 (2018).
[11]. Huang, Lun, et al. “Attention on Attention for Image Captioning.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.
[12]. Wang, Xiaolong, et al. “Non-Local Neural Networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
[13]. Hu, Jie, Li Shen, and Gang Sun. “Squeeze-and-Excitation Networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.
[14]. Zhai, Shuangfei, et al. “An Attention Free Transformer.” arXiv Preprint arXiv:2105.14103 (2021).
If you have any questions about the article, please feel free to leave a comment or add the author’s WeChat: xmu_xiaoma
