New Ideas on Attention Mechanisms: Frequency Domain + Attention

Frequency Domain + Attention has broken through the traditional modified ideas of attention mechanisms and has become a hot topic of research. It is recommended that those who want to publish papers pay more attention to this.

On one hand, the combination of frequency domain and attention is very useful in improving model performance, efficiency, and interpretability. For example, the FEDformer model not only reduces the computational complexity to linear but also achieves a prediction accuracy that is 22.6% higher than SOTA!

On the other hand, the frequency domain has received little attention in the past; it is a completely new perspective, and the frequency domain is a concept derived from the field of signals, involving many elements with great potential for application.

To provide everyone with a comprehensive and in-depth understanding of this direction and to gain more idea inspiration, I have prepared six representative fusion methods for you, mainly involving adaptive frequency domain feature extraction + attention and multi-scale frequency domain + attention.

Scan the QR code below and reply with 「Frequency」 to get all the papers and project codes for free New Ideas on Attention Mechanisms: Frequency Domain + Attention

1. FcaNet: Frequency Channel Attention Networks

Paper Summary: The attention mechanism, especially the channel attention mechanism, has achieved great success in the field of computer vision. Many studies focus on how to design effective channel attention mechanisms while neglecting a fundamental issue: the use of Global Average Pooling (GAP) as an unquestionable preprocessing method. In this work, the authors rethink the channel attention from a different angle using frequency analysis. Based on frequency analysis, it is mathematically proven that the conventional GAP is a special case of frequency domain feature decomposition. On this basis, the authors naturally extended the channel attention mechanism preprocessing to the frequency domain and proposed FcaNet with novel multi-spectral channel attention. This method is simple and effective; it can be implemented in existing channel attention methods with just one line of code change. Moreover, the method has achieved good results in image classification, object detection, and instance segmentation tasks.

Model Structure

New Ideas on Attention Mechanisms: Frequency Domain + Attention

Experimental Results

2. SpectFormer: Frequency and Attention is What You Need in a Vision Transformer

Paper Summary: This work studies this hypothesis and observes that the combination of spectral and multi-head attention layers indeed provides a better transformer structure. Therefore, the authors propose a new spectral structure of transformers that combines spectral layers and multi-head attention layers. They argue that the generated representations allow the transformer to appropriately capture feature representations and outperform other transformer representations. For example, it improves the top-1 accuracy by 2% on ImageNet compared to gfnet-h and LiT. Specformer-s achieves a top-1 accuracy of 84.25% on ImageNet-1k (the latest level for the small version). In addition, spectformer-l reaches 85.7%, which is the latest level for similar base version transformers. The authors and their team further ensure reasonable results in other scenarios, such as transfer learning on standard datasets (like CIFAR-10, CIFAR-100, Oxford-IIIT-flower, and Stanford Car datasets). Subsequently, the authors investigate its use in downstream tasks (such as object detection and instance segmentation) on the MS-COCO dataset, observing that specformer demonstrates consistent performance comparable to the best backbone networks and can be further optimized and improved.

Model Structure

Experimental Results

3. FCMNet: Frequency-aware Cross-modality Attention Networks for RGB-D Salient Object Detection

Paper Summary: The frequency-aware cross-modality attention network (FCMNet) is proposed for RGB-D SOD. The frequency-aware cross-modality attention (FACMA) module and weighted cross-modality fusion (WCMF) module are introduced. The authors evaluate the proposed method on 8 widely used datasets with 17 state-of-the-art methods. RGB-D salient object detection aims to comprehensively utilize RGB images and depth maps to detect object saliency. The field still faces two challenges: 1) how to extract representative multi-modal features, and 2) how to effectively fuse them. Previous methods mostly treat RGB and depth information as two modalities without considering the differences between the two modalities in the frequency domain, which may lose some complementary information. This paper introduces a frequency channel attention mechanism during the fusion process. First, the authors design a frequency-aware cross-modality attention (FACMA) module to interweave sufficient channel features and select representative features. In the FACMA module, the authors also propose a spatial frequency channel attention (SFCA) module to introduce more complementary information across different channels. Second, a weighted cross-modality fusion (WCMF) module is developed to adaptively fuse multi-modal features by learning content-related weight maps.

Model Structure

Experimental Results

Scan the QR code below and reply with 「Frequency」 to get all the papers and project codes for free New Ideas on Attention Mechanisms: Frequency Domain + Attention

4. FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting

Paper Summary: Although transformer-based methods have significantly improved the state-of-the-art results for long-term sequence forecasting, they are not only computationally expensive but also, more importantly, these methods fail to capture the global view of time series (e.g., overall trends). To address these issues, the authors suggest combining transformers with seasonal trend decomposition methods, where the decomposition methods capture the global profile of the time series while transformers capture more detailed structures. To further enhance the long-term forecasting performance of transformers, the authors leverage the fact that most time series often have sparse representations (based on well-known foundations such as Fourier transforms) and develop a frequency-enhanced transformer. The proposed method is called Frequency Enhanced Decomposed Transformer (FEDformer), which is more efficient than standard transformers with linear complexity concerning sequence length.

Model Structure

Experimental Results

5. Dual-domain Strip Attention for Image Restoration

Paper Summary: Image restoration aims to reconstruct potential high-quality images from degraded observations. Recently, due to the powerful long-range dependency modeling capability of transformers, their use has significantly improved the state-of-the-art performance in various image restoration tasks. However, the quadratic complexity of self-attention hinders practical applications. Moreover, fully utilizing the significant spectral differences between clean and degraded image pairs is also beneficial for image restoration. This paper proposes a dual-domain strip attention mechanism for image restoration through enhanced representation learning, consisting of spatial and frequency strip attention units. Specifically, the spatial strip attention unit acquires contextual information from adjacent positions of each pixel in the same row or column under the guidance of learned weights through a simple convolution branch. Additionally, the frequency strip attention unit refines features in the spectral domain through frequency separation and modulation using simple pooling techniques. Furthermore, the authors apply different strip sizes to enhance multi-scale learning, which is beneficial for handling degraded images of various sizes. By employing dual-domain attention units in different directions, each pixel can implicitly perceive information from extended areas.

Model Structure

Experimental Results

6. Efficient Frequency Domain-based Transformers for High-Quality Image Deblurring

Paper Summary: The authors propose an effective and efficient method that explores the properties of transformers for high-quality image deblurring in the frequency domain. This method is driven by the convolution theorem, which states that the correlation or convolution of two signals in the spatial domain is equivalent to their element-wise product in the frequency domain. This inspires the authors to develop an efficient frequency domain-based self-attention solver (FSAS) to estimate scaled dot-product attention through element-wise product operations instead of using matrix multiplication in the spatial domain. Additionally, it is found that simply using naive feed-forward networks (FFN) in transformers does not yield good deblurring results. To overcome this issue, the authors propose a simple and effective discriminator frequency domain-based FFN (DFFN), where they introduce a gating mechanism based on the Joint Photographic Experts Group (JPEG) compression algorithm in the FFN to discern which features’ low-frequency and high-frequency information should be retained for potential clear image recovery.

Model Structure New Ideas on Attention Mechanisms: Frequency Domain + Attention

Experimental Results

Scan the QR code below and reply with 「Frequency」 to get all the papers and project codes for free New Ideas on Attention Mechanisms: Frequency Domain + Attention

1. FcaNet: Frequency Channel Attention Networks

2. SpectFormer: Frequency and Attention is What You Need in a Vision Transformer

3. FCMNet: Frequency-aware Cross-modality Attention Networks for RGB-D Salient Object Detection

4. FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting

5. Dual-domain Strip Attention for Image Restoration

6. Efficient Frequency Domain-based Transformers for High-Quality Image Deblurring

Leave a Comment Cancel reply