Current Research Status of Attention Mechanisms

Click the above“Machine Learning and Generative Adversarial Networks” to follow and star

Get interesting and fun cutting-edge content!

Author on Zhihu: Mr. Good Good, please delete if infringing

https://zhuanlan.zhihu.com/p/361893386

Current Research Status of Attention Mechanisms

1 Background Knowledge

The Attention mechanism was first proposed in the field of visual images, probably in the 1990s, but it really gained popularity with the 2014 paper by the Google Mind team titled Recurrent Models of Visual Attention, where they used the Attention mechanism in RNN models for image classification.

Subsequently, Bahdanau et al. used a similar Attention mechanism in their paper Neural Machine Translation by Jointly Learning to Align and Translate, performing translation and alignment simultaneously in machine translation tasks, marking the first application of the Attention mechanism in the NLP field.

Then, the Attention mechanism was widely applied in various NLP tasks based on RNN/CNN and other neural network models. In 2017, the Google Translation team published Attention is All You Need, which extensively used the self-attention mechanism to learn text representations. The self-attention mechanism has also become a recent research hotspot, explored in various NLP tasks.

2 Principles and Classification of Attention Mechanism

2.1 Principles of Attention Mechanism

Research combining deep learning and visual attention mechanisms mostly focuses on using masks to form attention mechanisms. The principle of the mask is to identify key features in the image data through a new layer of weights, allowing the deep neural network to learn which areas of each new image need attention, thus forming attention. As shown in the figure below:

2.2 Classification of Attention Mechanism

Overall, there are two types: soft attention and hard attention, as well as the self-attention mechanism used for text processing in the NLP field.

Soft Attention Mechanism. Can be divided into item-wise soft attention and location-wise soft attention.
Hard Attention Mechanism. Can be divided into item-wise hard attention and location-wise hard attention.
Self-Attention Mechanism. A variant of the attention mechanism that reduces reliance on external information and is better at capturing internal correlations of data or features. In text applications, it mainly addresses long-distance dependency issues by calculating the mutual influence between words.

For item-based attention and location-based attention, their input forms are different. Item-based attention requires a sequence containing explicit items or additional preprocessing steps to generate a sequence containing explicit items (where items can be a vector, matrix, or even a feature map). Location-based attention is designed for inputs as a single feature map, where all targets can be specified by location.

The key point of soft attention is that it focuses more on areas or channels, and soft attention is deterministic; after learning, it can be directly generated through the network. The most critical aspect is that soft attention is differentiable, which is very important. Differentiable attention can compute gradients through neural networks and learn attention weights through forward propagation and backward feedback.

Hard attention differs from soft attention in that it focuses more on points, meaning each point in the image can extend attention, and hard attention is a random prediction process that emphasizes dynamic changes. Of course, the most critical aspect is that hard attention is non-differentiable, and the training process is often completed through reinforcement learning.

3 Soft Attention Mechanism (soft-attention)

3.1 Neural Image Caption Generation with Visual Attention

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Edit

Main Idea: Inspired by recent works in machine translation and object detection, the paper introduces an attention-based model that can automatically learn to describe the content of images. The paper describes how to use standard backpropagation techniques to train the model in a deterministic manner and randomly by maximizing the variational lower bound. The paper also visualizes how the model can automatically learn to fix attention on salient objects while generating corresponding words in the output sequence. The paper validates the use of attention through the latest performance on three benchmark datasets: Flickr8k, Flickr30k, and MS COCO.
Paper link: https://arxiv.org/pdf/1502.03044v3.pdf
Code link: https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning

3.2 Action Recognition using Visual Attention

Action Recognition using Visual Attention

Main Idea: For action recognition tasks in videos, the paper proposes a soft attention-based model. The model uses a multi-layer recurrent neural network (RNN) with long short-term memory (LSTM) units, which are deep in both spatial and temporal dimensions. The model learns to selectively focus on different parts of video frames and classifies the video after glancing at them. The model inherently understands which parts of the frames are relevant to the task at hand and assigns them higher importance. The model is evaluated on UCF-11 (YouTube actions), HMDB-51, and Hollywood2 datasets, analyzing how the model focuses attention based on the scene and the actions performed.
Paper link: https://arxiv.org/pdf/1511.04119v3.pdf
Code link: https://github.com/kracwarlock/action-recognition-visual-attention

3.3 Describing Videos by Exploiting Temporal Structure

Describing Videos by Exploiting Temporal Structure

Main Idea: Recent advances in using recurrent neural networks (RNNs) for image description have sparked exploration into their application in video description. However, while images are static, processing videos requires modeling their dynamic temporal structure and integrating this information correctly into natural language descriptions. In this case, a successful method is proposed that considers both local and global temporal structures of the video to generate descriptions. First, the method combines spatiotemporal 3-D convolutional neural network (3-D CNN) representations of short-term dynamics. The training of 3-D CNN representations for video action recognition tasks generates representations suitable for human actions and behaviors. Secondly, the paper proposes a temporal attention mechanism that transcends local temporal modeling and learns to automatically select the most relevant time periods given the RNN generating the text.
Paper link: https://arxiv.org/pdf/1502.08029v5.pdf
Code link: https://github.com/yaoli/arctic-capgen-vid

3.4 Attention-Gated Networks for Improving Ultrasound Scan Plane Detection

Attention-Gated Networks for Improving Ultrasound Scan Plane Detection

Main Idea: In this work, attention-gated networks are applied to real-time automatic scan plane detection for fetal ultrasound screening. Scan plane detection in fetal ultrasound is a challenging problem due to poor image quality, leading to poor interpretability for clinicians and automated algorithms. To solve this problem, the paper suggests combining self-gated soft attention mechanisms. The soft attention mechanism generates an end-to-end trainable gating signal, enabling the network to associate local information useful for predictions. The proposed attention mechanism is general and can be easily integrated into any existing classification architecture with just a few additional parameters. The paper shows that when the underlying network has high capacity, the integrated attention mechanism can improve overall performance while providing effective object localization. When the capacity of the base network is low, this method significantly outperforms benchmark methods and greatly reduces false positive rates.
Paper link: https://arxiv.org/pdf/1804.05338v1.pdf
Code link: https://github.com/ozan-oktay/Attention-Gated-Networks

3.5 Adaptive Physics-informed Neural Networks using Soft Attention Mechanism

Main Idea: A fundamentally new method is proposed for adaptively training PINNs, where the adaptive weights are fully trainable, allowing the neural network to understand which areas of the solution are difficult and focus on solving those areas, reminiscent of the multiplicative mask attention mechanism used in soft computer vision. The basic idea of these adaptive PINNs is to increase weights in areas where the corresponding loss is high, achieved by training the network to minimize both loss and maximize weights simultaneously (i.e., finding saddle points on the cost surface). The paper shows that this is formally equivalent to using penalty-based methods to solve optimization problems constrained by PDEs, although in a sense, the monotonically non-decreasing penalty coefficients are trainable. In numerical experiments using the Allen-Cahn rigid PDE, the adaptive PINN outperformed other state-of-the-art PINN algorithms in terms of L2 error while requiring fewer training iterations. The appendix contains additional results for Burger’s and Helmholtz PDEs, confirming the trends observed in the Allen-Cahn experiment. Finding saddle points on the cost surface is formally equivalent to using penalty-based methods to solve optimization problems constrained by PDEs, although in a sense, the monotonically non-decreasing penalty coefficients are trainable. In numerical experiments using the Allen-Cahn rigid PDE, the adaptive PINN outperformed other state-of-the-art PINN algorithms in terms of L2 error while requiring fewer training iterations. The appendix contains additional results for Burger’s and Helmholtz PDEs, confirming the trends observed in the Allen-Cahn experiment.
Paper link: https://arxiv.org/pdf/2009.04544v2.pdf
Code link: https://github.com/levimcclenny/SA-PINNs

3.6 Recurrent Models of Visual Attention

Recurrent Models of Visual Attention

Main Idea: Applying convolutional neural networks to large images is computationally expensive because the computation scales linearly with the number of pixels in the image. The paper proposes a novel recurrent neural network model that can extract information from images or videos by adaptively selecting sequences of regions or locations and processing only the selected regions at high resolution. Like convolutional neural networks, the proposed model has a built-in degree of translation invariance, but the amount of computation it performs can be controlled independently of the input image size. Although the model is non-differentiable, task-specific policies can be learned using reinforcement learning methods to train it. The paper evaluates the model on several image classification tasks, where it significantly outperforms convolutional neural network baselines on cluttered images, and learns to track simple objects in dynamic visual control problems without explicit training signals.
Paper link: https://arxiv.org/pdf/1406.6247v1.pdf
Code link: https://github.com/kevinzakka/recurrent-visual-attention

4 Hard Attention Mechanism (hard-attention)

4.1 Attention-based Extraction of Structured Information from Street View Imagery

Attention-based Extraction of Structured Information from Street View Imagery

Main Idea: The paper proposes a neural network model based on CNN, RNN, and a novel attention mechanism that achieves 84.2% accuracy on the challenging French Street Name Signs (FSNS) dataset, significantly surpassing the previous state-of-the-art (Smith’16) which reached 72.46%. Additionally, the new method is simpler and more general than previous approaches. To demonstrate the generality of the model, the paper shows that it also performs well on more challenging datasets derived from Google Street View, aiming to extract merchant names from storefronts. Finally, the paper studies the speed/accuracy trade-offs caused by using CNN feature extractors of different depths. Surprisingly, the paper finds that deeper is not always better (in terms of accuracy and speed). The model generated in the paper is simple, accurate, and fast, and can be used at scale for various challenging real-world text extraction problems.
Paper link: https://arxiv.org/pdf/1704.03549v4.pdf
Code link: https://github.com/tensorflow/models

4.2 Hard Non-Monotonic Attention for Character-Level Transduction

Hard Non-Monotonic Attention for Character-Level Transduction

Main Idea: Character-level string-to-string transduction is an important component of various NLP tasks. The goal is to map the input string to the output string, where these strings may differ in length and consist of characters drawn from different alphabets. Recent approaches have utilized sequence-to-sequence models with attention mechanisms to understand which parts of the input string the model should focus on during the generation of the output string. Soft attention and hard monotonic attention have been used, but hard non-monotonic attention has only been applied to other sequence modeling tasks, such as image captioning, requiring random approximations to compute gradients. In this work, the paper introduces an exact polynomial-time algorithm to marginalize over the exponential number of non-monotonic alignments between the two strings, indicating that the hard attention model can be viewed as a neural reparameterization of the classical IBM Model 1.
Paper link: https://arxiv.org/pdf/1808.10024v2.pdf
Code link: https://github.com/shijie-wu/neural-transducer

4.3 Improving Accuracy of Hard Attention Models for Vision

Saccader: Improving Accuracy of Hard Attention Models for Vision

Main Idea: A novel hard attention model called Saccader is proposed. The key to Saccader is a pre-training step that requires only class labels and provides initial attention locations for policy gradient optimization. The best model in the paper narrows the gap with the general ImageNet benchmark, achieving 75% top-1 and 91% top-5 accuracy while focusing on less than one-third of the images.
Paper link: https://arxiv.org/pdf/1908.07644v3.pdf
Code link: https://github.com/google-research/google-research

4.4 Overcoming Catastrophic Forgetting with Hard Attention to the Task

Overcoming Catastrophic Forgetting with Hard Attention to the Task

Main Idea: Catastrophic forgetting occurs when a neural network loses information learned in previous tasks while training on subsequent tasks. This remains a barrier for AI systems with sequential learning capabilities. In this paper, the authors propose a task-based hard attention mechanism that retains information from previous tasks without affecting the learning of the current task. Through stochastic gradient descent, hard attention masks can be learned simultaneously for each task, and previous masks can be utilized to adjust this learning. The paper shows that the proposed mechanism effectively reduces catastrophic forgetting, reducing the current rate by 45% to 80%. The paper also demonstrates its robustness to different hyperparameter selections and offers many monitoring functionalities. The method has the potential to control the stability and compactness of learned knowledge, which the authors believe is also attractive for online learning or network compression applications.
Paper link: https://arxiv.org/pdf/1801.01423v3.pdf
Code link: https://github.com/joansj/hat

5 Self-Attention Mechanism (self-attention)

5.1 Reinforced Self-Attention Network: a Hybrid of Hard and Soft Attention for Sequence Modeling

Main Idea: The paper integrates soft and hard attention into a context fusion model, termed Reinforced Self-Attention (ReSA), to achieve mutual enhancement. In ReSA, hard attention prunes a sequence for soft self-attention processing, while soft attention provides feedback reward signals to facilitate training hard attention. To this end, the authors develop a novel hard attention, Reinforced Sequence Sampling (RSS), that selects tokens in parallel and is trained via policy gradients. Utilizing two RSS modules, ReSA effectively extracts sparse dependencies between each pair of selected tokens. Finally, the paper proposes a sentence encoding model entirely based on ReSA, Reinforced Self-Attention Network (ReSAN), which achieves state-of-the-art performance on the Stanford Natural Language Inference (SNLI) and Sentence Involving Compositional Knowledge (SICK) datasets.
Paper link: https://arxiv.org/pdf/1801.10296v2.pdf
Code link: https://github.com/taoshen58/DiSAN

5.2 Attention Is All You Need

Attention Is All You Need

Main Idea: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best-performing models also connect the encoder and decoder through attention mechanisms. The authors propose a new simple network architecture, the Transformer, which is entirely based on attention mechanisms, completely eliminating recurrence and convolutions. Experiments conducted on two machine translation tasks show that these models outperform in quality while achieving higher parallelizability and significantly less training time. The authors’ model achieved a new state-of-the-art BLEU score of 28.4 on the WMT 2014 English-to-German translation task, improving by 2 BLEU over the previous best results (including ensembles). On the WMT English-to-French translation task in 2014, the authors’ model achieved a new single-model state-of-the-art BLEU score of 41.8 after 3.5 days of training on eight GPUs.
Paper link: https://arxiv.org/pdf/1706.03762v5.pdf
Code link: https://github.com/tensorflow/tensor2tensor

5.3 A Neural Attention Model for Abstractive Sentence Summarization

A Neural Attention Model for Abstractive Sentence Summarization

Main Idea: Extractive summarization is inherently limited, but generative abstract methods are challenging to construct. In this work, the authors propose a fully data-driven method for abstractive sentence summarization. The method utilizes a self-attention-based model that generates each word of the summary conditioned on the input sentence. Despite its structural simplicity, the model can be easily trained end-to-end and can scale with large amounts of training data. The model demonstrates significant performance improvements on the DUC-2004 shared task compared to several strong baselines.
Paper link: https://arxiv.org/pdf/1509.00685v2.pdf
Code link: https://github.com/toru34/rushemnlp2015

5.4 Neural Machine Translation by Jointly Learning to Align and Translate

Neural Machine Translation by Jointly Learning to Align and Translate

Main Idea: Neural machine translation is a recently proposed method for machine translation. Unlike traditional statistical machine translation, the goal of neural machine translation is to construct a single neural network that can be jointly tuned to maximize translation performance. Recently proposed models for neural machine translation typically belong to the encoder-decoder family and consist of an encoder that encodes the source sentence into a fixed-length vector, and a decoder that generates the translation based on that fixed-length vector. In this paper, the authors hypothesize that using a fixed-length vector is a bottleneck for improving the performance of this basic encoder-decoder architecture, and suggest extending this by allowing the model to automatically (softly) search for parts of the source sentence relevant to predicting the target word, without having to explicitly form these parts into a hard alignment.
Paper link: https://arxiv.org/pdf/1409.0473v7.pdf
Code link: https://github.com/graykode/nlp-tutorial

5.5 Self-Attention with Relative Position Representations

Self-Attention with Relative Position Representations

Main Idea: The Transformer introduced by Vaswani et al. (2017) relies entirely on attention mechanisms and achieves state-of-the-art results in machine translation. Unlike recurrent and convolutional neural networks, it does not explicitly model relative or absolute positional information in its structure. Instead, it requires absolute position representations to be added to its input. In this work, the authors propose an alternative approach that extends the self-attention mechanism to effectively consider representations of relative positions or distances between sequence elements. In the WMT 2014 English-to-German and English-to-French translation tasks, this method improves by 1.3 BLEU and 0.3 BLEU over absolute position representations, respectively. Notably, the authors observe that combining relative and absolute position representations does not further improve translation quality. The paper describes an efficient implementation of the proposed method and translates it into an instance of a relation-aware self-attention mechanism that can be generalized to any input with arbitrary tagging.
Paper link: https://arxiv.org/pdf/1803.02155v2.pdf
Code link: https://github.com/tensorflow/tensor2tensor

5.6 A Structured Self-attentive Sentence Embedding

A Structured Self-attentive Sentence Embedding

Main Idea: This paper proposes a new model for extracting interpretable sentence embeddings by introducing self-attention. Instead of using vectors, the authors use two-dimensional matrices to represent embeddings, with each row located at different parts of the sentence. The authors also propose a self-attention mechanism and a special regularization term for this model. As a side effect, the embeddings provide an intuitive way to see which specific parts of the sentence are encoded into the embeddings. The authors evaluate the model on three different tasks: author profiling analysis, sentiment classification, and textual entailment. The results show that the authors’ model produces significant performance improvements across all three tasks compared to other sentence embedding methods.
Paper link: https://arxiv.org/pdf/1703.03130v1.pdf
Code link: https://github.com/facebookresearch/pytext

You might also like:

Waiting for you to land! 【GAN Generative Adversarial Networks】 Knowledge Circle!

CVPR 2021 | Summary of Speaker-Driven GAN, 3D Face Papers

CVPR 2021 | Image Transformation, How Now? A Few GAN Papers

CVPR 2021 | Improving Legacy Issues in Face Recognition with GANs

CVPR 2021 Generative Adversarial Networks GAN Paper Summary

Classic GANs You Must Read: StyleGAN

Latest and Most Comprehensive 20 Papers! Papers Related to Improvements or Applications Based on StyleGAN