Click the above“Machine Learning and Generative Adversarial Networks” to follow and star
Get interesting and fun cutting-edge content!
Author on Zhihu: Mr. Good Good, please delete if infringing
https://zhuanlan.zhihu.com/p/361893386

1 Background Knowledge

2 Principles and Classification of Attention Mechanism
2.1 Principles of Attention Mechanism

2.2 Classification of Attention Mechanism
-
Soft Attention Mechanism. Can be divided into item-wise soft attention and location-wise soft attention.
-
Hard Attention Mechanism. Can be divided into item-wise hard attention and location-wise hard attention.
-
Self-Attention Mechanism. A variant of the attention mechanism that reduces reliance on external information and is better at capturing internal correlations of data or features. In text applications, it mainly addresses long-distance dependency issues by calculating the mutual influence between words.

3 Soft Attention Mechanism (soft-attention)
3.1 Neural Image Caption Generation with Visual Attention
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Edit
-
Main Idea: Inspired by recent works in machine translation and object detection, the paper introduces an attention-based model that can automatically learn to describe the content of images. The paper describes how to use standard backpropagation techniques to train the model in a deterministic manner and randomly by maximizing the variational lower bound. The paper also visualizes how the model can automatically learn to fix attention on salient objects while generating corresponding words in the output sequence. The paper validates the use of attention through the latest performance on three benchmark datasets: Flickr8k, Flickr30k, and MS COCO.
-
Paper link: https://arxiv.org/pdf/1502.03044v3.pdf
-
Code link: https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning
3.2 Action Recognition using Visual Attention
Action Recognition using Visual Attention
-
Main Idea: For action recognition tasks in videos, the paper proposes a soft attention-based model. The model uses a multi-layer recurrent neural network (RNN) with long short-term memory (LSTM) units, which are deep in both spatial and temporal dimensions. The model learns to selectively focus on different parts of video frames and classifies the video after glancing at them. The model inherently understands which parts of the frames are relevant to the task at hand and assigns them higher importance. The model is evaluated on UCF-11 (YouTube actions), HMDB-51, and Hollywood2 datasets, analyzing how the model focuses attention based on the scene and the actions performed.
-
Paper link: https://arxiv.org/pdf/1511.04119v3.pdf
-
Code link: https://github.com/kracwarlock/action-recognition-visual-attention
3.3 Describing Videos by Exploiting Temporal Structure
Describing Videos by Exploiting Temporal Structure
-
Main Idea: Recent advances in using recurrent neural networks (RNNs) for image description have sparked exploration into their application in video description. However, while images are static, processing videos requires modeling their dynamic temporal structure and integrating this information correctly into natural language descriptions. In this case, a successful method is proposed that considers both local and global temporal structures of the video to generate descriptions. First, the method combines spatiotemporal 3-D convolutional neural network (3-D CNN) representations of short-term dynamics. The training of 3-D CNN representations for video action recognition tasks generates representations suitable for human actions and behaviors. Secondly, the paper proposes a temporal attention mechanism that transcends local temporal modeling and learns to automatically select the most relevant time periods given the RNN generating the text.
-
Paper link: https://arxiv.org/pdf/1502.08029v5.pdf
-
Code link: https://github.com/yaoli/arctic-capgen-vid
3.4 Attention-Gated Networks for Improving Ultrasound Scan Plane Detection
Attention-Gated Networks for Improving Ultrasound Scan Plane Detection
-
Main Idea: In this work, attention-gated networks are applied to real-time automatic scan plane detection for fetal ultrasound screening. Scan plane detection in fetal ultrasound is a challenging problem due to poor image quality, leading to poor interpretability for clinicians and automated algorithms. To solve this problem, the paper suggests combining self-gated soft attention mechanisms. The soft attention mechanism generates an end-to-end trainable gating signal, enabling the network to associate local information useful for predictions. The proposed attention mechanism is general and can be easily integrated into any existing classification architecture with just a few additional parameters. The paper shows that when the underlying network has high capacity, the integrated attention mechanism can improve overall performance while providing effective object localization. When the capacity of the base network is low, this method significantly outperforms benchmark methods and greatly reduces false positive rates.
-
Paper link: https://arxiv.org/pdf/1804.05338v1.pdf
-
Code link: https://github.com/ozan-oktay/Attention-Gated-Networks
3.5 Adaptive Physics-informed Neural Networks using Soft Attention Mechanism
-
Main Idea: A fundamentally new method is proposed for adaptively training PINNs, where the adaptive weights are fully trainable, allowing the neural network to understand which areas of the solution are difficult and focus on solving those areas, reminiscent of the multiplicative mask attention mechanism used in soft computer vision. The basic idea of these adaptive PINNs is to increase weights in areas where the corresponding loss is high, achieved by training the network to minimize both loss and maximize weights simultaneously (i.e., finding saddle points on the cost surface). The paper shows that this is formally equivalent to using penalty-based methods to solve optimization problems constrained by PDEs, although in a sense, the monotonically non-decreasing penalty coefficients are trainable. In numerical experiments using the Allen-Cahn rigid PDE, the adaptive PINN outperformed other state-of-the-art PINN algorithms in terms of L2 error while requiring fewer training iterations. The appendix contains additional results for Burger’s and Helmholtz PDEs, confirming the trends observed in the Allen-Cahn experiment. Finding saddle points on the cost surface is formally equivalent to using penalty-based methods to solve optimization problems constrained by PDEs, although in a sense, the monotonically non-decreasing penalty coefficients are trainable. In numerical experiments using the Allen-Cahn rigid PDE, the adaptive PINN outperformed other state-of-the-art PINN algorithms in terms of L2 error while requiring fewer training iterations. The appendix contains additional results for Burger’s and Helmholtz PDEs, confirming the trends observed in the Allen-Cahn experiment.
-
Paper link: https://arxiv.org/pdf/2009.04544v2.pdf
-
Code link: https://github.com/levimcclenny/SA-PINNs
3.6 Recurrent Models of Visual Attention
Recurrent Models of Visual Attention
-
Main Idea: Applying convolutional neural networks to large images is computationally expensive because the computation scales linearly with the number of pixels in the image. The paper proposes a novel recurrent neural network model that can extract information from images or videos by adaptively selecting sequences of regions or locations and processing only the selected regions at high resolution. Like convolutional neural networks, the proposed model has a built-in degree of translation invariance, but the amount of computation it performs can be controlled independently of the input image size. Although the model is non-differentiable, task-specific policies can be learned using reinforcement learning methods to train it. The paper evaluates the model on several image classification tasks, where it significantly outperforms convolutional neural network baselines on cluttered images, and learns to track simple objects in dynamic visual control problems without explicit training signals.
-
Paper link: https://arxiv.org/pdf/1406.6247v1.pdf
-
Code link: https://github.com/kevinzakka/recurrent-visual-attention

4 Hard Attention Mechanism (hard-attention)
4.1 Attention-based Extraction of Structured Information from Street View Imagery
Attention-based Extraction of Structured Information from Street View Imagery
-
Main Idea: The paper proposes a neural network model based on CNN, RNN, and a novel attention mechanism that achieves 84.2% accuracy on the challenging French Street Name Signs (FSNS) dataset, significantly surpassing the previous state-of-the-art (Smith’16) which reached 72.46%. Additionally, the new method is simpler and more general than previous approaches. To demonstrate the generality of the model, the paper shows that it also performs well on more challenging datasets derived from Google Street View, aiming to extract merchant names from storefronts. Finally, the paper studies the speed/accuracy trade-offs caused by using CNN feature extractors of different depths. Surprisingly, the paper finds that deeper is not always better (in terms of accuracy and speed). The model generated in the paper is simple, accurate, and fast, and can be used at scale for various challenging real-world text extraction problems.
-
Paper link: https://arxiv.org/pdf/1704.03549v4.pdf
-
Code link: https://github.com/tensorflow/models
4.2 Hard Non-Monotonic Attention for Character-Level Transduction
Hard Non-Monotonic Attention for Character-Level Transduction
-
Main Idea: Character-level string-to-string transduction is an important component of various NLP tasks. The goal is to map the input string to the output string, where these strings may differ in length and consist of characters drawn from different alphabets. Recent approaches have utilized sequence-to-sequence models with attention mechanisms to understand which parts of the input string the model should focus on during the generation of the output string. Soft attention and hard monotonic attention have been used, but hard non-monotonic attention has only been applied to other sequence modeling tasks, such as image captioning, requiring random approximations to compute gradients. In this work, the paper introduces an exact polynomial-time algorithm to marginalize over the exponential number of non-monotonic alignments between the two strings, indicating that the hard attention model can be viewed as a neural reparameterization of the classical IBM Model 1.
-
Paper link: https://arxiv.org/pdf/1808.10024v2.pdf
-
Code link: https://github.com/shijie-wu/neural-transducer
4.3 Improving Accuracy of Hard Attention Models for Vision
Saccader: Improving Accuracy of Hard Attention Models for Vision
-
Main Idea: A novel hard attention model called Saccader is proposed. The key to Saccader is a pre-training step that requires only class labels and provides initial attention locations for policy gradient optimization. The best model in the paper narrows the gap with the general ImageNet benchmark, achieving 75% top-1 and 91% top-5 accuracy while focusing on less than one-third of the images.
-
Paper link: https://arxiv.org/pdf/1908.07644v3.pdf
-
Code link: https://github.com/google-research/google-research
4.4 Overcoming Catastrophic Forgetting with Hard Attention to the Task
Overcoming Catastrophic Forgetting with Hard Attention to the Task
-
Main Idea: Catastrophic forgetting occurs when a neural network loses information learned in previous tasks while training on subsequent tasks. This remains a barrier for AI systems with sequential learning capabilities. In this paper, the authors propose a task-based hard attention mechanism that retains information from previous tasks without affecting the learning of the current task. Through stochastic gradient descent, hard attention masks can be learned simultaneously for each task, and previous masks can be utilized to adjust this learning. The paper shows that the proposed mechanism effectively reduces catastrophic forgetting, reducing the current rate by 45% to 80%. The paper also demonstrates its robustness to different hyperparameter selections and offers many monitoring functionalities. The method has the potential to control the stability and compactness of learned knowledge, which the authors believe is also attractive for online learning or network compression applications.
-
Paper link: https://arxiv.org/pdf/1801.01423v3.pdf
-
Code link: https://github.com/joansj/hat

5 Self-Attention Mechanism (self-attention)
5.1 Reinforced Self-Attention Network: a Hybrid of Hard and Soft Attention for Sequence Modeling
-
Main Idea: The paper integrates soft and hard attention into a context fusion model, termed Reinforced Self-Attention (ReSA), to achieve mutual enhancement. In ReSA, hard attention prunes a sequence for soft self-attention processing, while soft attention provides feedback reward signals to facilitate training hard attention. To this end, the authors develop a novel hard attention, Reinforced Sequence Sampling (RSS), that selects tokens in parallel and is trained via policy gradients. Utilizing two RSS modules, ReSA effectively extracts sparse dependencies between each pair of selected tokens. Finally, the paper proposes a sentence encoding model entirely based on ReSA, Reinforced Self-Attention Network (ReSAN), which achieves state-of-the-art performance on the Stanford Natural Language Inference (SNLI) and Sentence Involving Compositional Knowledge (SICK) datasets.
-
Paper link: https://arxiv.org/pdf/1801.10296v2.pdf
-
Code link: https://github.com/taoshen58/DiSAN
5.2 Attention Is All You Need
Attention Is All You Need
-
Main Idea: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best-performing models also connect the encoder and decoder through attention mechanisms. The authors propose a new simple network architecture, the Transformer, which is entirely based on attention mechanisms, completely eliminating recurrence and convolutions. Experiments conducted on two machine translation tasks show that these models outperform in quality while achieving higher parallelizability and significantly less training time. The authors’ model achieved a new state-of-the-art BLEU score of 28.4 on the WMT 2014 English-to-German translation task, improving by 2 BLEU over the previous best results (including ensembles). On the WMT English-to-French translation task in 2014, the authors’ model achieved a new single-model state-of-the-art BLEU score of 41.8 after 3.5 days of training on eight GPUs.
-
Paper link: https://arxiv.org/pdf/1706.03762v5.pdf
-
Code link: https://github.com/tensorflow/tensor2tensor
5.3 A Neural Attention Model for Abstractive Sentence Summarization
A Neural Attention Model for Abstractive Sentence Summarization
-
Main Idea: Extractive summarization is inherently limited, but generative abstract methods are challenging to construct. In this work, the authors propose a fully data-driven method for abstractive sentence summarization. The method utilizes a self-attention-based model that generates each word of the summary conditioned on the input sentence. Despite its structural simplicity, the model can be easily trained end-to-end and can scale with large amounts of training data. The model demonstrates significant performance improvements on the DUC-2004 shared task compared to several strong baselines.
-
Paper link: https://arxiv.org/pdf/1509.00685v2.pdf
-
Code link: https://github.com/toru34/rushemnlp2015
5.4 Neural Machine Translation by Jointly Learning to Align and Translate
Neural Machine Translation by Jointly Learning to Align and Translate
-
Main Idea: Neural machine translation is a recently proposed method for machine translation. Unlike traditional statistical machine translation, the goal of neural machine translation is to construct a single neural network that can be jointly tuned to maximize translation performance. Recently proposed models for neural machine translation typically belong to the encoder-decoder family and consist of an encoder that encodes the source sentence into a fixed-length vector, and a decoder that generates the translation based on that fixed-length vector. In this paper, the authors hypothesize that using a fixed-length vector is a bottleneck for improving the performance of this basic encoder-decoder architecture, and suggest extending this by allowing the model to automatically (softly) search for parts of the source sentence relevant to predicting the target word, without having to explicitly form these parts into a hard alignment.
-
Paper link: https://arxiv.org/pdf/1409.0473v7.pdf
-
Code link: https://github.com/graykode/nlp-tutorial
5.5 Self-Attention with Relative Position Representations
Self-Attention with Relative Position Representations
-
Main Idea: The Transformer introduced by Vaswani et al. (2017) relies entirely on attention mechanisms and achieves state-of-the-art results in machine translation. Unlike recurrent and convolutional neural networks, it does not explicitly model relative or absolute positional information in its structure. Instead, it requires absolute position representations to be added to its input. In this work, the authors propose an alternative approach that extends the self-attention mechanism to effectively consider representations of relative positions or distances between sequence elements. In the WMT 2014 English-to-German and English-to-French translation tasks, this method improves by 1.3 BLEU and 0.3 BLEU over absolute position representations, respectively. Notably, the authors observe that combining relative and absolute position representations does not further improve translation quality. The paper describes an efficient implementation of the proposed method and translates it into an instance of a relation-aware self-attention mechanism that can be generalized to any input with arbitrary tagging.
-
Paper link: https://arxiv.org/pdf/1803.02155v2.pdf
-
Code link: https://github.com/tensorflow/tensor2tensor
5.6 A Structured Self-attentive Sentence Embedding
A Structured Self-attentive Sentence Embedding
-
Main Idea: This paper proposes a new model for extracting interpretable sentence embeddings by introducing self-attention. Instead of using vectors, the authors use two-dimensional matrices to represent embeddings, with each row located at different parts of the sentence. The authors also propose a self-attention mechanism and a special regularization term for this model. As a side effect, the embeddings provide an intuitive way to see which specific parts of the sentence are encoded into the embeddings. The authors evaluate the model on three different tasks: author profiling analysis, sentiment classification, and textual entailment. The results show that the authors’ model produces significant performance improvements across all three tasks compared to other sentence embedding methods.
-
Paper link: https://arxiv.org/pdf/1703.03130v1.pdf
-
Code link: https://github.com/facebookresearch/pytext
You might also like:
Waiting for you to land! 【GAN Generative Adversarial Networks】 Knowledge Circle!
CVPR 2021 | Summary of Speaker-Driven GAN, 3D Face Papers
CVPR 2021 | Image Transformation, How Now? A Few GAN Papers
CVPR 2021 | Improving Legacy Issues in Face Recognition with GANs
CVPR 2021 Generative Adversarial Networks GAN Paper Summary
Classic GANs You Must Read: StyleGAN
Latest and Most Comprehensive 20 Papers! Papers Related to Improvements or Applications Based on StyleGAN
Over 100 Papers! The Most Comprehensive GAN Paper Summary for CVPR 2020!
Download | “Advanced Python” Chinese Version
Download | Classic “Think Python” Chinese Version
Download | “Practical Guide to Pytorch Model Training”
Download | Latest 2020 Li Mu’s “Hands-On Learning Deep Learning”
Download | “Interpretable Machine Learning” Chinese Version
Download | “TensorFlow 2.0 Deep Learning Algorithm Practice”
Over 100 Papers! The Most Comprehensive GAN Paper Summary for CVPR 2020!
Download | “Mathematical Methods in Computer Vision” Sharing