Latest Review on Attention Mechanism and Related Source Code

Introduction

The left side of the figure below shows the traditional Seq2Seq model (which encodes a sequence and then decodes it back into a sequence). This is a conventional LSTM-based model, where the hidden state at a given timestamp in the Decoder only depends on the current timestamp’s hidden state and the output from the previous timestamp. The right side shows the Attention-based Seq2Seq model, where the output of the Decoder also relies on a context feature (c), which is obtained through a weighted average of all hidden states in the Encoder, with the weights used for the weighted average being the Attention Scores (a) between the current timestamp and each timestamp in the Encoder.

Latest Review on Attention Mechanism and Related Source Code

Latest Review on Attention Mechanism and Related Source Code

Latest Review on Attention Mechanism and Related Source Code

General Form of Attention

The following formula represents the basic form of Attention (Basic Attention), where u is the matching feature vector based on the current task, used to interact with the context. vi is the feature vector at a certain timestamp in the sequence, ei is the unnormalized Attention Score, ai is the normalized Attention Score, and c is the context feature at the current timestamp calculated based on the Attention Scores and the feature sequence v.

Latest Review on Attention Mechanism and Related Source Code

In most cases, ei can be calculated using several methods as shown below:

Latest Review on Attention Mechanism and Related Source Code

In practical applications, besides the basic Attention, there are various variants of Attention. Below we introduce some common variants:

Variant – Multi-dimensional Attention

For each u, Basic Attention generates an Attention Score ai for each vi, meaning each u corresponds to a 1-D Attention Score vector. Multi-dimensional Attention, on the other hand, produces a higher-dimensional Attention matrix, aiming to capture Attention features across different feature spaces. Some forms of 2D Attention are shown below:

Latest Review on Attention Mechanism and Related Source Code

Variant – Hierarchical Attention

Some Attention algorithms consider Attention across different semantic levels. For example, the model below uses Attention at both the word level and the sentence level sequentially to obtain better features:

Latest Review on Attention Mechanism and Related Source Code

Variant – Self Attention

Replacing u in the above formula with vi from the context sequence gives us Self Attention. In NLP, Self Attention can capture some dependency relationships between words in a sentence. Additionally, in some tasks, the meaning of a word is closely related to its context. For instance, in the following two sentences, ‘bank’ refers to both a financial institution and the side of a river. To accurately determine the meaning of ‘bank’, we can rely on the context of the sentence.

I arrived at the bank after crossing the street.

I arrived at the bank after crossing the river.

Variant – Memory-based Attention

The form of Memory-based Attention is as follows, where {(ki, vi)} is referred to as Memory. Here, Memory is essentially synonyms of the input. Particularly, when ki and vi are equal, Memory-based Attention and Basic Attention are the same.

Latest Review on Attention Mechanism and Related Source Code

Latest Review on Attention Mechanism and Related Source Code

For example, in QA questions, Memory-based Attention can iteratively update Memory to shift attention to the location of the answer.

Latest Review on Attention Mechanism and Related Source Code

Evaluation of Attention

Quantitative evaluation of Attention can be done through intrinsic and extrinsic methods. Intrinsic evaluation is based on labeled Alignment data, requiring a large amount of manual labeling. Extrinsic evaluation is simpler, directly comparing the model’s performance on specific tasks, but the drawback is that it is difficult to determine whether performance improvements are due to the Attention mechanism.

Quantitative evaluation is generally achieved through visualized heat maps:

Latest Review on Attention Mechanism and Related Source Code

Related Attention Code

  • “Neural Machine Translation by Jointly Learning to Align and Translate”: https://github.com/tensorflow/nmt

  • “Hierarchical Attention Networks for Document Classification”: https://github.com/richliao/textClassifier

  • “Coupled Multi-Layer Attentions for Co-Extraction of Aspect and Opinion Terms”: https://github.com/happywwy/Coupled-Multi-layer-Attentions

  • “Attention Is All You Need”: https://github.com/Kyubyong/transformer

  • “End-To-End Memory Networks”: https://github.com/facebook/MemNN

References:

  • “An Introductory Survey on Attention Mechanisms in NLP Problems”: https://arxiv.org/abs/1811.05544

For more information, please scan the QR code below to follow the Machine Learning Research Association.

Latest Review on Attention Mechanism and Related Source Code

Source: ZHUAN ZHI

Leave a Comment