Produced by Machine Learning Algorithms and Natural Language Processing
@Official Account Original Column Author Don.hub
Position | Algorithm Engineer at JD.com
School | Imperial College London
-
Outline
-
Intuition
-
Analysis
-
Pros
-
Cons
-
From Seq2Seq To Attention Model
-
Seq2Seq is important, but its flaws are obvious
-
Attention was born
-
Write the encoder and decoder model
-
Taxonomy of Attention
-
Number of Sequence
-
Distinctive
-
Co-Attention
-
Self
-
Number of Abstraction
-
Single-Level
-
Multi-Level
-
Number of Positions
-
Soft/Global
-
Hard
-
Local
-
Number of Representations
-
Multi-Representational
-
Multi-Dimensional
-
Summary
-
Networks with Attention
-
Encoder-Decoder
-
CNN/RNN + RNN
-
Pointer Networks
-
Transformer
-
Memory Networks
-
Applications
-
NLG
-
Classification
-
Recommendation Systems
-
Reference
1. Outline

2. Intuition
The term “attention” is very representative; when we look at an image, we are easily attracted to the more important or prominent elements, so we focus more on the local parts. In the field of computer vision (CV), it can be seen that local parts of the image have more weight. For example, when generating a title for an image, the words in the title will primarily focus on the local parts.

In the field of NLP, we can imagine that when we are doing reading comprehension, we often look for answers while reading the article, so different parts of the article require different levels of attention. For example, in sentiment analysis of comments, certain emotional words, such as “amazing,” require special attention because they are critical emotional words that often determine the sentiment of the reviewer. As shown in the figure (Yang et al., Professor He’s team HAN).

In simple terms, attention is a vector of weights.
3. Analysis
3.1 Pros
The benefits of attention mainly include good interpretability and a significant improvement in model performance. It has become an essential module for many SOTA models, especially with the emergence of transformers (which utilize self/global/multi-level/multi-head attention), greatly changing the landscape of NLP.
3.2 Cons
It cannot capture positional information, requiring the addition of positional information. Of course, different attention mechanisms have their own characteristics. If we talk about the drawbacks of transformers, the biggest issue is the large spatial consumption because we need to store the attention score (N*N) dimensions, so the sequence length (N) cannot be too long, which leads to a lack of correlation between sequences (for specific references, see XLNET and XLNET’s solutions).
3.3 From Seq2Seq To Attention Model
Why was attention created? Attention was originally born for translation tasks (but ultimately is not limited to translation tasks), so let’s take a look at its specific evolution.
3.3.1 Seq2Seq is Important, but Its Flaws are Obvious
The Seq2Seq model consists of an encoder and a decoder, primarily aimed at translating the input text into the target text. Both the encoder and decoder are RNNs (which can be RNN/LSTM or GRU or bidirectional RNN). The model encodes the source text into a fixed-length context vector, and then uses this encoding to decode the specific output target. This transformation task can be applied to translation, speech conversion, dialogue generation, and other sequence-to-sequence tasks.

However, the drawbacks of this model are also evident: – First, all inputs are encoded into a fixed-length context vector. What is the appropriate length? It is difficult to have a precise answer. A fixed-length vector cannot encode all contextual information, leading to many long-distance dependency relationships being lost. – The decoder, when generating output, lacks a matching mechanism with the encoder’s input, focusing on different weights for different inputs. – Second, it is unable to model alignment between input and output sequences, which is an essential aspect of structured output tasks such as translation or summarization [Young et al., 2018]. Intuitively, in sequence-to-sequence tasks, each output token is expected to be more influenced by some specific parts of the input sequence. However, the decoder lacks any mechanism to selectively focus on relevant input tokens while generating each output token.
3.3.2 Attention Was Born
NMT【paper】 【code】 was the first to propose adding an attention block between the encoder and decoder, primarily to solve the matching problem between the encoder and decoder.

-
Where
is the initialization hidden state of the decoder, which is randomly initialized. Compared to Seq2Seq (which uses the context vector as the hidden state of the decoder),
is the hidden states of the decoder.
-
represents the output hidden states of the j-th encoder position
-
represents the weight of the i-th decoder position on the j-th encoder position
-
is the output of the i-th decoder position, which is the output after passing through the hidden state and then through the fully connected layer
-
represents the context vector of the i-th decoder, which is actually the weighted sum of the output hidden output
-
The input to the decoder is the concatenation of its hidden state and
3.3.3 Write the Encoder and Decoder Model
For detailed implementation, refer to TensorFlow’s repo using tf1.x Neural Machine Translation (seq2seq) tutorial. The code here uses the latest 2.x version.
The shape of the hidden states obtained after input passes through the encoder is (batch_size, max_length, hidden_size), and the shape of the hidden state of the decoder is (batch_size, hidden_size).
Below are the implemented equations:

This tutorial uses Bahdanau attention for the encoder. Let’s decide on notation before writing the simplified form:
-
FC = Fully connected (dense) layer
-
EO = Encoder output
-
H = Hidden state
-
X = Input to the decoder
And the pseudo-code:
-
score = FC(tanh(FC(EO) + FC(H)))
-
attention weights = softmax(score, axis = 1)
. Softmax by default is applied on the last axis but here we want to apply it on the 1st axis, since the shape of score is (batch_size, max_length, hidden_size).Max_length
is the length of our input. Since we are trying to assign a weight to each input, softmax should be applied on that axis. -
context vector = sum(attention weights * EO, axis = 1)
. Same reason as above for choosing axis as 1. -
embedding output
= The input to the decoder X is passed through an embedding layer. -
merged vector = concat(embedding output, context vector)
-
This merged vector is then given to the GRU
class BahdanauAttention(tf.keras.layers.Layer):
def __init__(self, units):
super(BahdanauAttention, self).__init__()
self.W1 = tf.keras.layers.Dense(units)
self.W2 = tf.keras.layers.Dense(units)
self.V = tf.keras.layers.Dense(1)
def call(self, query, values):
# hidden shape == (batch_size, hidden size)
# hidden_with_time_axis shape == (batch_size, 1, hidden size)
# we are doing this to perform addition to calculate the score
hidden_with_time_axis = tf.expand_dims(query, 1)
# score shape == (batch_size, max_length, 1)
# we get 1 at the last axis because we are applying score to self.V
# the shape of the tensor before applying self.V is (batch_size, max_length, units)
score = self.V(tf.nn.tanh(
self.W1(values) + self.W2(hidden_with_time_axis)))
# attention_weights shape == (batch_size, max_length, 1)
attention_weights = tf.nn.softmax(score, axis=1)
# context_vector shape after sum == (batch_size, hidden_size)
context_vector = attention_weights * values
context_vector = tf.reduce_sum(context_vector, axis=1)
return context_vector, attention_weights
class Encoder(tf.keras.Model):
def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
super(Encoder, self).__init__()
self.batch_sz = batch_sz
self.enc_units = enc_units
self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
self.gru = tf.keras.layers.GRU(self.enc_units,
return_sequences=True,
return_state=True,
recurrent_initializer='glorot_uniform')
def call(self, x, hidden):
x = self.embedding(x)
output, state = self.gru(x, initial_state = hidden)
return output, state
def initialize_hidden_state(self):
return tf.zeros((self.batch_sz, self.enc_units))
class Decoder(tf.keras.Model):
def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
super(Decoder, self).__init__()
self.batch_sz = batch_sz
self.dec_units = dec_units
self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
self.gru = tf.keras.layers.GRU(self.dec_units,
return_sequences=True,
return_state=True,
recurrent_initializer='glorot_uniform')
self.fc = tf.keras.layers.Dense(vocab_size)
# used for attention
self.attention = BahdanauAttention(self.dec_units)
def call(self, x, hidden, enc_output):
# enc_output shape == (batch_size, max_length, hidden_size)
context_vector, attention_weights = self.attention(hidden, enc_output)
# x shape after passing through embedding == (batch_size, 1, embedding_dim)
x = self.embedding(x)
# x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
# passing the concatenated vector to the GRU
output, state = self.gru(x)
# output shape == (batch_size * 1, hidden_size)
output = tf.reshape(output, (-1, output.shape[2]))
# output shape == (batch_size, vocab)
x = self.fc(output)
return x, state, attention_weights
4. Taxonomy of Attention
According to different classification criteria, attention can be divided into multiple categories, but specifically, they all involve the interaction between q (query), k (key), and v (value). The score is calculated by q and k, and the calculation methods for this score vary as shown in the table below, followed by softmax for normalization. Finally, the calculated score is multiplied by v and summed (or argmax as in pointer networks).
Below is a summary table of several popular attention mechanisms and corresponding alignment score functions:

(*) Referred to as “concat” in Luong, et al., 2015 and as “additive attention” in Vaswani, et al., 2017. (^) It adds a scaling factor 1/n‾√1/n, motivated by the concern when the input is large, the softmax function may have an extremely small gradient, hard for efficient learning.
The following classifications are not mutually exclusive; for example, the HAN model is a multi-level, soft attention model (AM).
4.1 Number of Sequence
Classified based on the sequences from which our query and value come.
4.1.1 Distinctive
The attention’s query and value come from two different input sequences and output sequences, such as in NMT, where our query comes from the hidden state of the decoder, and our value comes from the hidden state of the encoder.
4.1.2 Co-Attention
Co-attention models jointly learn weights from multiple input sequences and capture the interactions of these inputs. For example, in visual question answering tasks, the authors believe that attention on images is important, but attention on question text is equally important. Therefore, they adopt a joint learning method using attention to allow the model to capture important stem information and corresponding image information simultaneously.
4.1.3 Self
In scenarios such as text classification or recommendation systems, our input is a sequence, and the output is not a sequence. In this case, each word in the text looks at the importance of related words within itself. As shown in the figure.

We can take a look at the function description of the self-attention implementation in BERT. If from tensor = to tensor, then it is self-attention.
def attention_layer(from_tensor,
to_tensor,
attention_mask=None,
num_attention_heads=1,
size_per_head=512,
query_act=None,
key_act=None,
value_act=None,
attention_probs_dropout_prob=0.0,
initializer_range=0.02,
do_return_2d_tensor=False,
batch_size=None,
from_seq_length=None,
to_seq_length=None):
"""Performs multi-headed attention from `from_tensor` to `to_tensor`.
This is an implementation of multi-headed attention based on "Attention
is all you Need". If `from_tensor` and `to_tensor` are the **same**, then
this is self-attention. Each timestep in `from_tensor` attends to the
corresponding sequence in `to_tensor`, and returns a fixed-width vector
"""
4.2 Number of Abstraction
This is classified based on the hierarchy of attention weight calculations.
4.2.1 Single-Level
In the most common case, attention is calculated on the input sequence, which is the ordinary single-level attention.
4.2.2 Multi-Level
However, many models, such as HAN, have a hierarchical structure. The model primarily addresses document classification issues. It proposes that a document is composed of sentences, and sentences are composed of words, so it builds a two-level encoder (bidirectional GRU) representation. The lower encoder encodes words, and the upper encoder encodes sentences. An attention layer connects the two encoders, which operates on the word-level encoding attention. Finally, when outputting for document classification, it also uses a sentence-level attention, culminating in a Dense layer for sentence classification. It is important to note that the two queries and
are both randomly initialized and trained along with the model. The scoring method used is also a Dense method. However, unlike NMT, this is self-attention.

4.3 Number of Positions
Based on the different positions that the attention layer focuses on, we can classify attention into three types: global/soft (these are almost the same), local, and hard attention. Effective Approaches to Attention-based Neural Machine Translation propose local global attention, while Show, Attend and Tell: Neural Image Caption Generation with Visual Attention proposes hard soft attention.
4.3.1 Soft/Global
Global/soft attention refers to attention focusing on all positions in the input sequence. The benefit is that it is smooth and differentiable, but the downside is the large computational load.
4.3.2 Hard
Hard attention calculates the context vector from sampled input sequence hidden states, effectively randomly selecting hidden states to compute attention. This can reduce computational load, but the downside is that the computation is non-differentiable, requiring reinforcement learning or other techniques such as variational learning methods.
4.3.3 Local
The local method strikes a balance between hard and soft: first, it finds a point or position in the input sequence that needs attention; then, it selects a window size to create a local soft attention. The advantage is that the computation is differentiable and reduces computational load.
4.4 Number of Representations
Generally, single-representation is the most common case, meaning that an input has only one feature representation. However, in other scenarios, an input may have multiple representations, so we classify based on how the input is represented.
4.4.1 Multi-Representational
In some scenarios, a single feature representation is insufficient to capture all the information of the input. Input features can have multiple feature representations, such as in Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, where text input is represented in multiple word embedding representations, and the final representations are weighted and summed through attention. For example, a text input can have embeddings representing words, grammar, visual features, and categories, and the final representations are weighted and summed through attention.
4.4.2 Multi-Dimensional
As the name suggests, this type of attention is related to dimensions. The weights of this attention can determine the correlations between different dimensions in the input embedding vector. In fact, the dimensions in the embedding can be viewed as a latent feature representation (unlike one-hot which is an explicit representation), so by calculating the correlations between different dimensions, we can identify the most influential feature dimensions. This method is particularly effective in resolving polysemy issues. Therefore, this approach is very useful in sentence-level embedding representations and NLU.
5. Summary

6. Networks with Attention
Having introduced so many categories of attention, in which networks is attention typically used? We summarize two types of networks: one is encoder-decoder based, and the other is memory network.
6.1 Encoder-Decoder
Encoder-decoder networks + attention are the most common networks with attention, where NMT was the first network to propose the idea of attention. The encoder and decoder can be flexibly changed and are not strictly RNN structures.
6.1.1 CNN/RNN + RNN
For tasks like image-to-text, the encoder can be replaced with CNN, while tasks like text-to-text can use RNN + RNN.
6.1.2 Pointer Networks
Not all sequence input and sequence output problems can be solved using encoder-decoder models (e.g., sorting or the traveling salesman problem). For example, in the problem below, we want to find a set of points that can enclose all points in the graph. The expected result is to input all points and finally output

If we directly train this, as shown in the figure: input the coordinates of 4 data points, we get a red vector, then input this vector into the decoder to get a distribution, then sample (e.g., do argmax to decide to output token 1…), and finally check if it works. The result is that it does not work. For example, during training, there are 50 points, numbered 1-50, but during testing, there are 100 points, and it can only choose points numbered 1-50, leaving the latter points unselectable.

Improvement: Attention allows the network to dynamically decide how large the output set is.
x0, y0 represent END words; each input will receive an attention weight = output distribution.

The final condition for the model to end is having the highest probability.

6.1.3 Transformer
The transformer network uses encoder + decoder networks, primarily addressing the slow computation speed of RNNs by improving computation efficiency through parallel self-attention mechanisms. However, it also brings issues of large computational load and spatial consumption, leading to limitations on sequence length. For solutions, refer to transformerXL. (A future article will cover transformers.) – The role of multi-head is somewhat similar to CNN kernels, mainly capturing different feature information.
6.2 Memory Networks
Applications such as question answering or chatbots require input queries and knowledge databases. End-to-end memory networks store knowledge databases through a memory block array and match queries and answers using attention. Memory networks consist of four components: the vector of the query (input), a series of trainable mapping matrices, attention weights, and multi-hop reasoning. This allows for reasoning using facts from the knowledge base, key information from history, and key information from the query, which is crucial in QA and dialogue applications.
7. Applications
7.1 NLG
-
MT: Machine Translation
-
QA: Problems have made use of attention to (i) better understand questions by focusing on relevant parts of the question [Hermann et al., 2015], (ii) store large amounts of information using memory networks to help find answers [Sukhbaatar et al., 2015], and (iii) improve performance in visual QA tasks by modeling multi-modality in input using co-attention [Lu et al., 2016].
-
Multimedia Description (MD): is the task of generating a natural language text description of a multimedia input sequence, which can be speech, image, or video [Cho et al., 2015]. Similar to QA, here attention performs the function of finding relevant acoustic signals in speech input [Chorowski et al., 2015] or relevant parts of the input image [Xu et al., 2015] to predict the next word in the caption. Further, Li et al. [2017] exploit the temporal and spatial structures of videos using multi-level attention for video captioning tasks. The lower abstraction level extracts specific regions within a frame, and the higher abstraction level focuses on a small subset of frames selectively.
7.2 Classification
-
Document Classification: HAN
-
Sentiment Analysis:
-
Similarly, in the sentiment analysis task, self-attention helps to focus on the words that are important for determining the sentiment of input. A couple of approaches for aspect-based sentiment classification by Wang et al. [2016] and Ma et al. [2018] incorporate additional knowledge of aspect-related concepts into the model and use attention to appropriately weigh the concepts apart from the content itself. Sentiment analysis applications have also seen multiple architectures being used with attention, such as memory networks [Tang et al., 2016] and Transformers [Ambartsoumian and Popowich, 2018; Song et al., 2019].
7.3 Recommendation Systems
Multiple papers use self-attention mechanisms for finding the most relevant items in the user’s history to improve item recommendations, either within a collaborative filtering framework [He et al., 2018; Shuai Yu, 2019] or within an encoder-decoder architecture for sequential recommendations [Kang and McAuley, 2018; Zhou et al., 2018].
Recently, attention has been used in novel ways which have opened new avenues for research. Some interesting directions include smoother incorporation of external knowledge bases, pre-training embeddings, and multi-task learning, unsupervised representational learning, sparsity learning, and prototypical learning, i.e., sample selection.
8. Reference
-
The writing style is good; the last model section could be supplemented in this article.
-
A very good overview: An Attentive Survey of Attention Models
-
wildml.com/2016/01/atte
-
Graphical explanation of NMT (there is a minor error in the decoder part, as the initialized embedding of the decoder is estimated to be different, and the initialized attention score uses the hidden output of the encoder as the key, which is actually the concatenation of context and embedding as input).
-
NMT code
-
Pointer Network
-
Pointer Slides
-
All Attention You Need has not been finished yet.
Recommended Reading
AINLP Annual Reading Collection
Using GPT-2 to Automatically Write Poetry, Starting with Five-Character Quatrains
BERT Source Code Analysis (PART I)
BERT Source Code Analysis (PART II)
Transformers Assemble (PART I)
Transformers Assemble (PART II)
New NLP Stars Standing on the Shoulders of BERT (PART I)
New NLP Stars Standing on the Shoulders of BERT (PART II)
New NLP Stars Standing on the Shoulders of BERT (PART III)
Significantly Reducing GPU Memory Usage: Reversible Residual Networks
During the Year of the Rat Spring Festival, Using GPT-2 to Automatically Write Couplets and Counter-Couplets
Notes on Transformer-XL and XLNet
AINLP-DBC GPU Cloud Server Rental Platform Established, Price is Affordable
Call for Papers | Manuscript Fees + GPU Computing Power + Planet Guests Included
About AINLP
AINLP is an interesting AI natural language processing community focused on sharing technologies related to AI, NLP, machine learning, deep learning, recommendation algorithms, etc. Topics include text summarization, intelligent Q&A, chatbots, machine translation, automatic generation, knowledge graphs, pre-training models, recommendation systems, computational advertising, recruitment information, and job experience sharing. Welcome to follow! To join the technical exchange group, please add AINLP’s WeChat (ID: AINLP2), and note your work/research direction + purpose of joining the group.