Understanding Attention Mechanism in NLP with Code Examples

Produced by Machine Learning Algorithms and Natural Language Processing

@Official Account Original Column Author Don.hub

Position | Algorithm Engineer at JD.com

School | Imperial College London

Outline
Intuition
Analysis

Pros
Cons
From Seq2Seq To Attention Model

Seq2Seq is important, but its flaws are obvious
Attention was born
Write the encoder and decoder model

Taxonomy of Attention

Number of Sequence

Distinctive
Co-Attention
Self

Number of Abstraction

Single-Level
Multi-Level

Number of Positions

Soft/Global
Hard
Local

Number of Representations

Multi-Representational
Multi-Dimensional

Summary

Networks with Attention

Encoder-Decoder

CNN/RNN + RNN
Pointer Networks
Transformer

Memory Networks

Applications

NLG
Classification
Recommendation Systems

Reference

1. Outline

Understanding Attention Mechanism in NLP with Code Examples

2. Intuition

The term “attention” is very representative; when we look at an image, we are easily attracted to the more important or prominent elements, so we focus more on the local parts. In the field of computer vision (CV), it can be seen that local parts of the image have more weight. For example, when generating a title for an image, the words in the title will primarily focus on the local parts.

In the field of NLP, we can imagine that when we are doing reading comprehension, we often look for answers while reading the article, so different parts of the article require different levels of attention. For example, in sentiment analysis of comments, certain emotional words, such as “amazing,” require special attention because they are critical emotional words that often determine the sentiment of the reviewer. As shown in the figure (Yang et al., Professor He’s team HAN).

In simple terms, attention is a vector of weights.

3. Analysis

3.1 Pros

The benefits of attention mainly include good interpretability and a significant improvement in model performance. It has become an essential module for many SOTA models, especially with the emergence of transformers (which utilize self/global/multi-level/multi-head attention), greatly changing the landscape of NLP.

3.2 Cons

It cannot capture positional information, requiring the addition of positional information. Of course, different attention mechanisms have their own characteristics. If we talk about the drawbacks of transformers, the biggest issue is the large spatial consumption because we need to store the attention score (N*N) dimensions, so the sequence length (N) cannot be too long, which leads to a lack of correlation between sequences (for specific references, see XLNET and XLNET’s solutions).

3.3 From Seq2Seq To Attention Model

Why was attention created? Attention was originally born for translation tasks (but ultimately is not limited to translation tasks), so let’s take a look at its specific evolution.

3.3.1 Seq2Seq is Important, but Its Flaws are Obvious

The Seq2Seq model consists of an encoder and a decoder, primarily aimed at translating the input text into the target text. Both the encoder and decoder are RNNs (which can be RNN/LSTM or GRU or bidirectional RNN). The model encodes the source text into a fixed-length context vector, and then uses this encoding to decode the specific output target. This transformation task can be applied to translation, speech conversion, dialogue generation, and other sequence-to-sequence tasks.

However, the drawbacks of this model are also evident: – First, all inputs are encoded into a fixed-length context vector. What is the appropriate length? It is difficult to have a precise answer. A fixed-length vector cannot encode all contextual information, leading to many long-distance dependency relationships being lost. – The decoder, when generating output, lacks a matching mechanism with the encoder’s input, focusing on different weights for different inputs. – Second, it is unable to model alignment between input and output sequences, which is an essential aspect of structured output tasks such as translation or summarization [Young et al., 2018]. Intuitively, in sequence-to-sequence tasks, each output token is expected to be more influenced by some specific parts of the input sequence. However, the decoder lacks any mechanism to selectively focus on relevant input tokens while generating each output token.

3.3.2 Attention Was Born

NMT【paper】【code】 was the first to propose adding an attention block between the encoder and decoder, primarily to solve the matching problem between the encoder and decoder.

Where is the initialization hidden state of the decoder, which is randomly initialized. Compared to Seq2Seq (which uses the context vector as the hidden state of the decoder), is the hidden states of the decoder.
represents the output hidden states of the j-th encoder position
represents the weight of the i-th decoder position on the j-th encoder position
is the output of the i-th decoder position, which is the output after passing through the hidden state and then through the fully connected layer
represents the context vector of the i-th decoder, which is actually the weighted sum of the output hidden output
The input to the decoder is the concatenation of its hidden state and

3.3.3 Write the Encoder and Decoder Model

For detailed implementation, refer to TensorFlow’s repo using tf1.x Neural Machine Translation (seq2seq) tutorial. The code here uses the latest 2.x version.

The shape of the hidden states obtained after input passes through the encoder is (batch_size, max_length, hidden_size), and the shape of the hidden state of the decoder is (batch_size, hidden_size).

Below are the implemented equations:

This tutorial uses Bahdanau attention for the encoder. Let’s decide on notation before writing the simplified form:

FC = Fully connected (dense) layer
EO = Encoder output
H = Hidden state
X = Input to the decoder

And the pseudo-code:

score = FC(tanh(FC(EO) + FC(H)))
attention weights = softmax(score, axis = 1). Softmax by default is applied on the last axis but here we want to apply it on the 1st axis, since the shape of score is (batch_size, max_length, hidden_size). Max_length is the length of our input. Since we are trying to assign a weight to each input, softmax should be applied on that axis.
context vector = sum(attention weights * EO, axis = 1). Same reason as above for choosing axis as 1.
embedding output = The input to the decoder X is passed through an embedding layer.
merged vector = concat(embedding output, context vector)
This merged vector is then given to the GRU

class BahdanauAttention(tf.keras.layers.Layer):
  def __init__(self, units):
    super(BahdanauAttention, self).__init__()
    self.W1 = tf.keras.layers.Dense(units)
    self.W2 = tf.keras.layers.Dense(units)
    self.V = tf.keras.layers.Dense(1)

  def call(self, query, values):
    # hidden shape == (batch_size, hidden size)
    # hidden_with_time_axis shape == (batch_size, 1, hidden size)
    # we are doing this to perform addition to calculate the score
    hidden_with_time_axis = tf.expand_dims(query, 1)

    # score shape == (batch_size, max_length, 1)
    # we get 1 at the last axis because we are applying score to self.V
    # the shape of the tensor before applying self.V is (batch_size, max_length, units)
    score = self.V(tf.nn.tanh(
        self.W1(values) + self.W2(hidden_with_time_axis)))

    # attention_weights shape == (batch_size, max_length, 1)
    attention_weights = tf.nn.softmax(score, axis=1)

    # context_vector shape after sum == (batch_size, hidden_size)
    context_vector = attention_weights * values
    context_vector = tf.reduce_sum(context_vector, axis=1)

    return context_vector, attention_weights
class Encoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
    super(Encoder, self).__init__()
    self.batch_sz = batch_sz
    self.enc_units = enc_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.enc_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')

  def call(self, x, hidden):
    x = self.embedding(x)
    output, state = self.gru(x, initial_state = hidden)
    return output, state

  def initialize_hidden_state(self):
    return tf.zeros((self.batch_sz, self.enc_units))
class Decoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
    super(Decoder, self).__init__()
    self.batch_sz = batch_sz
    self.dec_units = dec_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.dec_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')
    self.fc = tf.keras.layers.Dense(vocab_size)

    # used for attention
    self.attention = BahdanauAttention(self.dec_units)

  def call(self, x, hidden, enc_output):
    # enc_output shape == (batch_size, max_length, hidden_size)
    context_vector, attention_weights = self.attention(hidden, enc_output)

    # x shape after passing through embedding == (batch_size, 1, embedding_dim)
    x = self.embedding(x)

    # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

    # passing the concatenated vector to the GRU
    output, state = self.gru(x)

    # output shape == (batch_size * 1, hidden_size)
    output = tf.reshape(output, (-1, output.shape[2]))

    # output shape == (batch_size, vocab)
    x = self.fc(output)

    return x, state, attention_weights

4. Taxonomy of Attention

According to different classification criteria, attention can be divided into multiple categories, but specifically, they all involve the interaction between q (query), k (key), and v (value). The score is calculated by q and k, and the calculation methods for this score vary as shown in the table below, followed by softmax for normalization. Finally, the calculated score is multiplied by v and summed (or argmax as in pointer networks).

Below is a summary table of several popular attention mechanisms and corresponding alignment score functions:

(*) Referred to as “concat” in Luong, et al., 2015 and as “additive attention” in Vaswani, et al., 2017. (^) It adds a scaling factor 1/n‾√1/n, motivated by the concern when the input is large, the softmax function may have an extremely small gradient, hard for efficient learning.

The following classifications are not mutually exclusive; for example, the HAN model is a multi-level, soft attention model (AM).

4.1 Number of Sequence

Classified based on the sequences from which our query and value come.

4.1.1 Distinctive

The attention’s query and value come from two different input sequences and output sequences, such as in NMT, where our query comes from the hidden state of the decoder, and our value comes from the hidden state of the encoder.

4.1.2 Co-Attention

Co-attention models jointly learn weights from multiple input sequences and capture the interactions of these inputs. For example, in visual question answering tasks, the authors believe that attention on images is important, but attention on question text is equally important. Therefore, they adopt a joint learning method using attention to allow the model to capture important stem information and corresponding image information simultaneously.

4.1.3 Self

In scenarios such as text classification or recommendation systems, our input is a sequence, and the output is not a sequence. In this case, each word in the text looks at the importance of related words within itself. As shown in the figure.

We can take a look at the function description of the self-attention implementation in BERT. If from tensor = to tensor, then it is self-attention.

def attention_layer(from_tensor,
                    to_tensor,
                    attention_mask=None,
                    num_attention_heads=1,
                    size_per_head=512,
                    query_act=None,
                    key_act=None,
                    value_act=None,
                    attention_probs_dropout_prob=0.0,
                    initializer_range=0.02,
                    do_return_2d_tensor=False,
                    batch_size=None,
                    from_seq_length=None,
                    to_seq_length=None):
  """Performs multi-headed attention from `from_tensor` to `to_tensor`.
  This is an implementation of multi-headed attention based on "Attention
  is all you Need". If `from_tensor` and `to_tensor` are the **same**, then
  this is self-attention. Each timestep in `from_tensor` attends to the
  corresponding sequence in `to_tensor`, and returns a fixed-width vector
  """

4.2 Number of Abstraction

This is classified based on the hierarchy of attention weight calculations.

4.2.1 Single-Level

In the most common case, attention is calculated on the input sequence, which is the ordinary single-level attention.

4.2.2 Multi-Level

However, many models, such as HAN, have a hierarchical structure. The model primarily addresses document classification issues. It proposes that a document is composed of sentences, and sentences are composed of words, so it builds a two-level encoder (bidirectional GRU) representation. The lower encoder encodes words, and the upper encoder encodes sentences. An attention layer connects the two encoders, which operates on the word-level encoding attention. Finally, when outputting for document classification, it also uses a sentence-level attention, culminating in a Dense layer for sentence classification. It is important to note that the two queries and are both randomly initialized and trained along with the model. The scoring method used is also a Dense method. However, unlike NMT, this is self-attention.

4.3 Number of Positions

Based on the different positions that the attention layer focuses on, we can classify attention into three types: global/soft (these are almost the same), local, and hard attention. Effective Approaches to Attention-based Neural Machine Translation propose local global attention, while Show, Attend and Tell: Neural Image Caption Generation with Visual Attention proposes hard soft attention.

4.3.1 Soft/Global

Global/soft attention refers to attention focusing on all positions in the input sequence. The benefit is that it is smooth and differentiable, but the downside is the large computational load.

4.3.2 Hard

Hard attention calculates the context vector from sampled input sequence hidden states, effectively randomly selecting hidden states to compute attention. This can reduce computational load, but the downside is that the computation is non-differentiable, requiring reinforcement learning or other techniques such as variational learning methods.

4.3.3 Local

The local method strikes a balance between hard and soft: first, it finds a point or position in the input sequence that needs attention; then, it selects a window size to create a local soft attention. The advantage is that the computation is differentiable and reduces computational load.

4.4 Number of Representations

Generally, single-representation is the most common case, meaning that an input has only one feature representation. However, in other scenarios, an input may have multiple representations, so we classify based on how the input is represented.

4.4.1 Multi-Representational

In some scenarios, a single feature representation is insufficient to capture all the information of the input. Input features can have multiple feature representations, such as in Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, where text input is represented in multiple word embedding representations, and the final representations are weighted and summed through attention. For example, a text input can have embeddings representing words, grammar, visual features, and categories, and the final representations are weighted and summed through attention.

4.4.2 Multi-Dimensional

As the name suggests, this type of attention is related to dimensions. The weights of this attention can determine the correlations between different dimensions in the input embedding vector. In fact, the dimensions in the embedding can be viewed as a latent feature representation (unlike one-hot which is an explicit representation), so by calculating the correlations between different dimensions, we can identify the most influential feature dimensions. This method is particularly effective in resolving polysemy issues. Therefore, this approach is very useful in sentence-level embedding representations and NLU.

5. Summary

6. Networks with Attention

Having introduced so many categories of attention, in which networks is attention typically used? We summarize two types of networks: one is encoder-decoder based, and the other is memory network.

6.1 Encoder-Decoder

Encoder-decoder networks + attention are the most common networks with attention, where NMT was the first network to propose the idea of attention. The encoder and decoder can be flexibly changed and are not strictly RNN structures.

6.1.1 CNN/RNN + RNN

For tasks like image-to-text, the encoder can be replaced with CNN, while tasks like text-to-text can use RNN + RNN.

6.1.2 Pointer Networks

Not all sequence input and sequence output problems can be solved using encoder-decoder models (e.g., sorting or the traveling salesman problem). For example, in the problem below, we want to find a set of points that can enclose all points in the graph. The expected result is to input all points and finally output

If we directly train this, as shown in the figure: input the coordinates of 4 data points, we get a red vector, then input this vector into the decoder to get a distribution, then sample (e.g., do argmax to decide to output token 1…), and finally check if it works. The result is that it does not work. For example, during training, there are 50 points, numbered 1-50, but during testing, there are 100 points, and it can only choose points numbered 1-50, leaving the latter points unselectable.

Improvement: Attention allows the network to dynamically decide how large the output set is.

x0, y0 represent END words; each input will receive an attention weight = output distribution.

The final condition for the model to end is having the highest probability.

6.1.3 Transformer

The transformer network uses encoder + decoder networks, primarily addressing the slow computation speed of RNNs by improving computation efficiency through parallel self-attention mechanisms. However, it also brings issues of large computational load and spatial consumption, leading to limitations on sequence length. For solutions, refer to transformerXL. (A future article will cover transformers.) – The role of multi-head is somewhat similar to CNN kernels, mainly capturing different feature information.

6.2 Memory Networks

Applications such as question answering or chatbots require input queries and knowledge databases. End-to-end memory networks store knowledge databases through a memory block array and match queries and answers using attention. Memory networks consist of four components: the vector of the query (input), a series of trainable mapping matrices, attention weights, and multi-hop reasoning. This allows for reasoning using facts from the knowledge base, key information from history, and key information from the query, which is crucial in QA and dialogue applications.

7. Applications

7.1 NLG

MT: Machine Translation
QA: Problems have made use of attention to (i) better understand questions by focusing on relevant parts of the question [Hermann et al., 2015], (ii) store large amounts of information using memory networks to help find answers [Sukhbaatar et al., 2015], and (iii) improve performance in visual QA tasks by modeling multi-modality in input using co-attention [Lu et al., 2016].
Multimedia Description (MD): is the task of generating a natural language text description of a multimedia input sequence, which can be speech, image, or video [Cho et al., 2015]. Similar to QA, here attention performs the function of finding relevant acoustic signals in speech input [Chorowski et al., 2015] or relevant parts of the input image [Xu et al., 2015] to predict the next word in the caption. Further, Li et al. [2017] exploit the temporal and spatial structures of videos using multi-level attention for video captioning tasks. The lower abstraction level extracts specific regions within a frame, and the higher abstraction level focuses on a small subset of frames selectively.

7.2 Classification

Document Classification: HAN
Sentiment Analysis:
Similarly, in the sentiment analysis task, self-attention helps to focus on the words that are important for determining the sentiment of input. A couple of approaches for aspect-based sentiment classification by Wang et al. [2016] and Ma et al. [2018] incorporate additional knowledge of aspect-related concepts into the model and use attention to appropriately weigh the concepts apart from the content itself. Sentiment analysis applications have also seen multiple architectures being used with attention, such as memory networks [Tang et al., 2016] and Transformers [Ambartsoumian and Popowich, 2018; Song et al., 2019].

7.3 Recommendation Systems

Multiple papers use self-attention mechanisms for finding the most relevant items in the user’s history to improve item recommendations, either within a collaborative filtering framework [He et al., 2018; Shuai Yu, 2019] or within an encoder-decoder architecture for sequential recommendations [Kang and McAuley, 2018; Zhou et al., 2018].

Recently, attention has been used in novel ways which have opened new avenues for research. Some interesting directions include smoother incorporation of external knowledge bases, pre-training embeddings, and multi-task learning, unsupervised representational learning, sparsity learning, and prototypical learning, i.e., sample selection.

8. Reference

The writing style is good; the last model section could be supplemented in this article.
A very good overview: An Attentive Survey of Attention Models
wildml.com/2016/01/atte
Graphical explanation of NMT (there is a minor error in the decoder part, as the initialized embedding of the decoder is estimated to be different, and the initialized attention score uses the hidden output of the encoder as the key, which is actually the concatenation of context and embedding as input).
NMT code
Pointer Network
Pointer Slides
All Attention You Need has not been finished yet.

Recommended Reading

AINLP Annual Reading Collection

Using GPT-2 to Automatically Write Poetry, Starting with Five-Character Quatrains

BERT Source Code Analysis (PART I)

BERT Source Code Analysis (PART II)

Transformers Assemble (PART I)

Transformers Assemble (PART II)

New NLP Stars Standing on the Shoulders of BERT (PART I)

New NLP Stars Standing on the Shoulders of BERT (PART II)

New NLP Stars Standing on the Shoulders of BERT (PART III)

Significantly Reducing GPU Memory Usage: Reversible Residual Networks

During the Year of the Rat Spring Festival, Using GPT-2 to Automatically Write Couplets and Counter-Couplets

Notes on Transformer-XL and XLNet

AINLP-DBC GPU Cloud Server Rental Platform Established, Price is Affordable

Call for Papers | Manuscript Fees + GPU Computing Power + Planet Guests Included

About AINLP

AINLP is an interesting AI natural language processing community focused on sharing technologies related to AI, NLP, machine learning, deep learning, recommendation algorithms, etc. Topics include text summarization, intelligent Q&A, chatbots, machine translation, automatic generation, knowledge graphs, pre-training models, recommendation systems, computational advertising, recruitment information, and job experience sharing. Welcome to follow! To join the technical exchange group, please add AINLP’s WeChat (ID: AINLP2), and note your work/research direction + purpose of joining the group.