Comprehensive Guide to Seq2Seq Attention Model

Follow us on WeChat: ML_NLP.
Set as a “Starred” account for heavy content delivered to you first!

Comprehensive Guide to Seq2Seq Attention Model

Source: | Zhihu

Link: | https://zhuanlan.zhihu.com/p/40920384

Author: | Yuanche.Sh

Editor: | Machine Learning Algorithms and Natural Language Processing WeChat account

This article is for academic sharing only. If there is any infringement, please contact us to delete it.

Understand the seq2seq attention model in five minutes.

This article uses images to detail the entire process of the seq2seq + attention model, helping everyone to understand the important model for tasks like machine translation without pain.

seq2seq is a network with an Encoder-Decoder structure, where the input is a sequence, and the output is also a sequence. The Encoder transforms a variable-length signal sequence into a fixed-length vector representation, while the Decoder converts this fixed-length vector into a variable-length target signal sequence. –Jianshu

Alright, let’s stop talking and start looking at the images.

Main Framework

Comprehensive Guide to Seq2Seq Attention Model
Imagine a translation task where the input is a piece of English text, and the output is a piece of Chinese text.

Formulas (best to skip to the images)

Input: Comprehensive Guide to Seq2Seq Attention Model

Output: Comprehensive Guide to Seq2Seq Attention Model

(1) Comprehensive Guide to Seq2Seq Attention Model , The Encoder accepts the word embedding of each word and the hidden state from the previous time step. The output is the hidden state at this time step.

(2) Comprehensive Guide to Seq2Seq Attention Model , The Decoder accepts the word embedding of the words in the target sentence and the hidden state from the previous time step.

(3) Comprehensive Guide to Seq2Seq Attention Model , The context vector is a weighted average of the hidden states output by the encoder.

(4) Comprehensive Guide to Seq2Seq Attention Model , Each encoder’s hidden states correspond to the weights.

(5) Comprehensive Guide to Seq2Seq Attention Model , A score is calculated using the hidden states from the decoder and the hidden states from the encoder to compute the weights (4).

(6) Comprehensive Guide to Seq2Seq Attention Model, The context vector and the hidden states from the decoder are concatenated.

(7) Comprehensive Guide to Seq2Seq Attention Model , The final output probability is calculated.

Detailed Diagram

Comprehensive Guide to Seq2Seq Attention Model
On the left is the Encoder + Input, on the right is the Decoder + Output. In the middle is the Attention.

(1) Comprehensive Guide to Seq2Seq Attention Model , The Encoder accepts the word embedding of each word and the hidden state from the previous time step. The output is the hidden state at this time step.

Comprehensive Guide to Seq2Seq Attention Model
Starting from the left Encoder, the input is converted to word embeddings and enters the LSTM. The LSTM outputs hidden states at each time step, as shown in h1, h2,…, h8.

(2) Comprehensive Guide to Seq2Seq Attention Model , The Decoder accepts the word embedding of the words in the target sentence and the hidden state from the previous time step.

Comprehensive Guide to Seq2Seq Attention Model
Next, we enter the right Decoder, where the input is the symbol, the original context vector (which is 0), and the last hidden state from the encoder: h8. The output of the LSTM is a hidden state. (Of course, there is also the cell state, which is not mentioned here.)

(3) Comprehensive Guide to Seq2Seq Attention Model , The context vector is a weighted average of the hidden states output by the encoder.

(4) Comprehensive Guide to Seq2Seq Attention Model , Each encoder’s hidden states correspond to the weights.

(5) Comprehensive Guide to Seq2Seq Attention Model , A score is calculated using the hidden states from the decoder and the hidden states from the encoder to compute the weights (4).

Comprehensive Guide to Seq2Seq Attention Model
The hidden state of the Decoder and all hidden states of the Encoder are used as inputs to the Attention module to start calculating a context vector. The method of calculating attention will be introduced later.

Next Time Step

Comprehensive Guide to Seq2Seq Attention Model
Arriving at time point 2, the previous context vector can be used as input and concatenated with the target word to form the input for LSTM. Then it returns to a hidden state. This process repeats.

(6) Comprehensive Guide to Seq2Seq Attention Model, The context vector and the hidden states from the decoder are concatenated.

(7) Comprehensive Guide to Seq2Seq Attention Model , The final output probability is calculated.

Comprehensive Guide to Seq2Seq Attention Model
On the other hand, the context vector and the hidden state of the decoder are combined through a series of nonlinear transformations and softmax to finally compute the probabilities.

Luong mentioned three methods for calculating the score. Here are the first two illustrated:

Comprehensive Guide to Seq2Seq Attention Model

Attention score function: dot

Comprehensive Guide to Seq2Seq Attention Model

The input is all hidden states from the encoder H: size (hid dim, sequence length). The hidden state at one time point of the decoder, s: size (hid dim, 1).

Step One: Rotate H to (sequence length, hid dim) and perform a dot product with s to obtain a score of size (sequence length, 1).

Step Two: Apply softmax to the score to get weights that sum to 1.

Step Three: Multiply H by the weights obtained in Step Two to get a context vector of size (hid dim, 1).

Attention score function: general

Comprehensive Guide to Seq2Seq Attention Model

The input is all hidden states from the encoder H: size (hid dim1, sequence length). The hidden state at one time point of the decoder, s: size (hid dim2, 1). The dimensions of the two hidden states are not the same.

Step One: Rotate H to (sequence length, hid dim1) and perform a dot product with Wa [size hid dim1, hid dim 2], then do a dot product with s to obtain a score of size (sequence length, 1).

Step Two: Apply softmax to the score to get weights that sum to 1.

Step Three: Multiply H by the weights obtained in Step Two to get a context vector of size (hid dim, 1).

Conclusion

The best way to understand a model is to mentally walk through each step from input to model to output, observing how tensors flow. I hope this helps everyone ~ please give it a thumbs up!

Postscript

It has been almost a year since I wrote this illustration, and I want to respond to some detail questions in the comments here:

This article mainly introduces an idea of the Attention mechanism, which is essentially a weighted average of hidden states. As for how to use c, s, and calculate it, there are many methods. If you don’t understand the basic concept, you can refer to this article. If you have questions about the details, the correct approach is to look at the open-source code. After all, it all comes down to practical experience.

Exciting news! The WeChat group for Natural Language Processing is now established.

You can scan the QR code below, and the assistant will invite you to join the group for discussions.

Note: Please modify your remarks to [School/Company + Name + Field] when adding.

For example —— Harbin Institute of Technology + Zhang San + Dialogue System.

Account owner, please avoid business promotions. Thank you!

Comprehensive Guide to Seq2Seq Attention Model



Recommended Reading:
【Long Article Detailed Explanation】From Transformer to BERT Model
Sai Er Translation | Understanding Transformer from Scratch
Seeing is Believing! A Step-by-Step Guide to Building a Transformer with Python

Leave a Comment