Attention Mechanism in Machine Translation

Attention Mechanism in Machine Translation

In the previous article, we learned about the basic seq2seq model, which processes the input sequence through an encoder, passes the calculated hidden state to a decoder, and then decodes it to obtain the output sequence. The block diagram is shown again below:

Attention Mechanism in Machine Translation

The basic seq2seq model is quite effective for short and medium-length sentences in machine translation. However, for long sentences, the fixed-length encoding used between the encoder and decoder may become an information bottleneck, affecting translation quality. This is similar to how a person might only remember the general idea after hearing a long passage, missing out on some details.

This article further explores the Attention mechanism, or the “attention” model, which was first introduced in 【1】 and further refined in 【2】.

The core idea of the Attention mechanism is to establish a direct connection (shortcut connections) between the output sequence and the encoder’s historical states, focusing the “attention” on the inputs that are highly relevant to the current output during translation. This is intuitive because when translating longer and more complex sentences (such as relative clauses and adverbial clauses), a person typically translates with an emphasis on key parts of the sentence. A particular output word may only be closely related to a few local words in the input sentence, while having little relation to other parts of the input sentence, meaning that the “attention” is focused on a few local words in the input sentence. For example, in English-French translation, the alignment matrix between the input and output is printed as shown in the figure below:

Attention Mechanism in Machine Translation

In the figure, the brighter the pixel block, the stronger the contextual relevance. Conversely, areas with low brightness indicate weak contextual relevance, and this part of the input can be ignored during translation.

The Attention mechanism allows the decoder to select hidden states from the encoder at any time step, rather than only decoding based on the last hidden state of the encoder. The structure of the seq2seq model with the introduction of the Attention mechanism is shown in the figure below:

Attention Mechanism in Machine Translation

The above figure describes the Attention-based NMT system from 【2】. Observant readers will notice that the embedding layer is omitted in the figure for the sake of clarity in presenting the Attention details.

As shown in the figure, attention calculation occurs at each decoder time step, including the following stages:

  1. The current target hidden state is compared with all source states to generate attention weights (i.e., the previously mentioned input-output alignment matrix);

  2. Based on the attention weights, the context vector (weighted average of source states) is calculated;

  3. The context vector and the current target hidden state are combined to produce the final attention vector;

  4. The attention vector is fed into the next time step as input.

The first three steps can be expressed in formulas:

Attention Mechanism in Machine Translation

In the formula above, the score() function is used to compare the decoder’s hidden state Attention Mechanism in Machine Translation with each hidden state of the encoder at every time step Attention Mechanism in Machine Translation, normalizing the results to produce attention weights. The score() function can have various implementations, with a popular choice shown in formula (4):

Attention Mechanism in Machine Translation

The attention vector obtained from formula (3) can be used to compute the softmax logit and loss, similar to the steps in the previous article “Implementing Machine Translation with TensorFlow”.

Hands-On Practice

Follow the public account and reply with “20171022” or “attention” to automatically obtain the source code and pre-trained model for this article.

Before running, ensure you have TensorFlow 1.3.0 installed.

$ cd 20171022/
$ python translate.py
……
Reading model parameters from ././translate.ckpt-294000
> 

At this point, we can input a sentence in English, press enter, and receive the corresponding French translation.

> I am not familiar with you.
Je ne sais pas avec vous .
> I don't need your help, thanks!
Je ne peux pas vous aider à aider , mais à vous ! ! !
> My dad told me to be a man.
Mon père m’a dit être un homme .
> What can I do for you?
Que puis-je faire pour vous ?
> I'm very glad to meet you!
Je suis très heureux de vous rencontrer .
> Chinese people always spit on roads.
Les gens chinois sont toujours _UNK sur les routes .

References

【1】 Neural Machine Translation by Jointly Learning to Align and Translate, https://arxiv.org/abs/1409.0473

【2】 Effective Approaches to Attention-based Neural Machine Translation, https://arxiv.org/abs/1508.04025

【3】Sequence to sequence learning with neural networks. https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf

Attention Mechanism in Machine Translation

Leave a Comment