Applications of Attention Mechanism in Natural Language Processing

In recent years, research in deep learning has become increasingly in-depth, achieving many groundbreaking advances in various fields. Neural networks based on the attention mechanism have become a hot topic in recent neural network research. I have also recently studied some papers on neural networks based on the attention mechanism in the field of Natural Language Processing (NLP), and I would like to summarize the applications of attention in NLP and share them with everyone.

1. Research Progress on Attention

The attention mechanism was first proposed in the field of visual images, probably in the 1990s, but it truly gained popularity with the Google Mind team’s paper “Recurrent Models of Visual Attention” [14], where they used the attention mechanism in RNN models for image classification. Subsequently, Bahdanau et al. introduced a similar attention mechanism in their paper “Neural Machine Translation by Jointly Learning to Align and Translate” [1], where they simultaneously performed translation and alignment in machine translation tasks. Their work is considered the first to apply the attention mechanism in the NLP field. Following this, similar RNN models based on the attention mechanism began to be applied to various NLP tasks. Recently, how to use the attention mechanism in CNNs has also become a research hotspot. The following figure illustrates the general trend of research progress in attention.

2. Recurrent Models of Visual Attention

Before introducing attention in NLP, I would like to briefly discuss the concept of using attention in images. In the representative paper “Recurrent Models of Visual Attention” [14], the motivation for their research was also inspired by the human attention mechanism. When observing an image, humans do not view each pixel in the entire image at once, but rather concentrate their attention on specific parts of the image based on their needs. Moreover, humans learn to focus on certain positions in future images based on previously observed images. The following figure shows a schematic diagram of the core model of this paper.

This model integrates the attention mechanism into a traditional RNN (the part circled in red), learning which parts of an image to focus on through attention. With each current state, it processes the relevant pixels of attention based on the position learned from the previous state and the current input image, rather than processing all pixels of the image. The benefit is that fewer pixels need to be processed, reducing the complexity of the task. The application of attention in images is quite similar to the human attention mechanism; next, let’s look at the use of attention in NLP.

3. Attention-based RNN in NLP

3.1 Neural Machine Translation by Jointly Learning to Align and Translate [1]

This paper is considered the first work using the attention mechanism in NLP. They applied the attention mechanism to neural machine translation (NMT), which is a typical sequence-to-sequence model, specifically an encoder-to-decoder model. Traditional NMT uses two RNNs, one to encode the source language into a fixed-dimensional intermediate vector and the other to decode and translate it into the target language. The traditional model is shown in the figure below:

This paper proposed an NMT model based on the attention mechanism, as shown in the figure below:

In the figure, I have not drawn all the connections in the decoder, only the first two words; the subsequent words are similar. It can be seen that the attention-based NMT links the representations learned from each word on the source language side (traditionally only the last word’s learned representation) with the word currently being predicted for translation. This connection is made through the attention mechanism they designed. After the model is trained, we can obtain the alignment matrix between the source and target languages based on the attention matrix. The specific design of attention in the paper is as follows:

It can be seen that they used a perceptron formula to connect each word in the target language with each word in the source language and then normalized it through the softmax function to obtain a probability distribution, which is the attention matrix.

From the results, it can be seen that compared to traditional NMT (RNNsearch is attention NMT, RNNenc is traditional NMT), the performance has improved significantly, with the most notable feature being its ability to visualize alignments and its advantages in processing long sentences.

3.2 Effective Approaches to Attention-based Neural Machine Translation [2]

This paper is a highly representative follow-up to the previous one, informing everyone how attention can be extended in RNNs. This paper greatly promoted various attention-based models in NLP applications. In the paper, they proposed two types of attention mechanisms: global and local.

First, let’s look at the global mechanism of attention. This is actually similar to the attention proposed in the previous paper, which processes all words in the source language. The difference is that several simple extended versions were proposed for calculating the attention matrix values.

In their final experiments, the general calculation method yielded the best results.

Now let’s look at the local version they proposed. The main idea is to reduce the computational cost of attention. Instead of considering all words in the source language, the authors predicted the position Pt in the source language to align with during decoding based on a prediction function, and then only considered the words within a context window.

They provided two prediction methods, local-m and local-p. When calculating the final attention matrix, a Gaussian distribution related to the position pt was multiplied. The authors’ experimental results showed that local attention performed better than global attention.

I believe the greatest contribution of this paper is that it first informed us how to extend the calculation methods of attention and introduced local attention methods.

4. Attention-based CNN in NLP

Subsequently, attention-based RNN models began to be widely applied in NLP, not only for sequence-to-sequence models but also for various classification problems. So, can attention mechanisms also be used in Convolutional Neural Networks (CNNs), which are as popular as RNNs in deep learning? The paper “ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs” [13] proposed three methods for using attention in CNNs, representing an early exploratory work on attention in CNNs.

In traditional CNNs, when constructing a sentence pair model, as shown in the figure above, each single channel processes one sentence, then learns the sentence representation, and finally inputs them together into the classifier. In such a model, there is no mutual connection between the sentence pairs before inputting into the classifier. The authors aimed to design an attention mechanism to connect the different CNN channels of the sentence pairs.

The first method, ABCNN0-1, applies attention before convolution, calculating the corresponding attention feature map for the sentence pairs through the attention matrix, which is then input into the convolution layer along with the original feature map. The specific calculation method is as follows.

The second method, ABCNN-2, applies attention during pooling, re-weighting the convolutional representations through attention before pooling, as illustrated in the following figure.

The third method combines the first two methods in the CNN, as shown in the figure below:

This paper provides us with ideas on using attention in CNNs. There are now many works using attention-based CNNs that have achieved good results.

5. Conclusion

In conclusion, I believe that attention in NLP can be seen as a form of automatic weighting. It can connect two different modules you want to relate through a weighted form. Currently, the mainstream calculation formulas for attention are as follows:

By designing a function to relate the target module mt and the source module ms, and normalizing it through a softmax function to obtain a probability distribution.

Currently, attention has been widely applied in NLP. One of its great advantages is that it can visualize the attention matrix to show which parts of the neural network focused on during tasks.

However, the attention mechanism in NLP still differs from human attention mechanisms, as it generally requires calculating all objects to be processed and additionally storing their weights in a matrix, which actually increases overhead. Unlike humans, who can ignore unimportant parts and only process what they focus on.

References

[1] Bahdanau, D., Cho, K. & Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015 1–15 (2014).

[2] Luong, M. & Manning, C. D. Effective Approaches to Attention-based Neural Machine Translation. 1412–1421 (2015).

[3] Rush, A. M. & Weston, J. A Neural Attention Model for Abstractive Sentence Summarization. EMNLP (2015).

[4] Allamanis, M., Peng, H. & Sutton, C. A Convolutional Attention Network for Extreme Summarization of Source Code. Arxiv (2016).

[5] Hermann, K. M. et al. Teaching Machines to Read and Comprehend. arXiv 1–13 (2015).

[6] Yin, W., Ebert, S. & Schütze, H. Attention-Based Convolutional Neural Network for Machine Comprehension. 7 (2016).

[7] Kadlec, R., Schmid, M., Bajgar, O. & Kleindienst, J. Text Understanding with the Attention Sum Reader Network. arXiv:1603.01547v1 [cs.CL] (2016).

[8] Dhingra, B., Liu, H., Cohen, W. W. & Salakhutdinov, R. Gated-Attention Readers for Text Comprehension. (2016).

[9] Vinyals, O. et al. Grammar as a Foreign Language. arXiv 1–10 (2015).

[10] Wang, L., Cao, Z., De Melo, G. & Liu, Z. Relation Classification via Multi-Level Attention CNNs. ACL 1298–1307 (2016).

[11] Zhou, P. et al. Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification. Proc. 54th Annu. Meet. Assoc. Comput. Linguist. (Volume 2 Short Pap. 207–212 (2016).

[12] Yang, Z. et al. Hierarchical Attention Networks for Document Classification. NAACL (2016).

[13] Yin W, Schütze H, Xiang B, et al. ABCNN: Attention-based Convolutional Neural Network for Modeling Sentence Pairs. arXiv preprint arXiv:1512.05193, 2015.

[14] Mnih V, Heess N, Graves A. Recurrent Models of Visual Attention[C]//Advances in Neural Information Processing Systems. 2014: 2204-2212.

Source: Blog Garden

“The Uniqueness of Text Mining and Natural Language Processing” course across the country, along with the ability to simplify complex knowledge, will familiarize you with text mining and NLP techniques. You will learn how to apply them in your actual work, extending your data mining capabilities from limited structured data to unstructured massive text materials. Click the QR code below for course details.

Leave a Comment Cancel reply