Understanding Attention Mechanisms in Depth

Understanding Attention Mechanisms in Depth

Recently, I plan to organize the application of Attention in deep recommendation systems, so I wrote this introductory article about Attention. Since it was proposed in the 2015 ICLR paper “Neural machine translation by jointly learning to align and translate”, Attention has flourished in the fields of NLP and computer vision. What is so special about Attention? What are its principles and essence? Can various forms of Attention be understood within a unified framework? How to understand the various variants of Attention?

What is the essence of Attention?

The Attention mechanism originates from the human visual attention mechanism: concentrating limited attention on key information, “from focusing on everything to focusing on key points”, thus saving resources and quickly obtaining the most effective information. For Attention, it is a mechanism for distributing weight parameters, aimed at assisting the model in capturing important information. “Weighted summation” can be a highly summarized concept, focusing on different information in different contexts. In many application scenarios, attention takes on part of the responsibility for feature selection and feature representation.

The principle of Attention

Attention mechanisms can be viewed as a query mechanism, where a query retrieves from a memory area. We represent the query as key_q, and the memory is a set of key-value pairs, consisting of M items, where the i-th item is represented as <key_m[i], value_m[i]>. By calculating the relevance between the query and key_m[i], we determine the weight of value_m[i] in the query result. Note that key_q, key_m, and value_m are all vectors. Therefore, the essence of the Attention function can be described as<span>a mapping from a query to a series of (key-value) pairs</span>, as shown in the figure below.

Understanding Attention Mechanisms in Depth

It mainly includes the following three steps:
  1. Score function: Measures the similarity between the environment vector (memory) and the current input vector (query); identifies which input information should be focused on in the current environment;Understanding Attention Mechanisms in Depth

  2. Alignment function: Calculates the attention weight, usually using softmax for normalization;

    Understanding Attention Mechanisms in Depth

  3. Context vector function: Obtains the output vector by weighted averaging all values based on the attention weights;

Understanding Attention Mechanisms in Depth

Understanding Attention Mechanisms in Depth

Common types of Attention

Understanding Attention Mechanisms in Depth

The types of Attention usually revolve around the score function, alignment function, and context vector function, with variations among these three:

Understanding Attention Mechanisms in Depth
1. Score-function
How to generate the output vector involves the transformations mentioned above. Next, we have the more diverse score functions. The most commonly used score functions are those shown in the previous figure (almost all of them). Essentially, it measures the similarity between two vectors. If the two vectors are in the same space, the dot product method (or scaled dot product, where scaling is used to reduce the numerical size, allowing softmax to have a larger gradient for faster learning) can be used easily. If they are not in the same space, some transformations are needed (transformations can also be applied within the same space); additive attention involves linear transformations of the inputs followed by addition, while multiplicative attention directly transforms through matrix multiplication (you may have wondered why attention is called additive and multiplicative attention?).
2. Alignment function — global/local attention
Intuitively, this refers to the different weighted summation sets; global attention uses all input vectors as a weighted set, employing softmax as the alignment function, while local attention only allows a portion of input vectors to enter this pool. The rationale for using local attention is to reduce noise and further narrow the focus area. The next question is how to determine this local range. The text mentions two schemes: local-m and local-p. Local-m is based on a simple assumption and is passed directly. Local-p involves a prediction operation, estimating which position pt (with two parameter vectors vp and wp) in the input sequence (of total length S) should be focused on at the current moment, and then making a slight adjustment in the alignment function by adding a Gaussian distribution centered around pt to adjust the alignment results based on the attention weights computed by softmax.
The author concludes that the local-p + general (the score-function referenced in the multiplicative attention version in the figure) approach yields the best results. However, from the perspective of global/local classification, global attention is still more commonly used, as the performance gain from complex local attention does not seem significant.
3. Context vector function — hard / soft attention
Hard attention is a random sampling, where the sampling set is the collection of input vectors, and the sampling probability distribution is generated from the attention weights produced by the alignment function. Therefore, the output of hard attention is a specific input vector. Soft attention is a weighted summation process, where the summation set is the collection of input vectors, and the corresponding weights are the attention weights produced by the alignment function. Among hard/soft attention, soft attention is more commonly used (all attention mentioned later falls into this category) because it is differentiable and can be directly embedded into the model for training.

Examples of Attention: Bahdanau Attention & Luong Attention

Below is a comparison chart of the two:

Understanding Attention Mechanisms in Depth

Example of Attention: self-Attention

Under the common types of Attention mentioned above, this section focuses on self-attention. Generally, in Q, K, V, Q and K are inconsistent, while in self-attention, each word uses its own embedding as the query, querying from the memory space composed of all words’ embeddings, resulting in a representation for that word. If the sentence length is n, the results obtained from querying all words through memory will still have a length of n.

Understanding Attention Mechanisms in Depth

Visualization of Self-Attention Example

In Figure 8, we show the visualized Self-Attention mechanism, which can capture syntactic or semantic features among words in the same sentence. The above figure demonstrates that Self-Attention captures semantic features among words in the same sentence, with “it” referring to “the animal”.
In the Self-Attention method, each word in the sentence can establish a sensitive relationship with other words at any distance within the sentence, which greatly shortens the distance between long-distance dependency features, facilitating effective utilization of these features. Additionally, Self-Attention directly aids in increasing computational parallelism. This is also the primary reason why Self-Attention is increasingly widely used.

Self-attention In Transformer

First, let’s discuss the self-attention mechanism in Transformers. The basic form of self-attention has been covered, but the self-attention mechanism in Transformers is a new variant, reflected in two aspects: one is the addition of a scaling factor, and the other is the introduction of a multi-head mechanism.

The scaling factor appears in the Attention calculation formula, adding a vector dimension as the denominator to avoid overly large dot product results that lead to saturation in the softmax function, resulting in very small gradients. The self-attention calculation formula in Transformers is as follows:

Understanding Attention Mechanisms in Depth
The multi-head mechanism refers to the introduction of multiple sets of parameter matrices to perform linear transformations on Q, K, and V to obtain the results of self-attention, and then concatenating all results as the final self-attention output. This description may not be very clear; looking at the formula and illustration will clarify it, as shown below:
Understanding Attention Mechanisms in Depth
This method enables the model to have multiple relatively independent attention parameters, theoretically enhancing the model’s capabilities.

Advantages of Attention

The attention mechanism has the following advantages:

  • Captures global relationships directly while also focusing on local relationships; the attention function compares each element of the sequence with other elements when calculating the attention value, where the distance between each element is one. In contrast, in time series RNNs, the values of elements are obtained through a step-by-step recursive process to capture long-term dependencies, and the longer the sequence, the weaker the ability to capture long-term dependencies.
  • Parallel computation reduces model training time; the Attention mechanism allows each step’s computation to be independent of the previous step’s results, enabling parallel processing.
  • Low model complexity with fewer parameters.
However, the drawbacks of the attention mechanism are also quite obvious. Since it processes all elements of the sequence in parallel, it cannot consider the order of the elements in the input sequence.

Summary

In short, the Attention mechanism considers different weight parameters for each input element, thereby paying more attention to parts similar to the input elements while suppressing other irrelevant information. Its greatest advantage is the ability to consider global and local relationships directly and to parallelize computations, which is particularly important in big data environments. At the same time, we need to note that the Attention mechanism, as a concept, is not limited to the Encoder-Decoder framework but can be combined with various models based on actual situations.

Recommended Reading

Deep Summary | The Application of Knowledge Distillation in Recommendation Systems

Summary of Model Engineering Deployment Methods

Alchemy Handbook | Summary of Neural Network Training Tricks

Every like you click, I take seriously as a favoriteUnderstanding Attention Mechanisms in Depth

Leave a Comment