How Attention Mechanism Learns Regions to Focus On

Essential insights delivered promptly

Link｜https://www.zhihu.com/question/444712435

Editor｜Deep Learning and Computer Vision

This article is for academic sharing only, please delete if infringing

In simple terms, the attention mechanism learns the regions it should focus on based solely on the model itself in an unlabeled dataset. Is this method reliable?

Author: Zhihu User https://www.zhihu.com/question/444712435/answer/1755066079I plan to explain why regions become concentrated in a very toy model.First, consider a toy model.Consider the sample , weights , and noise with a mean of zero . Our toy model is as follows:What needs to be proven is that the size of each component of w is inversely proportional to the variance of each component.In layman’s terms, the less certain a place is, the smaller the weight will be; the more certain a place is, the larger the weight will be, which aligns with basic intuition.The proof is not difficult; just expand the square and calculate a bit:The last term averages to zero, only the second term remains, which is equivalent to having an l2 constraint; the larger the variance, the greater the constraint on w. The optimal solution for w can be known from ridge regression, which is inversely proportional to of each component.And the relationship with attention: assuming the representation is almost fixed, consider each component being the dot product of certain representations , where is the index of the position. Assume that positions I and J are close to each other, such as two neighboring pixels in an image, or the next word in a sentence. Generally, the representations of two adjacent positions have a strong correlation; if itself has little variation in numerical range, like after many strong normalization processes, they are basically on the surface of an ellipsoid, it can be expected that will not fluctuate much, which means that the noise term in is very small. According to the previous toy model, it can be roughly known that the stronger the correlation, the lower the co-occurrence uncertainty, and naturally the weight is larger.This also explains why in many natural language pre-training models, the attention matrix is strongest for adjacent tokens first, then for distant ones.Of course, this perspective assumes many things, such as the representation being fixed; in fact, the representation is also trained, making it unclear why attention weights still concentrate on specific regions when everything is moving. However, there is some guidance: low-frequency co-occurring “tokens” will have their attention automatically reduced by the model, emphasizing only high-frequency co-occurring pairs.Author: Zhihu User https://www.zhihu.com/question/444712435/answer/1733403787Well, if you want to talk about this, I’m not sleepy anymore. Here’s a quick informal description about the top-down attention in computer vision, which involves a lot of knowledge points, and I don’t have time to elaborate on each:From the perspective of the implementation principle of attention, the top-down attention (such as CAM, GradCAM, etc.) commonly used in computer vision first gives a classification conclusion through a classifier, then backtracks to highlight the regions that strongly contribute to the classification. Methods like CAM highlight key areas by weighting and summing feature maps, where the weights come from the classifier’s weights after global pooling (CAM) or the gradient magnitudes from backpropagation (GradCAM), etc. Overall, different methods vary in quality, but they can all reveal which regions the model focuses on when making conclusions.From the attention learning process, I personally explain this mainly as an iterative process of multi-instance learning (MIL) solving weakly supervised learning. MIL proposes a sample packaging learning process for weakly supervised learning, where multiple samples form a package, with the positive package containing at least one positive sample, and the negative package containing no positive samples. In weakly supervised learning, one image can be considered a package, where the patches containing the target are the true positive samples. Solving MIL requires introducing a pseudo-labeling process, i.e., designing a strategy to assign pseudo positive labels to some strongly responding samples in the positive package (common strategies include giving Top1, TopN, or all positive labels), and then iteratively optimizing it in a chicken-and-egg manner. Various pseudo-label strategies are reflected in CNN training as common terminal pooling modules like Global Max Pooling (GMP), Log-Sum-Exp (LSE), and Global Average Pooling (GAP). Since CAM can be computed before GAP (for readers who understand the details, you can try to prove very simply: the FC layer weights can be moved before the global pooling, first calculating CAM and then global average pooling, which is mathematically equivalent to computing CAM after the fact), the training of CNN classifiers is equivalent to a process of treating CAM attention as patch scores in MIL, where global pooling provides pseudo labels at the patch level for CAM learning.From the attention formation process, it can also be seen that pseudo-labels are not a completely reliable mechanism in principle; it is entirely possible that if they are wrong at the beginning, they will continue to be wrong later. In practice, top-down attention does indeed have perceptible issues, which can be reflected in terms of correctness and quality.The so-called correctness issue of attention means that the position emphasized by the attention does not match human expectations. A representative paper on this is the Guided Attention Inference Network (GAIN), which introduces an easy-to-understand example: the ship-water problem. When the model learns the “ship” category, due to the high statistical correlation between the ship and water in some datasets, the model sometimes completely confuses the concepts of ship and water, reflected in the CAM of the “ship” category falling entirely on the background water. It should be noted that the ship-water problem is just an extreme example of this issue; in actual applications, due to the complexity of tasks, the model’s concept confusion and correctness are often mixed, making it difficult to understand intuitively and modify the data distribution (by adding ships without water, or water without ships, etc.) to solve the problem. Therefore, the solution proposed in the GAIN paper is to treat CAM as a learnable output and add additional supervisory information (pixel-level masks) for surgical correction, ensuring the consistency between the classifier’s classification and its reasoning basis, which naturally increases labeling costs.The so-called quality issue of attention refers to the phenomenon where, even when the position is correct, the shape and quantity are poorly described, such as when CAM processes large targets or multiple targets, the output is often incomplete, only highlighting parts of the targets. This issue is mainly related to the inconsistency of task objectives caused by using “classification” as a learning method for localization tasks. Intuitively, just as the light path is always the shortest, the greedy nature of optimization will also drive the model to stop fully exploring the remaining information after solving the classification task, leading the classification model to directly make classification conclusions based on the most distinguishable areas of the target. A representative paper that addresses this issue without introducing additional supervisory information is the Adversarial Erasing series, which uses multi-model cascaded learning to enhance the integrity of attention, where each model removes the highlighted CAM area from the image and feeds it to the next model, forcing them to excavate remaining distinguishable information, thus enabling a more complete excavation of the target area.Regarding the correctness issue of attention, my work “Rectifying Supporting Regions With Mixed and Active Supervision for Rib Fracture Recognition” also makes some minor contributions to the literature: (1) Using a simple and practical adversarial sample method to relax the pixel-level additional supervisory labels required by the GAIN paper to bounding box-level labels, which can work better when the target boundaries are fuzzy or non-existent and are difficult to describe with pixel-level labels (or when the labeling budget is simply low); (2) Combining attention-driven active learning methods to further reduce the number of required bounding box labels. The final effect is that using 20% precisely labeled data and 80% roughly labeled data can achieve localization accuracy close to that of strong supervised learning using 100% precisely labeled data.Author: Howardhttps://www.zhihu.com/question/444712435/answer/1734304145@XinZhaoBi mentioned that the top-down attention mechanism CAM is mainly used in computer vision. However, the attention mechanism was first proposed in the NLP field.Initially, in NLP, attention was simply setting a vector, calculating the inner product with the text token vectors, and then normalizing through a softmax to obtain the weight of each token, which was then aggregated into a single vector for downstream tasks.@XinZhaoBiI really like the MIL explanation you mentioned. The simplest text classification model uses avg pooling, which assumes that each word in the text is a positive example. The introduction of attention is to select the truly effective tokens in this sequence as positive examples. Below is an example from Hierarchical Attention Networks for Document Classification, illustrating which words the model focused on when making judgments.

How Attention Mechanism Learns Regions to Focus On

How does the attention mechanism learn which words the model should focus on? A geometric intuition is that in high-dimensional space, during the learning process, the model continuously pulls the vectors of useful tokens closer to the attention vector. Ultimately, what is learned is a huge high-dimensional spherical space centered around the attention vector, where the token vectors that are closer to the center are more relevant to the task. In other words, the attention mechanism learns the feature words related to the task.Author: Anonymous User https://www.zhihu.com/question/444712435/answer/1756285423The attention mechanism can be seen as a special message-passing mechanism completed based on linear operations, with the basic idea being:(1) The system is divided into multiple subsystems that need to communicate messages with each other.(2) In the attention mechanism, the features of different subsystems are linearly embedded, and then the correlation of the embedded space vectors is calculated through the inner product.(3) The correlation between subsystems serves as weights to weight the messages (another linear embedded vector).Thus, the core is the correlation between subsystems, but here it is calculated through an embedding operation (which feels a bit like a kernel trick); in fact, this embedding can clearly be extended to general non-linear operations, but linear operations have lower complexity and can generally meet task requirements, which is why attention chose linear operations. This embedding operation needs to be completed through training.If we further simplify the general linear space operation to constraints of rotation and translation, it can be likened to a traditional registration problem in image processing, where high correlation can be likened to a fixed coordinate transformation relationship between subsystems, i.e., having a basic fixed positional relationship, while weak correlation means that the positional relationship between subsystems is not fixed. Our attention task is basically to see if we can estimate the positions of other subsystems from one subsystem’s position. Under the constraint of weight normalization, we need to maximize the reliability of the overall position estimation or minimize the weighted error. Intuitively, it is clear that the results of our parameter estimation must assign more trust to the parts with fixed positional relationships. Therefore, the result of the data-driven optimization of the attention mechanism is that, on the one hand, it can find relatively fixed coordinate transformations between subsystems and assign greater weights, while on the other hand, it reduces the weight of positional information for subsystems without fixed positional transformation relationships.

Leave a Comment Cancel reply