Click the "Little White Learns Visual" above, select "Star" or "Top"
Important content, delivered at the first time
In simple terms, the attention mechanism learns the regions it should focus on in an unannotated dataset solely based on the model itself. Is this method reliable?
Author: Zhihu User https://www.zhihu.com/question/444712435/answer/1755066079I plan to illustrate with a very toy model why regions tend to concentrate as we run the model.Let’s consider a toy model.Consider a sample , weights
, and noise with zero mean
. Our toy model is as follows:
The aim is to prove that the size of each component of w is inversely proportional to
the variance of each component.In layman’s terms, the less certain a place is, the smaller the weight will be; the more certain a place is, the larger the weight will be, which aligns with basic intuition.The proof is not difficult; just expand the square and calculate a bit:
The last term averages to zero, leaving only the second term, which corresponds to an l2 constraint: the larger the variance, the greater the constraint on w. From ridge regression, we can see that the optimal solution for w is inversely proportional to
for each component.Regarding the relationship with attention: assuming representation is almost fixed, consider
where each component is the pairwise dot product of certain representations
, where
is the index of the position. Assuming positions I and J are close to each other, for example, two neighboring pixels in an image, or the next word in a language task, the representations of two adjacent positions usually have a strong correlation. If
itself has little variation in its numerical range, for instance, after many strong normalization processes, they are basically on the surface of an ellipsoid, we can expect that
will not fluctuate much, which means that
has a very small noise component. According to the previous toy model, we can roughly know that the stronger the correlation, the lower the co-occurrence uncertainty, and naturally, the weight will be larger.This also explains why in many natural language pre-trained models, the attention matrix is strongest for nearby tokens first, and then for distant ones.Of course, this perspective assumes many things, such as the representation is assumed to be fixed. However, the representation is also trained, so it is unclear why the attention weights still exhibit regional concentration when everything is moving. However, there is some guidance: low-frequency co-occurring “tokens” will be automatically down-weighted by the model, emphasizing only high-frequency co-occurring pairs.Author: Zhihu User https://www.zhihu.com/question/444712435/answer/1733403787Oh, if you say that, I won’t be sleepy. Here’s a quick informal description of top-down attention in computer vision, involving many knowledge points, and I don’t have time to elaborate on each one now:From the perspective of the implementation principle of attention, the top-down attention methods commonly used in computer vision (like CAM, GradCAM, etc.) first provide classification conclusions through classifiers, and then backtrack to highlight the regions that strongly contribute to the classification. Methods like CAM highlight key areas by weighting and summing feature maps, with weights coming from the classifier’s weights after global pooling (CAM) or the magnitude of gradients during backpropagation (GradCAM). Overall, different methods vary in quality but all reveal which areas the model focuses on when making conclusions.From the learning process of attention, I personally interpret this mainly as an iterative process of multi-instance learning (MIL) solving weakly supervised learning. MIL proposes a sample packaging learning process for weakly supervised learning, where multiple samples form a package, with positive packages containing at least one positive sample and negative packages containing no positive samples. In weakly supervised learning, an image can be considered a package, where the target patches are the actual positive samples. Solving MIL requires introducing a pseudo-labeling process, i.e., designing a strategy to assign pseudo positive labels to some strongly responsive samples in the positive package (common strategies include Top1, TopN, assigning all positive labels, etc.), and then iteratively optimizing through a chicken-and-egg approach. Various pseudo-label strategies in CNN training manifest as common end pooling modules like Global Max Pooling (GMP), Log-Sum-Exp (LSE), Global Average Pooling (GAP), etc. Since CAM can be computed before GAP (for readers familiar with the details, you can try to simply prove: the FC layer weights can be moved before global pooling, calculate CAM first and then perform global average pooling, which is mathematically equivalent to computing CAM after the fact), the training of CNN classifiers is equivalent to a process of treating CAM attention as patch scores for MIL, with global pooling providing patch-level pseudo-labels for CAM learning.From the attention formation process, it can also be seen that pseudo-labels are not inherently reliable mechanisms; it is entirely possible to start with incorrect labels and continue to propagate errors. In practice, backtracking attention does indeed have perceptible issues, which can be reflected in terms of correctness and quality.The so-called correctness issue of attention refers to the phenomenon where the positions emphasized by attention do not align with human expectations. A representative paper on this is the Guided Attention Inference Network (GAIN), which presents an easily understandable example: the boat-water problem. When the model learns the “boat” category, due to the high statistical correlation between the semantics of boat and water in some datasets, the model may completely confuse the two concepts, reflected in the CAM for the “boat” category falling entirely on the background water. It is important to note that the boat-water problem is just one extreme example of this issue; in actual applications, due to the complexity of tasks, model concept confusion and correctness are often intertwined, making it difficult to intuitively understand how to resolve the issue by modifying the data distribution (adding boats without water, water without boats, etc.). Therefore, the solution proposed in the GAIN paper is to treat CAM as a learnable output and add additional supervisory information (pixel-level masks) for surgical corrections to ensure the consistency between the classifier’s classification and the correctness of its reasoning basis, which will naturally increase annotation costs.The so-called quality issue of attention refers to the phenomenon where, even when the positions are correct, the shapes and quantities are poorly described, such as when CAM outputs are often incomplete when handling large targets or multiple target localization, only highlighting parts of the target. This issue is mainly related to the inconsistency between using classification as a learning method for localization tasks. Intuitively, it is like the shortest optical path; the greedy nature of optimization will also drive the model to cease exploring remaining information after solving the classification task, leading the classification model to directly conclude based on the most distinguishing regions of the target. A representative paper addressing this issue without introducing additional supervisory information is the Adversarial Erasing series, which uses multi-model cascaded learning to enhance the completeness of attention, where each model removes the highlighted areas of CAM from the image and feeds them to the next model, forcing them to excavate remaining distinguishable information, thus allowing for a more complete excavation of the target area.Regarding the correctness issue of attention, my work “Rectifying Supporting Regions With Mixed and Active Supervision for Rib Fracture Recognition” also makes some minor contributions to the literature: (1) Using a simple and easy-to-implement adversarial sample method to relax the pixel-level additional supervision label required by GAIN to bounding box-level labels, which can perform better when the target boundary is fuzzy or non-existent, or when annotation budgets are low; (2) Combining attention-driven active learning methods to further reduce the number of required bounding box labels. The final effect is that using 20% precise annotations and 80% coarse annotations can achieve localization accuracy close to that of strong supervised learning with 100% precise annotations.Author: Howardhttps://www.zhihu.com/question/444712435/answer/1734304145@Xin Zhao Bi mentioned that the backtracking attention mechanism CAM is mainly used in CV. However, the attention mechanism was first proposed in the NLP field.Initially, attention in NLP was simply setting a vector, computing the inner product with the token vectors of the text, and then normalizing through softmax to obtain the weight of each token, which is then used to aggregate the sequence vectors into a vector for downstream tasks.@Xin Zhao Bi I really like the MIL explanation. The simplest text classification model uses avg pooling, which assumes that every word in the text is a positive case. The attention mechanism was proposed to select the tokens that truly play a role in this sequence as positive cases. Below is an example from Hierarchical Attention Networks for Document Classification, illustrating which words the model focused on when making judgments.

How does the attention mechanism learn which words the model should pay attention to? A geometric intuition is that in high-dimensional space, during the learning process, the model continuously pulls the vectors of useful tokens for the task closer to the attention vector. Ultimately, what is learned is a massive high-dimensional spherical space centered around the attention vector, where the token vectors that are closer to the center are the ones most relevant to the task. In other words, the attention mechanism learns the feature words related to the task.Author: Anonymous User https://www.zhihu.com/question/444712435/answer/1756285423The attention mechanism can be seen as a special message-passing mechanism completed based on linear operations, with the basic idea being:(1) The system is divided into multiple subsystems that need to communicate with each other.(2) In the attention mechanism, a linear embedding of the features of different subsystems is performed, and then the inner product is used to calculate the relevance of the embedded space vectors.(3) The relevance between subsystems serves as weights to weight the messages (another linear embedded vector).Thus, the core is the correlation between subsystems, but here it is calculated after an embedding operation (which feels a bit like a kernel trick). In fact, it is evident that this embedding can be extended to general non-linear forms; however, linear operations have lower complexity and can basically fulfill the task requirements, which is why attention chose linear operations. This embedding operation needs to be completed through training.If we further simplify the general linear space operation to constraints of rotation and translation, it can be likened to a traditional registration problem in image processing, where high relevance corresponds to fixed coordinate transformation relationships between subsystems, i.e., there is a basic fixed positional relationship, while weak relevance indicates that the positional relationship between subsystems is not fixed. Our attention task fundamentally examines whether we can estimate the positions of other subsystems from one subsystem’s position. Under the constraint of weight normalization, we need to maximize the reliability of overall position estimation or minimize the weighted error. Intuitively, it is clear that we will assign more trust to the parts with fixed positional relationships. Therefore, the data-driven optimization result of the attention mechanism is that it can find relatively fixed coordinate transformations between subsystems and assign greater weights, while reducing the weight of positional information for subsystems without fixed positional transformation relationships.
Download 1: OpenCV-Contrib Extension Module Chinese Version Tutorial
Reply "Extension Module Chinese Tutorial" in the backend of the "Little White Learns Visual" public account to download the first OpenCV extension module tutorial in Chinese on the internet, covering installation of extension modules, SFM algorithms, stereo vision, target tracking, biological vision, super-resolution processing, and more than twenty chapters of content.
Download 2: Python Visual Practical Project 52 Lectures
Reply "Python Visual Practical Project" in the backend of the "Little White Learns Visual" public account to download 31 visual practical projects including image segmentation, mask detection, lane line detection, vehicle counting, adding eyeliner, license plate recognition, character recognition, emotion detection, text content extraction, and face recognition, aiding in the rapid learning of computer vision.
Download 3: OpenCV Practical Projects 20 Lectures
Reply "OpenCV Practical Projects 20 Lectures" in the backend of the "Little White Learns Visual" public account to download 20 practical projects based on OpenCV, achieving advanced learning of OpenCV.
Group Chat
Welcome to join the reader group of the public account to exchange ideas with peers. Currently, there are WeChat groups for SLAM, 3D vision, sensors, autonomous driving, computational photography, detection, segmentation, recognition, medical imaging, GANs, algorithm competitions, etc. (these will gradually be subdivided). Please scan the WeChat ID below to join the group, with a note: "Nickname + School/Company + Research Direction", for example: "Zhang San + Shanghai Jiao Tong University + Visual SLAM". Please follow the format for notes; otherwise, you will not be approved. After successfully adding, you will be invited to the relevant WeChat group based on your research direction. Please do not send advertisements in the group; otherwise, you will be removed from the group. Thank you for your understanding.~