MLNLP community is a well-known machine learning and natural language processing community in China and abroad, covering NLP master’s and doctoral students, university teachers, and industry researchers.The vision of the community is to promote communication and progress between the academic and industrial circles of natural language processing and machine learning, especially for beginners. Reproduced from | PaperWeekly
This summarizes the main content introduced by Professor Li Hongyi in the Spring 2022 machine learning course regarding various attention mechanisms, which also supplements the 2021 course. Reference video:
https://www.bilibili.com/video/BV1Wv411h7kN/?p=51&vd_source=eaef25ec79a284858ac2a990307e06ae
In the 2021 course’s transformer video, Professor Li detailed part of the self-attention content, but self-attention actually has various forms of transformation:
First, let’s briefly review the previous self-attention. Assuming the input sequence (query) length is N, in order to capture the relationship between each value or token, N keys need to be generated correspondingly, and by performing a dot product between the query and the key, an Attention Matrix (attention matrix) can be produced, with dimensions N*N. The biggest problem with this approach is that when the sequence length is too long, the corresponding Attention Matrix dimension becomes too large, which complicates the computation.
For transformers, self-attention is just one module in a larger network architecture. From the above analysis, we know that the computational volume of self-attention is proportional to the square of N. When N is small, simply increasing the efficiency of self-attention may not significantly impact the overall network’s computational efficiency. Therefore, improving the computational efficiency of self-attention to greatly enhance the efficiency of the entire network is premised on N being particularly large, such as in image recognition (image processing).
How can we speed up the computation of self-attention? Based on the above analysis, we can see that the biggest issue affecting the efficiency of self-attention is the computation of the Attention Matrix. If we can selectively compute certain values in the Attention Matrix based on some human knowledge or experience, or determine that certain values do not need to be computed, we can theoretically reduce the computational load and improve efficiency.
For instance, when we are doing text translation, sometimes when translating the current token, we do not need to consider the entire sequence; we only need to know the neighboring tokens on both sides to translate accurately, which is to perform local attention. This can greatly enhance computational efficiency, but the downside is that it only focuses on nearby local values, making this approach not much different from CNN.
If the above local attention is not satisfactory, we can adopt another approach by giving the current token a certain interval (stride) of left and right neighbors to capture the relationships between the current token and the past and future. Of course, the value of stride can be determined by oneself.
Another method is global attention, which involves selecting certain tokens in the sequence as special tokens (such as punctuation marks) or adding special tokens to the original sequence. This allows the special tokens to establish global relationships with the sequence, while other tokens that are not special tokens do not have attention. For example, adding two special tokens at the front of the original sequence:
Which attention method is the best? Only children make choices… For a network, some heads can perform local attention, while others can perform global attention… Thus, there is no need to make a choice. See the following examples:
- Longformer combines the three types of attention mentioned above.
- Big Bird builds on Longformer by randomly selecting attention assignments to further improve computational efficiency.
The methods mentioned above are all manually set to determine where attention needs to be computed and where it does not need to be computed. However, is this the best approach? Not necessarily. For the Attention Matrix, if some positions have very small values, we can directly set these positions to 0, which will not significantly affect the actual prediction results. This means we only need to identify those values in the Attention Matrix that are relatively large. But how do we identify which positions have very small/large values?
The following two papers present a clustering scheme, where queries and keys are clustered first. Queries and keys belonging to the same class compute attention, while those not belonging to the same class do not participate in the computation, thus speeding up the computation of the Attention Matrix. For example, in the following example, there are 4 classes: 1 (red box), 2 (purple box), 3 (green box), 4 (yellow box). The two papers introduce methods for quick rough clustering.
Is there a way to learn whether or not to compute attention? It is possible. We can train another network where the input is the input sequence, and the output is a weight sequence of the same length. By concatenating all the weight sequences and transforming them, we can obtain a matrix indicating where attention needs to be computed and where it does not. One detail is that certain different sequences may output the same weight sequence through NN, which can significantly reduce the computational load.
What we discussed above is an N*N matrix, but in reality, such a matrix is usually not full rank, which means we can reduce the dimensionality of the original N*N matrix by removing redundant columns, resulting in a smaller matrix.
Specifically, select K representative keys from N keys, each key corresponds to a value, and then perform a dot product with the query. Then perform gradient descent to update the value.Why select representative keys instead of representative queries? Because queries correspond to outputs, which would shorten the output and thus lose information.
How to select representative keys? Here are two methods: one is to directly convolve (conv) the keys, and the other is to perform matrix multiplication with a matrix.
Review the computation process of the attention mechanism, where I is the input matrix and O is the output matrix.
Ignore softmax for now, and it can be expressed as follows:
The above process can be accelerated. If we do V*K^T first and then multiply by Q, the result will be the same as multiplying K^T*Q and then V, but the computational load will be significantly reduced.Appendix: A description of linear algebra regarding this part
Still using the above example for illustration. If we multiply K^T*Q, we will perform N*d*N multiplications. Then V*A will execute d’*N*N multiplications, so the total computational load is (d+d’)N^2.
If we do V*K^T, we will perform d’*N*d multiplications. Multiplying by Q will execute d’*d*N multiplications, so the total computational load is 2*d’*d*N.
And (d+d’)N^2>>2*d’*d*N, so by changing the order of operations, we can significantly improve computational efficiency.Now let’s bring softmax back. The original self-attention looks like this; taking b1 as an example:
If we can convert exp(q*k) into the form of two mappings multiplied together, we can further simplify the above expression:
▲ Simplification of the denominator
▲ Simplification of the numeratorTaking the contents inside the parentheses as a vector, M vectors form an M-dimensional matrix, multiplying by φ(q1) gives us the numerator.
Graphically represented as follows:
From the above, we can see that the blue vector and the yellow vector are actually independent of the 1 in b1. This means that when we calculate b2, b3…, the blue vector and yellow vector do not need to be recomputed.
Self-attention can also be viewed in another way. This computation method yields results almost identical to the original self-attention but with significantly reduced computational load. In simple terms, first find a transformation method φ(), convert k, and then perform a dot product with v to obtain an M-dimensional vector. Then transform q and multiply it with the corresponding dimension of M. The M-dimensional vector only needs to be computed once.
The calculation of b1 is as follows:
The calculation of b2 is as follows:
This can be understood as treating φ(k) and the vector computed with v as a template, and then using φ(q) to find which template is most important and perform matrix operations to obtain the output b.So how do we choose φ? Different papers have different methods:
When calculating self-attention, do we always need q and k? Not necessarily. In the Synthesizer literature, the attention matrix is not derived from q and k, but is learned as network parameters. Although different input sequences correspond to the same attention weights, performance does not degrade significantly. This raises a thought: what is the value of attention?
Is it necessary to use attention to process sequences? Can we try to eliminate attention? Are there any attention-free methods? Below are several MLP-based methods used to replace attention for sequence processing.
The final diagram summarizes all the methods discussed today. In the diagram, the LRA score on the vertical axis indicates better network performance; the horizontal axis represents how many sequences can be processed per second, with the right side indicating faster speed; larger circles represent more memory usage (greater computational load).
This concludes the summary of Professor Li’s lecture on “Various Transformations of Self-Attention Mechanisms”.Technical Group Invitation
△ Long press to add the assistant
Scan the QR code to add the assistant’s WeChat
Please note: Name – School/Company – Research Direction(e.g., Xiao Zhang – Harbin Institute of Technology – Dialogue System)to apply for joining the Natural Language Processing/Pytorch and other technical exchange groups
About Us
MLNLP Community is a civil academic community jointly constructed by machine learning and natural language processing scholars from home and abroad. It has developed into a well-known machine learning and natural language processing community, aiming to promote progress among the academic and industrial circles of machine learning and natural language processing, as well as among enthusiasts.The community can provide an open communication platform for related practitioners in terms of further education, employment, and research. Everyone is welcome to follow and join us.