Question: I have searched various materials and read the original papers, which detail how Q, K, and V are obtained through certain operations to derive output results. However, I have not found any explanation of where Q, K, and V come from. Isn’t the input to a layer just a tensor? Why do we have these three tensors: Q, K, and V?
IIIItdaf’s Response:
I work in CV, and while learning about Transformers, I also had this question about QKV in Self-Attention. Why do we need to define three tensors? I found that many have explained it well, but it could be stated more straightforwardly. I have a rough understanding, and since drawing diagrams for answers is cumbersome and I’m not experienced, I’ll simply share my understanding, which may not be entirely accurate, please bear with me.
The attention mechanism is fundamentally about obtaining a weighting through training. Self-attention aims to find relationships between words through a weight matrix. Therefore, it is necessary to define a tensor for each input and use multiplication between tensors to derive the relationships between inputs. So, would defining just one tensor for each input suffice? Not really! If each input only has one corresponding Q, after performing multiplication between Q1 and Q2 to obtain the relationship between A1 and A2, how would we store and use this result? Moreover, is the relationship between A1 and A2 reciprocal? What if A1 finds A2 and A2 finds A1 differently? Defining just one tensor seems too simplistic for this model.
One tensor is not enough, so we define two, leading to Q and K. You can understand Q as the one used for self-reference, while K is meant for others, specifically for inputs that seek relationships with you. Thus, by using your Q to multiply with others’ K (and of course your own K as well), you can derive the identified relationship: the weight α.
Is defining only Q and K sufficient? Probably not. The identified relationships need to be utilized; otherwise, they are pointless. The weight α is used to weight the input information, reflecting the value of the identified relationships. Can we just weight directly with the inputs? While that is possible, it appears somewhat direct and rigid. Therefore, we define a V. It’s important to know that V, like Q and K, is also obtained by multiplying the input A with a coefficient matrix. Thus, defining V essentially adds another layer of learnable parameters to A, which is then weighted to apply the relationships learned through the attention mechanism. Consequently, we perform a weighted operation using α and V to ultimately obtain the output O.

In summary, my feeling is that defining these three tensors serves two purposes: one is to learn the relationships between inputs and to find and record the weights of these relationships, and the other is to introduce learnable parameters within a reasonable structure, enhancing the network’s learning capability. The diagram below illustrates their relationships quite well, sourced from “Understanding Vision Transformer Principles and Code, This Technical Review is Enough,” from the Extreme City Platform, copyright reserved.

This is my rudimentary interpretation. If there are inaccuracies, please advise. I will continue learning about attention…
This article is reproduced from Zhihu, with copyright belonging to the original author, please delete if infringing.
——The End——

