“I found a bug in the attention formula, and no one has noticed it for eight years. All Transformer models, including GPT and LLaMA, are affected.”
Recently, a statistical engineer named Evan Miller has stirred up a storm in the AI field with his statement.
We know that the attention formula in machine learning is as follows:
Since the introduction of the Transformer in 2017, this formula has been widely used. However, Evan Miller has discovered that this formula is wrong and contains a bug!
Evan Miller’s blog explains how current popular AI models have errors in critical positions, making it difficult to compress and deploy all Transformer models.
In summary, Evan Miller introduced a new function called Quiet Attention, also known as Softmax_1, which is an innovative adjustment to the traditional softmax function.
Some netizens summarized a “TL;DR” version of the blog. The blog author suggests adding 1 to the denominator of the softmax formula used in the attention mechanism (not the final output softmax). The softmax in the attention unit allows it to match keys/queries as probabilities; these probabilities support a continuous value version of key-value lookups (the weights we get are not a lookup of 1/0 output, but high weight = desired key-value lookup).
Adding 1 to the denominator will change the attention unit, no longer using the true weight probability vector, but using weights that sum to less than 1. The motivation is that the network can learn to provide high weights, so the adjusted softmax is very close to the probability vector. At the same time, there is a new option to provide all-low weights (which provide all-low output weights), meaning it can choose not to have high confidence in anything.
Some even speculate, “Is this the reason why Microsoft’s RetNet performs better than transformers?”
Others have indicated that this research could promote improvements in LLMs, significantly compressing weights so that smaller models can rival larger ones:
Miller stated: You can use the Softmax_1 function just like the traditional softmax function, as shown below.
import torch
from softmax_one.softmax_one import softmax_one
x = torch.randn(5)
y = softmax_one(x, dim=0)
Based on this modification, Miller also conducted experiments, and the results are as follows:
Next, let’s see what error Miller discovered.
Outliers
Evan Miller discovered this bug while reading papers on quantization. Currently, memory and storage have become important factors limiting the development of artificial intelligence. People have been striving to compress models and trying to run large language models (LLMs) on the cloud and edge devices.
In computers, information is stored as binary data streams. If the data stream is highly predictable, such as always being within a limited range, we can store them with relatively few bits. Conversely, if a string of numbers is unpredictable and may be extraordinarily large, we need more binary digits to encode and store it. Transformer models contain some outlier weights.
In a paper published by Qualcomm AI Research in June titled “Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing,” the research team traced the existence of these outliers back to the softmax function in the attention mechanism.
Qualcomm paper link: https://arxiv.org/abs/2306.12929
This sounds surprising, but Evan Miller believes it is correct and further discovered an error in the softmax function.
Let’s see how Evan Miller illustrates that the softmax function is not a suitable tool in the attention mechanism.
Problems Introduced by Softmax
Why is softmax unsuitable for the attention mechanism? We need to start from what the attention mechanism can do.
Generally speaking, numerical errors are typically caused by program errors. However, when there are no errors in the program, it is necessary to spend a lot of time fixing complex mathematical formulas.
Evan Miller likely read about 50 arXiv papers before getting a clue. Miller started with “input embeddings,” which we can understand as a floating-point vector representing a word in the input string.
For example, Meta’s recently released LLaMA 2 model uses an embedding vector of length 3204, represented as half-precision floating-point numbers, just to represent one word in a vocabulary that typically contains 30,000 to 50,000 entries. This means that the embedding vector for one word occupies over 6KB of storage. As technology advances, the length of “input embeddings” has gradually increased, and the storage space occupied has also increased.
If you are a C programmer who is very sensitive to storage usage, you might find this number unacceptable. Why use 6KB for something that can be stored with just 2 bytes? If calculated at 2 bytes, assuming the vocabulary is less than 2^16=65384, we only need 16 bits to represent an entry.
However, in reality, the Transformer works as follows: it transforms the input vector into an output vector of the same size, and the final 6KB output vector is used to predict the next token. During operation, each layer of the Transformer adds information to the original word vector. Residual connections are also used in this process: all attention mechanisms add supplementary material to the original two-byte information, allowing LLMs to analyze longer contexts.
The last step of the Transformer is to multiply this output vector by a rectangular matrix and compress the resulting vocabulary length vector into a softmax function, treating these exponentiated outputs as probabilities for the next token. This is reasonable, but it is well known that this is not entirely correct, as we cannot be sure that these output probabilities are correct. Instead, each Transformer implementation and its derivative versions use sampling mechanisms to mask the fact that softmax over-represents probabilities that are low.
Next, Miller introduces the history of the development of softmax. Softmax first appeared in statistics, originally as a method for predicting state distributions based on energy levels, in the following form:
Later, economists modified it to
With this modification, softmax gained the multinomial logic function. Since Miller has an in-depth study of the softmax function, he can identify where softmax is used inappropriately.
Softmax is widely used; in physics, it is very effective; in economics, it may not be as accurate; but when applied in the field of machine learning, as long as discrete choices are involved, it seems to always be effective:
Miller further states that the key to softmax is that if you do not want to retain some items, you must modify softmax; otherwise, the results will be distorted.
For example, in the context of LLMs, the distortion arises from heavily weighting non-semantic tokens (such as commas), which become difficult-to-compress outliers, making research more challenging. AI researchers from Qualcomm also observed this phenomenon, noting that over 97% of outlier activations occur at the positions of spaces and punctuation marks in LLMs.
Next, Miller explains how softmax is used in attention, thus discovering where the problem arises:
Breaking down the above formula, in a decoder-only model, π, πΎ, and π originate from the same input sequence. They are not exactly the same, meaning the projection methods differ. However, in each layer, they all start with the same annotated embedding vector.
The ππΎ^π term is used to find the correlations between token vectors at different positions, essentially constructing a correlation matrix (the dot product is scaled), where each row and column corresponds to a token position. Then, a softmax operation is performed on each row of this square matrix, and the resulting probabilities are used as a mixing function for the value vectors in the π matrix. The mixed π is then added to the input vector, and the summed result is passed to the neural network for further processing.
Multi-head attention executes the above process multiple times in parallel at each layer. Essentially, this method partitions the embedding vector, with each head using information from the entire vector to annotate a (non-overlapping) segment of the output vector. This is the concatenation operation in the original Transformer paper.
The problem with using softmax is that it forces each attention head to annotate, even when there is no information to add to the output vector.
Softmax_1 and Quiet Attention
Here comes Softmax Super-Mod, which has ignited the LLM hacker channel.
Feeling a bit disappointed, right? What Miller did was simply add 1 to the denominator. If desired, this can allow the vector to approach 0 overall. Otherwise, it will only slightly reduce the values, and the reduced values will be compensated during the normalization process that occurs after attention.
The main difference lies in the negative value limit when the entries in π₯ are significantly less than zero, and the model tries to completely avoid annotation. This is the limiting behavior of the original softmax:
Compared to the new and improved softmax_1:
Vanilla softmax will always release the same total weight; softmax_1 looks mostly the same but has an “escape hatch” in the negative quadrant. It is important to clarify that the core issue here is essentially mathematical rather than numerical. Additional precision cannot save softmax; all Transformers will be affected.
You can also observe some other aspects of softmax_1. The derivative is positive, so there is always a non-zero gradient, and its sum is between 0 and 1, so the output will not go out of control. This function maintains the following properties:
That is, the relative values in the output vector remain unchanged.
Initially, Miller intended to call this function ghostmax because you can think of it as having an additional zero value entry, and there is a zero vector in the V matrix that can attenuate the results.
Although softmax_1 appears quite mundane, Miller is 99.44% confident that it will solve the outlier feedback loop, making quantization a topic of cascading research. Miller states that if you want to conduct some experiments to prove he is right, you can contact him. He will write a paper.
The improved mechanism can be referred to as Quiet Attention, which allows attention heads to remain “silent”.
Miller believes a test can be quickly integrated: if you prepend a zero vector to each input context and ensure the chosen neural network does not add any bias (including positional encoding), then the zero will not change as it passes through and will have an impact on adding unity to the denominator of each subsequent softmax. This way, you won’t lose your mind over gradient code processing. Miller believes this can be achieved using fixed embeddings and special prefix tokens in the LLaMA model.
You will still need to retrain the model, so do not attempt this on a Raspberry Pi (RPi) for now. But Miller is curious about what these weight kurtosis and activation infinity norms look like after running a few times. He believes this will become influential research, whether from the Qualcomm AI Research team’s paper or from the LLM hacker channel where someone calculates biblatex but first discovers it themselves.
-
Project link: https://github.com/kyegomez/AttentionIsOFFByOne
-
Blog link:https://www.evanmiller.org/attention-is-off-by-one.html?continueFlag=5d0e431f4edf1d8cccea47871e82fbc4
ββ End ββ
Source: Machine Heart
For academic sharing only. If there is any infringement, please leave a message, and it will be deleted immediately!
More readings
A PhD student from the University of Glasgow proposes a computational ghost imaging architecture
Expert Opinion: What is the real meaning of the recent Neuralink FDA IDE?
The medical device path of implanted brain-computer interface technology
The first author of LMDA-Net personally discusses its design philosophy
The relationship between hand movement imagination training and hand grab imagination
Join the Community
Welcome to join the brain-computer interface community group chat,
to discuss topics in the field of brain-computer interfaces and track the frontiers of brain-computer interfaces in real time.
Add WeChat group:
Add WeChat: RoseBrain [Note: Name + Industry/Profession].
Add QQ group: 913607986
Contributions Welcome
1. Contributions are welcome. For submission inquiries, please contact WeChat: RoseBrain
2. Join the community to become a part-time creator, please contact WeChat: RoseBrain
One-click three connections: “Share”, “Like”, and “View”
Not every piece of progress in brain-computer interfaces is great ~