The following article is sourced from WeChat public account: Xiao Bai Learning Vision.
Author: Xiao Bai Learning Vision
Editor: Machine Heart
Link:https://mp.weixin.qq.com/s/qaAnLOaopuXKptgFmpAKPA
This article is for academic sharing only. If there is any infringement, please contact the backend for deletion.
This article introduces a bug in the attention formula in machine learning, as pointed out by an engineer, and then delves into the issues raised by softmax through a paper published by Qualcomm AI Research in June, clarifying doubts for readers.
“Developers of large models, you are wrong.”
“I found a bug in the attention formula that no one has discovered for eight years. All Transformer models, including GPT and LLaMA, are affected.”
Yesterday, a statistician named Evan Miller stirred up a storm in the AI field.

We know that the attention formula in machine learning is as follows:
Since the advent of the Transformer in 2017, this formula has been widely used, but now, Evan Miller has discovered that this formula is wrong and has a bug!
Evan Miller’s blog explains how current popular AI models make critical errors, making it difficult to compress and deploy all Transformer models.
In summary, Evan Miller introduces a new function called Quiet Attention, also known as Softmax_1, which is an innovative adjustment to the traditional softmax function.
Some netizens summarized a “TL;DR” version of the blog. The author suggests adding 1 to the denominator of the softmax formula used in the attention mechanism (not the final output softmax). The softmax in the attention unit allows it to match keys/queries as probabilities; these probabilities support a continuous value version of key-value lookups (the weights we get are not a lookup of 1/0 output, but rather high weights = desired key-value lookup).
Adding 1 to the denominator will change the attention unit, no longer using the true weight probability vector but rather using weights that sum to less than 1. The motivation is that the network can learn to provide high weights, so the adjusted softmax is very close to the probability vector. There is also a new option to provide all-low weights (they provide all-low output weights), meaning it can choose not to have high confidence in anything.
Some even speculate, “Is this why Microsoft’s RetNet performs better than transformers?”
Others have expressed that this research could promote improvements in LLM, greatly compressing weights so that smaller models can rival larger ones:
Miller states: You can use the Softmax_1 function just like the traditional softmax function, as shown below.
import torch
from softmax_one.softmax_one import softmax_one
x = torch.randn(5)
y = softmax_one(x, dim=0)
Based on such modifications, Miller also conducted experiments, and the results are as follows:

Next, let’s see what errors Miller discovered.
Outliers
Evan Miller discovered this bug while reading papers on quantization. Currently, memory and storage have become important factors limiting the development of artificial intelligence. People have been striving to compress models and trying to run large language models (LLMs) on the cloud and edge devices.
In computers, information is stored as a binary data stream. If the data stream is highly predictable, such as always falling within a limited range, we can store them with relatively few bits. Conversely, if a string of numbers is unpredictable, possibly containing huge numbers that are rare, we will need more binary digits to encode and store them. Transformer models contain some outlier weights.
In a paper published by Qualcomm AI Research in June titled “Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing,” the research team traced the existence of these outliers back to the softmax function in the attention mechanism.
Qualcomm paper link: https://arxiv.org/abs/2306.12929
This sounds surprising, but Evan Miller believes it is correct and further discovers an error in the softmax function.
Let’s see how Evan Miller explains that the softmax function is not a suitable tool in the attention mechanism.
Problems Raised by Softmax
Why is softmax unsuitable for the attention mechanism? This starts with what the attention mechanism can do.
Generally speaking, numerical errors are usually caused by programming errors. However, when the program has no errors, it is necessary to start fixing complex mathematical formulas, which takes a lot of time.
Evan Miller probably read about 50 arXiv papers before getting some clues. Miller started with “input embeddings,” which we can understand as a floating-point vector representing a word in the input string.
For example, Meta’s recently launched LLaMA 2 model uses an embedding vector of length 3204, represented as half-precision floating-point numbers, just to represent one word in the vocabulary, which typically contains 30,000 to 50,000 entries. This means an embedding vector for one word occupies over 6KB of storage space. With the development of technology, the length of “input embeddings” has gradually increased, and the storage space occupied has also increased.
If you are a C programmer sensitive to storage usage, you might find it unacceptable that something that could be stored in 2 bytes is taking up 6KB. If calculated as 2 bytes, assuming the vocabulary is less than 2^16=65384, then we only need 16 bits to represent one entry.
However, the reality is that the Transformer works like this: it converts the input vector into an output vector of the same size, and the final 6KB output vector is used to predict the next token. During operation, each layer of the Transformer adds information to the original word vector. In this process, residual connections are also used: all attention mechanisms are adding supplementary material to the original two-byte information, allowing LLMs to analyze longer contexts.
The last step of the Transformer is to multiply this output vector by a rectangular matrix and compress the resulting vocabulary-length vector into a softmax function, treating these exponentiated outputs as probabilities for the next token. This is reasonable, but it is well known that this is not entirely correct, as we cannot be sure that these output probabilities are accurate. Instead, each Transformer implementation and its derived versions use sampling mechanisms to hide the fact that softmax over-represents low probabilities.
Next, Miller introduces the history of the development of softmax. Softmax first appeared in statistics, initially as a method for predicting state distributions based on energy levels, in the following form:
Later, economists modified it to
This modification gave softmax a multinomial logistic function. Due to Miller’s deep research on the softmax function, he can identify inappropriate uses of softmax.
Softmax is widely applied; in physics, it is very effective; in economics, it may not be so accurate; but when applied to the field of machine learning, as long as discrete choices are involved, it seems to always be effective:
Miller further states that the key to softmax is that if you do not want to retain some items, you must modify softmax; otherwise, the results will be distorted.
For example, in the context of LLMs, the reason for distortion is the large weighting of non-semantic tokens (such as commas), which become difficult-to-compress outliers, making research more challenging. Researchers from Qualcomm also observed this phenomenon, noting that over 97% of outlier activations occur at spaces and punctuation marks.
Next, Miller explains how softmax is used in attention, thereby discovering where the problem lies:
Breaking down the above formula, in a decoder-only model, π, πΎ, and π come from the same input sequence. They are not exactly the same, i.e., they are projected differently. However, at each layer, they start with the same annotated embedding vector.
The ππΎ^π term is used to find the correlation between token vectors at different positions, essentially constructing a correlation matrix (pointwise product scaled), where each row and column corresponds to a token position. Then, a softmax operation is performed on each row of this square matrix, and the resulting probabilities are used as a mixing function for the value vectors in the π matrix. The mixed probabilities from π are added to the input vector, and the sum is passed to the neural network for further processing.
Multi-head attention executes the above process in parallel multiple times. Essentially, this method partitions the embedding vector, with each head using information from the entire vector to annotate a (non-overlapping) segment of the output vector. This is the concatenation operation in the original Transformer paper.
The problem with using softmax is that it forces each attention head to annotate even when there is no information to add to the output vector.
Softmax_1 and QuietAttention
Here comes Softmax Super-Mod, igniting the LLM hacker channel.
Feeling a bit disappointed, right? What Miller did was simply add 1 to the denominator. If desired, this can allow the vector to approach 0 as a whole. Otherwise, it will only slightly reduce the values, and the reduced values will be compensated during normalization, which occurs after attention.
The main difference lies in the negative value limit when entries in π₯ are significantly less than zero and the model tries to completely avoid annotation. The limiting behavior of the original softmax is as follows:
Compared to the new and improved softmax_1.
Vanilla softmax will always release the same total weight; softmax_1 looks mostly the same but has an “escape hatch” in the negative quadrant. It is important to clarify that the core issue here is essentially mathematical rather than numerical. Additional precision cannot save softmax; all Transformers will be affected.
You can also observe some other aspects regarding softmax_1. The derivative is positive, so there is always a non-zero gradient, and its sum is between 0 and 1, so the output does not go out of control. This function maintains the following properties:
That is, the relative values in the output vector remain unchanged.
Initially, Miller intended to name this function ghostmax because you can think of it as having an additional zero value entry and a zero vector in the V matrix that can attenuate the results.
Although softmax_1 may seem boring on the surface, Miller is 99.44% confident it will solve the outlier feedback loop, making quantization a topic of cascading research. Miller states that if you want to conduct some experiments to prove he is right, you can contact him. He will write a paper.
The improved mechanism can be called QuietAttention, which allows attention heads to remain “silent”.
Miller believes a test can be integrated soon: if you prepend a zero vector to every input context and ensure that the selected neural network does not add any bias (including positional encoding), then the zero will not change as it passes through, and adding unity to the denominator of each subsequent softmax will have an effect. This way, you won’t lose your mind processing gradient code. Miller believes this can be achieved using fixed embeddings and special prefix tokens in the LLaMA model.
You still need to retrain the model, so do not try this on a Raspberry Pi (RPi) for now. But Miller is curious about what the kurtosis of these weights and the activation infinity norm look like after running several times. He believes this will become influential research, whether it is the Qualcomm AI Research team’s paper or someone in the LLM hacker channel calculating biblatex but discovering it first.

Recommended Reading
AIHIA | AI Talent Innovation Development Alliance 2023 Ally Recruitment
AI Classmates Association | AI Classmates Association starts trial operation, come pick your AI classmate
AI Financing | Intelligent IoT company Agasai receives 50 million financing from Qualcomm
Yolov5 Application | Full Process and Code Explanation of Home Security Alarm System
Jiang Daba | Reflections on Transitioning to the AI Industry from Scratch Over the Years
Note: Daba organizes some mid-to-high-end positions in the AI industry, with annual salaries ranging from 500,000 to 1,200,000. Hot positions like image algorithms, search recommendations, etc., are welcome to contact Daba, providing full process communication tracking, details of various positions are as follows:
“AI Future Planet” accompanies you in your growth in the AI industry, with various benefits opening up:
(1) 198 yuan “31 Lessons to Get Started with Artificial Intelligence” video course;
(2) Various datasets purchased by Daba for nearly ten thousand yuan;
(3) Monthly self-study activities, the 17th of each month is the membership day, with various prizes continuously sent;
(4) Join the internal WeChat group of “AI Future Planet”;
There are also various documents and research reports shared during live broadcasts, let’s scan the code to join!

There are many research directions in the AI industry, with dozens of directions.
To facilitate learning and communication, Daba has created some industry discussion groups in different directions.
In each field, there are industry practitioners who are experts in their respective directions to communicate and share with everyone.
Currently mainly established: Opencv project aspects, object detection aspects, model deployment aspects, and new direction groups will be established later based on the joining of experts from different fields!
Everyone can join the corresponding WeChat group based on their interests and hobbies to learn and communicate together!