Attention Mechanism Bug: Softmax as the Culprit Affecting All Transformers

“The stone from other hills can serve to polish jade.” Only by standing on the shoulders of giants can we see further and go farther. On the path of scientific research, we need to leverage favorable conditions to move forward faster. Therefore, we have specially collected and organized some practical code links, datasets, software, programming techniques, etc., to open the “Stone from Other Hills” column, helping you ride the winds and waves, bravely moving forward. Stay tuned.

Source: Machine Heart

I discovered a bug in the attention formula that no one has noticed for eight years. All Transformer models, including GPT and LLaMA, are affected.

A statement by a statistician named Evan Miller has caused a stir in the AI field.

Attention Mechanism Bug: Softmax as the Culprit Affecting All Transformers

We know that the attention formula in machine learning is as follows:

Since the introduction of the Transformer in 2017, this formula has been widely used. However, Evan Miller has discovered that this formula is incorrect and has a bug!

Evan Miller’s blog explains how the current popular AI models have errors in critical positions, making it difficult for all Transformer models to be compressed and deployed.

In summary, Evan Miller introduced a new function called Quiet Attention, also known as Softmax_1, which is an innovative adjustment to the traditional softmax function.

Some netizens summarized a “TL;DR” version of the blog. The author suggests adding 1 to the denominator of the softmax formula used in the attention mechanism (not the final output softmax). The softmax in the attention unit allows it to treat key/query matches as probabilities; these probabilities support a continuous value version of key-value lookups (the weights we obtain are not a binary output of 1/0, but high weights = desired key-value lookups).

Adding 1 to the denominator will change the attention unit, no longer using the true weight probability vector, but using weights that sum to less than 1. The motivation is that the network can learn to provide high weights, so the adjusted softmax is very close to the probability vector. At the same time, there is a new option to provide all-low weights (which provide all-low output weights), meaning it can choose not to have high confidence in anything.

Some even speculate, “Is this the reason why Microsoft’s RetNet outperforms transformers?”

Others have mentioned that this research could promote improvements in LLMs, greatly compressing weights, allowing smaller models to rival larger ones:

Miller states: You can use the Softmax_1 function just like the traditional softmax function, as shown below.

import torch
from softmax_one.softmax_one import softmax_one
x = torch.randn(5)
y = softmax_one(x, dim=0)

Based on such modifications, Miller also conducted experiments, and the results are as follows:

Next, let’s see what errors Miller discovered.

Outliers

Evan Miller discovered this bug while reading papers on quantization. Currently, memory and storage have become significant constraints on the development of artificial intelligence. People have been striving to compress models and try to run large language models (LLMs) on the cloud and edge devices.

In computers, information is stored as binary data streams. If the data stream is highly predictable, for example, always within a limited range, we can store them with relatively few bits. Conversely, if a string of numbers is unpredictable, possibly containing massive numbers, we need more binary digits to encode and store it. Transformer models contain some outlier weights.

In a paper published by Qualcomm AI Research in June titled “Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing,” the research team traced the existence of these outliers back to the softmax function in the attention mechanism.

Qualcomm paper link: https://arxiv.org/abs/2306.12929

This sounds surprising, but Evan Miller believes it is correct and further discovered that the softmax function has an error.

Let’s see how Evan Miller explains that the softmax function is not a suitable tool in the attention mechanism.

Problems Introduced by Softmax

Why is softmax unsuitable for the attention mechanism? This starts with what the attention mechanism can do.

Generally speaking, numerical errors are usually caused by programming errors. However, when the program is error-free, we need to spend a lot of time fixing complex mathematical formulas.

Evan Miller read about 50 arXiv papers before he got a clue. Miller started with “input embeddings,” which we can understand as a floating-point vector representing a word in the input string.

For example, Meta’s recently released LLaMA 2 model uses an embedding vector of length 3204, represented as half-precision floating-point numbers, just to represent one word in a vocabulary that typically contains 30,000 to 50,000 entries. This means that an embedding vector for a single word occupies over 6KB of storage. As technology advances, the length of “input embeddings” has gradually increased, and the storage space they occupy has also increased.

If you are a C programmer who is very sensitive to storage usage, you might find it unacceptable that something that could be stored in 2 bytes is taking up 6KB. If calculated at 2 bytes, assuming the vocabulary is less than 2^16=65384, we only need 16 bits to represent an entry.

However, the way Transformers work is that they convert the input vector into an output vector of the same size, and the final 6KB output vector is used to predict the next token. During operation, each layer of the Transformer adds information to the original word vector. Residual connections are also used: all attention mechanisms add supplementary material to the original two-byte information, allowing LLMs to analyze longer contexts.

The final step of the Transformer is to multiply this output vector by a rectangular matrix and compress the resulting vocabulary-length vector into a softmax function, treating these exponentiated outputs as probabilities for the next token. This is reasonable, but it is well-known that this is not entirely correct because we cannot be sure these output probabilities are correct. Instead, each Transformer implementation and its derivatives use sampling mechanisms to obscure the fact that softmax over-represents low probabilities.

Next, Miller introduces the history of softmax. Softmax first appeared in statistics, initially as a method for predicting state distributions based on energy levels, in the following form:

Later, economists modified it to

It was this modification that gave softmax its multinomial function. Due to Miller’s deep research on the softmax function, he can identify inappropriate uses of softmax.

Softmax is widely used; in physics, it is very effective; in economics, it may not be as accurate; but when applied to machine learning, as long as it involves discrete choices, it seems to always be effective:

Miller further states that the key to softmax is that if you do not want to keep certain terms, you must modify softmax; otherwise, the results will be distorted.

For example, in the context of LLMs, the reason for distortion is due to the heavy weighting of non-semantic tokens (such as commas), which become difficult-to-compress outliers, making research more difficult. Researchers from Qualcomm also discovered this phenomenon, finding that over 97% of outlier activations occur at spaces and punctuation positions in LLMs.

Next, Miller introduces how softmax is used in attention, revealing where the problem lies:

Breaking down the above formula, in a pure decoder model, 𝑄, 𝐾, and 𝑉 come from the same input sequence. They are not entirely the same, as the projection methods differ. However, in each layer, they all start with the same annotated embedding vector.

The 𝑄𝐾^𝑇 term is used to find the correlation between token vectors at different positions, effectively constructing a correlation matrix (the dot product is scaled), where each row and column corresponds to a token position. Then, a softmax operation is performed on each row of this matrix, and the resulting probabilities are used as a mixing function for the value vectors in the 𝑉 matrix. The mixed 𝑉 is then added to the input vector, and the summation result is passed to the neural network for further processing.

Multi-head attention executes the above process in parallel multiple times per layer. Essentially, this method partitions the embedding vector, with each head using information from the entire vector to annotate a (non-overlapping) segment of the output vector. This is the concatenation operation in the original Transformer paper.

The problem with using softmax is that it forces each attention head to annotate, even when there is no information to add to the output vector.

Softmax_1 and QuietAttention

Here comes Softmax Super-Mod, igniting the LLM hacker channel.

Disappointing, right? What Miller did was simply add 1 to the denominator. If desired, this could allow the vector to approach 0 as a whole. Otherwise, it would only slightly reduce the values, and the reduced values would be compensated during normalization, which occurs after attention.

When entries in 𝑥 are significantly less than zero and the model tries to completely avoid annotation, the main difference lies in the negative value limitation. Below is the limiting behavior of the original softmax:

Compared with the new, improved softmax_1.

Vanilla softmax will always release the same total weight; softmax_1 looks mostly the same, but has an “escape hatch” in the negative quadrant. It is essential to clarify that the core issue here is fundamentally mathematical rather than numerical. Additional precision cannot save softmax; all Transformers will be affected.

You can also observe some other aspects regarding softmax_1. The derivative is positive, so there is always a non-zero gradient, and its sum is between 0 and 1, so the output does not go out of control. This function maintains the following properties:

That is, the relative values in the output vector remain unchanged.

Initially, Miller intended to call this function ghostmax because you could think of it as having an extra zero-value entry, and the V matrix has a zero vector capable of diminishing results.

Although softmax_1 appears rather mundane, Miller is 99.44% confident it will solve the outlier feedback loop, making quantization a topic of cascading research. Miller states that if you want to conduct some experiments to prove him right, you can contact him. He will write a paper.

The improved mechanism could be called QuietAttention, allowing attention heads to remain “silent.”

Miller believes a test can be integrated soon: if you add a zero vector at the front of each input context and ensure the chosen neural network does not add any bias (including position encoding), then the zero will not change as it passes through, and adding unity to each subsequent softmax denominator will have an effect. This way, you won’t lose your mind over gradient code processing. Miller believes this can be accomplished using fixed embeddings and special prefix tokens in the LLaMA model.

You still need to retrain the model, so don’t try this on a Raspberry Pi (RPi) for now. But Miller is curious about what these weight kurtosis and activation infinity norms look like after several runs. He believes this will become influential research, whether it is the Qualcomm AI Research team’s paper or someone calculating biblatex in the LLM hacker channel, but he discovered it first.

Project link: https://github.com/kyegomez/AttentionIsOFFByOne
Blog link: https://www.evanmiller.org/attention-is-off-by-one.html?continueFlag=5d0e431f4edf1d8cccea47871e82fbc4

本文目的在于学术交流，并不代表本公众号赞同其观点或对其内容真实性负责，版权归原作者所有，如有侵权请告知删除。

“The stone from other hills” historical articles

27 Python Data Science Library Practical Cases

Reviewing the official FasterRCNN code from PyTorch

Segment Anything project summary

Comparison of several commonly used frameworks for RLHF (trlx, deepspeedchat, colossalaichat)

Large Model Thinking Chain (Chain-of-Thought) Technical Principles

Understanding LangChain in one article

Notes on CUDA SGEMM Matrix Multiplication Optimization – From Beginner to cublas

Three lines of code to call the large model referee PandaLM: privacy-preserving, reliable, reproducible

Tips for upgrading to PyTorch 2.0

Understanding PyTorch Memory Management Mechanism in One Article

YOLOv5 Practical PCB Defect Detection

Time Series Prediction with Neural Networks PyTorch-Forecasting

Summary of Python Feature Selection

Simple Implementation of a ChatGPT Plugin

Using BERT for NER? Teaching you to easily get started with Roberta using PyTorch!

NSFW Image Classification

PyTorch-Forecasting, a new time series prediction library (with code)

Recording an understanding of the use of past_key_values

Overview of Few-Shot Learning: Algorithms, Models, and Applications

For more articles from the “Stone from Other Hills” column, please click the “Read Original” at the bottom of the article.

Share, like, and give a three-way boost!

Leave a Comment Cancel reply