Attention Mechanism Bug: Softmax is the Culprit Affecting All Transformers

↑ ClickBlue Text Follow the Jishi Platform
Attention Mechanism Bug: Softmax is the Culprit Affecting All Transformers
Source丨Machine Heart

Jishi Guide

“Big model developers, you are wrong.”>> Join the Jishi CV technology group to stay at the forefront of computer vision.

“I found a bug in the attention formula that no one has discovered for eight years. All Transformer models, including GPT and LLaMA, are affected.”

Yesterday, a statistician named Evan Miller made waves in the AI field with his statement.

Attention Mechanism Bug: Softmax is the Culprit Affecting All Transformers

We know that the attention formula in machine learning is as follows:

Attention Mechanism Bug: Softmax is the Culprit Affecting All Transformers

Since the introduction of the Transformer in 2017, this formula has been widely used. However, now Evan Miller has discovered that this formula is wrong and contains a bug!

Evan Miller’s blog explains how popular AI models have errors in critical positions, making it difficult to compress and deploy all Transformer models.

In summary, Evan Miller introduced a new function called Quiet Attention, also known as Softmax_1, which is an innovative adjustment to the traditional softmax function.

Attention Mechanism Bug: Softmax is the Culprit Affecting All Transformers

Some netizens summarized a “TL;DR” version of the blog. The blog author suggests adding 1 to the denominator of the softmax formula used in the attention mechanism (not the final output softmax). The softmax in the attention unit allows it to treat key/query matching as probabilities; these probabilities support a continuous value version of key-value lookup (the weights we get are not a 1/0 output of a lookup but rather high weight = the desired key-value lookup).

Adding 1 to the denominator will change the attention unit, no longer using the real weight probability vector, but rather weights that sum to less than 1. The motivation is that the network can learn to provide high weights, so the adjusted softmax is very close to the probability vector. At the same time, there is a new option to provide all-low weights (which provide all-low output weights), meaning it can choose not to have high confidence in anything.

Attention Mechanism Bug: Softmax is the Culprit Affecting All Transformers

Some even speculate, “Is this the reason why Microsoft’s RetNet outperforms transformers?”

Attention Mechanism Bug: Softmax is the Culprit Affecting All Transformers

Others have stated that this research can promote improvements in LLMs, greatly compressing weights so that smaller models can compete with larger ones:

Attention Mechanism Bug: Softmax is the Culprit Affecting All Transformers

Miller stated: you can use the Softmax_1 function just like the traditional softmax function, as shown below.

import torch
from softmax_one.softmax_one import softmax_one
x = torch.randn(5)
y = softmax_one(x, dim=0)

Based on such modifications, Miller also conducted experiments, and the results are as follows:

Attention Mechanism Bug: Softmax is the Culprit Affecting All Transformers

Next, let’s see what error Miller discovered.

Outliers

Evan Miller discovered this bug while reading papers on quantization. Currently, memory and storage have become critical factors limiting the development of AI. People have been working hard to compress models and trying to run large language models (LLMs) in the cloud and on edge devices.

In computers, information is stored as a binary data stream. If the data stream is highly predictable, such as always being within a limited range, we can store it with relatively few bits. Conversely, if a string of numbers is unpredictable, possibly containing huge numbers, we need more binary digits to encode and store it. Transformer models contain some outlier weights.

In a paper published by Qualcomm AI Research in June titled “Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing,” the research team traced the existence of these outliers back to the softmax function of the attention mechanism.

Attention Mechanism Bug: Softmax is the Culprit Affecting All TransformersQualcomm Paper Link: https://arxiv.org/abs/2306.12929

This may sound surprising, but Evan Miller believes it is correct and further discovered an error in the softmax function.

Let’s see how Evan Miller explains that the softmax function is not a suitable tool for the attention mechanism.

Problems Introduced by Softmax

Why is softmax not suitable for the attention mechanism? It starts with what the attention mechanism can do.

Generally speaking, numerical errors are usually caused by programming errors. However, when there are no programming errors, it is necessary to spend a lot of time fixing complex mathematical formulas.

Evan Miller read about 50 arXiv papers before getting some insights. Miller started with “input embeddings,” which we can understand as a floating-point vector representing a word in the input string.

For example, Meta’s recently released LLaMA 2 model uses an embedding vector of length 3204, represented as half-precision floating-point numbers, just to represent one word in the vocabulary, which typically contains 30,000 to 50,000 entries. This means that the embedding vector for one word occupies over 6KB of storage. As technology advances, the length of “input embeddings” has gradually increased, and so has the storage space they occupy.

If you are a C programmer sensitive to storage usage, you might find it unacceptable that something that could be stored in 2 bytes is taking up 6KB. If calculated at 2 bytes, assuming the vocabulary is less than 2^16=65384, we only need 16 bits to represent an entry.

However, in reality, the Transformer works like this: it converts the input vector into an output vector of the same size, and the final 6KB output vector is used to predict the next token. During operation, each layer of the Transformer adds information to the original word vector. In this process, residual connections are also used: all attention mechanisms are adding supplementary material to the original two-byte information, allowing LLMs to analyze longer contexts.

The last step of the Transformer is to multiply this output vector by a rectangular matrix and compress the resulting vocabulary-length vector into a softmax function, treating these exponentiated outputs as the probabilities of the next token. This is reasonable, but it is well known that this is not entirely correct, as we cannot be sure that these output probabilities are correct. Instead, each Transformer implementation and its derivative versions use sampling mechanisms to hide the fact that softmax over-represents low probabilities.

Next, Miller introduces the history of softmax. Softmax first appeared in statistics, initially as a method for predicting state distributions based on energy levels, in the following form:

Attention Mechanism Bug: Softmax is the Culprit Affecting All Transformers

Later, economists modified it to

Attention Mechanism Bug: Softmax is the Culprit Affecting All Transformers

With this modification, softmax gained the multinomial logistic function. Due to Miller’s deep research into the softmax function, he can identify inappropriate uses of softmax.

Softmax is widely used; it is very effective in physics; it may not be as accurate in economics; but when applied to machine learning, as long as discrete choices are involved, it seems to always be effective:

Attention Mechanism Bug: Softmax is the Culprit Affecting All Transformers

Miller further states that the key to softmax is that if you don’t want to retain some items, you must modify softmax; otherwise, the results will be distorted.

For example, in the context of LLMs, the reason for distortion is the excessive weighting of non-semantic tokens (commas, etc.), which leads to these higher weights becoming difficult-to-compress outliers, complicating research. Qualcomm AI researchers have also observed this phenomenon, noting that over 97% of outlier activations in LLMs occur at the positions of spaces and punctuation.

Next, Miller explains how softmax is used in attention mechanisms, revealing where the problems arise:

Attention Mechanism Bug: Softmax is the Culprit Affecting All Transformers

Breaking down the above formula, in a decoder-only model, 𝑄, 𝐾, and 𝑉 originate from the same input sequence. They are not exactly the same, as they are projected differently. However, in each layer, they all start with the same annotated embedding vector.

The 𝑄𝐾^𝑇 term is used to find the correlations between token vectors at different positions, essentially constructing a correlation matrix (with scaling), where each column and row corresponds to a token position. Then, softmax is applied to each row of this matrix, and the resulting probabilities are used as a mixing function for the value vectors in the 𝑉 matrix. The mixed 𝑉 is then added to the input vector, and the summed result is passed to the neural network for further processing.

Multi-head attention executes the above process multiple times in parallel at each layer. Essentially, this approach partitions the embedding vector, with each head using information from the entire vector to annotate a (non-overlapping) segment of the output vector. This is the concatenation operation in the original Transformer paper.

The problem with using softmax is that it forces each attention head to annotate, even when there is no information to add to the output vector.

Softmax_1 and Quiet Attention

Here comes Softmax Super-Mod, igniting the LLM hacker channel.

A bit disappointing, right? What Miller did was simply add 1 to the denominator. If desired, this can allow the vector to approach 0 as a whole. Otherwise, it will only slightly reduce the values, and the reduced values will be compensated during normalization, which occurs after attention.

Attention Mechanism Bug: Softmax is the Culprit Affecting All Transformers

When the entries in 𝑥 are significantly less than zero and the model tries to completely avoid annotations, the main difference lies in the negative value limitation. The limiting behavior of the original softmax is as follows:

Attention Mechanism Bug: Softmax is the Culprit Affecting All Transformers

Compared to the new and improved softmax_1.

Attention Mechanism Bug: Softmax is the Culprit Affecting All Transformers

Vanilla softmax will always release the same total weight; softmax_1 looks mostly the same, but has an “escape hatch” in the negative quadrant. It is important to clarify that the core issue here is fundamentally mathematical rather than numerical. Additional precision cannot save softmax; all Transformers will be affected.

You can also observe some other aspects of softmax_1. The derivative is positive, so there is always a non-zero gradient, and its sum is between 0 and 1, so the output does not run wild. The function maintains the following properties:

Attention Mechanism Bug: Softmax is the Culprit Affecting All Transformers

That is, the relative values in the output vector remain unchanged.

Initially, Miller intended to name this function ghostmax because you can think of it asAttention Mechanism Bug: Softmax is the Culprit Affecting All Transformershaving an extra zero-value entry, and a zero vector in the V matrix that can attenuate the results.

Although softmax_1 may seem boring on the surface, Miller is 99.44% confident that it will solve the outlier feedback loop, making quantization a topic of cascading research. Miller states that if you want to conduct some experiments to prove him right, you can contact him. He will write a paper.

The improved mechanism can be called Quiet Attention, which allows attention heads to “remain silent”.

Attention Mechanism Bug: Softmax is the Culprit Affecting All Transformers

Miller believes a test can be quickly integrated: if you add a zero vector at the front of each input context and ensure that the chosen neural network does not add any bias (including positional encoding), then the zero will not change as it passes through, and it will affect the addition of unity to the denominator of each subsequent softmax. This way, you won’t lose your mind processing the gradient code. Miller believes this can be accomplished using fixed embeddings and special prefix tokens in the LLaMA model.

You will still need to retrain the model, so temporarily don’t try this on a Raspberry Pi (RPi). But Miller wonders what these weight kurtosis and activation infinity norms look like after running several times. He believes this will become influential research, whether it is the Qualcomm AI Research team’s paper or someone calculating biblatex in the LLM hacker channel who discovered it first.

  • Project address: https://github.com/kyegomez/AttentionIsOFFByOne

  • Blog link:https://www.evanmiller.org/attention-is-off-by-one.html?continueFlag=5d0e431f4edf1d8cccea47871e82fbc4

Attention Mechanism Bug: Softmax is the Culprit Affecting All Transformers

Reply “Jishi Live” in the official account to get 100+ episodes of Jishi technical live replays + PPT

Jishi Insights

Jishi Perspective Dynamics: The 2023 GCVC Global Artificial Intelligence Vision Industry and Technology Ecosystem Partner Conference concluded successfully in Qingdao!|Jishi Perspective helps build the urban brain center, and the Wuhu Bayzhi District Smart City Operation Management Center is online!
Datasets: A compilation of open-source datasets related to facial expression recognition|A compilation of open-source datasets related to fight recognition (with download links)|A compilation of open-source datasets for mask detection
Classic Interpretations: A detailed interpretation column on multimodal large models

Attention Mechanism Bug: Softmax is the Culprit Affecting All Transformers

Click to read the original text and enter the CV community

Gain more technical insights

Leave a Comment