“Don’t let Yann LeCun see this.”
Yann LeCun said it’s too late; he has already seen it. Today, we will introduce a paper that “LeCun must see,” exploring the question: Is the Transformer a far-sighted language model? When it performs inference at a certain position, does it consider future positions in advance?
This study concludes that while Transformers have the capability to do so, they do not do it in practice.
We all know that humans think before they speak. Decades of linguistic research have shown that when humans use language, they internally predict the upcoming language input, words, or sentences.
Unlike humans, current language models allocate a fixed amount of computation for each token when “speaking.” So we can’t help but ask: Do language models think ahead like humans?
Recent studies have shown that it is possible to predict more tokens after the next token by probing the hidden states of language models. Interestingly, by using linear probes on the model’s hidden states, it is possible to predict the model’s outputs for future tokens to some extent, and interfering with the hidden states can lead to predictable modifications of future outputs.
These findings suggest that the model activations at a given time step can predict future outputs to some extent.
However, we are still unclear on the reason: Is this just a coincidental property of the data, or does the model intentionally prepare information for future time steps (which would affect the model’s performance at the current position)?
To answer this question, three researchers from the University of Colorado Boulder and Cornell University recently published a paper titled “Do Language Models Plan for Future Tokens?”
Paper Title: Do Language Models Plan for Future Tokens?
Paper Link: https://arxiv.org/pdf/2404.00859.pdf
Research Overview
They observed that during training, the gradients optimize the loss for the current token position as well as for the tokens that follow in the sequence. They further asked: How do the current transformer weights allocate resources between the current token and future tokens?
They considered two possibilities: the pre-caching hypothesis and the breadcrumbs hypothesis.
The pre-caching hypothesis states that the transformer computes features that are unrelated to the inference task at the current time step but may be useful for future time steps t + τ, while the breadcrumbs hypothesis states that the features most relevant to time step t are equivalent to the features that will be most useful at time step t + τ.
To evaluate which hypothesis is correct, the team proposed a myopic training scheme that does not propagate the gradient of the loss at the current position to the hidden states of previous positions.
For mathematical definitions and theoretical descriptions of the above hypotheses and schemes, please refer to the original paper.
Experimental Results
To understand whether language models can directly implement pre-caching, they designed a synthetic scenario where the task could only be accomplished through explicit pre-caching. They configured a task where the model had to pre-compute information for the next token; otherwise, it could not accurately compute the correct answer in a single forward pass.
Definition of the synthetic dataset constructed by the team.
In this synthetic scenario, the team found clear evidence that transformers can learn to pre-cache. When transformer-based sequence models must pre-compute information to minimize loss, they will do so.
Subsequently, they explored whether natural language models (pre-trained variants of GPT-2) exhibited the breadcrumbs hypothesis or the pre-caching hypothesis. Their experiments with the myopic training scheme indicated that pre-caching occurred much less frequently in this setup, thus the results leaned more towards the breadcrumbs hypothesis.
Cross-entropy loss and its differences between the original GPT-2 model and the myopic GPT-2 model based on token positions.
Validation cross-entropy loss obtained by GPT-2 through original and myopic training.
Thus, the team claims that on real language data, language models do not significantly prepare information for the future. Instead, they compute features useful for predicting the next token — which turns out to also be useful for future steps.
The team stated, “In language data, we observe that optimizing greedily for the next token loss does not significantly trade-off with ensuring future prediction performance.”
Therefore, we can see that the question of whether the Transformer can think ahead seems to be fundamentally a data issue.
It can be imagined that perhaps in the future we can enable language models to possess the ability to think ahead like humans through appropriate data organization methods.