Introducing ∞-former: Infinite Long-Term Memory for Any Length Context

Reported by Machine Heart

Machine Heart Editorial Team

Can it hold context of any length? Here is a new model called ∞-former.

In the past few years, the Transformer has dominated the entire NLP field and has also crossed into other areas such as computer vision. However, it has its weaknesses, such as not being good at handling long contexts, as the computational complexity increases with the length of the context, making it difficult to effectively model long-term memory. To mitigate this issue, various Transformer variants have been proposed, but their memory capacity is limited, forcing them to discard earlier information.

In a paper, researchers from DeepMind and other institutions proposed a model called ∞-former, which is a Transformer model with infinite long-term memory (LTM) that can handle contexts of any length.

Introducing ∞-former: Infinite Long-Term Memory for Any Length Context

Paper link:https://arxiv.org/pdf/2109.00301.pdf

By utilizing a continuous spatial attention mechanism to handle long-term memory, the attention complexity of ∞-former can be independent of the length of the context. Therefore, it can model contexts of any length with a fixed computational cost while maintaining “sticky memories”.

Experiments conducted on a comprehensive ranking task demonstrated that ∞-former can retain information from long sequences. Additionally, researchers conducted language modeling experiments, including training a model from scratch and fine-tuning a pre-trained language model, which showed the advantages of infinite long-term memory.

However, like many other Transformer variant papers, the title of this paper has also sparked some complaints:

∞-former: A Transformer with Infinite Memory

To enable the model to handle long-range contexts, researchers proposed to extend the original Transformer with a continuous LTM that stores input embeddings and hidden states from previous steps. They also considered the possibility of having two types of memory: LTM and STM (short-term memory), similar to the memory of transformer-XL. The overall architecture of ∞-former is shown in the figure below.

To achieve an infinite level of LTM in the new model, researchers employed a continuous spatial attention framework (see “Sparse and Continuous Attention Mechanisms”), which balances the number of information units (basis functions) suitable for memory and the granularity of their representations. In this framework, the input sequence is represented as a continuous signal, expressed as a linear combination of radial basis functions. This representation has two significant advantages: 1) The context can be represented with N basis functions, where N is less than the number of tokens in the context, reducing the attention complexity; 2) N can be fixed, making it possible to represent contexts of infinite length in memory (as shown in the figure), at the cost of losing resolution but not increasing attention complexity, O(L^2 + L × N), where L corresponds to the length of the Transformer sequence.

To mitigate the loss of earlier memory resolution, researchers introduced the concept of “sticky memory”, allocating larger spaces in the new signal of LTM to relevant areas of previous memory signals. This is a process that forces important information to persist in LTM, allowing the model to better capture long contexts without losing relevant information, similar to long-term potentiation and synaptic plasticity in the brain.

Experimental Results

To test whether ∞-former can model long contexts, researchers first conducted experiments on a comprehensive task, including sorting tokens based on their frequency in a long sequence, with the following results:

From the figure, it can be seen that when the sequence length is 4000, the accuracy of transformerXL is slightly higher than that of compressive transformer and ∞-former. This is because transformerXL can almost retain the entire complete sequence in memory. However, as the sequence length increases, the accuracy of transformerXL drops rapidly, while compressive transformer and ∞-former show little variation. This indicates that ∞-former is better at modeling long sequences.

Next, they conducted language modeling experiments, including: 1) training a model from scratch; 2) fine-tuning a pre-trained language model.

The results of the first language modeling experiment are shown in the following table. It can be seen that utilizing long-term memory to extend the model indeed leads to better perplexity results, and using sticky memory can also reduce perplexity to some extent.

The results of the second language modeling experiment are shown in the following table. This result indicates that by simply adding long-term memory to GPT-2 and fine-tuning, the model’s perplexity on Wikitext-103 and PG19 decreases. This shows that ∞-former has multiple uses: it can be trained from scratch or used to improve pre-trained models.

Discussing the Future of ML with Andrew Ng at the 2021 Amazon Cloud Technology China Summit: Play and Learn

The “second stop” of the 2021 Amazon Cloud Technology China Summit will be heldonline from September 9 to September 14.For AI developers, the “Artificial Intelligence and Machine Learning Summit” on September 14 is the most noteworthy event.

On that day, Dr. Swami Sivasubramanian, Vice President of AI and Machine Learning at Amazon Web Services, will have a “fireside chat” with the renowned scholar in the AI field, Dr. Andrew Ng, founder of Landing AI.

Moreover, the “Artificial Intelligence and Machine Learning Summit” will also feature four major sub-forums, namely “Machine Learning Science”, “Impact of Machine Learning”, “Practical Machine Learning Without Expertise”, and “How Machine Learning is Implemented”, elaborating on the development of machine learning from multiple aspects such as technical principles, application scenarios, and impacts on industry sectors.

Clickto read the original text and sign up immediately.

Introducing ∞-former: Infinite Long-Term Memory for Any Length Context

For reprints, please contact this public account for authorization

For submissions or inquiries: [email protected]

Leave a Comment Cancel reply