Google Proposes New Titans Architecture Beyond Transformers

-Titans: Learning to Memorize at Test Time

Ali Behrouz† , Peilin Zhong† , and Vahab Mirrokni†

Google Research

Abstract

For more than a decade, extensive research has been conducted on how to effectively utilize recurrent models and attention mechanisms.While recurrent models aim to compress data into fixed-size memories (known as hidden states), attention allows for focusing on the entire context window, capturing all direct dependencies of tokens.However, this more accurate dependency modeling comes with secondary costs, limiting the models to handle fixed-length contexts.We propose a new neural long-term memory module that learns to memorize historical contexts and assists attention in focusing on the current context while leveraging information from long-term memory. We demonstrate that this neural memory has the advantage of fast parallelization during training while maintaining rapid inference speed. From a memory perspective, we considerattention, due to its limited context but accurate dependency modeling, similar to short-term memory,while neural memory, due to its ability to memorize data, resembles long-term and more persistent memory. Based on these two modules, we introduce a new series of architectures calledTitans, proposing three variants to address how to effectively incorporate memory into this architecture. Our experimental results on language modeling, common sense reasoning, genomics, and time series tasks indicate thatTitans are more efficient than Transformers and recent modern linear recurrent models. Compared to the baseline, they can effectively scale to over 2M context window sizes while achieving higher accuracy.

The true art of memory is the art of attention!Samuel·Johnson , 1787

Table of Contents

1Introduction

1.1Memory Perspective

1.2Contributions and Roadmap

22.Preliminary Work

2.1Background

3Learning to Memorize at Test Time

3.1Long-term Memory

3.2How to Parallelize Long-term Memory Training

3.3Persistent Memory

4How to Integrate Memory?

4.1Memory as Context

4.2Gated Memory

4.3Memory as a Layer

4.4Architecture Details

5Experiments

5.1Experimental Setup

5.2Language Modeling

5.3Needle in a Haystack

5.4BABILong Benchmark

5.5Effects of Memory Depth

5.6Time Series Prediction Training Throughput

5.7DNA Modeling

5.8Efficiency

5.9Ablation Study

6Conclusion

In this paper, we proposed a neural long-term memory as a meta-context learner that learns to memorize at test time. The neural memory module is essentially a recurrent model that can adaptively memorize those tokens that are more surprising or close to surprising. Compared to modern recurrent models, it has more expressive memory update and storage mechanisms. Leveraging this memory, we introduced theTitans architecture and its three variants, where we suggest incorporating the memory module as (1) context, (2 gated, and (3 layer. Our experimental evaluations across various tasks validate thatTitans is more effective than Transformers and recent modern linear recurrent models, especially for long contexts. That is, Titans can achieve better accuracy than baseline models when scaling beyond 2M context window sizes. The Titans have been implemented inPytorch and JAX, and we intend to provide the code we used for training and evaluating the models soon.

References

Omitted…

ARelated Work

There are various independent perspectives that can lead to the design ofTitan or its components. Thus, to further place our work in a broader context, we review three categories of research.

A.1Linear Recurrent Models

Recently, to address the computational costs of Transformers during training and inference, linear recurrent models have gained much attention (Tiezzi et al., 2024), primarily due to their fast inference and training capabilities. The first-generation models, such asRetNet (Sun Yutao et al., 2023), LRU (Orvieto et al., 2023), RWKV (Peng, Alcaide et al., 2023), S5 (J. T. Smith, Warrington, and Linderman, 2023), and S4 (Gu, Goel, and Re, 2022), use data-independent transition matrices / decay mechanisms. The second generation of such models begins to incorporate gating mechanisms, a widely used technique in traditional RNNs (Gers, Jürgen Schmidhuber, and Cummins, 2000; Greff et al., 2016; Van Der Westhuizen and Lasenby, 2018), integrating this linear architecture, such asGriffin (De et al., 2024), SSMs (Behrouz, Santacatterina, and Zabih, 2024; Dao and Gu, 2024; Gu and Dao, 2024; Hasani et al., 2023), RWKV6 (Peng, Goldstein et al., 2024). The third generation of linear recurrent models is based on more complex memory update rules, based on meta-learning, online learning, and/or incremental rules, producing more expressive and effective models, such as: Longhorn (Liu Bin et al., 2024), gated DeltaNet (S. Yang, B. Wang, Zhang Yu et al., 2024), TTT (Sun et al., 2024), and DeltaNet (S. Yang, B. Wang, Zhang Yu et al., 2024). Our LMM model can be seen as the next generation of these models, where we incorporate token flow into the memory update mechanism, with a more robust memory update process. See Appendix C for a detailed discussion of different recurrent models and Titans.

A.2Transformer-based Architectures

Transformers (Vaswani et al., 2017) serve as a de facto standard for many deep learning models based on attention mechanisms (Bahdanau, 2014). However, they suffer from quadratic computational costs, limiting their ability to scale to long context windows. To improve the memory consumption and throughput of softmax attention for longer sequences, various studies focus on performing I/O aware implementations of attention (Dao, 2024; Dao, D. Fu et al., 2022), designing more efficient attention mechanisms by sparsifying the attention matrix (B. Chen et al., 2021; Choromanski et al., 2021; Dai et al., 2019; J. Dong et al., 2024; Roy et al., 2021), approximating softmax (Arora et al., 2024), or developing kernel (linear) based attention (Aksenov et al., 2024; Kacham, Mirrokni, and P. Zhong, 2024; Schlag, Irie, and Jürgen Schmidhuber, 2021; S. Yang, B. Wang, Shen et al., 2024).

Segment-based Transformers represent another research direction to improve the efficiency of Transformers (Dai et al., 2019). The main drawback of segment-based Transformers is that the segments are completely isolated from each other, thus limiting the context window to the length of the segments. Various studies have discussed the importance of memory to help models pass information between segments (Bulatov, Yuri Kuratov et al., 2023; Bulatov, Yury Kuratov, and Burtsev, 2022; Feng et al., 2022; Hutchins et al., 2022; Rodkin et al., 2024; Z. Wang et al., 2019; Q. Wu et al., 2020; Zancato et al., 2024). The key difference of Titans from these models is that (1) the memory in such models is simple small-size vectors, lacking the expressiveness to compress complex information; (2) the memory modules lack forgetting mechanisms, leading to rapid memory overflow; (3) they focus only on instantaneous surprise, lacking information flow. More specifically, reviewing recurrent memory Transformers (RMT) (Bulatov, Yuri Kuratov et al., 2023; Bulatov, Yury Kuratov, and Burtsev, 2022; Rodkin et al., 2024), Titans (MAC) can be viewed as a generalization of RMT, where we use a neural memory module instead of small-size vector memory.

About Memory of Large Language Models. Another interesting research direction is combining external memory modules with LLMs during post-training (Z. He et al., 2024; Khandelwal et al., 2020). These models differ from our approach as we incorporate memory as part of the initial architecture and train it end-to-end. Furthermore, most of these explicit memory modules face the same limitations as segment-based Transformers (as discussed above). For a detailed discussion of these models, we refer to recent works by Y. Wang, Han et al. (2024).

A.3Test Time Training and Fast Weight Programs

Memory design and memory enhancement. In the literature, a substantial amount of research work has been dedicated to designing memory modules that can abstract knowledge through recurrent memory (e.g., persistent memory) (Sukhbaatar, Grave et al., 2019) or memorize data-dependent information (also known as context memory) (Bulatov, Yury Kuratov, and Burtsev, 2022; Rodkin et al., 2024; Zancato et al., 2024), Transformers (Berges et al., 2024; Cetin et al., 2024; Feng et al., 2022; Le, Tran, and Venkatesh, 2020; Munkhdalai, Sordoni et al., 2019), gradient (Irie, Csordás, and Jürgen Schmidhuber, 2021; Munkhdalai, Sordoni et al., 2019; Munkhdalai and H. Yu, 2017; Schlag, Irie, and Jürgen Schmidhuber, 2021; JH Schmidhuber, 1992; S. Yang, Kautz, and Hatamizadeh, 2024; S. Yang, B. Wang, Yu Zhang et al., 2024). However, all of these memory models either (1) are based on instantaneous surprise, lacking data flow and events, (2) lack a forgetting mechanism to delete memories, leading to rapid memory overflow, (3) are shallow (matrix-valued) memories of fixed size, resulting in poor performance in long contexts, and (4) are based on fixed parameters at test time, lacking test-time adaptability.

Various studies have extensively explored these models (Irie, Schlag et al., 2021; Munkhdalai, Sordoni et al., 2019; Munkhdalai and H. Yu, 2017; Schlag, Irie, and Jürgen Schmidhuber, 2021; JH Schmidhuber, 1992; S. Yang, Kautz, and Hatamizadeh, 2024; S. Yang, B. Wang, Yu Zhang et al., 2024). However, all of these models are based on instantaneous surprise, lacking token flow in the sequence (see Section 3.1), and most lack forgetting gates, leading to poor memory management.

Test Time Training. The key idea of learning at test time or learning to learn (i.e., (Andrychowicz et al., 2016)) can be traced back to early work on local learning by Bottou and Vapnik (1992), where training each test data sample’s neighbors before making predictions (Gandelsman et al., 2022; H. Zhang et al., 2006) showed promising performance in visual tasks (Jain and Learned-Miller, 2011; Mullapudi et al., 2019). In this direction, the most similar studies to ours are MNM (Munkhdalai, Sordoni et al., 2019) and TTT layers (Yu Sun et al., 2024), where we discuss their key differences in Appendix C.

BLanguage Models and Common Sense Reasoning Datasets

Recent studies on linear recurrent models (Dao and Gu, 2024; S. Yang, Kautz, and Hatamizadeh, 2024; S. Yang, B. Wang, Yu Zhang et al., 2024) have utilized Wikitext (Merity et al., 2017), LMB (Paperno et al., 2016), PIQA (Bisk et al., 2020), HellaSwag (Zellers et al., 2019), WinoGrande (Sakaguchi et al., 2021), ARC-easy (ARC-e) and ARC-challenge (ARC-c) (P. Clark et al., 2018), SIQA (Sap et al., 2019) and BoolQ (C. Clark et al., 2019). Additionally, the benchmark results for the 400M model were reported by S. Yang, Kautz, and Hatamizadeh (2024).

CLong-term Memory Module (LMM ) as Sequence Model

In this section, we discuss how the LMM connects with modern linear recurrent models. For simplicity, we start with linear memory, where M𝑡 = 𝑊𝑡 ∈ R𝑑in×𝑑in. In this case, our objective function becomes ℓ(M;𝑥𝑡) = 12 ∥M𝑡k𝑡 − v𝑡 ∥22, where we optimize using momentum gradient descent and weight decay. Thus, revisiting the recursive formula in Equation 13:

Google Proposes New Titans Architecture Beyond Transformers

LMM is a generalized gated DeltaNet. According to the discussion by S. Yang, Kautz, and Hatamizadeh (2024), DeltaNet (S. Yang, B. Wang) is a generalized gated DeltaNet.

Zhang Yu et al. (2024) can be interpreted as optimizing the online learning problem of L = 12 ∥S𝑡k𝑡 − v𝑡 ∥22, leading to:

Google Proposes New Titans Architecture Beyond Transformers In this formula, the gated DeltaNet is the same as above but adds an extra weight decay term (S. Yang, Kautz, and Hatamizadeh, 2024). Comparing Equation 32 and Equation 34, we can see that setting 𝜂𝑡 = 0 leads to the equivalence of the two equations. Therefore, we can say that LMM summarizes the recent research on gated DeltaNet from three aspects:

1. Momentum-based Rules: The Delta rule is based on instantaneous surprise, meaning that the flow of tokens cannot influence the memory update rules. However, LMM is based on a momentum rule, considering both past and instantaneous surprises.

2. Deep Memory: While the gated DeltaNet is limited to linear (matrix-valued) memory, as it requires finding a closed recursive form, LMM achieves greater expressiveness by using deep storage modules.

3. Non-linear Recursion: The DeltaNet and Gated DeltaNet are based on linear recursion, while our LMM adopts block-wise non-linear recursion and block-wise linear recursion. This design gives LMM higher expressive power.

Here, we discussed the gated DeltaNet as an example of the latest generation of recurrent models. Similar approaches, such as RWKV-7 (Peng, 2021), also use the same formulas and loss functions, so LMM generalizes all such models.

Finally, a key distinction from the above and other recent linear recurrent research is that modern linear models’ mixed variants—such as Griffin (De et al., 2024), DeltaNet (S. Yang, B. Wang, Zhang Yu et al., 2024), Gated DeltaNet (S. Yang, Kautz, and Hatamizadeh, 2024), H3 (Fu et al., 2023), Mamba2 (Dao and Gu, 2024), Samba (Ren et al., 2024)—are all based on layer-wise designs. We propose Titans to illustrate how to effectively incorporate these memory modules into architectures.

Leave a Comment