
MLNLP community is a well-known machine learning and natural language processing community, covering NLP graduate students, university teachers, and corporate researchers both domestically and internationally.
The vision of the community is to promote communication and progress between the academic and industrial circles of natural language processing and machine learning, especially for the progress of beginners.
Reprinted from | Machine Heart
Memory3 with 2.4B parameters outperforms larger LLMs and RAG models.
In recent years, large language models (LLMs) have received unprecedented attention due to their extraordinary performance. However, the training and inference costs of LLMs are high, and various optimization methods have been attempted to reduce these costs.
This article is from researchers at the Shanghai Algorithm Innovation Research Institute, Peking University, and other institutions. Inspired by the hierarchical structure of human brain memory, they reduce these costs by equipping LLMs with explicit memory (a cheaper memory format compared to model parameters and RAG). Conceptually, because most of its knowledge is externalized as explicit memory, LLMs can enjoy reduced parameter size, training costs, and inference costs.
-
Paper link: https://arxiv.org/pdf/2407.01178
-
Paper title: Memory3: Language Modeling with Explicit Memory
As a preliminary proof of concept, the researchers trained a 2.4B LLM from scratch, which achieved better performance than larger LLMs and RAG models, and realized a higher decoding speed than RAG. This model is named Memory3 because explicit memory in LLMs is the third form of memory following implicit memory (model parameters) and working memory (context key-value).
Specifically, this paper introduces a new memory format, namely explicit memory, characterized by relatively low write and read costs. As shown in Figure 1, the model first transforms the knowledge base (or any text dataset) into explicit memory realized as sparse attention key-values, and then calls these memories during inference and integrates them into the self-attention layer.
The new memory format defines a new memory hierarchy:
In addition, this paper also introduces a memory circuit theory that supports knowledge externalization and proposes a memory sparsity mechanism that makes storage easy to handle and a two-stage pre-training scheme that facilitates memory formation.
-
Memory3 utilizes explicit memory during inference, alleviating the burden on model parameters to remember specific knowledge;
-
Explicit memory is encoded from the constructed knowledge base, where the sparse memory format maintains the actual storage size;
-
The researchers trained a Memory3 model with 2.4B non-embedding parameters from scratch, which outperformed larger-scale SOTA models. It also has better performance and faster inference speed than RAG;
-
Moreover, Memory3 improves factuality and reduces hallucination, and can quickly adapt to specialized tasks.
Method Introduction
The memory circuit theory helps to determine which knowledge can be stored as explicit memory and which model architecture is suitable for reading and writing explicit memory.
The researchers treat the input-output relationship as the internal mechanism of the circuit and define knowledge as the input-output relationship and its circuit. By manipulating these circuits, many pieces of knowledge can be extracted from the LLM while maintaining its functionality intact.
Memory3: In terms of architecture, the goal of this paper is to design an explicit memory mechanism for Transformer LLMs, keeping both write and read costs relatively low. Moreover, this paper aims to limit modifications to the Transformer architecture as much as possible, without adding any new trainable parameters, so that most existing Transformer LLMs can be converted to Memory3 models with minimal fine-tuning. The simple design process is as follows:
Write cost: Before inference, the LLM writes each reference into explicit memory, stored on the drive. The memory is selected from the key-value vectors of the self-attention layer, so the write process does not involve training. Each reference is processed independently, avoiding the cost of long-context attention.
Read cost: During inference, explicit memory is retrieved from the drive and read by self-attention along with the usual context key-values. Each memory consists of a minimal amount of key-values from a few attention heads, significantly reducing additional computation, GPU storage, drive storage, and loading time. It allows LLMs to frequently retrieve many references with limited impact on decoding speed.
The inference process is shown in Figure 9. Whenever the LLM generates 64 tokens, it discards the current memory, uses these 64 tokens as query text to retrieve 5 new memories, and continues decoding with these memories. Similarly, when processing prompts, the LLM retrieves 5 memories for each block of 64 tokens. Each block focuses on its own memory, and the memories may differ between different blocks.
Writing and reading memory: During inference, the LLM can directly read the retrieved explicit memory through its self-attention layer by concatenating them with context key-values (Figure 9). Specifically, for each attention head h at layer l, if it is selected as a memory head, then its output Y^(l,h) will change:
In addition, the study adopts parallel positional encoding for all explicit memories, meaning that all key positions are located within the same interval of length 128, as shown in Figure 9.
Two-stage pre-training: The pre-training consists of two stages, warmup and continuous training. Only the continuous training stage involves explicit memory, while the warmup stage uses the same format as regular pre-training.
Figure 13 illustrates the training loss and learning rate schedule during the warmup phase.
Figure 14 illustrates the training loss and learning rate schedule during the continuous training phase.
Experimental Results
The researchers evaluated the general capabilities (benchmark tasks), conversational abilities, specialized capabilities (in law and medicine), and hallucination of the Memory3 model. Additionally, they measured the decoding speed of Memory3 and compared it with similar and larger SOTA LLMs and RAG models.
The evaluation results of general capabilities are shown below, indicating that explicit memory improved the average score by 2.51%. In contrast, the score gap between Llama2-7B and 13B is 4.91%. Explicit memory can increase the “effective model size” by 2.51/4.91 ≈ 51.1%.
The authors then evaluated the conversational skills of Memory3, with results listed in Table 18, showing that the model outperformed Vicuna-7B, Falcon-40B-Instruct, and ChatGLM2-6B with fewer parameters.
Currently, LLMs still face hallucination issues. Conceptually, Memory3 should be less susceptible to hallucination since its explicit memory directly corresponds to reference texts. To evaluate hallucination, the researchers selected two English datasets for assessment. The results are shown in Table 19, where Memory3 achieved the highest scores in most tasks.
One advantage of using explicit memory is that LLMs can easily adapt to new domains and tasks by updating their knowledge base. Just import task-related references into Memory3’s knowledge base and optionally convert them into explicit memory in a hot-start scenario. The model can then leverage this new knowledge for inference, bypassing the costlier and potentially detrimental fine-tuning process, and it runs faster than RAG. Figure 4 has demonstrated this cost reduction, which can facilitate the rapid deployment of LLMs across various industries.
The table below indicates that Memory3 outperforms most models.
Finally, the researchers evaluated the decoding speed or throughput of Memory3 by the number of tokens generated per second.
For more information, please refer to the original paper.
Technical Communication Group Invitation

△ Long press to add assistant
Scan the QR code to add the assistant WeChat
Please note: Name-School/Company-Research Direction
(e.g., Xiao Zhang-Harbin Institute of Technology-Dialogue System)
to apply to join the Natural Language Processing/Pytorch and other technical communication groups
About Us
MLNLP community is a grassroots academic community jointly built by scholars in machine learning and natural language processing from home and abroad. It has now developed into a well-known machine learning and natural language processing community, aiming to promote progress among the academic and industrial circles of machine learning and natural language processing and enthusiasts.
The community can provide an open communication platform for related practitioners for further study, employment, and research. Everyone is welcome to follow and join us.
