A Comprehensive Survey of Retrieval-Augmented Generation (RAG): Evolution, Current Landscape and Future DirectionsThis article delves into the development of Retrieval-Augmented Generation (RAG), from basic concepts to the latest technologies. RAG effectively enhances output accuracy by combining retrieval and generation models, overcoming the limitations of LLMs. The study details the architecture of RAG, demonstrating how retrieval and generation work together to handle knowledge-intensive tasks. Additionally, this article reviews key technological advancements in RAG in areas such as question answering and summarization, and discusses new methods to improve retrieval efficiency. At the same time, the article points out the challenges RAG faces in scalability, bias, and ethics, proposing future research directions to enhance model robustness, expand application scope, and focus on social impacts. This survey aims to provide a foundational guide for researchers and practitioners in the NLP field to better understand the potential of RAG and its developmental path.https://arxiv.org/abs/2410.12837
1. Introduction
1.1 Overview of Retrieval-Augmented Generation (RAG)
RAG (Retrieval-Augmented Generation) integrates two core components:
-
(i) The retrieval module, responsible for retrieving relevant documents or information from external knowledge bases, using dense vector representations to identify relevant documents from large datasets such as Wikipedia or private databases. -
(ii) The generation module, which processes this information to produce human-like text. The retrieved documents are subsequently sent to the generation module, which is typically built on a transformer architecture.
RAG helps to reduce the phenomenon of “hallucination” in generated content, ensuring that the text is more factual and contextually appropriate. RAG has been widely applied in various fields, including:
-
Open-domain question answering -
Conversational agents -
Personalized recommendations.
1.2 New Systems of Hybrid Retrieval and Generation
Before the emergence of RAG, natural language processing (NLP) primarily relied on either retrieval or generation methods.
-
Retrieval-based systems: For example, traditional information retrieval engines that efficiently provide relevant documents or snippets based on queries but cannot synthesize new information or present results in a coherent narrative form. -
Generation-based systems: With the rise of transformer architectures, pure generative models have gained popularity for their fluency and creativity but often lack factual accuracy.
The complementarity of these two methods has led to attempts at hybrid systems that combine retrieval and generation. The earliest hybrid systems can be traced back to DrQA, which used retrieval techniques to obtain relevant documents for question-answering tasks.
1.3 Limitations of RAG
-
Errors may still occur when facing vague queries or specific knowledge domains during retrieval. Relying on dense vector representations such as those used by DPR (Dense Passage Retrieval) can sometimes retrieve irrelevant or off-topic documents. Therefore, introducing more refined query expansion and context disambiguation techniques is needed to enhance the precision of retrieval techniques. Theoretically, the combination of retrieval and generation should be seamless, but in practice, the generation module sometimes struggles to effectively integrate the retrieved information into responses, leading to inconsistencies or incoherence between the retrieved facts and the generated text. -
Computational cost is also a significant concern, as it requires executing both retrieval and generation steps for each query, which is especially resource-intensive for large-scale applications. Techniques such as model pruning or knowledge distillation may help reduce computational burdens without sacrificing performance. -
Ethical issues, particularly biases and transparency concerns. Biases in AI and LLMs are a widely researched and evolving area, with researchers identifying various types of biases, including those related to gender, socioeconomic status, and educational background. While RAG has the potential to reduce biases by retrieving more balanced information, there remains the risk of amplifying biases from retrieval sources. Moreover, ensuring transparency in the selection and use of retrieval results is crucial for maintaining trust in these systems.
2. Overview of Core Components and Architecture of RAG Systems
2.1 Overview of RAG Models
RAG models consist of two core components:
-
The retriever: Utilizes techniques like Dense Passage Retrieval (DPR) or the traditional BM25 algorithm to retrieve the most relevant documents from the corpus. -
The generator: Integrates the retrieved documents into coherent, contextually relevant answers.
The strength of RAG lies in its ability to dynamically utilize external knowledge, outperforming generative models that rely on static datasets like GPT-3.
2.2 Retriever in RAG Systems
2.2.1 BM25
BM25 is a widely used information retrieval algorithm that ranks documents based on relevance using term frequency-inverse document frequency (TF-IDF). Despite being a classic method, it remains a standard algorithm in many modern retrieval systems, including those used in RAG models.
BM25 calculates the relevance score of documents based on the frequency of query terms in the documents while considering document length and term frequency across the entire corpus. Although BM25 performs excellently in keyword matching, it has limitations in understanding semantic meanings. For instance, BM25 cannot capture relationships between words and performs poorly when handling complex natural language queries that require contextual understanding.
However, BM25 is widely adopted due to its simplicity and efficiency. It is suitable for keyword-based simple query tasks, although modern retrieval models like DPR often outperform it in handling semantically complex tasks.
2.2.2 Dense Passage Retrieval (DPR)
Dense Passage Retrieval (DPR) is a new information retrieval method. It uses a high-dimensional vector space, where both queries and documents are encoded into high-dimensional vectors.
Employing a dual-encoder architecture, it encodes queries and documents separately, enabling efficient nearest neighbor search.
Unlike BM25, DPR excels at capturing semantic similarity between queries and documents, making it highly effective in open-domain question answering tasks.
The advantage of DPR lies in its ability to retrieve relevant information based on semantic meaning rather than keyword matching. By training the retriever on a large corpus of question-answer pairs, DPR can find documents relevant to the query context, even if the query and documents do not use the exact same vocabulary. Recent research has further optimized DPR by combining it with pre-trained language models.
2.2.3 REALM (Retrieval-Augmented Language Model)
REALM integrates the retrieval process into the pre-training of language models, ensuring that the retriever and generator are co-optimized for subsequent tasks.
The innovation of REALM lies in its ability to learn to retrieve documents that enhance the model’s performance on specific tasks, such as question answering or document summarization.
During training, REALM synchronously updates the retriever and generator, optimizing the retrieval process to better serve text generation tasks.
The retriever in REALM is trained to identify documents that are both relevant to the query and assist in generating accurate, coherent answers. Therefore, REALM significantly improves the quality of generated answers, especially in tasks that require reliance on external knowledge.
Recent studies show that in certain knowledge-intensive tasks, REALM outperforms BM25 and DPR, particularly in scenarios where retrieval and generation are closely integrated.
The essence of RAG lies in the quality of retrieved passages, but many existing methods rely on similarity-based retrieval (Mallen et al., 2022).
Self-RAG and REPLUG enhance retrieval capabilities by leveraging large language models (LLMs), achieving more flexible retrieval.
After initial retrieval, cross-encoder models re-rank the results by jointly encoding queries and retrieved documents to compute relevance scores. Although these models provide richer context-aware retrieval, they come with higher computational costs.
RAG systems utilize the self-attention mechanism in LLMs to manage the context and relevance of different parts of the input and retrieved text. When integrating retrieval information into the generation model, a cross-attention mechanism is employed to ensure that the most relevant information segments are highlighted during generation.
2.3 Generator in RAG Systems
In RAG, the generator is the key link that integrates the retrieved information with the input query to generate the final output.
Once the retrieval component extracts relevant knowledge from external resources, the generator weaves this information into coherent, contextually appropriate responses. Large language models (LLMs) form the core of the generator, ensuring that the generated text is fluent, accurate, and consistent with the original query.
2.3.1 T5
T5 (Text-to-Text Transfer Transformer) is one of the commonly used models for generation tasks in RAG systems.
The flexibility of T5 lies in its treatment of all NLP tasks as text-to-text tasks. This unified framework allows T5 to be fine-tuned for a wide range of tasks, including question answering, summarization, and dialogue generation.
By integrating retrieval and generation, T5-based RAG models have surpassed traditional generative models like GPT-3 and BART in multiple benchmark tests, particularly on the Natural Questions and TriviaQA datasets.
Furthermore, T5’s capability in handling complex multi-task learning makes it the preferred choice for RAG systems that need to tackle diverse knowledge-intensive tasks.
2.3.2 BART
BART (Bidirectional and Auto-Regressive Transformer) is particularly suited for tasks that involve generating text from noisy inputs, such as summarization and open-domain question answering.
As a denoising autoencoder, BART can reconstruct corrupted text sequences, making it excel in tasks that require generating coherent, factual outputs from incomplete or noisy data.
When combined with the retriever in RAG systems, BART has been shown to enhance the factual accuracy of generated text through external knowledge.
3. Cross-Modal Retrieval-Augmented Generation Models
3.1 Text-Based RAG Models
Text-based RAG models are currently the most mature and widely researched type.
Relying on textual data, they perform retrieval and generation tasks, driving the development of applications such as question answering, summarization, and conversational agents.
Transformers such as BERT and T5 form the foundation of text RAG models, utilizing self-attention mechanisms to capture contextual relationships within the text, thereby enhancing retrieval accuracy and generation fluency.
3.2 Audio-Based RAG Models
Audio-based RAG models extend the concept of retrieval-augmented generation into the audio domain, paving the way for applications in speech recognition, audio summarization, and conversational agents in voice interfaces. Audio data is often represented through embeddings derived from pre-trained models like Wav2Vec 2.0. These embeddings serve as inputs for the retrieval and generation components, enabling the model to effectively process audio data.
3.3 Video-Based RAG Models
Video-based RAG models integrate visual and textual information, enhancing performance in tasks such as video understanding, caption generation, and retrieval. Video data is represented through embeddings from models like I3D TimeSformer. These embeddings capture temporal and spatial features, which are critical for effective retrieval and generation.
3.4 Multimodal RAG Models
Multimodal RAG models combine data from various modalities, including text, audio, video, and images, providing a more comprehensive approach to retrieval and generation tasks.
Models like Flamingo integrate different modalities into a unified framework, achieving simultaneous processing of text, images, and video. Cross-modal retrieval techniques involve retrieving relevant information across different modalities.
“Retrieval as generation” extends the retrieval-augmented generation (RAG) framework to multimodal applications by combining text-to-image and image-to-text retrieval, allowing for rapid image generation when user queries match stored textual descriptions.
4. Overview of Existing RAG Frameworks
Agent-Based RAG
A new agent-based retrieval-augmented generation (RAG) framework adopts a hierarchical multi-agent structure, where sub-agents fine-tune small pre-trained language models (SLMs) for specific time-series tasks. The main agent assigns tasks to these sub-agents, retrieving relevant prompts from a shared knowledge base. This modular multi-agent approach achieves high performance, demonstrating flexibility and efficiency compared to specific task methods in time-series analysis.
RULE
RULE is a multimodal RAG framework aimed at enhancing the factual accuracy of medical vision-language models (Med-LVLMs) by introducing calibration selection strategies to control factual risks and developing preference optimization strategies to balance the model’s intrinsic knowledge with retrieval context, proving its effectiveness in improving the factual accuracy of Med-LVLM systems.
METRAG
METRAG, a multi-level, thoughts-enhanced retrieval-augmented generation framework, combines document similarity and practicality to enhance performance. It includes a task-adaptive summarizer to produce distilled content summaries. By utilizing multiple reflections in these stages, LLMs generate knowledge-enhanced content, demonstrating superior performance in knowledge-intensive tasks compared to traditional methods.
RAFT (Retrieval Augmented Fine-Tuning)
Interfering documents are a key feature of retrieval-augmented fine-tuning (RAFT) (Zhang et al., 2024), training the model to discern relevant sources while directly citing relevant ones. Combining chain-of-thought reasoning enhances the model’s reasoning capabilities. RAFT shows consistent performance improvements in RAG tasks across specific domains, including PubMed, HotpotQA, and Gorilla datasets, as a post-training enhancement for LLMs.
FILCOFILCO aims to enhance the quality of context provided by generative models in tasks such as open-domain question answering and fact verification, addressing the issues of over-reliance or under-reliance on retrieved passages, which may lead to hallucination problems in generated outputs. The approach identifies useful context through lexical and information-theoretic methods and refines retrieval context during testing through training context filtering models, improving context quality.
Self-RAG
Reflective marking is a key attribute of self-reflective retrieval-augmented generation (Self-RAG) (Asai et al., 2023), improving the factual accuracy of large language models (LLMs) by combining retrieval with self-reflection. Unlike traditional methods, Self-RAG adaptively retrieves relevant passages and uses reflective markings to evaluate and refine its responses, allowing the model to adjust its behavior based on specific task needs and demonstrating superior performance in open-domain question answering, reasoning, fact verification, and long-form generation tasks. The intelligence and effectiveness of RAG largely depend on the quality of retrieval, and a more nuanced understanding of the knowledge base will enhance the effectiveness of RAG systems.
MK Summary
A data-centric retrieval-augmented generation (RAG) workflow that transcends the traditional retrieval-reading model, adopting a prepare-rewrite-retrieve-read framework to enhance LLMs by integrating contextually relevant, time-critical, or domain-specific information. Its innovations include generating metadata, synthesizing questions and answers (QA), and introducing metadata summaries (MK summaries) from document clusters.
CommunityKG-RAG
CommunityKG-RAG is a zero-shot framework that integrates community structures from knowledge graphs (KGs) into retrieval-augmented generation (RAG) systems. By leveraging multi-hop connections within KGs, it enhances the accuracy and contextual relevance of fact-checking, surpassing traditional methods that do not require additional domain-specific training.
RAPTOR
RAPTOR introduces a hierarchical approach to enhance retrieval-augmented language models, addressing the limitations of traditional methods that only retrieve short, contiguous text blocks. RAPTOR retrieves information by recursively embedding, clustering, and summarizing text, forming a summary tree to retrieve information at varying levels of abstraction. Experiments show that RAPTOR performs superiorly in question answering tasks that require complex reasoning. When paired with GPT-4, RAPTOR improved accuracy by 20% in the QuALITY benchmark test.
4.1 Long-Context Based RAG Frameworks
Recently launched large language models (LLMs) that support long contexts, such as Gemini-1.5 and GPT-4, have significantly enhanced RAG performance.
Self-Route
Self-Route dynamically allocates queries to RAG or LC through model introspection, optimizing computational costs and performance. It provides profound insights into the best applications of RAG and LC when handling long-context tasks.
SFR-RAG
SFR-RAG is a compact and efficient RAG model designed to enhance LLMs’ integration of external contextual information while reducing hallucination phenomena.
LA-RAG
LA-RAG is a new RAG paradigm aimed at enhancing automatic speech recognition (ASR) capabilities within LLMs. Its highlight is its ability to utilize fine-grained token-level speech data storage and speech-to-speech retrieval mechanisms to improve ASR accuracy through LLM contextual learning.
HyPA-RAG
LLMs face challenges in AI law and policy contexts due to outdated knowledge and hallucinations. HyPA-RAG is a hybrid parameter adaptive retrieval-augmented generation system that improves accuracy through adaptive parameter adjustments and hybrid retrieval strategies. In tests on NYC Local Law 144, HyPA-RAG demonstrated higher correctness and contextual accuracy, effectively addressing the complexities of legal texts.
MemoRAG
MemoRAG introduces a new RAG paradigm that overcomes the limitations of traditional RAG systems in handling ambiguous or unstructured knowledge. The dual-system architecture of MemoRAG uses lightweight long-distance LLMs to generate draft answers and guide retrieval tools, while a more powerful LLM is responsible for refining the final output. This framework is optimized for better clue extraction and memory capacity, significantly outperforming traditional RAG models in both complex and simple tasks.
NLLB-E5
NLLB-E5 introduces a scalable multilingual retrieval model that addresses the challenges of supporting multiple languages, especially low-resource languages like Hindi. By utilizing the NLLB encoder and E5 multilingual retriever’s distillation method, NLLB-E5 achieves zero-shot retrieval across languages without multilingual training data. Evaluations on benchmarks like Hindi-BEIR demonstrate its strong performance, highlighting task-specific challenges and advancing global inclusivity in multilingual information retrieval.
5. Challenges and Limitations of RAG
-
• Scalability and Efficiency: A major challenge for RAG is its scalability. Given that the retrieval component relies on external databases, efficiently handling large and growing datasets requires effective retrieval algorithms. High computational and memory demands also make it difficult to deploy RAG models in real-time or resource-constrained environments. -
• Quality and Relevance of Retrieval: Ensuring the quality and relevance of retrieved documents is an important issue. Retrieval models may sometimes return irrelevant or outdated information, which can lower the accuracy of generated content. Particularly in long-form content generation, improving retrieval precision remains a hot research topic. -
• Bias and Fairness: Like other machine learning models, RAG systems may exhibit biases due to biases in their retrieval datasets. Retrieval-based models may amplify harmful biases in the knowledge retrieved, leading to biased outputs. Developing bias mitigation techniques for retrieval and generation is an ongoing challenge. -
• Coherence: RAG models often encounter difficulties in integrating retrieved knowledge into coherent and contextually relevant text. The connection between retrieved content and generated model outputs is not always seamless, potentially leading to inconsistencies or factual hallucinations in the final answers. -
• Interpretability and Transparency: Like many AI systems, RAG models are often viewed as opaque black-box operations.
6. Future Directions
6.1 Strengthening Multimodal Fusion
Integrating text, image, audio, and video data in RAG models requires focusing on enhancing multimodal fusion technologies to achieve seamless interaction between different data types, including:
-
• Developing more advanced methods for aligning and synthesizing cross-modal information. -
• More innovation is needed to enhance the coherence and contextual adaptability of multimodal outputs. -
• Improving the ability of RAG systems to retrieve relevant information across different modalities. For example, combining text-based queries with image or video content retrieval can enhance applications like visual question answering and multimedia search.
6.2 Scalability and Efficiency
As RAG models are deployed in broader large-scale applications, scalability becomes crucial. Research should focus on developing efficient methods for scaling retrieval and generation processes without sacrificing performance. Distributed computing and efficient indexing techniques are vital for handling large datasets. Enhancing the efficiency of RAG models requires optimizing retrieval and generation components to reduce computational resources and latency.
6.3 Personalization and Adaptability
Future RAG models should focus on personalizing the retrieval process based on individual user preferences and contexts. This includes developing techniques to adjust retrieval strategies based on user history, behavior, and preferences. Deepening the understanding of the context and sentiment of queries and document libraries is crucial for enhancing the contextual adaptability of RAG models to improve the relevance of generated responses. Research should explore methods for dynamically adjusting retrieval and generation processes based on interactive context, including integrating user feedback and contextual cues into the RAG workflow.
6.4 Ethical and Privacy Considerations
Addressing bias is a key area for future research, especially concerning biases in RAG models. As RAG systems are deployed in diverse applications, ensuring fairness and reducing biases in retrieved and generated content is essential. Future RAG research should focus on privacy-preserving technologies to protect sensitive information during retrieval and generation processes. This includes developing secure data handling methods and privacy-aware retrieval strategies. The interpretability of models is also a critical area for ongoing improvements in RAG research.
6.5 Cross-Language and Low-Resource Language Support
Expanding RAG technologies to support multiple languages, especially low-resource languages, is a promising direction for development.
Efforts should be made to enhance cross-language retrieval and generation capabilities, ensuring accurate and relevant results across different languages. Improving RAG models’ effective support for low-resource languages requires developing methods for content retrieval and generation under limited training data. Research should focus on transfer learning and data augmentation techniques to improve performance in edge languages.
6.6 Advanced Retrieval Mechanisms
Future RAG research should explore dynamic retrieval mechanisms that adapt to changing query patterns and content needs. This includes building models that can dynamically adjust retrieval strategies based on new information and user demands.
Researching hybrid retrieval methods that combine dense and sparse retrieval strategies holds promise for enhancing RAG system performance. Future studies should focus on how to integrate diverse retrieval approaches to adapt to various tasks and achieve optimal performance.
6.7 Integration with Emerging Technologies
Combining RAG models with brain-computer interfaces (BCIs) could open new applications in human-computer interaction and assistive technologies. Research should investigate how RAG systems can leverage BCI data to enhance user experience and generate context-aware responses. The integration of RAG with augmented reality (AR) and virtual reality (VR) technologies presents opportunities for creating immersive interactive experiences. Future research should explore how RAG models can be utilized to enhance AR and VR applications by providing contextually relevant information and interactions to improve user experiences.
Original paper: https://arxiv.org/abs/2410.12837