Key Module Analysis of RAG Full Link

Original: https://zhuanlan.zhihu.com/p/682253496

Organizer: Qingke AI

1. Background Introduction

The RAG (Retrieval Augmented Generation) method refers to a combination of retrieval-based models and generative models to improve the quality and relevance of generated text. This method was proposed by Meta in the 2020 paper “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”[1], allowing language models (LM) to acquire information beyond internalized knowledge and enabling them to answer questions more accurately based on specialized knowledge bases. In the era of large models, it is also a necessary technology to solve various limitations or shortcomings of large models, such as hallucination problems, knowledge timeliness issues, and ultra-long text problems.

2. Challenges of RAG

RAG mainly faces three challenges: retrieval quality, enhancement process, and generation quality.

2.1 Retrieval Quality

  • Semantic Ambiguity: Vector representations (e.g., word embeddings) may fail to capture subtle differences between concepts. For example, the term “apple” may refer to either a fruit or a technology company. Embeddings may confuse these meanings, leading to irrelevant results.

  • User Input Complexity: Unlike traditional keyword or phrase search logic, user input questions are no longer words or short phrases, but have shifted to natural conversational knowledge multi-turn dialogue data, with more diverse question forms closely related to context and a more conversational input style.

  • Document Segmentation: Document segmentation mainly has two methods: one is based on formal segmentation, such as using punctuation and paragraph endings; the other is based on the meaning of the document content. How to convert these document chunks into a form that computers can understand and compare (i.e., “embedding”) affects the matching degree of these chunks with user search content.

  • Extraction and Representation of Multi-modal Content (e.g., tables, charts, formulas, etc.): How to extract and dynamically represent multi-modal content is a practical problem currently faced, especially when dealing with ambiguous or negative queries, which significantly affects the performance of the RAG system.

2.2 Enhancement Process

  • Context Integration: The challenge here is to smoothly integrate the context of retrieved paragraphs with the current generation task. If not done well, the output may seem disjointed or lack coherence.

  • Redundancy and Repetition: If multiple retrieved paragraphs contain similar information, the generation step may produce repetitive content.

  • Ranking and Prioritization: Determining the importance or relevance of multiple retrieved paragraphs for the generation task can be challenging. The enhancement process must appropriately weigh the value of each paragraph.

2.3 Generation Quality

  • Over-reliance on Retrieved Content: Generative models may overly depend on augmented information, exacerbating hallucination issues rather than adding value or providing synthesis.

  • Irrelevance: This is another concerning issue, where the model-generated answers fail to address the query.

  • Toxicity or Bias: This is also another problem, where the model-generated answers are harmful or offensive.

3. Overall Architecture

3.1 Product Architecture

Key Module Analysis of RAG Full Link

From the figure, it can be clearly seen that the entire product architecture consists of the following four layers:

  • • The bottom layer is the Model Layer. In the model layer, the differences between models are masked, supporting not only self-developed sequential models but also open-source large models and third-party models. Additionally, to optimize embedding effectiveness, a cross-language embedding model is proposed, effectively addressing cross-language retrieval issues while improving model performance.

  • Offline Understanding Layer. This layer is mainly designed around two modules: intelligent knowledge base and search enhancement. The intelligent knowledge base is primarily responsible for processing unstructured text into a retrievable knowledge base, including text parsing, table recognition, OCR recognition, etc. Search enhancement ensures retrieval accuracy through modules such as question rewriting and reordering.

  • Online Q&A Layer, which supports multi-document, multi-turn, multi-modal, and security and rejection features, enhancing product competitiveness while meeting diverse user needs in different scenarios.

  • Scenario Layer, which pre-configures various scenario roles based on the characteristics of different industries, lowering the product usage threshold.

3.2 Technical Architecture

Key Module Analysis of RAG Full Link

To understand the retrieval-augmented generation framework, we divide it into three main components: query understanding, retrieval models, and generation models.

  • • Query Understanding: This module aims to comprehend user queries or generate structured queries from user input, allowing for querying both structured databases and unstructured data, thereby improving recall rates. This module includes four parts: query rewriting, query expansion, and intent recognition. We will introduce each module in detail in later chapters.

  • • Retrieval Model: This model aims to retrieve relevant information from a given document set or knowledge base. They typically use information retrieval or semantic search techniques to identify the most relevant information based on a given query. Retrieval-based models excel at finding accurate and specific information but lack the ability to generate creative or novel content. Technically, retrieval models mainly include document loading, text transformation, embedding, etc. We will detail these in later chapters.

  • • Generation Model: This model aims to generate new content based on a given prompt or context. Currently, generative models can produce creative and coherent text, but they may encounter difficulties in factual accuracy or relevance to specific contexts. In the RAG framework, generation models mainly include chat systems (long-term memory and short-term memory), prompt optimization, etc. These will also be introduced in later chapters.

In summary, retrieval-augmented generation combines the advantages of retrieval models and generation models, overcoming their respective limitations. In this framework, retrieval-based models are used to retrieve relevant information from a knowledge base or a set of documents based on a given query or context. The retrieved information then serves as input or additional context for the generation model. By integrating the retrieved information, the generation model can leverage the accuracy and specificity of the retrieval-based model to generate more relevant and accurate text. This helps the generation model to base itself on existing knowledge and produce text consistent with the retrieved information.

4. Query Understanding

Currently, RAG systems may encounter retrieving content from the knowledge base that is irrelevant to user queries. This is due to the following issues: (1) The phrasing of user questions may hinder retrieval, (2) there may be a need to generate structured queries from user questions. To address the above issues, we introduce the query understanding module.

4.1 Intent Recognition

Intent recognition refers to receiving user queries and a set of “choices” (defined by metadata) and returning one or more selected “choice modules”. It can be used independently (as a “selector module”) or as part of a query engine or retriever (e.g., on top of other query engines/retrievers). It is a simple yet powerful module, currently mainly utilizing LLM to implement decision-making functionality.

It can be applied in the following scenarios:

  • • Selecting the correct data source from various data sources;

  • • Deciding whether to summarize (e.g., using summary index query engines) or conduct semantic searches (e.g., using vector index query engines);

  • • Deciding whether to “try” multiple choices at once and merge the results (using multi-routing functionality).

The core module has the following forms:

  • • LLM selector dumps the choices as text into the prompt and uses LLM to make decisions;

  • • Building traditional classification models, including semantic matching-based classification models, Bert intent classification models, etc.

4.2 Query Rewriting

This module mainly utilizes LLM to rephrase user queries instead of directly using the original user query for retrieval. This is because, for RAG systems, original queries in the real world cannot always be the best retrieval conditions.

4.2.1 HyDE[2]

Hypothetical Document Embeddings (HyDE) is a technique for generating document embeddings to retrieve relevant documents without the need for actual training data. First, LLM creates a hypothetical answer in response to the query. Although this answer reflects patterns related to the query, the information it contains may not be factually accurate. Next, the query and the generated answer are both transformed into embeddings. The system then identifies and retrieves the actual documents closest to these embeddings in the predefined database.

Key Module Analysis of RAG Full Link

4.2.2 Rewrite-Retrieve-Read[3]

This work introduces a new framework—Rewrite-Retrieve-Read, which improves the retrieval-augmented method from the perspective of query rewriting. Previous research mainly focused on adjusting the retriever or LLM. In contrast, this method emphasizes the adaptability of the query. Because for LLM, the original query is not always the best retrieval result, especially in the real world. First, LLM is used to rewrite the query, followed by retrieval enhancement. Additionally, to further improve rewriting effectiveness, a small language model (T5) is applied as a trainable rewriter, rewriting search queries to meet the needs of frozen retrievers and LLMs. To fine-tune the rewriter, this method also uses pseudo-data for supervised warm-up training. Then, the “retrieve then generate” pipeline is modeled as a reinforcement learning environment. By maximizing the reward for pipeline performance, the rewriter is further trained as a policy model.

Key Module Analysis of RAG Full Link

4.3 Query Expansion

This module is mainly aimed at breaking down complex problems into sub-problems. This technique uses a divide-and-conquer approach to handle complex issues. It first analyzes the problem and decomposes it into simpler sub-problems, each of which retrieves answers from relevant documents providing partial answers. It then collects these intermediate results and combines all partial results into a final response.

4.3.1 Step-Back Prompting[4]

This work explores how LLM can handle complex tasks involving many low-level details by using abstraction and reasoning in two steps. The first step is to use LLM to “step back” and generate high-level abstract concepts, establishing reasoning based on these abstract concepts to reduce the probability of errors in intermediate reasoning steps. This method can be used both with and without retrieval. When used with retrieval, both the abstract concepts and the original problems are used for retrieval, and then both results are used as the basis for LLM responses.

Key Module Analysis of RAG Full Link

4.3.2 CoVe[5]

Chain of Verification (CoVe) aims to improve the reliability of answers provided by large language models by systematically verifying and refining responses to minimize inaccuracies, especially in factual questions and answering scenarios. The underlying concept is based on the idea that responses generated by large language models (LLM) can be used to verify themselves. This self-verification process can be used to assess the accuracy of initial responses and make them more precise.

In the RAG system, addressing increasingly complex questions in real user scenarios draws on CoVe technology to break down complex prompts into multiple independent, search-friendly queries that can be retrieved in parallel, allowing LLM to conduct targeted knowledge base searches for each sub-query, ultimately providing more accurate and detailed answers while reducing hallucination output.

Key Module Analysis of RAG Full Link

4.3.3 RAG-Fusion[6]

In this method, the original query is passed through LLM to generate multiple queries. These search queries can then be executed in parallel, and the retrieved results are passed together. This method is particularly useful when one question may depend on multiple sub-questions. RAG-Fusion represents this method, serving as a search approach aimed at bridging the gap between traditional search paradigms and the multifaceted nature of human queries. This method first uses LLM to generate multiple queries, then uses Reciprocal Rank Fusion (RRF) to reorder them.

Key Module Analysis of RAG Full Link

4.3.4 ReAct[7]

Recently, in RAG systems, the ReAct concept has been used to break down complex queries into simpler “sub-queries,” where different parts of the knowledge base may answer different “sub-queries”. This is particularly useful in compositional graphs. In a compositional graph, a query can be routed to multiple sub-indexes, each representing a subset of the entire knowledge corpus. By decomposing queries, we can transform queries into more suitable questions within any given index.

Key Module Analysis of RAG Full Link

The ReAct model is illustrated in the figure above, combining Chain of Thought (CoT) prompting and Action planning generation, complementing each other to enhance the problem-solving capabilities of large models. The reasoning tracking of CoT aids the model in inducing, tracking, and updating action plans and handling exceptions. Action operations allow it to interface with external sources such as knowledge bases or environments and collect additional information.

4.4 Query Reconstruction

Considering the overall pipeline efficiency of the Query Understanding module, we developed the Query Reconstruction module, which emphasizes rewriting, decomposing, and expanding the original user input complex problems in one request, exploring deeper sub-questions from users, thereby leveraging the characteristics of sub-questions to improve retrieval effectiveness and address the issue of complex problem retrieval quality bias, aiming to enhance the accuracy and efficiency of queries.

5. Retrieval Model

5.1 Challenges of Retrieval Models

  • • Dependence on the accuracy of vectorization of embedding models.

  • • Dependence on whether external data is reasonably segmented (not all knowledge can be converted into one vector; instead, data needs to be segmented and then converted before being stored in the vector database).

  • • Dependence on prompt concatenation, where after ranking the most similar returned documents, they are sent to the large model along with the user’s question, actually aiming to allow the large model to accurately identify and locate suitable content for answering within a long context.

The article “Lost in the Middle: How Language Models Use Long Contexts”[8] points out that performance is often highest when relevant information appears at the beginning or end of the input context, while performance significantly drops when the model must retrieve relevant information from the middle of a long context, even for explicit long-context models.

5.2 Architecture

Key Module Analysis of RAG Full Link
Note: Image source from langchain.

5.3 Document Loader

The document loader provides a “loading” method for loading document data from configured sources. Document data consists of a piece of text and related metadata. Document loaders can load documents from various sources. For example, some document loaders can load simple .txt files or load the text content of any webpage, even loading transcripts of YouTube videos. Additionally, document loaders can opt to implement “lazy loading” to lazily load data into memory.

5.4 Text Converter

A key part of retrieval is to only obtain relevant parts of documents. After loading documents, they typically need to be transformed to better fit the application. This involves several transformation steps to prepare the documents for retrieval. One major step is to segment (or chunk) large documents into smaller pieces, i.e., the text converter. The simplest example is when processing long texts, it is necessary to break the text into several pieces to fit into the model’s context window. Ideally, semantically related text segments should be grouped together. This may sound simple, but the potential complexity is significant.

5.4.1 How It Works

  • • Split the text into semantically meaningful small pieces (usually sentences).

  • • Start combining these small pieces into a larger piece until a certain size is reached (measured by a specific function).

  • • Once a certain size is reached, this semantic block is treated as its own text, then starts creating new semantic blocks with some overlap (to maintain context between blocks).

5.4.2 Common Types of Text Converters

Name Splits On Adds Metadata Description
Recursive A list of user-defined characters Recursively splits text. Splitting text recursively serves the purpose of trying to keep related pieces of text next to each other. This is the recommended way to start splitting text.
HTML HTML specific characters Splits text based on HTML-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the HTML)
Markdown Markdown specific characters Splits text based on Markdown-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the Markdown)
Code Code (Python, JS) specific characters Splits text based on characters specific to coding languages. 15 different languages are available to choose from.
Token Tokens Splits text on tokens. There exist a few different ways to measure tokens.
Character A user-defined character Splits text based on a user-defined character. One of the simpler methods.
[Experimental] Semantic Chunker Sentences First splits on sentences. Then combines ones next to each other if they are semantically similar enough. Taken from Greg Kamradt

5.4.3 Evaluating Text Converters

You can use the Chunkviz tool created by Greg Kamradt to evaluate text converters. Chunkviz is a tool for visualizing how text converters work. It can show you how the text is segmented and help you adjust the segmentation parameters.

5.5 Text Embedding Model

Another key part of retrieval is the document embedding model. The document embedding model creates a vector representation of a piece of text. It can capture the semantics of the text, allowing you to quickly and effectively find similar other segments within the text. This is very useful because it means we can think about text in vector space and perform operations such as semantic search.

Ideally, the retriever should have the ability to associate translated texts in different languages (cross-language retrieval capability), associate long original texts with short summaries, associate different expressions with the same semantics, associate different questions with the same intent, and associate questions with possible answer texts. Additionally, to provide the large model with the highest quality knowledge fragments possible, the retriever should also provide as many relevant fragments as possible, and truly useful fragments should be prioritized. Finally, we hope our model can cover as many fields and scenarios as possible, enabling a model to span multiple business scenarios, allowing users to obtain out-of-the-box models without needing further fine-tuning.

5.6 Vector Database

With the rise of embeddings, there has been a need for vector databases to support the efficient storage and search of these embeddings. One of the most common methods for storing and searching unstructured data is to embed the data and store the resulting embedding vectors, then during querying, embed the unstructured query and retrieve the embedding vectors most similar to the embedded query. The vector database is responsible for storing the embedded data and performing vector searches.

Key Module Analysis of RAG Full Link

5.7 Indexing

After the previous data reading and text chunking operations, the processed data needs to be indexed. Indexing is a data structure used for quickly retrieving text content related to user queries. It is one of the core foundational components of retrieval-augmented LLM.

Below are several common indexing structures. To illustrate different indexing structures, we introduce the concept of nodes (Node). Here, a node is the text chunk generated from the document segmentation in the previous steps. The following index structure diagram is sourced from LlamaIndex’s “How Each Index Works”[9].

5.7.1 Summary Index (formerly known as Chain Index)

The summary index simply stores nodes as a sequential chain. In subsequent retrieval and generation phases, all nodes can be simply traversed sequentially or filtered based on keywords.

Key Module Analysis of RAG Full Link

5.7.2 Tree Index

The tree index builds a hierarchical tree index structure from a set of nodes (text chunks), constructed from the leaf nodes (original text chunks) upwards, with each parent node being a summary of its child nodes. During the retrieval phase, traversal can either go down from the root node or directly utilize information from the root node. The tree index provides a more efficient way to query long text chunks and can also be used to extract information from different parts of the text. Unlike chain indexes, tree indexes do not require sequential querying.

Key Module Analysis of RAG Full Link

5.7.3 Keyword Table Index

The keyword table index extracts keywords from each node, constructing a many-to-many mapping from each keyword to the corresponding nodes, meaning each keyword may point to multiple nodes, and each node may contain multiple keywords. During the retrieval phase, nodes can be filtered based on keywords in the user query.

Key Module Analysis of RAG Full Link

5.7.4 Vector Index

The vector index is currently one of the most popular indexing methods. This method typically uses text embedding models to map text chunks into fixed-length vectors, which are then stored in the vector database. During retrieval, the user query text is mapped into a vector using the same embedding model, and the most similar one or more nodes are retrieved based on vector similarity calculations.

Key Module Analysis of RAG Full Link

5.8 Ranking and Post-processing

The previous retrieval process may yield many relevant documents, necessitating filtering and ranking. Common filtering and ranking strategies include:

  • • Filtering and ranking based on similarity scores.

  • • Filtering based on keywords, such as limiting the inclusion or exclusion of certain keywords.

  • • Allowing LLM to reorder based on returned relevant documents and their relevance scores.

  • • Filtering and ranking based on time, such as only selecting the most recent relevant documents.

  • • Weighting similarity based on time, then sorting and filtering.

6. Generation Model

6.1 Response Generation Strategies

The retrieval module retrieves relevant text chunks based on user queries, and the response generation module allows LLM to utilize the retrieved relevant information to generate responses to the original query. Here are some different response generation strategies.

  • • One strategy is to sequentially combine each retrieved relevant text chunk, continuously refining the generated response with each iteration. In this case, the number of independent relevant text chunks will determine the number of LLM calls.

  • • Another strategy is to fill as many text chunks as possible into the prompt during each LLM call. If a prompt cannot accommodate all the chunks, a similar approach can be used to construct multiple prompts, and the multiple prompt calls can adopt the same response refinement strategy as the previous one.

6.2 Prompt Concatenation Strategies

Used to combine different parts of prompts. You can use string prompts or chat prompts to do this. Building prompts in this way makes it easy to reuse components.

6.2.1 String Prompts

When using string prompts, each template is concatenated together. You can use prompts directly or as strings (the first element in the list must be a prompt). For example, the prompt template provided by langchain[10].

6.2.2 Chat Prompts

Chat prompts consist of a list of messages. Purely for developer experience, we have added a convenient way to create these prompts. In this pipeline, each new element is a new message in the final prompt. For instance, AIMessage, HumanMessage, and SystemMessage provided by langchain[11].

7. Plugins (Demonstration Retriever for In-Context Learning)

In-context learning is an emerging paradigm that refers to making predictions for given test inputs by providing some demonstrative input-output pairs (examples) without updating model parameters. Its unique ability to operate without parameter updates allows in-context learning methods to unify various natural language processing tasks through the reasoning of a language model, making it a promising alternative to supervised fine-tuning.

While this method performs well across various natural language processing tasks, its performance depends on the given demonstrative input-output pairs. Specifying demonstrative examples for each task in the model’s prompt affects the length of the prompt, leading to increased costs for each call to the large language model, even exceeding the model’s input length. This thus presents a new avenue for research—context learning based on demonstration retrieval. Such methods primarily retrieve candidates with demonstrative examples that are textually or semantically similar to the test input, adding the user’s input along with similar demonstrative examples to the model’s prompt, allowing the model to produce correct prediction results. However, the aforementioned single retrieval strategy results in low recall rates, causing demonstrative examples to be imprecisely recalled, leading to suboptimal model performance.

To address the above issues, we propose a hybrid demonstration retrieval-based context learning method. This method fully considers the pros and cons of different retrieval models and proposes a fusion algorithm that utilizes both text retrieval (e.g., BM25 and TF-IDF) and semantic retrieval (e.g., OpenAI embedding-ada and sentence-bert) for multi-path recall, solving the low recall rate problem of single-path retrieval. However, this brings a new challenge: how to fuse different recall results. This is due to the mismatch in score ranges of different retrieval algorithms; for example, text retrieval scores typically range between 0 and some maximum value (depending on the relevance of the query), while semantic search (e.g., cosine similarity) generates scores between 0 and 1, making it tricky to directly sort the recall results. Therefore, we propose a re-ranking method based on reciprocal rank fusion that combines the results of different retrieval algorithms without needing to adjust the model, while considering that large models tend to perform best when relevant information appears at the beginning or end of the input context, designing a re-ranking algorithm to obtain high-quality fusion results.

The specific model architecture is shown in the figure below, which includes retrieval, re-ranking, and generation modules.

Key Module Analysis of RAG Full Link

7.1 Retrieval Module

The retrieval module is further divided into text retrieval and semantic retrieval.

Semantic retrieval employs a dual-tower model, using OpenAI’s embedding-ada as the representation model, representing both the user’s input and each task’s candidate examples to obtain semantic vectors, then using the k-nearest neighbor algorithm (KNN) to calculate semantic similarity and ranking the candidate results based on similarity. For each task’s candidate examples, we use offline representation and store them in the vector database because once a task is confirmed, each task’s candidate examples remain fixed. For user inputs, we use real-time representation to improve computational efficiency as user inputs are diverse and variable.

For text retrieval, we first preprocess each task’s candidate examples, removing stop words, special symbols, etc. Meanwhile, in text retrieval, we employ inverted index technology to accelerate queries and use BM25 to calculate text similarity, finally ranking based on similarity.

7.2 Re-ranking Module

For the re-ranking module, we propose a re-ranking algorithm based on reciprocal rank fusion. This method aims to solve the score range mismatch problem of different recall algorithms by introducing the reciprocal rank fusion algorithm to merge and rank the multi-path recall results. Although the reciprocal rank fusion algorithm can effectively merge and rank multi-path recalls, this ranking does not meet the large model’s highest performance characteristic when relevant information appears at the beginning or end of the input context. This makes simply using reciprocal rank fusion for output ineffective. Therefore, we further re-rank the results after fusion ranking, with the ranking strategy being to fill and sort based on the results of the fusion ranking. For example, the original order of [1,2,3,4,5] becomes [1,3,5,4,2] after re-ranking.

7.3 Generation Module

The generation module can generate creative and coherent text, aiming to produce new content based on a given prompt or context. Here, we designed a prompt assembly module that combines system prompts with retrieved relevant information. Additionally, the assembly module integrates long-term and short-term dialogue records for prompt encapsulation.

8. Citation or Attribution Generation

8.1 What is Citation or Attribution

In the context of RAG’s knowledge Q&A scenario, with an increasing number of documents, webpages, and other information being injected into applications, more developers are realizing the importance of information sources, which can align the content generated by the model with the content of reference information, providing evidence sources and ensuring information accuracy, thus making the large model’s responses more authentic.

8.2 The Role of Attribution

  • • From the user perspective: It allows verification of whether the model’s response is reliable.

  • • From the model perspective: It improves accuracy and reduces hallucinations.

8.3 How to Achieve Attribution

8.3.1 Model Generation

Directly let the model generate attribution information. You can add prompts like “each generated evidence must be cited in the reference information”. This method is the simplest but requires high adherence to instructions from the model, typically using GPT-4 for responses or fine-tuning the model to learn to include citations in generated responses. The drawbacks are also evident; it is highly reliant on the model’s capabilities (and citations may even be fabricated), and the repair cycle for bad cases that occur in real scenarios is long, with weak intervention capabilities.

8.3.2 Dynamic Calculation

Adding citation information during the model generation process. Specifically, in streaming generation scenarios, semantic units (such as periods, paragraph breaks, etc.) of the generated text are judged, and when a complete semantic unit appears, it matches that semantic unit with each reference source in the reference information (keywords, embeddings, etc.), finding the Top-N reference sources through a thresholding method, thus attaching the reference sources for return. This method’s implementation is significantly simpler than the first method, and the bad case repair cycle is shorter, but it is limited by the matching method and threshold, and there is a prerequisite assumption: the model’s generated text is sourced from reference information.

More research on attribution can be found in: A Survey of Large Language Models Attribution[12]

9. Evaluation

For developers, regardless of whether they have used TDD (Test-Driven Development), they have likely heard of its name; similarly, when developing large model applications, there should be a corresponding concept of MDD (Metrics-Driven Development). The most comfortable approach is to pre-define the business scenarios, data used, set metrics, and target scores, then systematically achieve the set goals, boosting employee morale and pleasing the boss!

However, the ideal and reality are often so entangled. For most large model development tasks, the more common situation is unclear scenario definitions, with data cleaning exhausting the troops, and as for needed metrics and goals? They haven’t even been considered! This time, we will address what should have been considered from the very beginning: how to quantify business metrics for large model applications, specifically, how to quantify to prove that your RAG is indeed better than the neighboring Wang’s group? What quantitative metrics should be used to persuade peers?

Several aspects affecting RAG system performance:

  • • Position Bias: LLMs may give higher attention to content at specific positions in the text, such as content located at the beginning and end of paragraphs being more readily accepted.

  • • Relevance of Retrieved Content: Due to the diversity of query expressions, RAG systems may retrieve irrelevant content, which adds noise to LLMs’ understanding and increases the model’s burden.

9.1 Evaluation Metrics

  • • Faithfulness refers to evaluating whether the generated responses are faithful to the contexts, which is crucial for avoiding hallucinations and ensuring that the retrieved context can serve as reasoning for generating answers.

  • • Answer Relevance indicates that the generated answers should address the actual problems presented.

  • • Context Relevance means that the retrieved context should be focused, containing as little irrelevant information as possible. Ideally, the retrieved context should only encompass the essential information necessary to address the provided query. The less redundant information included, the higher the context relevancy.

9.2 Evaluation Methods

9.2.1 RGB (Benchmarking Large Language Models in Retrieval-Augmented Generation)[13]

This work systematically studies the impact of retrieval-augmented generation on large language models. It analyzes the performance of different large language models in four fundamental capabilities required for RAG, including noise robustness, refusal, information integration, and counterfactual robustness, and establishes a benchmark for retrieval-augmented generation. Additionally, current RAG implementations are pipeline-based, involving multiple stages such as chunking, relevance recall, and refusal, each of which can be evaluated separately; the four capabilities mentioned can actually reflect each stage.

9.2.2 RAGAS (RAGAS: Automated Evaluation of Retrieval Augmented Generation)[14]

This work proposes a framework for reference-free evaluation of retrieval-augmented generation (RAG) pipelines. This framework considers the retrieval system’s ability to identify relevant and key contextual paragraphs, the LLM’s capability to utilize these paragraphs faithfully, and the quality of the generation itself. This method has been open-sourced, and specific details can be found on GitHub: GitHub – exploding gradients/ragas: Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines

9.2.3 Llamalindex-Evaluating[15]

LlamaIndex provides key modules to measure the quality of generated results. It also offers key modules to measure retrieval quality.

10. Conclusion

The wave of LLMs has indeed spawned many technologies, and to truly excel in every aspect, aligning with enterprise applications requires extensive research and continuous practice to refine quality. This article summarizes the key modules in RAG practice over the past year, hoping that this summary will be of some help to everyone.

Reference Links

[1] : https://arxiv.org/abs/2005.11401v4[2] : https://boston.lti.cs.cmu.edu/luyug/HyDE/HyDE.pdf[3] : https://arxiv.org/pdf/2305.14283.pdf?ref=blog.langchain.dev[4] : https://arxiv.org/pdf/2310.06117.pdf?ref=blog.langchain.dev[5] : https://arxiv.org/pdf/2309.11495.pdf[6] : https://github.com/Raudaschl/rag-fusion[7] : https://arxiv.org/pdf/2210.03629.pdf[8] : https://arxiv.org/pdf/2307.03172.pdf[9] : https://docs.llamaindex.ai/en/latest/module_guides/indexing/index_guide.html[10] : https://python.langchain.com/docs/modules/model_io/prompts/composition[11] : https://python.langchain.com/docs/modules/model_io/prompts/composition[12] : https://arxiv.org/pdf/2311.03731.pdf[13] : https://arxiv.org/pdf/2309.01431.pdf[14] : https://arxiv.org/pdf/2309.15217.pdf[15] : https://docs.llamaindex.ai/en/latest/module_guides/evaluating/root.html

Related Articles

Tips for Selecting Fine-tuning Data for Large Models (Part 3)

Pragmatic Thoughts on the Implementation of Large Models

Claude 3 Released, Competing with GPT-4

Sora Technology Analysis Report

The Most Detailed MoE Principles and Code in Large Models

The Most Thorough Explanation of PPO Principles and Source Code in Large Models

Does Chinese Data Make LLMs Dumber?

Tricks for Reward Models in Large Models

Experience and Insights on Fine-tuning Large Models

Tips for Selecting and Constructing Fine-tuning Data for Large Models (Part 2)

Reasons and Solutions for Loss Spikes in Large Model Training

Methods for Surpassing 70B with 8 7B Models in Large Models

Must-Read Articles for Large Models/AIGC/Agents

Is There an Opportunity for Large Models in Task-oriented Dialogue?

Which of the Nearly 80 Domestic AI Large Models is Most Promising?

Techniques for Selecting and Constructing Fine-tuning Data for Large Models

How to Fix Bad Cases in Large Models

Scaling Law Calculation Methods in Large Models

Thoughts on Implementing Large Models in Specific Domains

Optimization in Large Models

Summary of the Champion Solution for Large Model Kaggle Competitions

Detailed Explanation of RLHF Theory in Large Models

Interview Experiences from 24 Domestic Large Models

Eight Common Interview Questions for Large Models with Answers

Explanation of Unlabeled Alignment RLAIF in Large Models

Why Use A100 for Large Model Training Instead of 4090

Details of the Large Model Baichuan 2 Technical Report

Insights and Sharing from Interviews Related to Large Models

Determining Whether a Scenario is Suitable for Large Models

Summary of Fine-tuning Techniques for Large Models

The Hallucination Problem in Large Models

Tutorial on Training Large Models from Scratch

Is Training Domain/Scenario-Specific Large Models Too Difficult?

The Open-source Community of Large Models: The Atomic Bomb Llama2

Some Pitfalls and Judgments in Large Model Training

Tricks for RLHF in Large Models

Evaluating Large Models is Quite Challenging

Common Interview Questions for Large Models

Tricks for Constructing Samples for Fine-tuning in Large Models

Training Large Models is Really Difficult!

Bag Bag Algorithm Notes is a place where Bag Da Ren shares knowledge, career, and experience during his commute. In simple words, it talks about professional knowledge.

Reply “刷题” to get efficient problem-solving experience, reply “面试” to get algorithm campus recruitment interview guide, reply “大模型” to get large model technical materials, reply “aigc” to get a collection of large model mind maps and papers.

Recently, writing original content on large models!

If you want to join the discussion group, add WeChat ID logits and mention joining the group.

Leave a Comment