Advanced Self-Reflective RAG

Overview

As most LLMs are only trained periodically on a large amount of public data, they cannot access the latest information and/or private data. Retrieval-Augmented Generation (RAG) is a core paradigm for developing applications with LLMs, addressing this issue by connecting to external data sources. A basic RAG pipeline includes embedding user queries, retrieving relevant documents to the query, and passing the documents to the LLM to generate answers based on the retrieval context.

Advanced Self-Reflective RAG
Basic RAG Process

In simple terms, it retrieves relevant content from a knowledge base, then passes it to LLMs, and finally generates the final result based on the retrieved content to feedback to the user.

The downside is that the retrieved content may often be irrelevant or not very useful, leading to generated content that does not meet expectations.

In the process shown above, when the answer is unsatisfactory, the aforementioned adjustments cannot be made. In this case, our actual operation requires iterating again or optimizing the question, or filtering or sorting the content, returning to a certain node, optimizing it again, and executing it again until we get a satisfactory result under certain judgment conditions, thus achieving a better output, as shown in the figure below:

Advanced Self-Reflective RAG

Cognitive Architecture of RAG

In practice, it is found that implementing RAG requires logical reasoning around these steps: for example, we can ask when to retrieve (based on the composition of the question and index), when to rewrite the question for better retrieval, or when to discard irrelevant retrieved documents and try retrieving again? The Self-Reflective RAG is introduced, proposing the use of LLMs to self-correct poor retrieval and/or generated ideas.

The basic RAG process uses only one chain: the LLM determines what content to generate based on the retrieved documents.

Some RAG flows use routing, for example, the LLM makes decisions among different retrievers based on the question. However, Self-Reflective RAG usually requires some feedback, regenerating the question and/or re-retrieving documents.

A state machine is a third cognitive architecture that supports loops, which is very suitable for this: a state machine simply allows us to define a set of steps (for example, retrieve, rank documents, rewrite queries) and set conversion options between them; for example, if the retrieved documents are irrelevant, rewrite the query and retrieve new documents again.

Advanced Self-Reflective RAG
Cognitive Architecture of RAG

Corrective RAG (CRAG)

The core of CRAG is to evaluate the retrieved relevant documents using LLMs, classifying them into three categories: Correct, Incorrect, Ambiguous:

  • Evaluate the overall quality of the retrieved documents through LLMs (retrieval evaluator) and return a confidence score for each document.
  • If the vector store retrieval is deemed ambiguous or irrelevant to the user query, perform web-based document retrieval to supplement the context.
  • By dividing the retrieved documents into “knowledge strips,” ranking each strip, and filtering out irrelevant strips, refine the knowledge of the retrieved documents.

This can be represented as follows:

Advanced Self-Reflective RAG
CRAG Process
  • Correct: Pass the Correct documents to the LLM to generate answers based on the retrieval context.
  • Incorrect: Perform web-based document retrieval to supplement the context and pass it to the LLM to generate answers.
  • Ambiguous: Pass both to the LLMs as contextual information to generate answers.

Simplified steps are shown in the figure below:

Advanced Self-Reflective RAG

Self-Reflective RAG

The core idea: to have autonomous judgment, when the judgment feels the result is not very satisfactory, to form a loop that can optimize the previous steps again.

  • Retrieve token decides to use D input x (question) OR x (question), output y (generation) as yes, no, continue.
  • ISREL token decides whether paragraph D is relevant to x d in D‘s input (x (question), d (chunk)). Output is relevant, irrelevant.
  • ISSUP token decides whether each block’s LLM D generation is relevant to that block. Input is x, d, y for d in D. It confirms that all verifiable statements in y (generation) are supported by d. Output is fully supported, partially supported, no support.
  • ISUSE token decides whether the generation from each block in D is a useful response to x. Input is x, y for d in D. Output is {5, 4, 3, 2, 1}.
Advanced Self-Reflective RAG
Four Types of Tokens Used in Self-RAG

We can summarize it as a simplified graphic to understand the information flow:

Advanced Self-Reflective RAG
Flowchart Used in Self-RAG

The detailed content is shown in the figure below:

Advanced Self-Reflective RAG
Overview of Self-RAG

Conclusion

Self-Reflective can greatly improve RAG, thus correcting poor retrieval or generation quality.

Recommended Reading

  • With Code | Making Retrieval-Augmented Generation (RAG) Faster
  • Embeddings-Based Search for Q&A
  • LLMs: Using Search API for Q&A and Re-ranking

References:

Self-Reflective RAG with LangGraph (https://blog.langchain.dev/agentic-rag-with-langgraph/)

Corrective RAG (CRAG) (https://arxiv.org/pdf/2401.15884.pdf)

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection (https://arxiv.org/abs/2310.11511)

Leave a Comment