Overview
As most LLMs are only trained periodically on a large amount of public data, they cannot access the latest information and/or private data. Retrieval-Augmented Generation (RAG) is a core paradigm for developing applications with LLMs, addressing this issue by connecting to external data sources. A basic RAG pipeline includes embedding user queries, retrieving relevant documents to the query, and passing the documents to the LLM to generate answers based on the retrieval context.

In simple terms, it retrieves relevant content from a knowledge base, then passes it to LLMs, and finally generates the final result based on the retrieved content to feedback to the user.
The downside is that the retrieved content may often be irrelevant or not very useful, leading to generated content that does not meet expectations.
In the process shown above, when the answer is unsatisfactory, the aforementioned adjustments cannot be made. In this case, our actual operation requires iterating again or optimizing the question, or filtering or sorting the content, returning to a certain node, optimizing it again, and executing it again until we get a satisfactory result under certain judgment conditions, thus achieving a better output, as shown in the figure below:

Cognitive Architecture of RAG
In practice, it is found that implementing RAG requires logical reasoning around these steps: for example, we can ask when to retrieve (based on the composition of the question and index), when to rewrite the question for better retrieval, or when to discard irrelevant retrieved documents and try retrieving again? The Self-Reflective RAG is introduced, proposing the use of LLMs to self-correct poor retrieval and/or generated ideas.
The basic RAG process uses only one chain: the LLM determines what content to generate based on the retrieved documents.
Some RAG flows use routing, for example, the LLM makes decisions among different retrievers based on the question. However, Self-Reflective RAG usually requires some feedback, regenerating the question and/or re-retrieving documents.
A state machine is a third cognitive architecture that supports loops, which is very suitable for this: a state machine simply allows us to define a set of steps (for example, retrieve, rank documents, rewrite queries) and set conversion options between them; for example, if the retrieved documents are irrelevant, rewrite the query and retrieve new documents again.

Corrective RAG (CRAG)
The core of CRAG is to evaluate the retrieved relevant documents using LLMs, classifying them into three categories: Correct, Incorrect, Ambiguous:
-
Evaluate the overall quality of the retrieved documents through LLMs (retrieval evaluator) and return a confidence score for each document. -
If the vector store retrieval is deemed ambiguous or irrelevant to the user query, perform web-based document retrieval to supplement the context. -
By dividing the retrieved documents into “knowledge strips,” ranking each strip, and filtering out irrelevant strips, refine the knowledge of the retrieved documents.
This can be represented as follows:

-
Correct: Pass the Correct documents to the LLM to generate answers based on the retrieval context. -
Incorrect: Perform web-based document retrieval to supplement the context and pass it to the LLM to generate answers. -
Ambiguous: Pass both to the LLMs as contextual information to generate answers.
Simplified steps are shown in the figure below:

Self-Reflective RAG
The core idea: to have autonomous judgment, when the judgment feels the result is not very satisfactory, to form a loop that can optimize the previous steps again.
-
Retrieve
token decides to useD
inputx (question)
ORx (question)
, outputy (generation)
asyes, no, continue
. -
ISREL
token decides whether paragraphD
is relevant tox
d
inD
‘s input (x (question)
,d (chunk)
). Output isrelevant, irrelevant
. -
ISSUP
token decides whether each block’s LLMD
generation is relevant to that block. Input isx
,d
,y
ford
inD
. It confirms that all verifiable statements iny (generation)
are supported byd
. Output isfully supported, partially supported, no support
. -
ISUSE
token decides whether the generation from each block inD
is a useful response tox
. Input isx
,y
ford
inD
. Output is{5, 4, 3, 2, 1}.

We can summarize it as a simplified graphic to understand the information flow:

The detailed content is shown in the figure below:

Conclusion
Self-Reflective can greatly improve RAG, thus correcting poor retrieval or generation quality.
Recommended Reading
-
With Code | Making Retrieval-Augmented Generation (RAG) Faster -
Embeddings-Based Search for Q&A -
LLMs: Using Search API for Q&A and Re-ranking
References:
Self-Reflective RAG with LangGraph (https://blog.langchain.dev/agentic-rag-with-langgraph/)
Corrective RAG (CRAG) (https://arxiv.org/pdf/2401.15884.pdf)
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection (https://arxiv.org/abs/2310.11511)