Detailed Explanation of RAG 2.0 Architecture: Building End-to-End Retrieval-Augmented Generation Systems

Click on “Deephub Imba“, follow the public account, and don’t miss out on great articles!

There have been many articles about Retrieval-Augmented Generation (RAG). If we could create a trainable retriever, or if the entire RAG could be customized like fine-tuning a large language model (LLM), we would definitely achieve better results. However, the current issue with RAG is that the various sub-modules are not fully coordinated, resembling a patchwork monster; it works but the parts are not harmonious. Therefore, we introduce the concept of RAG 2.0 to address this problem.

What is RAG?

In simple terms, RAG can provide additional context to our large language models (LLMs) to generate better, more specific responses. LLMs are trained on publicly available data; they are very intelligent systems, but they cannot answer specific questions because they lack the context necessary for those answers.

Thus, RAG can inject new knowledge or capabilities into LLMs, although this knowledge injection is not permanent. Another common way to add new knowledge or capabilities to LLMs is through fine-tuning the LLM on our specific data.

Adding new knowledge through fine-tuning is quite difficult and expensive, but it is permanent. Adding new capabilities through fine-tuning can even affect the knowledge that was previously held. During fine-tuning, we cannot control which weights will be changed, and therefore we cannot know which capabilities will be increased or decreased.

The choice between fine-tuning, RAG, or a combination of both depends entirely on the task at hand. There is no one-size-fits-all approach.

The classic steps of RAG are as follows:

Divide documents into uniform chunks.
Each chunk is a segment of raw text.
Use an encoder to generate embeddings for each chunk (e.g., OpenAI embeddings, sentence_transformer, etc.) and store them in a database.
Find the most similar encoded chunks, retrieve the original text of these chunks, and provide it as context along with the prompt to the generator.

RAG 2.0

Today’s typical RAG systems use off-the-shelf frozen models for embeddings, vector databases for retrieval, and black-box language models for generation, stitching them together through prompts or orchestration frameworks. Each component is technically feasible, but the overall performance is far from optimal. These systems are fragile, lack any machine learning or specialization for their deployment domain, require extensive prompting, and are prone to cascading errors. As a result, RAG systems rarely meet production standards.

The concept of RAG 2.0, which we propose, aims to maximize performance through pre-training, fine-tuning, and aligning all components as an integrated system, utilizing dual backpropagation of language models and retrievers:

Below, we compare contextual language models (CLMs) with frozen model RAG systems across multiple dimensions.

For open-domain question answering: We test each model’s ability to retrieve relevant knowledge and accurately generate answers using standard natural question (NQ) and TriviaQA datasets. The HotpotQA (HPQA) dataset is also used to evaluate the models in a single-step retrieval setup. All datasets use Exact Match (EM) metrics.

For fidelity: We use HaluEvalQA and TruthfulQA to measure each model’s ability to maintain a foundation in retrieving evidence and hallucinations.

For freshness: We use web search indices to measure each RAG system’s ability to summarize rapidly changing world knowledge and demonstrate accuracy in the latest FreshQA benchmark test.

These dimensions are crucial for building production-level RAG systems. CLMs significantly enhance performance across various powerful frozen model RAG systems, which are built using GPT-4 or state-of-the-art open-source models like Mixtral.

How Does RAG Solve Intelligence Issues?

RAG is a semi-parametric system where the parametric part is the large language model (LLM), and the remainder is non-parametric. This gives us a semi-parametric system. The LLM stores all information in its weights or parameters (in encoded form), while the rest of the system does not define those knowledge parameters.

But how does this solve the problem?

Exchanging indices (specific information) in the LLM provides us with customization, meaning we do not just obtain outdated knowledge; we can also revise the content in the indices.
Positioning the LLM through these indices means reducing hallucinations and allowing for citations and attribution through source pointers.

Thus, RAG provides LLMs with better contextual capabilities, enabling them to perform well. But is it really that simple?

Not really, as we have many questions to answer in order to create a modern, scalable RAG pipeline.

Current RAG systems are not that intelligent, and they are quite simplistic, unable to handle complex tasks that require extensive customized context.

We see that currently the only trainable parameter part is the LLM. Can we add more parameters?

Better Retrieval Strategies

1. Sparse Retrieval

TF-IDF: TF-IDF, or Term Frequency-Inverse Document Frequency, is a metric that measures the importance of a word to a document set or corpus, adjusting for the fact that certain words tend to appear more frequently. [1] It is commonly used as a weighting factor in information retrieval, text mining, and user modeling searches.

BM25: This can be seen as an improvement over TF-IDF.

For the query “machine learning”, the calculation of BM25 would be the sum of BM25Score(machine) + BM25Score(learning).

The first part of the formula is the inverse document frequency (IDF) of the term. The second part represents term frequency (TF), which is normalized by document length.

f(q(i), D) is the term frequency of term q(i) in document D.

K and b are adjustable parameters. |D| represents the length of the document, and avgdl represents the average length of all documents in the database.

These are some early steps in sparse retrieval.

2. Dense Retrieval

The need for dense retrieval arises because language is not always straightforward. For example, if there are synonyms, sparse retrieval will completely fail. We want to retrieve information not just based on exact keyword matches, but more on the semantics of sentences. BERT sentence embedding is an example of dense retrieval. After converting sentences into vectors, we use dot product or cosine similarity to retrieve information.

One advantage of dense retrieval is its ease of parallel processing; with GPUs, it can easily run on similarity searches at the billion scale, which is how Meta developed FAISS, or what we commonly refer to as vector databases.

Dense retrieval is also what we commonly refer to as vector querying, which typically uses dot products to determine similarity. This is also a common step in general RAGs. How do we go beyond simple dot products?

In addition to simple dot products, there are many ways for documents and queries to interact, such as twin networks, ColBERT, etc.

Model-Based Retrieval Algorithms

ColBERT is a very good retrieval strategy, but it is not the SOTA in information retrieval. We have other more advanced algorithms and strategies, such as SPLADE, DRAGON, and Hybrid search.

1. SPLADE: A combination of sparse and dense query expansion.

Through query expansion, more context is covered, which helps in better retrieval.

2. DRAGON: Promoting dense retrievers through progressive data augmentation.

Let’s understand how DRAGON works through an example:

Initial Inquiry: “How to care for a spider plant?”
DRAGON’s Action: After identifying the theme of plant care, DRAGON formulates a targeted retrieval query specifically to gather general care information about spider plants.
Initial Retrieval: DRAGON digs into its database, retrieving documents about the sunlight needs, watering schedules, and suitable fertilizers for these green plants. It then generates the response: “Spider plants need moderate indirect sunlight and should be watered once a week. Fertilizing once a month during the growing season is beneficial for them.”
User Update: As the user inquires, “What to do if the leaves turn brown?” the conversation shifts.
DRAGON Adapts: DRAGON refines the retrieval query, focusing on the issue of brown leaves in spider plants.
Dynamic Retrieval Action: DRAGON retrieves information about common causes for leaves turning brown, such as overwatering or excessive direct sunlight.
Knowledge Transfer: By utilizing the newly retrieved data, DRAGON customizes its response based on the development of the conversation: “Brown leaves in spider plants may indicate overwatering or too much direct sunlight. Try reducing the watering frequency and moving the plant to a cooler place.”

DRAGON dynamically adjusts its retrieval queries based on the user’s evolving interests in the conversation. Every user input updates the retrieval process in real-time, ensuring that the information provided is both relevant and detailed, matching the latest context.

3. Hybrid Search: We interpolate between dense and sparse searches. This is the direction the RAG community has been studying, such as adopting a BM25-like approach and combining it with SPLADE or DRAGON.

However, regardless of the method used, the retriever remains fixed or non-customizable (non-fine-tunable).

Retrievers That Can Provide Context

1. RePlug

This is a very interesting paper on retrieval, where for a given query, we retrieve the top K documents, normalize (calculate their probabilities), and then get a distribution. We then input each document along with the query into a generator separately. We then observe the perplexity of the language model on the correct answer. This results in two probability distributions, and we calculate the KL divergence loss on these distributions to minimize the KL divergence, yielding the result with the lowest perplexity on the retrieved documents and the correct answer.

2. In-Context RALM

It uses frozen model RAG and BM25, then specializes the retrieval part through reordering. It includes a zero-shot learning language model and a trained reordering model.

The language model is fixed; we only backpropagate or train the reordering part. This is not very advanced, but it performs reasonably well compared to the previous simple RAG.

But the question is, if we cannot access the parameters of the LLM, how do we backpropagate or update the parameters of the retriever?

So it uses a reinforcement-style loss to train the retriever. The retriever’s effectiveness is judged by how well the information it retrieves enhances the output of the language model. Improvements to the retriever focus on maximizing this enhancement. This may involve adjusting retrieval strategies based on performance metrics derived from the language model’s output (the content and manner of information retrieval). Common metrics might include coherence, relevance, and factual accuracy of the generated text.

3. Combined Contextual Retriever and Generator

Instead of optimizing the LLM or retriever separately, why not optimize the entire process at once?

When retrieving documents, there are many opportunities to optimize at every n tokens or during each retrieval.

In the RAG-token model, different documents can be retrieved for different target tokens compared to a single retrieval in the RAG-Sequence model.

Using an encoder to encode all k documents, then collaborating, and finally decoding them before providing them as context to the input prompt.

4. k-NN LM

Another interesting idea in RAG systems is to include k-NN LM:

Researchers have shown that if trained in a RAG environment, they can create models that are 25 times smaller.

Current SOTA Summary

The contextualization of large language models (LLMs) is both complex and expensive. Because it is not easy to update the entire LLM, which requires updating billions or even trillions of tokens.

Thus, Meta’s FAIR has released the ATLAS paper, which discusses how to train the entire RAG pipeline and uses different types of loss functions for different parts, comparing their performance.

Below is a performance comparison of all the different losses in the ATLAS paper:

ATLAS is a carefully designed and pre-trained retrieval-augmented language model capable of learning knowledge-intensive tasks with very few training examples. ATLAS integrates these loss functions into a coherent training process, allowing for fine-tuning of the retriever directly based on its impact on language model performance, rather than relying on external annotations or predefined relevance scores. This integration enables the system to improve over time by adapting to the specific needs of its training tasks.

It uses a dual-encoder framework as its retrieval system, with one encoder dedicated to encoding queries and the other to documents.
Then the retrieved documents are input together with queries into a powerful sequence-to-sequence language model based on the T5 architecture, which acts as a decoder to generate the final text output.
It adopts an in-decoder fusion method, integrating the information from the retrieved documents directly into the decoder of the sequence-to-sequence model. This method allows the language model to dynamically leverage the retrieved information during generation, enhancing the relevance and accuracy of its output.

Conclusion

RAG (Retrieval-Augmented Generation) has three types:

Frozen model RAG: These are ubiquitous in the industry; they are merely proofs of concept (POC).
Semi-frozen model RAG: These apply intelligent retrievers and attempt to make them fit in some way. They do not modify the LLM but operate the retrievers and combine them with the final output.
Fully trainable RAG: End-to-end training is quite difficult, but if done correctly, it can provide optimal performance. However, it is certainly very resource-intensive.

What we commonly use as RAG is still just the first type, the frozen model RAG. Therefore, the technology of RAG is currently still in its infancy. Expanding the parameters of language models is still a challenge, and how to effectively scale the retriever, separating memory from summarization through parameters or data chunks, and separating knowledge retrieval from the generation process are all issues that require further research.

Author: Vishal Rajput

Kaggle competition communication and team building

Add me on WeChat, invite you to join the group