How LlamaIndex Performs Retrieval Augmented Generation (RAG)

The full name of RAG is Retrieval Augmented Generation, which means “retrieval enhanced generation”.

LLMs are trained on a vast amount of data, but this training data does not include your data. RAG solves this problem by adding your data to the data that the LLM already has access to.

In RAG, your data is loaded and ready for querying or indexing. User queries act on the index, and the index filters the data down to the most relevant context. Then, this context and your query, along with the prompt, are sent to the LLM, which is responsible for generating a response.

Steps of RAG

There are five key stages in RAG that will be part of any large application you build. These are:

Loading: This refers to retrieving data from its location (whether it’s a text file, PDF, other websites, databases, or APIs) into your pipeline. LlamaHub offers hundreds of connectors to choose from.
Indexing: This means creating a data structure that allows querying of the data. For LLMs, this almost always means creating vector embeddings (numerical representations of text meanings), along with many other metadata strategies to easily and accurately find contextually relevant data.
Storing: Once the data is indexed, you almost always want to store the index and other metadata to avoid re-indexing.
Querying: For any given indexing strategy, you can query using the LLM and LlamaIndex data structure in various ways, including sub-queries, multi-step queries, and hybrid strategies.
Evaluating: A key step in any pipeline is to check its effectiveness relative to other strategies, or when to make changes. Evaluation provides an objective measure of the accuracy, fidelity, and speed of your responses to queries.

Important Concepts in Each Step

Loading Stage

Node and Document: A document is a container for any data source – such as a PDF, API output, or data retrieved from a database. A node is an atomic unit of data in LlamaIndex, representing a “chunk” of the source document. Nodes have metadata that associates them with the document they belong to and other nodes.

Connectors: Data connectors (often called readers) extract data from different data sources and formats into documents and nodes.

Indexing Stage

Indexing: After acquiring data, LlamaIndex will help you index the data into an easily retrievable structure. This usually involves generating vector embeddings and storing them in a dedicated database called vector storage. The index can also store various metadata about the data.

Embedding: The LLM generates a numerical representation of the data, referred to as an embedding. When filtering the relevance of data, LlamaIndex converts queries into embeddings, and your vector storage will look for data that is numerically similar to the query’s embedding.

Querying Stage

Retriever: The retriever defines how to effectively retrieve relevant context from the index for a given query. Your retrieval strategy is crucial for the relevance of the retrieved data and the efficiency of its completion.

Router: The router determines which retriever to use to fetch relevant context from the knowledge base. More specifically, the RouterRetriever class is responsible for selecting one or more candidate retrievers to execute the query. They use selectors to choose the best option based on the metadata of each candidate retriever and the query.

Node Post-Processor: The node post-processor receives a set of retrieved nodes and applies transformation, filtering, or reordering logic to them.

Response Synthesizer: The response synthesizer uses the user query and a given set of retrieved text blocks to generate a response from the LLM.

Putting It All Together

LLM applications supported by data have countless use cases, but they can be broadly categorized into three types:

Query Engine: A query engine is an end-to-end pipeline that allows you to ask questions of the data. It accepts natural language queries and returns responses, along with the reference context retrieved and passed to the LLM.
Chat Engine: A chat engine is an end-to-end pipeline for conversing with your data (multiple back-and-forth exchanges instead of a single Q&A).
Agent: An agent is an automated decision-maker powered by an LLM that interacts with the world through a set of tools. Agents can take any number of steps to complete a given task, dynamically deciding on the best course of action rather than following a predetermined set of steps. This gives them added flexibility to handle more complex tasks.

Leave a Comment Cancel reply