Goodbye RAG, Hello Agentic RAG!

In 2023, Retrieval-Augmented Generation (RAG) technology dominated the landscape, and in 2024, agentic workflows are driving significant advancements. The use of AI agents opens up new possibilities for building more powerful, robust, and versatile applications powered by large language models (LLMs). One potential use case is to enhance AI agents within the agentic RAG process.

Basics of Agentic RAG

What is Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is a technology for building LLM-driven applications. It utilizes external knowledge sources to provide relevant context to LLMs, reducing hallucination phenomena.

A simple RAG process includes a retrieval component (typically consisting of an embedding model and a vector database) and a generation component (an LLM). During inference, user queries are used to run similarity searches on indexed documents to retrieve the most similar documents and provide additional context to the LLM.

Typical RAG applications have two significant limitations:

Simple RAG processes consider only one external knowledge source. However, some solutions may require two external knowledge sources, and some may require external tools and APIs, such as web searches.
They are one-off solutions, meaning context is retrieved only once. There is no reasoning or validation of the quality of the retrieved context.

What are Agents in AI Systems

With the popularity of LLMs, a new paradigm of AI agents and multi-agent systems has emerged. AI agents are LLMs with roles and tasks that can access memory and external tools. The reasoning capabilities of LLMs help agents plan the necessary steps and take actions to complete the tasks at hand.

Thus, the core components of AI agents include:

LLM (with roles and tasks)
Memory (short-term and long-term)
Planning (e.g., reflection, self-critique, query routing, etc.)
Tools (e.g., calculators, web searches, etc.)

A popular framework is the ReAct framework. ReAct agents can process sequential multi-part queries while maintaining state (in memory) by combining routing, query planning, and tool usage into a single entity.

ReAct = Reasoning + Action (using LLM)

The process includes the following steps:

Think: Upon receiving a user query, the agent reasons about the next action to take
Act: The agent decides on an action and executes it (e.g., using tools)
Observe: The agent observes feedback from the action
This process iterates until the agent completes the task and responds to the user.

What is Agentic RAG?

Agentic RAG describes RAG implemented based on AI agents. Specifically, it incorporates AI agents into the RAG process to coordinate its components and perform additional actions beyond simple information retrieval and generation to overcome the limitations of non-agentic processes.

Agentic RAG describes RAG implemented based on AI agents.

How Does Agentic RAG Work?

Although agents can be incorporated into different stages of the RAG process, agentic RAG is most commonly used in the retrieval component.

Specifically, the retrieval component becomes agentified by using retrieval agents that have access to various retrieval tools, such as:

Vector search engines (also known as query engines), which perform vector searches on vector indexes (as in typical RAG processes)
Web searches
Calculators
Any APIs used to programmatically access software, such as email or chat programs
And so on.

The RAG agent can then reason and act in the following example retrieval scenarios:

Decide whether to retrieve information
Decide which tool to use to retrieve relevant information
Formulate the query itself
Evaluate the retrieved context and decide whether to re-retrieve.

Agentic RAG Architecture

Compared to the sequential simple RAG architecture, the core of the agentic RAG architecture is the agent. The agentic RAG architecture can have varying degrees of complexity. In its simplest form, a single-agent RAG architecture is a simple router. However, you can also add multiple agents to a multi-agent RAG architecture. This section discusses two fundamental RAG architectures.

Single-Agent RAG (Router)

In its simplest form, the agentic RAG is a router. This means you have at least two external knowledge sources, and the agent decides from which one to retrieve additional context. However, external knowledge sources are not limited to (vector) databases. You can also retrieve more information from tools. For example, you might perform a web search, or you can use an API to retrieve additional information from a Slack channel or your email account.

Multi-Agent RAG Systems

As you might guess, single-agent systems have their limitations, as they are limited to one agent for reasoning, retrieval, and answer generation. Therefore, it is beneficial to chain multiple agents into multi-agent RAG applications.

For example, you might have a primary agent that coordinates information retrieval among multiple specialized retrieval agents. For instance, one agent could retrieve information from proprietary internal data sources. Another agent could specialize in retrieving information from your personal accounts (like email or chat). Another agent might specialize in retrieving public information from web searches.

Beyond Retrieval Agents

The examples above illustrate the use of different retrieval agents. However, you can also use agents for purposes beyond retrieval. The possibilities for agents in RAG systems are diverse.

Agentic RAG vs (Regular) RAG

While the fundamental concept of RAG (sending queries, retrieving information, generating responses) remains unchanged, tool usage expands it, making it more flexible and powerful.

Think of it this way: Regular (standard) RAG is like answering a specific question in a library (before smartphones existed). In contrast, agentic RAG is like having a smartphone with a web browser, calculator, email, etc.

Regular RAG agentic RAG access external tools for query preprocessing, multi-step retrieval, verification of retrieved information, etc.

Implementing Agentic RAG

As mentioned earlier, agents consist of multiple components. To build an agentic RAG process, there are two options: language models with function calls or agent frameworks. Both implementations can achieve the same result, depending on the control and flexibility you desire.

Language Models with Function Calls

Language models are the primary component of an agentic RAG system. Another component is tools that enable the language model to access external services. Language models with function calls provide a way to build agent systems, allowing the model to interact with predefined tools. Language model providers have added this feature to their clients.

In June 2023, OpenAI released function calls for gpt-3.5-turbo and gpt-4. This enables these models to reliably connect GPT’s capabilities with external tools and APIs. Developers quickly began building applications that integrated gpt-4 with code executors, databases, calculators, and more.

Cohere further launched their connector API, adding tools to the Command-R model suite. Additionally, Anthropic and Google released function calls for Claude and Gemini. By providing external services for these models, they can access and reference web resources, execute code, etc.

Function calls are not limited to proprietary models. Ollama introduced tool support for popular open-source models like Llama3.2, nemotron-mini, etc.

To build a tool, you first need to define a function. In this snippet, we are writing a function that retrieves objects from a database using Weaviate’s hybrid search:

def get_search_results(query: str) -> str:    """Sends a query to Weaviate's Hybrid Search. Parses the response into a {k}:{v} string."""    response = blogs.query.hybrid(query, limit=5)    stringified_response = ""    for idx, o in enumerate(response.objects):        stringified_response += f"Search Result: {idx+1}:
"        for prop in o.properties:            stringified_response += f"{prop}:{o.properties[prop]}"        stringified_response += "\n"    return stringified_response

Then we pass the function to the language model via `tools_schema`. This schema is then used in the prompt for the language model:

tools_schema=[{    'type': 'function',    'function': {        'name': 'get_search_results',        'description': 'Get search results for a provided query.',        'parameters': {          'type': 'object',          'properties': {            'query': {              'type': 'string',              'description': 'The search query.',            },          },          'required': ['query'],        },    },}]

Since you are directly connecting to the language model API, you need to write a loop that routes between the language model and the tools:

def ollama_generation_with_tools(user_message: str,                                 tools_schema: List, tool_mapping: Dict,                                 model_name: str = "llama3.1") -> str:    messages=[{        "role": "user",        "content": user_message    }]    response = ollama.chat(        model=model_name,        messages=messages,        tools=tools_schema    )    if not response["message"].get("tool_calls"):        return response["message"]["content"]    else:        for tool in response["message"]["tool_calls"]:            function_to_call = tool_mapping[tool["function"]["name"]]            print(f"Calling function {function_to_call}...")            function_response = function_to_call(tool["function"]["arguments"]["query"])            messages.append({                "role": "tool",                "content": function_response,            })        final_response = ollama.chat(model=model_name, messages=messages)    return final_response["message"]["content"]

Your query will look like this:

ollama_generation_with_tools("How is HNSW different from DiskANN?",                            tools_schema=tools_schema, tool_mapping=tool_mapping)

Agent Frameworks

The emergence of agent frameworks like DSPy, LangChain, CrewAI, LlamaIndex, and Letta has made it easier to build applications using language models. These frameworks simplify the process of building agentic RAG systems by combining pre-built templates.

DSPy supports ReAct agents and Avatar optimization. Avatar optimization describes using automated prompt engineering to describe the use of each tool.
LangChain offers many services for using tools. The LCEL and LangGraph frameworks of LangChain further provide built-in tools.
LlamaIndex further introduced QueryEngineTool, a collection of templates for retrieval tools.
CrewAI is one of the leading frameworks for developing multi-agent systems. A key concept for tool usage is sharing tools among agents.
Swarm is a multi-agent coordination framework built by OpenAI. Swarm also focuses on how agents share tools.
Letta reflects and refines internal world models as functions. This means that search results can be used to update the agent’s memory of the chatbot user, in addition to answering questions.

https://weaviate.io/blog/what-is-agentic-rag

Source | PaperAgent

Leave a Comment Cancel reply