Introduction to Agentic RAG Architectures

This article mainly introduces the seven most common RAG architectures and the latest Agentic RAG.

Most Popular RAG Architectures

Naive RAG: The most basic architecture, which includes a simple document retrieval, processing, and response generation process.
Retrieve-and-rerank: Adds a reranking step on top of the basic RAG, which can optimize the relevance of retrieval results.
Multimodal RAG: Capable of handling multimodal data such as images, not limited to text.
Graph RAG: Enhances knowledge connections using graph databases, which can better understand the relationships between documents.
Hybrid RAG: Combines the advantages of various technologies, including graph structures and traditional retrieval methods.
Agentic RAG Router: Uses AI Agents to route and process queries, allowing selection of the most suitable processing path.
Agentic RAG Multi-Agent: Uses multiple specialized AI Agents working collaboratively, capable of invoking different tools (such as vector search, web search, Slack, Gmail, etc.).

Core Components

Embedding model: Converts text into vector representations.
Generative model: Responsible for the final content generation.
Reranker model: Optimizes the relevance of retrieval results.
Vector database: Stores and retrieves vectorized content.
Prompt template: A standardized query processing template.
AI Agent: Smart decision-making and task coordination.

Typical RAG Limitations:

Naive RAG pipelines only consider one external knowledge source. However, some solutions may require two external knowledge sources, while others may need external tools and APIs, such as web searches.
They are one-time solutions, meaning the context is retrieved once. There is no reasoning or validation of the quality of the retrieved context.

What are Agents in AI Systems

An AI Agent is an LLM with roles and tasks, capable of accessing memory and external tools. The reasoning ability of the LLM helps the agent plan the necessary steps and take action to complete the task at hand.

AI Agent Core Components:

LLM (with a role and a task)
Memory (short-term and long-term)
Planning (e.g., reflection, self-critics, query routing, etc.)
Tools (e.g., calculator, web search, etc.)

ReAct Framework

ReAct agents can handle sequential multi-part queries by combining routing, query planning, and tool usage into a single entity while maintaining state (in memory).

ReAct = Reason + Act (with LLMs)

The process involves the following steps:

Thought: Upon receiving the user query, the agent reasons about the next action to take.
Action: The agent decides on an action and executes it (e.g., tool use).
Observation: The agent observes the feedback from the action.
This process iterates until the agent completes the task and responds to the user.

Agentic RAG

What is Agentic RAG?

Agentic RAG describes an AI agent-based implementation of RAG. Specifically, it integrates AI agents into the RAG pipeline to orchestrate its components and perform additional operations beyond simple information retrieval and generation, overcoming the limitations of non-agentic pipelines.

How does Agentic RAG work?

Agentic RAG typically refers to the use of agents in the retrieval component.

Specifically, the retrieval component becomes an agentic component by using retrieval agents that can access different retrieval tools, such as:

Vector search engine (also called a query engine) that performs vector search over a vector index (like in typical RAG pipelines).
Web search.
Calculator.
Any API to access software programmatically, such as email or chat programs.
And many more.

Then, the RAG agent can reason and operate on the following example retrieval scenarios:

Decide whether to retrieve information or not.
Decide which tool to use to retrieve relevant information.
Formulate the query itself.
Evaluate the retrieved context and decide whether it needs to re-retrieve.

Agentic RAG Architecture

Agentic RAG architecture can have varying degrees of complexity. In its simplest form, the single-agent RAG architecture is a simple router. Similarly, multiple agents can be added to a multi-agent RAG architecture. This section discusses two basic RAG architectures.

Single-Agent RAG (Router)

The simplest form of Agentic RAG is the Router. This means there are at least two external knowledge sources, and the agent decides from which source to retrieve additional context. However, the external knowledge sources are not limited to (vector) databases. You can also retrieve more information from tools. For example, you can perform web searches or use APIs to retrieve additional information from Slack channels or your email account.

Multi-agent RAG Systems

Single-Agent has its limitations as it is limited to one agent, integrating reasoning, retrieval, and answer generation into one.

To address this issue, multiple agents can be linked to a Multi-agent RAG.

For example, there can be a main agent responsible for coordinating information retrieval among multiple dedicated retrieval agents. For instance, one agent may retrieve information from proprietary internal data sources. Another agent may specialize in retrieving information from your personal accounts, such as email or chat. Yet another agent may specialize in retrieving public information from web searches.

Implementing Agentic RAG

To build an Agentic RAG pipeline, there are two options:

A language model with function calling.
An agent framework.

Language Models with Function Calling

Tool use is a crucial component of Agentic RAG, enabling language models to access external services. Language models with function calling capabilities provide a way to build agentic systems by allowing the model to interact with predefined tools.

In June 2023, OpenAI released function calling for gpt-3.5-turbo and gpt-4. This allows these models to reliably connect GPT’s capabilities with external tools and APIs. Developers quickly began building applications that integrate gpt-4 with code execution engines, databases, calculators, and more.
Cohere further launched their connector API to add tools to the Command-R model suite. Additionally, Anthropic and Google have also introduced function calling for Claude and Gemini. By using external services to support these models, they can access and reference web resources, execute code, and more.
Function calling is not limited to proprietary models. Ollama introduced tool support for popular open-source models like Llama3.2, nemotron-mini, etc.

To build a tool, you first need to define a function. In this code snippet, we will write a function that retrieves objects from the database using Weaviate’s hybrid search:

def get_search_results(query: str) -> str:
    """Sends a query to Weaviate's Hybrid Search. Parses the response into a {k}:{v} string."""
    
    response = blogs.query.hybrid(query, limit=5)
    
    stringified_response = ""
    for idx, o in enumerate(response.objects):
        stringified_response += f"Search Result: {idx+1}:
"
        for prop in o.properties:
            stringified_response += f"{prop}:{o.properties[prop]}"
        stringified_response += "\n"
    
    return stringified_response

Then, we will pass the function to the language model through tools_schema. Then use that schema in the language model’s prompt:

tools_schema=[{
    'type': 'function',
    'function': {
        'name': 'get_search_results',
        'description': 'Get search results for a provided query.',
        'parameters': {
          'type': 'object',
          'properties': {
            'query': {
              'type': 'string',
              'description': 'The search query.',
            },
          'required': ['query'],
        },
    },
}]

Since you are directly connected to the language model API, you need to write a loop that routes between the language model and the tools:

def ollama_generation_with_tools(user_message: str,
                                 tools_schema: List, tool_mapping: Dict,
                                 model_name: str = "llama3.1") -> str:
    messages=[{
        "role": "user",
        "content": user_message
    }]
    response = ollama.chat(
        model=model_name,
        messages=messages,
        tools=tools_schema
    )
    if not response["message"].get("tool_calls"):
        return response["message"]["content"]
    else:
        for tool in response["message"]["tool_calls"]:
            function_to_call = tool_mapping[tool["function"]["name"]]
            print(f"Calling function {function_to_call}...")
            function_response = function_to_call(tool["function"]["arguments"]["query"])
            messages.append({
                "role": "tool",
                "content": function_response,
            })
    
    final_response = ollama.chat(model=model_name, messages=messages)
    return final_response["message"]["content"]

Then, your query will look like this:

ollama_generation_with_tools("How is HNSW different from DiskANN?",
                            tools_schema=tools_schema, tool_mapping=tool_mapping)

Agent Frameworks

If you find the code-based Function Calling approach too complex, you can use easy-to-use framework solutions.

DSPy, LangChain, CrewAI, LlamaIndex, and Letta are among the agent frameworks that have emerged to facilitate the use of language models to build applications. These frameworks simplify the construction of agentic RAG systems by stitching together pre-built templates.

DSPy supports ReAct agents and Avatar optimization. Avatar optimization describes the use of automatic prompt engineering for describing each tool.
LangChain provides many services for using tools. LangChain’s LCEL and LangGraph frameworks further provide built-in tools.
LlamaIndex further introduces QueryEngineTool, a collection of templates for retrieval tools.
CrewAI is one of the leading frameworks for developing multi-agent systems. One of the key concepts for tool usage is sharing tools among agents.
Swarm is a framework built by OpenAI for multi-agent orchestration. Swarm also focuses on how to share tools among agents.
Letta interfaces reflect and optimize the internal world model as functions. In addition to answering questions, this may also require using search results to update the agent’s memory of the chatbot user.

Demo

Here, we also recommend the verbam 2.1 launched by Weaviate:

https://www.youtube.com/watch?v=swKKRdLBhas
https://verba.weaviate.io/
https://github.com/weaviate/Verba

Reference

Vanilla RAG: https://weaviate.io/blog/introduction-to-rag
Advanced RAG: https://weaviate.io/blog/advanced-rag
Multimodal RAG: https://weaviate.io/blog/multimodal-rag
Agentic RAG: https://weaviate.io/blog/what-is-agentic-rag
Modular RAG: Transforming RAG Systems into LEGO-like Reconfigurable Frameworks: https://arxiv.org/pdf/2407.21059