
Source: DeepHub IMBA
This article is approximately 3000 words long and is recommended to be read in 6 minutes.
This article introduces you to building multi-agent RAG using Llama index.
Retrieval-Augmented Generation (RAG) has become a powerful technique for enhancing the capabilities of Large Language Models (LLMs). By retrieving relevant information from knowledge sources and incorporating it into prompts, RAG provides LLMs with useful context to generate fact-based outputs.
However, existing single-agent RAG systems face challenges of low retrieval efficiency, high latency, and suboptimal prompts. These issues limit the performance of RAG in real-world applications. Multi-agent architecture offers an ideal framework to overcome these challenges and unlock the full potential of RAG. By dividing responsibilities, multi-agent systems allow for specialized roles, parallel execution, and optimized collaboration.
Single-Agent RAG
Current RAG systems use a single agent to handle the complete workflow—query analysis, paragraph retrieval, ranking, summarization, and prompt enhancement.
This singular approach provides a simple integrated solution. However, relying on a single agent for each task can lead to bottlenecks. Agents waste time retrieving irrelevant paragraphs from a large corpus. Summarization of long contexts is poor, and prompts cannot optimally integrate the original question and the retrieved information.
These inefficiencies severely limit the scalability and speed of RAG for real-time applications.
Multi-Agent RAG
The multi-agent architecture can overcome the limitations of single agents. By modularizing RAG into concurrently executed roles, the following can be achieved:
Retrieval: Dedicated retrieval agents focus on effective channel retrieval using optimized search techniques. This minimizes latency.
Search: By excluding retrieval factors, searches can be parallelized among retrieval agents to reduce wait times.
Ranking: Separate ranking agents evaluate the richness, specificity, and other relevant signals of the retrieved data. This filters for maximum relevance.
Summarization: Condenses lengthy contexts into concise segments that contain only the most important facts.
Optimized prompts: Dynamically adjust the integration of original prompts and retrieved information.
Flexible architecture: Agents can be replaced and added to customize the system. Visualization tool agents can provide insights into workflows.
By dividing RAG into specialized collaborative roles, multi-agent systems enhance relevance, reduce latency, and optimize prompts. This unlocks scalable high-performance RAG.
Dividing responsibilities allows retrieval agents to combine complementary techniques, such as vector similarity, knowledge graphs, and internet scraping. This multi-signal approach allows retrieval to capture different aspects of relevance.
By collaboratively decomposing retrieval and ranking among agents, relevance can be optimized from different perspectives. Combined with reading and orchestration agents, it supports scalable multi-perspective RAG.
The modular architecture allows engineers to combine different retrieval techniques across specialized agents.
Multi-Agent RAG with Llama Index
Llama index outlines specific examples of using multi-agent RAG:
Document agents—perform QA and summarization within a single document.
Vector index—enables semantic search for each document agent.
Summary index—allows summarization for each document agent.
Top-level agents—orchestrate document agents to retrieve answers to cross-document questions.
For multi-document QA, it shows real advantages compared to single-agent RAG baselines. Dedicated document agents coordinated by top-level agents provide more concentrated and relevant responses based on specific documents.
Let’s see how Llama index is implemented:
We will download Wikipedia articles about different cities. Each article is stored separately. We only looked for 18 cities, which, while not large, can effectively demonstrate advanced document retrieval capabilities.
from llama_index import ( VectorStoreIndex, SummaryIndex, SimpleKeywordTableIndex, SimpleDirectoryReader, ServiceContext, ) from llama_index.schema import IndexNode from llama_index.tools import QueryEngineTool, ToolMetadata from llama_index.llms import OpenAI
Here is the list of cities:
wiki_titles = [ "Toronto", "Seattle", "Chicago", "Boston", "Houston", "Tokyo", "Berlin", "Lisbon", "Paris", "London", "Atlanta", "Munich", "Shanghai", "Beijing", "Copenhagen", "Moscow", "Cairo", "Karachi", ]
Next is the code to download each city document:
from pathlib import Path
import requests
for title in wiki_titles: response = requests.get( "https://en.wikipedia.org/w/api.php", params={ "action": "query", "format": "json", "titles": title, "prop": "extracts", # 'exintro': True, "explaintext": True, }, ).json() page = next(iter(response["query"]["pages"].values())) wiki_text = page["extract"]
data_path = Path("data") if not data_path.exists(): Path.mkdir(data_path)
with open(data_path / f"{title}.txt", "w") as fp: fp.write(wiki_text)
Load the downloaded documents.
# Load all wiki documents city_docs = {} for wiki_title in wiki_titles: city_docs[wiki_title] = SimpleDirectoryReader( input_files=[f"data/{wiki_title}.txt"] ).load_data()
Define LLM + context + callback manager.
llm = OpenAI(temperature=0, model="gpt-3.5-turbo") service_context = ServiceContext.from_defaults(llm=llm)
We define a “document agent” for each document: a vector index (for semantic search) and a summary index (for summarization) for each document. Then, both query engines are converted to pass to OpenAI function call tools.
Document agents can dynamically choose to perform semantic search or summarization in a given document. We create a separate document agent for each city.
from llama_index.agent import OpenAIAgent from llama_index import load_index_from_storage, StorageContext from llama_index.node_parser import SimpleNodeParser import os
node_parser = SimpleNodeParser.from_defaults()
# Build agents dictionary agents = {} query_engines = {}
# this is for the baseline all_nodes = []
for idx, wiki_title in enumerate(wiki_titles): nodes = node_parser.get_nodes_from_documents(city_docs[wiki_title]) all_nodes.extend(nodes)
if not os.path.exists(f"./data/{wiki_title}"): # build vector index vector_index = VectorStoreIndex(nodes, service_context=service_context) vector_index.storage_context.persist( persist_dir=f"./data/{wiki_title}"
) else: vector_index = load_index_from_storage( StorageContext.from_defaults(persist_dir=f"./data/{wiki_title}"), service_context=service_context,
)
# build summary index summary_index = SummaryIndex(nodes, service_context=service_context) # define query engines vector_query_engine = vector_index.as_query_engine() summary_query_engine = summary_index.as_query_engine()
# define tools query_engine_tools = [ QueryEngineTool( query_engine=vector_query_engine, metadata=ToolMetadata( name="vector_tool", description=( "Useful for questions related to specific aspects of" f" {wiki_title} (e.g. the history, arts and culture," " sports, demographics, or more)." ), ), ), QueryEngineTool( query_engine=summary_query_engine, metadata=ToolMetadata( name="summary_tool", description=( "Useful for any requests that require a holistic summary" f" of EVERYTHING about {wiki_title}. For questions about" " more specific sections, please use the vector_tool." ), ), ), ]
# build agent function_llm = OpenAI(model="gpt-4") agent = OpenAIAgent.from_tools( query_engine_tools, llm=function_llm, verbose=True, system_prompt=f"""
You are a specialized agent designed to answer queries about {wiki_title}. You must ALWAYS use at least one of the tools provided when answering a question; do NOT rely on prior knowledge.
""",
)
agents[wiki_title] = agent query_engines[wiki_title] = vector_index.as_query_engine( similarity_top_k=2 )
Here is the top-level agent, which can orchestrate across different document agents to answer any user query.
The top-level agent can use all document agents as tools to perform retrieval. Here we use a top-k retriever, but the best method is to customize retrieval based on our needs.
# define tool for each document agent all_tools = [] for wiki_title in wiki_titles: wiki_summary = ( f"This content contains Wikipedia articles about {wiki_title}. Use" f" this tool if you want to answer any questions about {wiki_title}.
" ) doc_tool = QueryEngineTool( query_engine=agents[wiki_title], metadata=ToolMetadata( name=f"tool_{wiki_title}", description=wiki_summary, ), ) all_tools.append(doc_tool)
# define an "object" index and retriever over these tools from llama_index import VectorStoreIndex from llama_index.objects import ObjectIndex, SimpleToolNodeMapping
tool_mapping = SimpleToolNodeMapping.from_objects(all_tools) obj_index = ObjectIndex.from_objects( all_tools, tool_mapping, VectorStoreIndex, )
from llama_index.agent import FnRetrieverOpenAIAgent
top_agent = FnRetrieverOpenAIAgent.from_retriever( obj_index.as_retriever(similarity_top_k=3), system_prompt="""
You are an agent designed to answer queries about a set of given cities. Please always use the tools provided to answer a question. Do not rely on prior knowledge.
""",
verbose=True, )
For comparison, we define a “simple” RAG pipeline that dumps all documents into a single vector index collection. Set top_k = 4.
base_index = VectorStoreIndex(all_nodes) base_query_engine = base_index.as_query_engine(similarity_top_k=4)
Let’s run some sample queries, comparing single-document QA/summarization to multi-document QA/summarization.
response = top_agent.query("Tell me about the arts and culture in Boston")
The results are as follows:
=== Calling Function === Calling function: tool_Boston with args: { "input": "arts and culture" } === Calling Function === Calling function: vector_tool with args: { "input": "arts and culture" } Got output: Boston is known for its vibrant arts and culture scene. The city is home to a number of performing arts organizations, including the Boston Ballet, Boston Lyric Opera Company, Opera Boston, Boston Baroque, and the Handel and Haydn Society. There are also several theaters in or near the Theater District, such as the Cutler Majestic Theatre, Citi Performing Arts Center, the Colonial Theater, and the Orpheum Theatre. Boston is a center for contemporary classical music, with groups like the Boston Modern Orchestra Project and Boston Musica Viva. The city also hosts major annual events, such as First Night, the Boston Early Music Festival, and the Boston Arts Festival. In addition, Boston has several art museums and galleries, including the Museum of Fine Arts, the Isabella Stewart Gardner Museum, and the Institute of Contemporary Art. ======================== Got output: Boston is renowned for its vibrant arts and culture scene. It is home to numerous performing arts organizations, including the Boston Ballet, Boston Lyric Opera Company, Opera Boston, Boston Baroque, and the Handel and Haydn Society. The city's Theater District houses several theaters, such as the Cutler Majestic Theatre, Citi Performing Arts Center, the Colonial Theater, and the Orpheum Theatre. Boston is also a hub for contemporary classical music, with groups like the Boston Modern Orchestra Project and Boston Musica Viva. The city hosts major annual events, such as First Night, the Boston Early Music Festival, and the Boston Arts Festival, which contribute to its cultural richness. In terms of visual arts, Boston boasts several art museums and galleries. The Museum of Fine Arts, the Isabella Stewart Gardner Museum, and the Institute of Contemporary Art are among the most notable. These institutions offer a wide range of art collections, from ancient to contemporary, attracting art enthusiasts from around the world. ========================
Now let’s take a look at the results from the simple RAG pipeline above:
# baseline response = base_query_engine.query( "Tell me about the arts and culture in Boston" ) print(str(response))
Boston has a rich arts and culture scene. The city is home to a variety of performing arts organizations, such as the Boston Ballet, Boston Lyric Opera Company, Opera Boston, Boston Baroque, and the Handel and Haydn Society. Additionally, there are numerous contemporary classical music groups associated with the city's conservatories and universities, like the Boston Modern Orchestra Project and Boston Musica Viva. The Theater District in Boston is a hub for theater, with notable venues including the Cutler Majestic Theatre, Citi Performing Arts Center, the Colonial Theater, and the Orpheum Theatre. Boston also hosts several significant annual events, including First Night, the Boston Early Music Festival, the Boston Arts Festival, and the Boston gay pride parade and festival. The city is renowned for its historic sites connected to the American Revolution, as well as its art museums and galleries, such as the Museum of Fine Arts, Isabella Stewart Gardner Museum, and the Institute of Contemporary Art.
It can be seen that the results from the multi-agent system we built are much better.
Conclusion
RAG systems must evolve to a multi-agent architecture to achieve enterprise-level performance. As this example illustrates, dividing responsibilities can yield benefits in relevance, speed, summary quality, and timely optimization. By decomposing RAG into specialized collaborative roles, multi-agent systems can overcome the limitations of single agents and enable scalable high-performance RAG.
Editor: Yu Tengkai
