
If you are building a retrieval-augmented generation (RAG) application, you know how powerful they can be when they work well. But semantic embedding models are not magic. Most RAG implementations rely solely on semantic similarity as the unique retrieval mechanism, throwing every document into a vector database and applying the same retrieval logic to each query. This approach works for simple questions, but often retrieves contextually irrelevant (yet semantically similar) documents. When detailed queries require precise answers, relying solely on semantic similarity can lead to confusion or incorrect responses.
The problem lies not with your model—but with your retrieval process.
Here, we will introduce a better approach: agentic hybrid search. By utilizing structured metadata and allowing the large language model (LLM) to choose the best retrieval operation for each query, you can transform your RAG application into a truly intelligent assistant. We will first discuss the core concepts and then illustrate how to transform a simple “credit card policy Q&A bot” into an agent system that can dynamically adapt to user needs.
Say goodbye to one-size-fits-all retrieval and welcome a smarter RAG experience.
Why Your RAG Application Can’t Deliver
At its core, RAG connects LLMs to external knowledge. You index your documents, use vector search to retrieve semantically similar documents, and let the LLM generate responses based on those results. Sounds simple, right?
But simplicity can be a double-edged sword. While many developers focus on improving the knowledge base—enriching it with more documents or better embeddings—or fine-tuning their LLM prompts, the real bottleneck often lies in the retrieval process itself. Most RAG implementations rely on semantic similarity as a catch-all strategy. This approach often retrieves incorrect documents: either because semantic similarity is not the right method for the query, introducing contextually irrelevant results, or retrieving too many overlapping or redundant documents, diluting the effectiveness of the responses. Without a smarter way to filter and prioritize results, detailed queries relying on subtle distinctions will continue to fail.
Imagine a QA bot tasked with answering specific questions, such as, “If I submit my Premium Card bill 10 days late, what will happen?” or “Does Bank A’s Basic Card offer purchase protection?” These queries require precise answers that depend on subtle differences between policies. Similarly, consider a support bot, such as one from Samsung, which offers a wide range of products from smartphones to refrigerators. Questions like “How do I reset my Galaxy S23?” need to retrieve model-specific instructions, while inquiries about refrigerator warranties require entirely different documents. Using simple vector search, the bot might pull in semantically relevant but contextually irrelevant documents, muddling responses or leading to hallucinations by mixing completely different products or use cases.
This problem exists regardless of how advanced your LLM or embeddings are. Developers often respond by fine-tuning models or adjusting prompts, but the real solution lies in improving the retrieval method before generating documents. Simple retrieval systems either retrieve too much—forcing the LLM to sift through irrelevant information, which can sometimes be alleviated by clever prompting—or retrieve too little, leaving the LLM “flying blind” without the context needed to generate meaningful responses. By making retrieval smarter and more context-aware, hybrid search addresses both issues: it reduces irrelevant noise by limiting searches to relevant topics and ensures that retrieved documents contain the more precise information the LLM needs. This greatly enhances the accuracy and reliability of your RAG application.
Solution: Autonomous Hybrid Search
The solution is surprisingly simple yet transformative: combine structured metadata-based hybrid search with the autonomous decision-making capabilities of large language models to achieve autonomous hybrid search. This approach does not require a complete overhaul of your architecture or giving up existing investments; it builds on what you already have to unlock new intelligence and flexibility.
From Simple to Smart: A Smarter Process
A typical RAG application follows a simple flow: question → search → generate. The user’s question is passed to the retrieval engine—usually vector search—which retrieves the most semantically similar documents. These documents are then passed to the LLM to generate a response. This works well for simple queries but struggles when detailed retrieval strategies are needed.
Smart hybrid search replaces this rigid process with a smarter, more flexible flow: question → analyze → search → generate. The LLM does not jump directly to retrieval; instead, it analyzes the question to determine the best retrieval strategy. This flexibility allows the system to handle a more diverse range of use cases with higher accuracy.
Unlocking Capabilities
With agentic hybrid search, your RAG applications become more powerful:
-
• Multiple Knowledge Bases — The LLM can dynamically decide which knowledge base to query based on the question. For example, a Q&A bot might pull general policy information from one database while extracting bank-specific FAQs from another. -
• Custom Search Queries — The LLM can craft custom search queries rather than relying solely on semantic similarity. For instance, a question like “Which cards from Bank A offer purchase protection?” might trigger a metadata-filtered search for cards tagged with “purchase protection.” -
• Metadata Filters — By enriching documents with structured metadata (e.g., card name, bank name, sections, dates), you can achieve precise, targeted searches that avoid irrelevant results. -
• Multiple Search Operations — Some questions may require breaking the query into sub-parts. For example, “What are the eligibility requirements and benefits of the Premium Card?” may involve one search about eligibility standards and another about benefits.
These capabilities expand the types of queries your application can handle. Your RAG application can now manage exploratory research, multi-step reasoning, and domain-specific tasks, not just simple fact-finding—while maintaining accuracy.
How It Works: Transforming a Credit Card Policy Q&A Bot
Let’s understand this through an example. Suppose you are building a bot to answer questions about credit card policies from multiple banks. Here’s what a simple implementation looks like:
The Naive Approach
Documents are indexed in a vector database, and the bot performs simple semantic searches to retrieve the most similar documents. Regardless of whether the user inquires about eligibility requirements, fees, or cancellation policies, the retrieval logic remains the same.
from langchain_core.runnables import (
RunnablePassthrough,
ConfigurableField,
)
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_astradb.graph_vectorstores import AstraDBVectorStore
llm = ChatOpenAI()
embeddings = OpenAIEmbeddings()
vectordb = AstraDBVectorStore(
collection_name="knowledge_store",
embedding=embeddings,
)
ANSWER_PROMPT = (
"Use the information in the results to provide a concise answer to the original question.\n\n"
"Original Question: {question}\n\n"
"Vector Store Results:\n{'
'.join(c.page_content for c in context)}\n\n"
)
retriever = vectordb.as_retriever()
### Construct the LLM execution chain
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| ChatPromptTemplate.from_messages([ANSWER_PROMPT])
| llm
)
What’s the result? For questions like *“How much is my annual membership fee?”*, the system might retrieve irrelevant card policies because embeddings prioritize broad similarity over specificity.
chain.invoke("How much is my annual membership fee?",)
### > Response: Your annual membership fee could be $250, $95, $695, or $325, depending on the specific plan or card you have chosen. Please refer to your specific card member agreement or plan details to confirm the exact amount of your annual membership fee.
The Agentic Approach
In the agentic hybrid search approach, we improve the system by:
Enriching Documents with Metadata — When indexing policies, we add structured metadata such as:
-
• Card Name (“Premium Card”) -
• Bank Name (“Bank A”) -
• Policy Sections (“Fees”, “Rewards”, “Eligibility”) -
• Effective Dates
Using LLM to Select Retrieval Operations — The bot does not blindly execute vector searches; it uses the context of the query to decide:
-
• Should it search for semantically similar policies? -
• Should it filter by card or bank metadata? -
• Should it issue multiple queries for specific policy sections?
Combining Responses from Multiple Searches — The bot intelligently combines results to generate precise and reliable answers.
Here’s an example of how it works:
Example Code
from typing import List, Literal
from pydantic import BaseModel, Field
from langchain_core.documents.base import Document
from langchain_core.tools import StructuredTool
from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain_core.prompts import MessagesPlaceholder
prompt = ChatPromptTemplate.from_messages([
("system", "Answer the following question concisely, using information retrieved from tools and provided user information."),
("system", "The following card types are relevant to the user: {cards}"),
("system", "Always use the provided tools to retrieve information needed to answer policy-related questions."),
("human", "{question}"),
MessagesPlaceholder("agent_scratchpad"),
])
### First, we define the parameters for the search operation
class RetrieveInput(BaseModel):
question: str = Field(description="The question to retrieve content for. It should be a simple question describing the starting point of the retrieval, possibly with content.")
card_type: str = Field(description=f"Search for documents related to this card type. The value must be one of {pages.keys()}")
### Next, create a 'tool' that implements the search logic
def retrieve_policy(question: str, card_type: str) -> List[Document]:
print(f"retrieve_policy(card_type: {card_type}, question: {question}")
retriever = graph_vectorstore.as_retriever(
search_type = "similarity",
search_kwargs = {"metadata_filter": {"card-type": card_type}},
)
return list(retriever.invoke(question))
policy_tool = StructuredTool.from_function(
func=retrieve_policy,
name="RetrievePolicy",
description="Retrieve information about specific card policies.",
args_schema=RetrieveInput,
return_direct=False,
)
### Finally, build an agent to use the tool we created
agent = create_tool_calling_agent(llm, [policy_tool], prompt)
agent_executor = AgentExecutor(agent=agent, tools=[policy_tool], verbose=True)
In this example, the bot recognizes the query is very specific and uses a metadata filter to retrieve the exact policies based on the provided user profile. Additionally, the LLM rewrites the user’s question to focus on the information needed to retrieve relevant documents.
agent_executor.invoke({
"question": "How much is my annual fee?",
"cards": ["gold"],
})
### > Agent: Invoking: `RetrievePolicy` with `{'question': 'annual membership fee', 'card_type': 'gold'}`
### > Response: Your annual fee could be $250, $95, $695, or $325, depending on the specific plan or card you have chosen. Please refer to your specific card member agreement or plan details to confirm the exact amount of your annual fee.
Since the LLM decides how to use the search tools, we are not limited to using the same filter for every question. For example, the LLM can dynamically identify that the user is asking about a policy different from their own and create the appropriate filter.
agent_executor.invoke({
"question": "How much is the annual fee for the platinum card?",
"cards": ["gold"],
})
### > Agent: Invoking: `RetrievePolicy` with `{'question': 'annual membership fee for platinum cards', 'card_type': 'platinum'}`
### > Response: The annual fee for the platinum card is $695. Additionally, each additional platinum card has an annual fee of $195, but the companion platinum card has no annual fee.
The LLM may even choose to use a tool multiple times. For example, the following question requires the LLM to understand both the user’s current policy and the policy mentioned in the question.
agent_executor.invoke({
"question": "How much will my fee change if I upgrade to the platinum card?",
"cards": ["gold"],
})
### > Agent: Invoking: `RetrievePolicy` with `{'question': 'membership fee for gold card', 'card_type': 'gold'}`
### > Agent: Invoking: `RetrievePolicy` with `{'question': 'membership fee for platinum card', 'card_type': 'platinum'}`
### > Response: Your current American Express® Gold Card annual fee is $325. If you upgrade to the platinum card, the fee will be $695. Therefore, upgrading from the gold card to the platinum card will increase your fee by $370.
Try the code in this notebook: Agentic_Retrieval.ipynb.
Why This Works
The magic lies in leveraging the LLM as the decision-maker. You are not hard-coding retrieval logic; instead, you let the LLM analyze the query and dynamically select the best approach. This flexibility makes your system smarter and more adaptable without needing large-scale changes to the infrastructure.
The Payoff: Smarter Retrieval, Better Responses
Implementing agentic hybrid search transforms your RAG application into a system capable of handling complex, nuanced queries. By introducing smarter retrieval, you can provide several key benefits:
-
• Increased Accuracy — Smarter retrieval ensures that each query displays the correct documents, reducing hallucinations and irrelevant results. This directly enhances the quality of LLM responses. -
• Enhanced Trust — By extracting only contextually appropriate information and avoiding awkward errors like confusing key details, you ensure users have confidence in your system. -
• Broader Use Cases — Dynamic search strategies enable your application to handle more complex queries, integrate multiple knowledge sources, and serve a wider range of users and scenarios. -
• Simplified Maintenance — No longer hard-coding retrieval rules or manually curating filters, you let the LLM dynamically adjust retrieval strategies, reducing the need for ongoing manual intervention. -
• Future Scalability — As your datasets grow or knowledge bases diversify, the agentic approach can scale to meet new challenges without fundamental changes to the system.
By making retrieval smarter and more adaptive, you enhance the overall performance of the system without needing major overhauls.
Trade-offs: Balancing Flexibility and Cost
Adding an autonomous layer to your retrieval process does come with some trade-offs:
-
• Increased Latency — Each query analysis involves additional LLM calls, and issuing multiple custom searches may take longer than a single operation. This could slightly delay response times, especially for latency-sensitive applications. -
• Higher Inference Costs — Query analysis and coordinating multiple searches increase computational overhead, which could raise costs for systems with high query volumes. -
• Orchestration Complexity — While implementation is relatively straightforward, maintaining a system that dynamically selects retrieval strategies may introduce additional debugging or testing considerations.
Despite these trade-offs, the benefits of autonomous hybrid search often outweigh the costs. For most applications, the added flexibility and accuracy significantly enhance user satisfaction and system reliability, making the investment worthwhile. Additionally, latency and cost issues can often be mitigated through caching, pre-computed filters, or limiting analysis to complex queries.
By understanding and managing these trade-offs, you can fully leverage the potential of autonomous hybrid search to build smarter, more powerful RAG applications.
Conclusion
Agentic hybrid search is the key to unlocking the full potential of your RAG applications. By enriching your documents with structured metadata and allowing the LLM to intelligently decide on retrieval strategies, you can move beyond simple semantic similarity to build an assistant that users can truly rely on.
This is an easy-to-adopt change that brings unexpectedly rich rewards. Why not try it in your next project? Your users—and your future self—will thank you.
