Overview of Querying Process in LlamaIndex

Explanation

Querying is the most important part of LLM applications. In LlamaIndex, once you have completed: data loading, building the index, and storing the index, you can proceed to the most crucial part of LLM applications: querying.

A simple query is just a prompt call to the large language model: it can be a question to get an answer, or a summary request, or a more complex instruction.Complex queries may involve repeated/linked prompts + LLM calls, or even reasoning loops across multiple components.

Simple Example

The foundation of all queries is the QueryEngine. The simplest way to obtain a QueryEngine is to create it through the index, as shown below:

query_engine=index.as_query_engine()
response=query_engine.query(
   "Write an email to the user given their background information."
)
print(response)

Query Stages

The content of the query is more than it initially appears. A query consists of three or four different stages:

  • Retrieval (Retrieval) refers to: finding and returning the documents from the index that are most relevant to your query. As discussed earlier in the index, the most common type of retrieval is “top-k” semantic retrieval, but there are many other retrieval strategies.

  • Postprocessing (Postprocessing) refers to: optional reordering, transforming, or filtering of the retrieved nodes, such as requiring them to have specific metadata, like additional keywords.

  • Response synthesis (Response synthesis) is: combining your query, the most relevant data, and prompts, and sending them to your large language model to return a response.

  • Structured Outputs (Structured Outputs): LlamaIndex provides various modules that enable the large model to generate outputs in a structured format. This is crucial for downstream applications.

Custom Queries

LlamaIndex has a low-level composable API that allows you to have fine control over queries.In this example, we customize the retriever to use a different number for top_k and add a postprocessing step that requires the retrieved nodes to meet a minimum similarity score to be included. This will give you a lot of data when you have relevant results, but may not provide any data if you have no relevant results.

from llama_index.core import VectorStoreIndex, get_response_synthesizer
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor

# Build the index
index = VectorStoreIndex.from_documents(documents)

# Configure the retriever
retriever = VectorIndexRetriever(
   index=index,
   similarity_top_k=10,
)

# Configure the response synthesizer
response_synthesizer = get_response_synthesizer()

# Integrate the query engine
query_engine = RetrieverQueryEngine(
   retriever=retriever,
   response_synthesizer=response_synthesizer,
   node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.7)],
)

# Start querying
response = query_engine.query("What did the author do growing up?")
print(response)

Summary

By building the query engine through the index, a series of operations will be performed during the querying process, including: retrieval, postprocessing, response synthesis, structured outputs, etc.

Leave a Comment