Explanation
Querying is the most important part of LLM applications. In LlamaIndex, once you have completed: data loading, building the index, and storing the index, you can proceed to the most crucial part of LLM applications: querying.
A simple query is just a prompt call to the large language model: it can be a question to get an answer, or a summary request, or a more complex instruction.Complex queries may involve repeated/linked prompts + LLM calls, or even reasoning loops across multiple components.
Simple Example
The foundation of all queries is the QueryEngine. The simplest way to obtain a QueryEngine is to create it through the index, as shown below:
query_engine=index.as_query_engine()
response=query_engine.query(
"Write an email to the user given their background information."
)
print(response)
Query Stages
The content of the query is more than it initially appears. A query consists of three or four different stages:
-
Retrieval (Retrieval) refers to: finding and returning the documents from the index that are most relevant to your query. As discussed earlier in the index, the most common type of retrieval is “top-k” semantic retrieval, but there are many other retrieval strategies.
-
Postprocessing (Postprocessing) refers to: optional reordering, transforming, or filtering of the retrieved nodes, such as requiring them to have specific metadata, like additional keywords.
-
Response synthesis (Response synthesis) is: combining your query, the most relevant data, and prompts, and sending them to your large language model to return a response.
-
Structured Outputs (Structured Outputs): LlamaIndex provides various modules that enable the large model to generate outputs in a structured format. This is crucial for downstream applications.
Custom Queries
LlamaIndex has a low-level composable API that allows you to have fine control over queries.In this example, we customize the retriever to use a different number for top_k and add a postprocessing step that requires the retrieved nodes to meet a minimum similarity score to be included. This will give you a lot of data when you have relevant results, but may not provide any data if you have no relevant results.
from llama_index.core import VectorStoreIndex, get_response_synthesizer
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor
# Build the index
index = VectorStoreIndex.from_documents(documents)
# Configure the retriever
retriever = VectorIndexRetriever(
index=index,
similarity_top_k=10,
)
# Configure the response synthesizer
response_synthesizer = get_response_synthesizer()
# Integrate the query engine
query_engine = RetrieverQueryEngine(
retriever=retriever,
response_synthesizer=response_synthesizer,
node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.7)],
)
# Start querying
response = query_engine.query("What did the author do growing up?")
print(response)
Summary
By building the query engine through the index, a series of operations will be performed during the querying process, including: retrieval, postprocessing, response synthesis, structured outputs, etc.