LlamaIndex Practical Guide – Overview of Query Engine Usage

Overview

The Query Engine is a generic interface that allows you to query data. It accepts natural language queries and returns rich responses. It is typically (but not always) built on one or more indexes through a retriever. You can combine multiple query engines to achieve more advanced functionality.

Note: If you want to have a conversation with the data (multiple back-and-forths rather than a single Q&A), please check the ChatEngine related content.

Usage Paradigm

Regular Usage

query_engine = index.as_query_engine()
response = query_engine.query("Who is Paul Graham.")

Streaming Output

When streaming output, the large model outputs content as it becomes available, which can enhance user experience.

query_engine = index.as_query_engine(streaming=True)
streaming_response = query_engine.query("Who is Paul Graham.")
streaming_response.print_response_stream()

Usage Process

Getting Started

Build a query engine from the index:

query_engine = index.as_query_engine()

Ask questions about your data

response = query_engine.query("Who is Paul Graham?")

Configuring the Query Engine

High-Level API

You can directly build and configure the query engine from the index with 1 line of code:

query_engine = index.as_query_engine(
    response_mode="tree_summarize",
    verbose=True,
)

The response mode (response_mode) is explained in later chapters.

Low-Level Combinatorial API

If you need finer control, you can use the hierarchical combinatorial API. Specifically, you will explicitly construct a QueryEngine object instead of calling index.as_query_engine(…).

from llama_index.core import VectorStoreIndex, get_response_synthesizer
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine

# Build Index
index = VectorStoreIndex.from_documents(documents)

# Configure Retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=2,
)

# Configure Responder
response_synthesizer = get_response_synthesizer(
    response_mode="tree_summarize",
)

# Combine into Query Engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

# Query through Query Engine
response = query_engine.query("What did the author do growing up?")
print(response)

Streaming Response

To enable streaming, simply pass the Streaming=True flag

query_engine = index.as_query_engine(
    streaming=True,
)
streaming_response = query_engine.query(
    "What did the author do growing up?",
)
streaming_response.print_response_stream()

Response Mode (response_mode) Description

Currently, the following options are supported:

  1. refine (Refine) Response

Creates and refines answers by sequentially browsing each retrieved text block to create and refine the response. This involves separate LLM calls for each node/retrieved chunk.

The detailed process is as follows: The first block is used in a query with the text_qa_template prompt. Then, the answer and the next block (along with the original question) will be used in another query with the refine_template prompt. This continues until all blocks are processed.

If a block is too large to fit in the window (considering prompt size), it will be split using the TokenTextSplitter (allowing some text overlap between blocks), with the (new) additional blocks treated as chunks of the original block collection (and thus can also be queried using refine_template).

This option is suitable for more detailed answers.

  1. compact (Compact) Response

The compact response mode is similar to refine but pre-compresses (concatenates) blocks to reduce LLM calls.

The detailed process is as follows: It fills as much text as possible (concatenating/packing the retrieved data blocks) to fit the context window (considering the maximum prompt size values of text_qa_template and refine_template). If the text is too long to fit in one prompt, it will be split into multiple parts as needed (using TokenTextSplitter, allowing some overlap between text blocks).

Each text part is treated as a chunk and sent to the refine synthesizer. In short, it is fundamentally similar to refine, but with fewer LLM calls.

  1. tree_summarize Response Mode

Queries the LLM multiple times using the summary_template prompt to query all concatenated blocks to produce as many answers as possible, which are recursively treated as data blocks when calling the LLM with the tree_summarize template, and so forth, until only one block remains, resulting in a single final answer.

The detailed process is as follows: It concatenates as many data blocks as possible to fit the context window using the summary_template prompt and splits them as needed (again using TokenTextSplitter and some text overlap). Then, each generated block/split is queried with the summary_template (without refinement queries!) to get as many answers as possible.

If there is only one answer (because there is only one large block), then that is the final answer.

If there are multiple answers, these will be treated as blocks and recursively sent to the tree_summarize process (concatenating/splitting to fit/query).

Overall, this option is suitable for summarization purposes.

  1. simple_summarize: This mode truncates all text blocks to fit a single LLM prompt. Suitable for quick summaries, but may lose details due to truncation.

  2. no_text: Only runs the retriever to get the nodes that should have been sent to the LLM without actually sending them. You can check by inspecting response.source_nodes.

  3. accumulate: Given a set of text blocks and a query, applies the query to each text block while accumulating responses into an array. Returns a concatenated string of all responses. Suitable when you need to run the same query on each text block individually.

  4. compact_accumulate: Same as accumulate, but “compresses” each LLM prompt like compact and runs the same query on each text block.

Conclusion

This article illustrates the usage paradigm of the Query Engine and explains the implementation logic of different response modes.

Leave a Comment