Click on the top↗️「HuoShui Intelligence」, Follow + Star🌟
Author: Tomaz Bratanic
Compiled by: HuoShui Intelligence
Retrieval-augmented generation (RAG) has become a mainstream technology, with ample reasons supporting its widespread application. It is a powerful framework that combines advanced large language models with targeted information retrieval techniques to achieve faster access to relevant data and generate more accurate, context-aware responses. Although RAG applications typically focus on unstructured data, I personally advocate for integrating structured data—a crucial yet often overlooked strategy. One of my favorite ways to do this is by utilizing graph databases, such as Neo4j.
Typically, the preferred method for retrieving data from a graph database is Text2Cypher, which automatically converts natural language queries into Cypher statements for querying the graph database. This technique relies on language models (or rule-based systems) to interpret user queries, infer their underlying intent, and translate them into valid Cypher queries, enabling RAG applications to retrieve relevant information from knowledge graphs and generate accurate answers.
Text2Cypher offers significant flexibility as it allows users to ask questions in natural language without needing to understand the underlying graph database schema or Cypher syntax. However, due to the nuances of language interpretation and the need for precise schema-specific details, its accuracy may still fall short, as demonstrated in the following Text2Cypher article.
The following visualization showcases the most important results from the benchmark tests:
At a high level, the benchmark tests compared three groups of models:
-
• Models fine-tuned for the Text2Cypher task
-
• Open foundational models
-
• Closed foundational models
The benchmark tests evaluated these models’ performance in generating correct Cypher queries using two metrics: Google BLEU (top image) and ExactMatch (bottom image).
-
• Google BLEU measures the overlap between the generated query and the reference query (in n-grams). A higher score typically indicates closer alignment with the reference query, but this does not necessarily guarantee that the query can execute correctly in the database context.
-
• ExactMatch is an execution-based metric. It represents the percentage of generated queries that exactly match the correct query text, meaning they produce the same results when executed. ExactMatch is a stricter measure of correctness, directly related to the practical utility of the query in real-world scenarios.
Despite some encouraging results from fine-tuned models, the overall accuracy indicates that Text2Cypher remains an evolving technology. Some models still struggle to generate completely correct queries in every case, highlighting the need for further improvements.
In this article, we will attempt to implement a more intelligent Text2Cypher strategy using the LlamaIndex workflow. Unlike the typical single-query generation (the approach used by most benchmarks), we will try a multi-step approach that allows for retries or alternative query forms. By introducing these additional steps and fallback options, we aim to enhance overall accuracy and reduce the occurrence of erroneous Cypher generation.
The code is available on GitHub (https://github.com/tomasonjo-labs/text2cypher_llama_agent). We also provide a hosted version of the application. Thanks to Anej Gorkic for his contributions to the application and debugging assistance. 🙂

LlamaIndex Workflow
The LlamaIndex workflow is a practical approach that connects different operations through an event-driven system, organizing a multi-step AI processing procedure. It helps to break complex tasks into smaller, more manageable parts that can communicate with each other in a structured way. Each step in the workflow processes specific events and generates new events, creating a chain of operations to accomplish tasks such as document processing, Q&A, or content generation. The system automatically handles the coordination between steps, making it easier to build and maintain complex AI applications.
Simple Text2Cypher Process
The simple Text2Cypher architecture is a streamlined method for converting natural language questions into Cypher queries for the Neo4j graph database. It operates through the following three-stage workflow:
-
1. Generate Cypher queries for input questions using similar examples stored in a vector database through few-shot learning.
-
2. The system executes the generated Cypher queries against the graph database.
-
3. Process the database results through a language model to generate natural language responses that directly answer the original question.
This architecture maintains a simple yet efficient pipeline, utilizing vector similarity search (e.g., few-shot retrieval) and large language models for Cypher query generation and response formatting.
Below is a visualization of the simple Text2Cypher workflow:

It is noteworthy that most Neo4j schema generation methods perform poorly when handling multi-label nodes. This is not only due to increased complexity but also because the combinatorial explosion of labels can overload the prompts. To mitigate this issue, we excluded the Actor
and Director
labels during the schema generation process:
@step
async def generate_cypher(self, ctx: Context, ev: StartEvent) -> ExecuteCypherEvent:
question = ev.input
# Cypher query generation using an LLM
cypher_query = await generate_cypher_step(
self.llm, question, self.few_shot_retriever
)
# Streaming event information to the web UI.
ctx.write_event_to_stream(
SseEvent(
label="Cypher generation",
message=f"Generated Cypher: {cypher_query}",
)
)
# Return for the next step
return ExecuteCypherEvent(question=question, cypher=cypher_query)
The pipeline begins at the generate_cypher
step:
@step
async def execute_query(
self, ctx: Context, ev: ExecuteCypherEvent
) -> SummarizeEvent | CorrectCypherEvent:
# Get global var
retries = await ctx.get("retries")
try:
database_output = str(graph_store.structured_query(ev.cypher))
except Exception as e:
database_output = str(e)
# Retry
if retries < self.max_retries:
await ctx.set("retries", retries + 1)
return CorrectCypherEvent(
question=ev.question, cypher=ev.cypher, error=database_output
)
return SummarizeEvent(
question=ev.question, cypher=ev.cypher, context=database_output
)
The generate_cypher
step converts natural language questions into Cypher queries using language models and retrieving similar examples from vector storage. This step also streams the generated Cypher queries back to the user interface in real-time, providing instant feedback on the query generation process. You can view the complete code and prompts.
Simple Text2Cypher Process with Retry Mechanism
This enhanced version of the Text2Cypher process adds a self-correction mechanism to the original architecture. When generated Cypher queries fail to execute, the system does not immediately report an error but instead feeds back the error information to the language model via the CorrectCypherEvent
step to correct the query. This makes the system more resilient, capable of handling initial errors in a manner similar to how humans adjust their approach after receiving feedback.
Below is a visualization of the simple Text2Cypher workflow with a retry mechanism:

Below is an example of ExecuteCypherEvent
:
@step
async def evaluate_context(
self, ctx: Context, ev: EvaluateEvent
) -> SummarizeEvent | CorrectCypherEvent:
# Get global var
retries = await ctx.get("retries")
evaluation = await evaluate_database_output_step(
self.llm, ev.question, ev.cypher, ev.context
)
if retries < self.max_retries and not evaluation == "Ok":
await ctx.set("retries", retries + 1)
return CorrectCypherEvent(
question=ev.question, cypher=ev.cypher, error=evaluation
)
return SummarizeEvent(
question=ev.question, cypher=ev.cypher, context=ev.context
)
The execute
function first attempts to run the query, and if successful, passes the results to the subsequent summarization step. However, if problems arise, it does not give up immediately but checks if there are remaining retries. If so, it sends the query along with the error information back to the correction step. This mechanism creates a more fault-tolerant system that can learn from errors, much like how we adjust our approach after receiving feedback. You can view the complete code and prompts (https://github.com/tomasonjo-labs/text2cypher_llama_agent/blob/main/app/workflows/naive_text2cypher_retry.py).
Simple Text2Cypher Process with Retry and Evaluation Mechanism
Building on the simple Text2Cypher process with a retry mechanism, this enhanced version adds an evaluation stage to check whether the query results are sufficient to answer the user’s question. If the results are deemed inadequate, the system returns the query to the correction step with improvement suggestions. If the results are satisfactory, the process continues to the final summarization step. This additional validation layer further enhances the pipeline’s resilience, ensuring that users ultimately receive the most accurate and complete answers.
Below is a visualization of the simple Text2Cypher workflow with retry and evaluation mechanisms:

The additional evaluation step is implemented as follows:
@step
async def evaluate_context(
self, ctx: Context, ev: EvaluateEvent
) -> SummarizeEvent | CorrectCypherEvent:
# Get global var
retries = await ctx.get("retries")
evaluation = await evaluate_database_output_step(
self.llm, ev.question, ev.cypher, ev.context
)
if retries < self.max_retries and not evaluation == "Ok":
await ctx.set("retries", retries + 1)
return CorrectCypherEvent(
question=ev.question, cypher=ev.cypher, error=evaluation
)
return SummarizeEvent(
question=ev.question, cypher=ev.cypher, context=ev.context
)
The evaluate_check
function is a simple check to determine whether the query results sufficiently answer the user’s question. If the evaluation indicates that the results are inadequate and there are remaining retries, it returns a CorrectCypherEvent
for further query improvement. Otherwise, it proceeds to execute SummarizeEvent
, indicating that the results are suitable for final summarization.
Later, I realized that capturing instances of successfully self-correcting invalid Cypher statements is an excellent idea. These instances could serve as dynamic few-shot prompts for future Cypher generation. This approach not only enables the agent to self-repair but also allows for continuous self-learning and improvement over time.
@step
async def summarize_answer(self, ctx: Context, ev: SummarizeEvent) -> StopEvent:
retries = await ctx.get("retries")
# If retry was successful:
if retries > 0 and check_ok(ev.evaluation):
# print(f"Learned new example: {ev.question}, {ev.cypher}")
# Store success retries to be used as fewshots!
store_fewshot_example(ev.question, ev.cypher, self.llm.model)
Iterative Planning Workflow
The final workflow is the most complex, coincidentally the one I originally designed ambitiously. I kept the code so that you can learn from my explorations.
The iterative planning workflow achieves a more complex approach by introducing an iterative planning system. It does not directly generate Cypher queries; instead, it first creates a sub-query plan, validating each sub-query Cypher statement before execution and including an information-checking mechanism. If the initial results are insufficient, it can modify the plan. The system can perform up to three iterations of information gathering, optimizing the approach based on previous results each time. This method creates a more comprehensive question-answering system that can handle complex queries by breaking them down into manageable steps and validating information at each stage.
Below is a visualization of the iterative planning workflow:

Let’s take a look at the query planning prompts. At the beginning, I was very ambitious, expecting the language model to generate responses like:
class SubqueriesOutput(BaseModel):
"""Defines the output format for transforming a question into parallel-optimized retrieval steps."""
plan: List[List[str]] = Field(
description=(
"""A list of query groups where:
- Each group (inner list) contains queries that can be executed in parallel
- Groups are ordered by dependency (earlier groups must be executed before later ones)
- Each query must be a specific information retrieval request
- Split into multiple steps only if intermediate results return ≤25 values
- No reasoning or comparison tasks, only data fetching queries"""
)
)
The output represents a structured plan for transforming complex questions into sequential and parallel query steps. Each step includes a group of queries that can be executed in parallel, with subsequent steps dependent on the results of previous ones. Queries are strictly for information retrieval, avoiding reasoning tasks, and split into smaller steps when necessary to manage result size. For example, the following plan first lists the movies of two actors in parallel, then finds the highest-grossing movie from the results of the first step.
plan = [
# 2 steps in parallel
[
"List all movies made by Tom Hanks in the 2000s.",
"List all movies made by Tom Cruise in the 2000s.",
],
# Second step
["Find the highest profiting movie among winner of step 1"],
]
This idea is undoubtedly cool. It is a clever way to break down complex questions into smaller, actionable steps, even using parallelization to optimize retrieval. This sounds like a strategy that could genuinely accelerate the process. However, in practice, expecting language models to reliably execute this strategy is somewhat overly ambitious. While parallelization is theoretically efficient, it introduces a lot of complexity. Dependencies between steps, intermediate results, and maintaining logical consistency between parallel steps can easily lead even advanced models to make mistakes. Sequential execution, while less flashy, is currently more reliable and significantly reduces the cognitive load on the model.
Moreover, language models often perform poorly when dealing with structured tool outputs, such as list nesting, especially when dependencies between reasoning steps are involved. Here, I would be interested to see how much improvement could be achieved in these tasks by using prompts alone (without relying on tool outputs).
View the code for the iterative planning workflow: https://github.com/tomasonjo-labs/text2cypher_llama_agent/blob/main/app/workflows/iterative_planner.py
Benchmark Testing
Creating a benchmark dataset to evaluate the Text2Cypher agents within the LlamaIndex workflow architecture feels like an exciting step forward.
We sought an alternative to traditional single Cypher execution metrics (such as ExactMatch), as these metrics often do not fully reflect the potential of workflows like iterative planning. In these workflows, optimizing queries and retrieving relevant information through multi-step processes renders single-step execution metrics insufficient.
Therefore, we chose Ragas’ **Answer Relevancy** metric—it is more aligned with what we want to measure. Here, we use language models to generate answers and then compare them as referees against the real answers. We prepared a custom dataset of about 50 samples, designed to avoid generating excessive or overly detailed database outputs (i.e., overly large outputs may hinder the language model referee’s ability to effectively assess relevance). Keeping the results concise ensures a fair comparison between single-step and multi-step workflows.
The results are as follows:

Claude 3.5 Sonnet, Deepseek-V3, and GPT-4o emerged as the top three models in terms of answer relevancy, each scoring over 0.80. NaiveText2CypherRetryCheckFlow generally produced the highest relevancy, while IterativePlanningFlow consistently ranked lower (dropping to a low of 0.163).
Although the OpenAI o1 model is quite accurate, it may not have made it to the top due to multiple timeouts (set at 90 seconds). Deepseek-V3, in particular, is promising, scoring high with relatively low latency. Overall, these results emphasize the importance of not only focusing on accuracy but also on stability and speed in real-world deployment scenarios.
Another table is provided for a clearer view of the improvements across different workflows:

Sonnet 3.5’s score steadily rose from NaiveText2CypherFlow’s 0.596 to NaiveText2CypherRetryFlow’s 0.616, then jumped significantly to NaiveText2CypherRetryCheckFlow’s 0.843. GPT-4o showed a similar overall pattern, slightly dropping from NaiveText2CypherFlow’s 0.622 to NaiveText2CypherRetryFlow’s 0.603, but then significantly rising to NaiveText2CypherRetryCheckFlow’s 0.837. These improvements indicate that adding retry mechanisms and final validation steps significantly enhances answer relevancy.
View the benchmark testing code: https://github.com/tomasonjo-labs/text2cypher_llama_agent/blob/main/benchmark/benchmark_gridsearch.ipynb.
Please note that benchmark results may vary by at least 5%, meaning you may observe slightly different results and the best-performing models across different runs.
Lessons Learned and Production Deployment
This has been a two-month project, and I have learned a lot during this process. One highlight is achieving an 84% relevancy in the testing benchmark, which is a significant accomplishment. However, does this mean you can achieve 84% accuracy in production? Probably not.
Production environments come with their own challenges—real-world data is often noisier, more diverse, and less structured than benchmark datasets. One point we have yet to discuss is that you will see in real applications and users that production environments require production-ready steps. This not only means focusing on achieving high accuracy in controlled benchmark tests but also ensuring that the system is reliable, adaptable, and provides consistent results under real-world conditions.
In these scenarios, you need to implement some type of guardrails to prevent irrelevant questions from passing through the Text2Cypher pipeline.

We have an example implementation of guardrails (https://github.com/tomasonjo-labs/text2cypher_llama_agent/blob/main/app/workflows/steps/iterative_planner/guardrails.py). Besides simply rerouting irrelevant questions, initial guard steps can also help educate users by guiding them on the types of questions they can ask, showcasing available tools, and demonstrating how to use them effectively.
In the following example, we emphasize the importance of adding a process to map user input values to the database. This step is crucial for ensuring that the information users provide aligns with the database schema, allowing for accurate query execution and minimizing errors due to data mismatches or ambiguities.

This is an example where a user requests “sci-fi movies.” The issue is that this type is stored in the database as “Sci-Fi,” resulting in no results returned for the query.
Often overlooked is the presence of null values. Null values are common in real-world data and must be considered, especially when performing sorting or similar operations. Failing to handle them properly can lead to unexpected results or errors.

In this example, we received a random movie rated Null
. To resolve this, the query needs to add an additional clause WHERE m.imdbRating IS NOT NULL
.
There are also cases where missing information is not just a data issue but a schema limitation. For example, if we request Oscar-winning movies but the schema does not include any information about awards, the query will not return the desired results.

Since large language models are trained to please users, they may still generate a valid result that does not hold true. I am still unsure how to best handle such examples.
Finally, I want to mention the query planning aspect. I used the following query plan to answer the question:
Who made more movies in the 2000s, Tom Hanks or Tom Cruise? For the winner, find their highest-grossing movie.
The plan is as follows:
plan = [
# 2 steps in parallel
[
"List all movies made by Tom Hanks in the 2000s.",
"List all movies made by Tom Cruise in the 2000s.",
],
# Second step
["Find the highest profiting movie among winner of step 1"],
]
It looks impressive, but the reality is that Cypher is very flexible, and GPT-4o can handle this problem in a single query.

I believe parallelization is redundant in this case. If you are dealing with truly complex types of questions, you can include a query planner, but keep in mind that many multi-hop questions can be efficiently handled with a single Cypher statement.
This example highlights a different issue: the final answer is ambiguous because the language model only received limited information, specifically Tom Cruise’s War of the Worlds. In this case, the reasoning has already been done in the database, so the language model does not need to handle that logic. However, language models tend to operate by default in this way, so providing complete context to ensure accurate and clear responses is crucial.
Finally, you also need to consider how to handle the issue of returning a large number of results.

In our implementation, we enforced a hard limit of 100 records on the results. While this helps manage the amount of data, it may still be excessive in some cases and could mislead the language model during reasoning.
Moreover, not all agents discussed in this article have conversational capabilities. You may need to add a question rewriting step at the beginning to make it conversational or include it as part of the guardrails. If you have a large graph schema that cannot be fully conveyed in the prompt, you need to design a dynamic system to retrieve relevant graph schemas.
Many considerations need to be taken into account in production environments!
Conclusion
Agents are very useful, but it is best to start simple and avoid getting bogged down in overly complex implementations from the outset. Focus on building a reliable benchmark to effectively evaluate and compare different architectures. In terms of tool outputs, consider minimizing usage or sticking to the simplest tools, as many agents struggle to handle tool outputs effectively and often require manual parsing.
Learning Resources
To learn more about knowledge graphs or graph databases, you can check out other articles from the public account:
-
Neo4j+Milvus: A Powerful Combination for Building Stronger Graph RAG Knowledge Graphs -
Neo4j Graph RAG: One Python Package to Easily Handle RAG + Knowledge Graphs! -
Neo4j + LangChain: How to Build the Strongest RAG System Based on Knowledge Graphs? -
Using AI Large Models to Transform Any Text Corpus into Knowledge Graphs, Locally Operated -
Interpreting Graph RAG: Discovering Patterns from Large-Scale Documents, Finding Relationships, Faster and More Comprehensive Information! -
Using LLMs to Build Knowledge Graphs from Unstructured Text
HuoShui Intelligence was established in Beijing, dedicated to improving the productivity of knowledge workers through AI education, AI software, and community building. A council member of the AIGC industry alliance in China.
HuoShui currently offers over 10 courses including AI offline workshops, 15+ AI software launched, and multiple products in development. The knowledge planet has over 2600 members, including programmers from large companies, executives, lawyers, and various knowledge workers. Offline organizations are present in cities such as Beijing, Shanghai, Guangzhou, Shenzhen, Hangzhou, and Chongqing.
Welcome to join our benefit group, where we share first-hand information, distribute coupons, share experiences from outstanding students, and hold book giveaways every week~
👇🏻👇🏻👇🏻