DeepSeek & LangGraph: Evaluating Reasoning Models in Agent Applications

Click the blue text above to follow us

DeepSeek has gained popularity, especially against the OpenAI o1’s R1 model, whose lower prompt threshold and deep reasoning capabilities have accelerated the adoption of AI applications. However, compared to the enthusiasm on the personal application side, we need a more objective and scientific focus on its productivity applications in the commercial sector.

In the previous article, we discussed the application of DeepSeek-R1 in RAG and explored an experimental architecture: using the Cot reasoning chain output by R1 to achieve RAT (Retrieval-Augmented Thinking), generating more profound and complete RAG outputs (DeepSeek-R1 + LlamaIndex: A New Way to Apply Reasoning Models in RAG/RAT).

However, now you might be curious: since DeepSeek-R1 excels at deep “thinking”, does it perform better in Agent applications that require more autonomous planning abilities? This article will explore this question using the most common ReAct paradigm Agent, but the results may overturn your expectations.

Review: What is a ReAct Agent?
What reasoning capabilities does a ReAct Agent require?
Building a ReAct Agent that supports DeepSeek-R1 (based on LangGraph)
Evaluation of ReAct Agents with Different Models
Thoughts and Summary

What is a ReAct Agent?

ReAct (Reasoning + Acting) is a paradigm of AI agents that combines reasoning and action. The core idea is to let AI think and reason first, then take action based on that, such as calling external tools; this process loops multiple times until the task is completed.

In simple terms:

Equip the Agent with tools and a brain (LLM), allowing it to think for itself and complete tasks using those tools.

Thus, the characteristics of a ReAct Agent are:

Autonomous Planning Ability: Independently reason through multiple subtask steps and dynamically adjust during the process.
Tool Usage: Support interaction with APIs, databases, search engines, etc., to complete tasks.
Higher Flexibility: Compared to fixed-process workflows, tasks are more flexible.
Inherent Unpredictability: The uncertainty of LLM outputs may introduce a degree of unpredictability.

Anthropic divides agent systems into workflows with predetermined processes and autonomous planning agents in “Build Effective Agents”. Clearly, ReAct Agents belong to the latter, see: AI Agent Workflow’s 5 Basic Patterns and Their General Implementations (PydanticAI)【Previous Article】

What Reasoning Capabilities Does a ReAct Agent Require?

A ReAct Agent is a highly LLM-dependent agent, requiring reasoning capabilities including:

* Planning and Reasoning for Multi-task Steps

Reasoning sub-tasks based on input tasks and tool information, and arranging them in a logical order.
Understanding reasoning history and observing tool results, dynamically planning, adjusting decisions, and avoiding redundancy.

For example, consider this task:

“Understand the online reviews of the company’s new product; if they are mostly negative, send an email to the product manager.”

This task requires the LLM to plan out steps like the following dynamically, as you cannot know the online review results in advance:

* Reasoning for Tool Usage

Being able to reason which tools are needed for sub-tasks and their structured input parameters. This is typically achieved through LLM’s Function Calling feature or prompt engineering.

* Assisting in Sub-task Execution

Besides reasoning task steps, LLM can also assist in completing specific tasks, such as generating outlines for papers in a research agent.

We focus on the first two reasoning capabilities, as they are the core of the Agent’s autonomy.

Building a ReAct Agent Supporting DeepSeek-R1

A typical graph representation of a ReAct Agent is as follows:

The quickest way to build a ReAct Agent in LangGraph is to use the prebuilt functions provided by the official:

However, this method is not suitable for us because it has a strict requirement:

The large model must support Function Calling, as its internal implementation requires the LLM’s Function Calling feature to reason the next step and tool! Currently, reasoning models do not support function calling.

Therefore, we must implement this Agent ourselves using LangGraph, which requires prompt engineering to ask the LLM for task step reasoning. Here we only show the core process:

1. Prompt Design

The prompt for the ReAct paradigm is as follows:

# ReAct Prompt Format
REACT_PROMPT = '''
You are designed to help complete various input tasks, including answering questions, content creation, automation, etc.

## Tools
You can use various tools and need to decide the order of using them to complete the current task.
This may require breaking the task into multiple sub-tasks and using different tools to complete each sub-task.
You can use the following tools:

{tools_desc}

## Output Format
If you need to use a tool to complete a sub-task, please output according to the following format:
```
Thought: I need to use a tool to help answer this question.
Action: Tool Name (choose one from {tool_names}), if a tool is needed.
Action Input: Pass the input to the tool, using JSON format to represent parameters (e.g.: {{"query": "Hello"}}).
```
Note:
* Always start with "Thought".
* Absolutely do not use Markdown code marks outside of your response, but if needed in your answer, you can use code marks appropriately.
* Please use valid JSON format as Action Input. Do not use incorrect formats like {{'input': 'Hello World', 'num_beams': 5}}.
* A response can only use one tool to complete one sub-task at most. Do not have multiple Actions in one response.

If you have enough information to output the final answer, you must use one of the following two formats:
```
Thought: I can answer the question without using more tools.
Answer: [Your answer (in the same language as the user's question)]
```
```
Thought: I cannot answer the question using the provided tools.
Answer: [Your answer (in the same language as the user's question)]
```
Note:
* Ensure that all sub-tasks are completed before outputting the final answer.
* The final answer should cover all aspects of the user's task as much as possible.


## Current Dialogue
Below is the current dialogue history, alternating between human and AI messages.
'''

2. Preparing Tools

Prepare several simple simulated tools for the Agent to use:

Network Search (network_search)
Calculator (calculator)
Document Summarizer (document_summarizer)
Email Sending (Email)

The code is roughly as follows (internal logic omitted):

# Define 4 simulated tools
@tool
def network_search(query: str) -> str:
    """
    Used for performing a network search and returning search results
    Args:
        query (str): The keyword to search on the web
    """
    print_colored(f'Calling tool: network_search, Input: {query}', 'green')
    try:
        # Call TavilySearchResults to perform the search, simulating the return of results
        results = TavilySearchResults(max_results=2).invoke({"query": query})
        return f"Search results: {results}"
    except Exception as e:
        return f"Search error: {str(e)}"
    
@tool
def calculator(query: str) -> str:
    """
    Used to perform basic arithmetic calculations
    Args:
        query (str): The expression to calculate, such as 1+2*3
    """
  ...

@tool
def document_summarizer(text: str) -> str:
    """
    Used to extract and summarize the core content of a text, forming a summary
    Args:
        text (str): The text content to summarize
    """
  ...
@tool
def email(recipient, subject, body) -> str:
    """
    Used for sending emails
    Args:
        recipient (str): The recipient's email address
        subject (str): The email subject
        body (str): The email body
    """
..."

3. Implementing Core Reasoning Logic (Using LLM)

This is the core part of the entire process, primarily responsible for calling the LLM for task reasoning:

Use prompts and dialogue history to let the LLM reason the next action.
If a tool needs to be called, provide structured calling information.
If no further tool needs to be called, output the answer given by the LLM.
If the maximum iteration count is reached and the task is still not completed, report an error.

Implementation is as follows (the llm in the code will be replaced with different models like Deepseek-R1 in later tests):

... 
llm = ChatOpenAI(model='gpt-4o-mini')
tools = [network_search,calculator,document_summarizer,email]

async def call_model(
    state: State, config: RunnableConfig
) -> Dict[str, List[AIMessage]]:
    
    # Generate tool descriptions
    tools_desc = "\n".join([
        f"- {tool.name}: {tool.description}\n" 
        for tool in tools
    ])

    tool_names = [tool.name for tool in tools]

    # Generate ReAct prompt
    system_prompt = REACT_PROMPT.format(
        tools_desc=tools_desc,
        tool_names=tool_names
    )

    # This step needs to be adjusted or commented based on different models
    state.messages = [HumanMessage(content=msg.content) if "ToolMessage" in msg.__class__.__name__ else msg for msg in state.messages]

    # Call LLM, input the prompt (system_prompt) and dialogue history (messages)
    response = await llm.ainvoke([SystemMessage(content=system_prompt)] + state.messages)
    content = response.content
    
    print_colored("\n===========================Reasoning================================", 'magenta')
    print_colored(f'{content}', 'magenta')
    print_colored("=========================Reasoning End================================\n", 'magenta')

    if "Action:" in content and "Action Input:" in content:

        # Extract tool calling information
        action_lines = [line for line in content.split('\n') if line.startswith('Action:') or line.startswith('Action Input:')]
        tool_name = action_lines[0].replace('Action:', '').strip()
        tool_input = action_lines[1].replace('Action Input:', '').replace("'", '"').strip()
        tool_input = tool_input.replace('\\', '\\\\')
        
        response.tool_calls = [{
            "id": "call_1",
            "type": "function",
            "name": tool_name,
            "args": json.loads(tool_input)
            }
        ]

    if state.is_last_step and response.tool_calls:
        return {
            "messages": [
                AIMessage(
                    content="Sorry, I couldn't find the answer to the question within the specified number of steps."
                )
            ]
        }

    return {"messages": [response]}"

4. Creating the Graph

Once the core reasoning nodes are completed, we can create this Graph (tool calls can use the official ToolNode component):

@dataclass
class InputState:
    messages: Annotated[Sequence[AnyMessage], add_messages] = field(
        default_factory=list
    )

@dataclass
class State(InputState):

    is_last_step: IsLastStep = field(default=False)

builder = StateGraph(State, input=InputState)
builder.add_node(call_model)
builder.add_node("tools", ToolNode(tools))
builder.add_edge("__start__", "call_model")

# Define a function to decide the next node based on the model's output; if there is a tool call, switch to tools
def route_model_output(state: State) -> Literal["__end__", "tools"]:
    last_message = state.messages[-1]
    if not isinstance(last_message, AIMessage):
        raise ValueError(
            f"Expected AIMessage in output edges, but got {type(last_message).__name__}"
        )
    
    if not last_message.tool_calls:
        return "__end__"
    
    return "tools"

# Add conditional edges
builder.add_conditional_edges(
    "call_model",
    route_model_output,
)
builder.add_edge("tools", "call_model")

# Compile
graph = builder.compile()
graph.name = "ReAct Agent""

5. Validating Usability

Now you can use the graph.invoke method to input tasks to test this ReAct Agent. As expected, it should be able to understand your natural language input tasks and use the existing four tools (search, calculator, summarizer, email sending) to complete them. For better observation, we add some tracking information to show the LLM’s reasoning process.

Here, we first look at a simple task (model gpt-4o-mini):

As you can see, in the first reasoning, the LLM thinks it needs to use the search tool; in the second reasoning, because it observed the tool’s return (not printed here), the LLM was able to answer the question correctly!

Evaluation of ReAct Agents with Different Models

Now we use tasks of varying difficulty to examine the performance of reasoning models like DeepSeek-R1 in multi-task step reasoning within ReAct Agents, comparing them with general models.

Test Tasks (from Simple to Complex):

“Search for the latest box office data of ‘Nezha 2′”
“Search for detailed plot information of ‘Nezha 2’ and send it to [email protected]”
“Search for the latest box office of ‘Nezha 2’, if it exceeds 5 billion, then help me search for the release date of ‘Nezha 3’; if it does not exceed 5 billion, then write an encouraging text. Finally, send the result to [email protected]”

Comparison Models:

gpt-4o-mini
doubao-1.5-32k
deepseek-v3 (official)
deepseek-r1 (official)
deepseek-r1:7b (open source)
qwen-2.5:7b (open source)

Testing Method:

Switching different models without changing other codes and prompts.
Sequentially inputting three test tasks, testing each task multiple times, and observing the results.
Ensuring that each task is independent, with no memory of previous task history.

Test Results:

Task One	Task Two	Task Three
gpt-4o-mini	Yes	Yes	Yes
doubao-1.5-32k	yes	yes	yes
deepseek-v3	yes	yes	yes
deepseek-r1	yes	yes *	no
deepseek-r1:7b	yes	no *	no
qwen-2.5:7b	yes	yes	no

yes: All succeeded, and the reasoning output met expectations.
no: All failed, did not obtain expected results or even exceptions.
yes*: Indicates mostly successful, occasionally failed; or successful results, but reasoning process did not meet expectations.
no*: Indicates mostly failed, occasionally succeeded.

Main Issues with DeepSeek-R1 Failures:

1. Does not follow step-by-step reasoning instructions. That is, it returns a complete task process containing multiple Thought-Action steps at once (even if this process may be correct). For example:

2. Hypothetical (hallucination) reasoning. After the first sub-task reasoning, sometimes it will take the liberty to “assume” a tool call result and reason subsequent steps based on that (r1:7b model).

3. Omits a task step. For instance, in one test of task 3, it did not reason and execute the “send email” step, yet ultimately produced a seemingly perfect output:

4. Output format cannot consistently follow instruction requirements. For example, in the following image, it added a ‘json’ tag on its own; while formatting issues can sometimes be tolerated through code, it may also lead to direct exceptions:

We look at the successful output of the most complex task 3, the reasoning process should be similar to the following image:

Thoughts and Summary

Despite your intuition suggesting that reasoning models like DeepSeek-R1, which inherently have Cot (thinking chains), should excel in autonomous multi-step task planning and reasoning, the reality is quite the opposite:

Current reasoning models (DeepSeek-R1, OpenAI’s o1, etc.) still lack sufficient usability in the autonomous, multi-step, dynamic task planning capabilities required by Agents..

Below are some specific thoughts and summaries:

1. Current models like o1 and r1 indeed excel at step-by-step reasoning, but the input tasks should be more direct. The complexity should reflect in the problem itself, not in the prompts and context. For example:

Generate complex code based on requirements.
Solve a complex math problem.
Conduct scientific calculations and research.
Generate in-depth papers based on topics.

However, it currently does not excel at the following types of tasks:

Reasoning tasks based on complex contexts and dialogue history.
Tasks with strict instruction adherence and structured output requirements.
Agent tasks that rely on Function Calling capabilities.

Structured output can be improved with the help of auxiliary LLMs; Function Calling can also be obtained through prompts, but the first issue requires model evolution.

2. Specifically regarding AI Agent applications, especially ReAct Agents, since they rely more on LLM’s multi-task step reasoning capabilities, we recommend:

The general models are currently more suitable for dynamic planning and reasoning of task steps, such as GPT-4o and DeepSeek-v3; and the larger the parameters, the more stable the model.
Reasoning models, including full-blooded DeepSeek-R1 and distilled versions like 1.5b, are not recommended for task step reasoning in Agents; however, they can be used as a tool to complete a specific sub-task. For example, writing a detailed report based on search results.

3. Another issue with applying reasoning models in Agents is response performance; due to their deep thinking characteristics, their response times are generally longer, which restricts their application in enterprise-level settings.

Some conclusions above also validate the “capability deficiencies” revealed in the official technical report of DeepSeek-R1, including not supporting function calling, not excelling in multi-turn dialogues, weak structured outputs, and sensitivity to prompts and complex contexts. We also look forward to breakthroughs in these capabilities in the next version of R1!

* The full source code in the article can be requested later

THE END

Welfare Time

To help LLM developers learn RAG applications more systematically and deeply, especially the mainstream optimization methods and technical implementations in enterprise-level RAG application scenarios, we have written “Development and Optimization of RAG Applications Based on Large Models – Building Enterprise-Level LLM Applications”, a 500-page development and optimization guide, to delve into the new world of LLM application development.

For more details, click the link to learn DeepSeek & LangGraph: Evaluating Reasoning Models in Agent Applications

Purchase here for a 50% discount

Please identify the following business card for communication

Leave a Comment Cancel reply