In this tutorial, we will learn how to build an intelligent document retrieval system using langgraph. This system can extract information from web pages, perform intelligent segmentation, and achieve precise Q&A functionality through query analysis and vector retrieval.
1. Install Dependencies
<span>pip install beautifulsoup4</span>
2. Import Necessary Libraries
import bs4
from typing import Literal
from typing_extensions import List, TypedDict, Annotated
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langgraph.graph import START, StateGraph
from langchain_core.prompts import PromptTemplate
3. Load Web Content
WebBaseLoader is a powerful web content loader provided by LangChain, and its workflow is as follows:
-
1. URL Fetching: Use the urllib library to fetch raw HTML content from the specified URL -
2. HTML Parsing: Use the BeautifulSoup4 library to parse the HTML content -
3. Content Filtering: You can customize the parsing rules through the <span>bs_kwargs</span>
parameter
-
• In our example, we use <span>SoupStrainer("li")</span>
to extract only the list item content -
• This effectively filters out irrelevant content such as navigation bars and footers from the web page
loader = WebBaseLoader(
web_paths=("https://github.com/jobbole/awesome-python-cn/blob/master/README.md",),
bs_kwargs=dict(
parse_only=bs4.SoupStrainer("li")
),
)
docs = loader.load()
4. Intelligent Document Segmentation
The text splitter uses a recursive strategy, with the following steps:
-
1. Initial Splitting: First, attempt to use the highest level of delimiters (such as newline characters and paragraph markers) -
2. Recursive Processing: If the resulting chunks are still too large, continue splitting using secondary delimiters (such as periods and semicolons) -
3. Overlap Processing:
-
• chunk_overlap=200 means that each adjacent chunk shares 200 characters -
• This overlap design ensures continuity of context, preventing sentences from being abruptly cut off -
• For example, if an important concept spans two chunks, the overlap allows it to be fully captured during retrieval
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
all_splits = text_splitter.split_documents(docs)
5. Metadata Enhancement
To achieve smarter retrieval, we add location-related metadata to the documents. This metadata can help us filter more precisely during retrieval:
total_documents = len(all_splits)
third = total_documents // 3
for i, document in enumerate(all_splits):
if i < third:
document.metadata["section"] = "beginning"
elif i < 2 * third:
document.metadata["section"] = "middle"
else:
document.metadata["section"] = "end"
By adding section metadata, we can:
-
• Perform targeted searches during retrieval -
• Search only the content at the beginning, middle, or end of the document -
• Improve the accuracy of retrieval
6. Define Query Structure
Use TypedDict to define the data structure for queries to ensure standardization and maintainability:
class Search(TypedDict):
"""Search query."""
query: Annotated[str, ..., "Search query to run."]
section: Annotated[
Literal["beginning", "middle", "end"],
...,
"Section to query."
]
7. Vector Store Setup
InMemoryVectorStore provides efficient vector storage and retrieval capabilities:
-
• Use OpenAI’s text-embedding-3-large model to convert text into high-dimensional vectors -
• Each document chunk will be converted into a unique vector representation
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vector_store = InMemoryVectorStore(embeddings)
_ = vector_store.add_documents(documents=all_splits)
8. Set Up Language Model and Prompt Template
The design of the prompt template considers the following key points:
-
1. Context Injection:
-
• Provide the retrieved document content as context to the language model -
• Use {context} and {question} placeholders to dynamically insert content
-
• Limit answers to a maximum of three sentences to keep them concise -
• Clearly instruct to acknowledge uncertainty if the answer is unknown, avoiding fabricated answers -
• Add a fixed closing phrase “thanks for asking!” to maintain a consistent interaction style
llm = ChatOpenAI(model="gpt-4o-mini")
template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use three sentences maximum and keep the answer as concise as possible.
Always say "thanks for asking!" at the end of the answer.
{context}
Question: {question}
Helpful Answer:"""
prompt = PromptTemplate.from_template(template)
9. Build Processing Flow
LangGraph provides a flexible workflow management system that allows us to break complex processing flows into multiple independent steps and coordinate the data flow between these steps through state management.
9.1 State Management
First, we define a TypedDict to manage the state throughout the processing flow:
class State(TypedDict):
question: str # User's original question
query: Search # Structured query information
context: List[Document] # Retrieved relevant documents
answer: str # Final answer
This state dictionary contains all the key data in the processing flow:
-
• question: Stores the user’s original input question -
• query: Stores the structured query after analysis (using the previously defined Search type) -
• context: Stores the relevant documents retrieved from the vector database -
• answer: Stores the final generated answer
9.2 Processing Steps
The processing flow is broken down into three main steps, each of which is an independent function that receives the current state and returns the updated state portion:
-
1. Query Analysis (analyze_query):
def analyze_query(state: State):
# Use LLM to convert natural language questions into structured queries
structured_llm = llm.with_structured_output(Search)
# Call LLM for structured output
query = structured_llm.invoke(state["question"])
# Return the updated state portion
return {"query": query}
The purpose of this step is:
-
• Receive the user’s natural language question -
• Use LLM to analyze the question and generate a structured query -
• Determine which part of the document (beginning, middle, end) the query should search
-
2. Document Retrieval (retrieve):
def retrieve(state: State):
query = state["query"]
# Use vector storage for similarity search
retrieved_docs = vector_store.similarity_search(
query["query"], # Use the query text from the structured query
filter=lambda doc: doc.metadata.get("section") == query["section"], # Use metadata filtering
)
# Return the retrieved documents
return {"context": retrieved_docs}
The functionalities of this step include:
-
• Retrieve the structured query from the state -
• Use the query text to search for similar documents in the vector storage -
• Filter documents using section metadata -
• Return the most relevant document list
-
3. Answer Generation (generate):
def generate(state: State):
# Merge the content of the retrieved documents
docs_content = "\n\n".join(doc.page_content for doc in state["context"])
# Use the prompt template to construct the input message
messages = prompt.invoke({
"question": state["question"], # Original question
"context": docs_content # Merged document content
})
# Use LLM to generate the answer
response = llm.invoke(messages)
# Return the generated answer
return {"answer": response.content}
The processing flow of this step is:
-
• Merge the content of all retrieved documents into one text -
• Use the prompt template to construct an input containing context and question -
• Call LLM to generate the final answer -
• Return the generated answer text
10. Assemble Processing Graph
Use LangGraph to chain the various processing steps into a directed acyclic graph:
-
• The output of each step will automatically update the state for use in the next step -
• Supports conditional branching and parallel processing (this example uses a simple linear flow)
graph_builder = StateGraph(State).add_sequence([analyze_query, retrieve, generate])
graph_builder.add_edge(START, "analyze_query")
graph = graph_builder.compile()
11. Usage Example
for message, metadata in graph.stream(
{"question": "Please list the recommended Python libraries at the end of the article?"}, stream_mode="messages"
):
print(message.content, end="")
# The recommended Python libraries at the end of the article include: Pyro, PyUserInput, scapy, wifi, Pingo, keyboard, mouse, Python-Future, Six, and modernize. Thanks for asking!
12. Summary
This project demonstrates how to build a complete intelligent document retrieval system using langgraph. The main features of the system include:
-
1. Intelligent web content extraction -
2. Intelligent segmentation and metadata enhancement of documents -
3. Vectorized storage and similarity retrieval -
4. Intelligent Q&A based on LLM -
5. Process-oriented architecture
Through this system, we can easily implement intelligent retrieval and Q&A functionality for large documents. This architecture is not only suitable for web content but can also be extended to other types of document processing scenarios.
13. Complete Code
import bs4
from typing import Literal
from typing_extensions import List, TypedDict, Annotated
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore
from langchain import hub
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langgraph.graph import START, StateGraph
from langchain_core.prompts import PromptTemplate
# Web content loading
# WebBaseLoader uses urllib to load HTML from web URLs and parses it into text using BeautifulSoup.
# We can customize the parsing process of HTML to text by passing parameters to the BeautifulSoup parser,
# here we only parse HTML tags with the "li" class
loader = WebBaseLoader(
web_paths=("https://github.com/jobbole/awesome-python-cn/blob/master/README.md",),
bs_kwargs=dict(
parse_only=bs4.SoupStrainer("li")
),
)
docs = loader.load()
# print("docs: ", docs)
# Document segmentation
# The recursive character text splitter will recursively split the document using common delimiters (like new lines)
# until each chunk is of appropriate size. This is the recommended text splitter for general text use cases.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
all_splits = text_splitter.split_documents(docs)
# Query analysis: Add metadata to documents to transform or build optimized search queries from the original user input
total_documents = len(all_splits)
third = total_documents // 3
for i, document in enumerate(all_splits):
if i < third:
document.metadata["section"] = "beginning"
elif i < 2 * third:
document.metadata["section"] = "middle"
else:
document.metadata["section"] = "end"
# Define a pattern for our search queries
class Search(TypedDict):
"""Search query."""
query: Annotated[str, ..., "Search query to run."]
section: Annotated[
Literal["beginning", "middle", "end"],
...,
"Section to query."
]
# Document embedding storage
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vector_store = InMemoryVectorStore(embeddings)
_ = vector_store.add_documents(documents=all_splits)
# Large language model
llm = ChatOpenAI(model="gpt-4o-mini")
# Prompt template
template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use three sentences maximum and keep the answer as concise as possible.
Always say "thanks for asking!" at the end of the answer.
{context}
Question: {question}
Helpful Answer:"""
prompt = PromptTemplate.from_template(template)
# Define the state of the graph, including: question, query pattern, context, and answer
class State(TypedDict):
question: str
query: Search
context: List[Document]
answer: str
# Query analysis step: Extract input question information into the specified query pattern Search
def analyze_query(state: State):
structured_llm = llm.with_structured_output(Search)
query = structured_llm.invoke(state["question"])
return {"query": query}
# Retrieval step: Use the input question for similarity search
def retrieve(state: State):
query = state["query"]
retrieved_docs = vector_store.similarity_search(
query["query"],
filter=lambda doc: doc.metadata.get("section") == query["section"],
)
return {"context": retrieved_docs}
# Generation step: Format the retrieved context and original question into the prompt for the chat model
def generate(state: State):
docs_content = "\n\n".join(doc.page_content for doc in state["context"])
messages = prompt.invoke({"question": state["question"], "context": docs_content})
response = llm.invoke(messages)
return {"answer": response.content}
# Compile the graph
# Connect the retrieval and generation steps into a single sequence
graph_builder = StateGraph(State).add_sequence([analyze_query, retrieve, generate])
graph_builder.add_edge(START, "analyze_query")
graph = graph_builder.compile()
for message, metadata in graph.stream(
{"question": "Please list the recommended Python libraries at the end of the article?"}, stream_mode="messages"
):
print(message.content, end="")
# The recommended Python libraries at the end of the article include: Pyro, PyUserInput, scapy, wifi, Pingo, keyboard, mouse, Python-Future, Six, and modernize. Thanks for asking!
Recommended Reading
-
FastAPI Beginner Series Collection
-
Django Beginner Series Collection
-
Flask Tutorial Series Collection
-
tkinter Tutorial Series Collection
-
Flet Tutorial Series Collection
Please open in WeChat client