Advanced LlamaIndex Documentation: Using LlamaParse for Complex PDFs

In the previous article, we primarily used SimpleDirectoryReader to process documents. Today, we will explore how to use LlamaParse to handle more complex PDF documents, achieving more accurate document parsing and information extraction.

1. Introduction to LlamaParse

LlamaParse is a tool specifically designed for handling complex documents, with the following features:

  • Accurate layout recognition
  • Table and chart extraction
  • Format preservation
  • Support for multiple output formats

Online experience address: https://cloud.llamaindex.ai/

Advanced LlamaIndex Documentation: Using LlamaParse for Complex PDFs

2. Environment Configuration

First, we need to install the necessary dependencies and configure the environment:

import nest_asyncio
from llama_parse import LlamaParse
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex

# Apply nest_asyncio to support asynchronous operations
nest_asyncio.apply()

Then, apply for an API Key on the LlamaCloud website https://cloud.llamaindex.ai/

Advanced LlamaIndex Documentation: Using LlamaParse for Complex PDFs

3. Basic Implementation

3.1 Initializing LlamaParse

parser = LlamaParse(
    api_key="your-llama-parse-api-key",  # Can be set via environment variables
    result_type="markdown",  # Supports "markdown" and "text"
    verbose=True # Enable detailed output
)

3.2 Configuring Document Reader

# Set file extractor, specify PDF files to be parsed by LlamaParse
file_extractor = {".pdf": parser} 

# Use SimpleDirectoryReader to read documents from the ./pdf directory
documents = SimpleDirectoryReader(
    "./pdf",
    file_extractor=file_extractor
).load_data()

3.3 Creating Index and Querying

# Create vector storage index
index = VectorStoreIndex.from_documents(documents)

# Persist the index to the storage directory
index.storage_context.persist(persist_dir="storage")

# Create query engine
query_engine = index.as_query_engine()

# Execute query
response = query_engine.query("Your question")
print(response)

4. Using Advanced Features

4.1 Custom Parsing Configuration

  1. Performance-related configurations
parser = LlamaParse(
    num_workers=4,                 # Number of worker threads for API requests (between 1-9)
    check_interval=1,              # Interval time to check parsing progress (seconds)
    max_timeout=2000,             # Maximum wait timeout (seconds)
    show_progress=True,           # Show progress when parsing multiple files
    verbose=True,                 # Display detailed parsing process information
)
  1. Parsing control configurations
parser = LlamaParse(
    parsing_instruction="",        # Custom parsing instruction
    skip_diagonal_text=False,      # Whether to ignore diagonal text
    fast_mode=False,              # Fast mode (skip OCR and table/title reconstruction)
    do_not_unroll_columns=False,  # Keep the original column layout of the document
    split_by_page=True,           # Whether to split the document by page
)
  1. Page formatting configurations
parser = LlamaParse(
    page_separator="\n---\n",     # Page separator
    page_prefix=None,             # Page prefix
    page_suffix=None,             # Page suffix
    bounding_box=None,            # Bounding box for text extraction
    target_pages=None,            # Specify target pages to extract
)
  1. Cache control configurations
parser = LlamaParse(
    invalidate_cache=False,       # Whether to ignore cache and reprocess the document
    do_not_cache=False,          # Whether to not cache processing results
)
  1. GPT-4 and Multimodal Support
parser = LlamaParse(
    gpt4o_mode=False,            # Whether to use GPT-4 for text extraction
    gpt4o_api_key=None,          # GPT-4 API key
    use_vendor_multimodal_model=False,  # Whether to use vendor's multimodal model
    vendor_multimodal_api_key=None,     # Multimodal API key
    vendor_multimodal_model_name=None,  # Multimodal model name
    take_screenshot=False,        # Whether to take screenshots of each page
)

5. Document Structure Handling

5.1 Table Extraction

from llama_parse import LlamaParse

# 1. Initialize the parser
parser = LlamaParse(
    api_key="your-api-key",
    verbose=True
)

# 2. Get JSON results
json_results = parser.get_json_result("your_file.pdf")

# 3. Process table data
for result in json_results:
    for page in result['pages']:
        # Get all tables on the page
        tables = page.get('tables', [])
        
        for table in tables:
            print(f"Page number: {page['page']}")
            print(f"Table location: {table.get('bbox')}")  # Table position on the page
            print(f"Table data: {table.get('data')}")  # Table content

5.2 Chart Processing

from llama_parse import LlamaParse

# 1. Initialize the parser
parser = LlamaParse(
    api_key="your-api-key",
    verbose=True
)

# 2. Get JSON results
json_results = parser.get_json_result("your_file.pdf")

# 3. Process table data
for result in json_results:
    for page in result['pages']:
        # Get all tables on the page
        tables = page.get('tables', [])
        
        for table in tables:
            print(f"Page number: {page['page']}")
            print(f"Table location: {table.get('bbox')}")  # Table position on the page
            print(f"Table data: {table.get('data')}")  # Table content

Conclusion

Through this article, we learned:

  • The basic usage of LlamaParse
  • Advanced features and configurations
  • Document structure handling

We hope this series helps you better use LlamaIndex to build powerful AI applications!

Leave a Comment