Building A Local Document Intelligence Stack Using Docling, Ollama, Phi-4

Are you still troubled by the mixed quality of AI in China and its poor performance?

Then let’s take a look at Dev Cat AI (3in1).

This is an integrated AI assistant that combines GPT-4, Claude3, and Gemini into one.

It covers all models of the three AI tools.

Including GPT-4o and Gemini flash

Now you can own them for just ¥68 .

The official value is ¥420+.

Send “Dev Cat” in the backend to start using it.

Becoming a member at this moment also offers one-on-one private service to safeguard your usage.

Building A Local Document Intelligence Stack Using Docling, Ollama, Phi-4

Building A Local Document Intelligence Stack Using Docling, Ollama, Phi-4

In this new era of LLMs, banks and financial institutions are at a disadvantage, as cutting-edge models are almost impossible to use locally due to their high hardware requirements. However, the sensitivity of banking data brings serious privacy issues, especially when these models are only offered as cloud services. To address these challenges, organizations can turn to local or small language model (SLM) setups to keep data internal, avoiding potential leaks of sensitive information. This approach allows you to leverage advanced LLMs (locally or with minimal external calls) while ensuring strict compliance with regulations such as GDPR, HIPAA, or various financial directives.

This article demonstrates how to build a fully on-premise document intelligence solution by combining the following:

  • ExtractThinker — An open-source framework for coordinating the OCR, classification, and data extraction processes of LLMs.

  • Ollama — A local deployment solution for language models like Phi-4 or Llama 3.x.

  • Docling or MarkItDown — Flexible libraries for handling document loading, OCR, and layout parsing.

Whether you are operating under strict confidentiality rules, processing scanned PDFs, or simply want advanced visual-based extraction, this end-to-end stack can provide a completely secure and high-performance pipeline within your own infrastructure.

1. Choose the Right Model (Text vs. Visual)

Building A Local Document Intelligence Stack Using Docling, Ollama, Phi-4

When building a document intelligence stack, the first step is to determine whether you need a pure text model or a visual-supporting model. Pure text models are usually preferred for local solutions because they are widely available and have fewer restrictions. However, visual-supporting models are crucial for advanced splitting tasks, especially when documents rely on visual cues (such as layout, color schemes, or unique formats).

In some cases, you can pair different models for different stages. For example, a smaller moondream model (0.5B parameters) may handle splitting, while the Phi-4 14B model manages classification and extraction. Many large organizations tend to deploy a single more powerful model (such as Llama 3.3 or Qwen 2.5 in the 70B range) to cover all use cases. If you only need an English-centric IDP, you can simply use Phi4 to accomplish most tasks and keep a lightweight moondream model for emergencies. It entirely depends on your specific requirements and available infrastructure.

2. Handling Documents: MarkItDown vs. Docling

For document parsing, two popular libraries stand out:

MarkItDown

  • Simple and straightforward, widely supported by Microsoft.

  • Very suitable for direct text-based tasks that do not require multiple OCR engines.

  • Easy to install and integrate.

Docling

  • More advanced, supports multiple OCRs (Tesseract, AWS Textract, Google Document AI, etc.).

  • Very suitable for scanned workflows or powerful extraction from image PDFs.

  • Detailed documentation, adaptable to complex layouts.

ExtractThinkerDocumentLoaderMarkItDown allows you to switch to DocumentLoaderDocling for simple digital PDFs or multi-engine OCR as needed.

3. Running Local Models

Although Ollama is a popular local hosting LLM tool, there are now several on-premise solutions that can seamlessly integrate with ExtractThinker:

  • LocalAI — An open-source platform that mimics the OpenAI API locally. It can run LLMs like Llama 2 or Mistral on consumer-grade hardware (even just CPU) and provides a simple connection endpoint.

  • OpenLLM — A project of BentoML that exposes LLMs through an OpenAI-compatible API. It is optimized for throughput and low latency, suitable for both local and cloud, and supports various open-source LLMs.

  • Llama.cpp — A low-level method to run Llama models with advanced custom configurations. Very suitable for fine control or HPC settings, though more complex to manage.

Ollama is usually the first choice because it is easy to set up and has a simple CLI. However, for enterprise or HPC scenarios, solutions like Llama.cpp server deployment, OpenLLM, or LocalAI may be more appropriate. Simply point the local LLM endpoint to an environment variable or base URL in your code to integrate all of these with ExtractThinker.

4. Handling Context Windows

Building A Local Document Intelligence Stack Using Docling, Ollama, Phi-4

When using local models with limited context windows (e.g., ~8K tokens or fewer), it becomes crucial to manage the following two points:

Splitting Documents

To avoid exceeding the model’s input capacity, delayed splitting is the ideal choice. Instead of extracting the entire document at once:

  • You can compare pages step by step (for example, pages 1-2, then pages 2-3) to determine if they belong to the same sub-document.

  • If they match, combine them for the next step. If they do not match, start a new segment.

  • This approach saves memory and allows you to load and analyze only a few pages at a time, scaling up to very large PDFs.

Note: When you have a higher token limit, concatenation is ideal; for limited windows, pagination is preferred.

Handling Partial Responses

For smaller local models, if the prompt is large, each response may also be truncated. PaginationHandler elegantly solves this problem by:

  • Splitting the pages of the document into separate requests (one page per request).

  • Finally merging page-level results, resolving conflicts if there are discrepancies in certain fields.

Quick Example Workflow

  1. Lazy splitting of the PDF so that each chunk/page is below the model’s limit.

  2. Cross-page pagination: Results for each chunk are returned separately.

  3. Merge partial page results into final structured data.

This minimalist approach ensures you never exceed the context window of the local model—whether in the way you provide the PDF or in how you handle multi-page responses.

5. ExtractThinker: Building the Stack

Below is a minimal code snippet demonstrating how to integrate these components. First, install ExtractThinker:

pip install extract-thinker

Document Loader

As mentioned, we can use MarkItDown or Docling.

from extract_thinker import DocumentLoaderMarkItDown, DocumentLoaderDocling

# DocumentLoaderDocling or DocumentLoaderMarkItDown
document_loader = DocumentLoaderDocling()

Definition

We use Pydantic-based contracts to specify the structure of the data to extract. For example, invoices and driver licenses:

from extract_thinker.models.contract import Contract
from pydantic import Field

class InvoiceContract(Contract):
    invoice_number: str = Field(description="Unique invoice identifier")
    invoice_date: str = Field(description="Date of the invoice")
    total_amount: float = Field(description="Overall total amount")

class DriverLicense(Contract):
    name: str = Field(description="Full name on the license")
    age: int = Field(description="Age of the license holder")
    license_number: str = Field(description="License number")

Classification

If you have multiple document types, define a classification object. You can specify:

  • The name of each classification (e.g., “Invoice”).

  • A description.

  • The contract it maps to.

from extract_thinker.models.contract import Contract
from pydantic import Field

class InvoiceContract(Contract):
    invoice_number: str = Field(description="Unique invoice identifier")
    invoice_date: str = Field(description="Date of the invoice")
    total_amount: float = Field(description="Overall total amount")

class DriverLicense(Contract):
    name: str = Field(description="Full name on the license")
    age: int = Field(description="Age of the license holder")
    license_number: str = Field(description="License number")

In summary: Local extraction process

Next, we create an extractor using our chosen local model (Ollama, LocalAI, etc.). Then we build a workflow that loads, classifies, splits, and extracts text content in a single pipeline.

import os
from dotenv import load_dotenv

from extract_thinker import (
    Extractor,
    Process,
    Classification,
    SplittingStrategy,
    ImageSplitter,
    TextSplitter
)

# Load environment variables (if you store LLM endpoints/API_BASE, etc. in .env)
load_dotenv()

# Example path to a multi-page document
MULTI_PAGE_DOC_PATH = "path/to/your/multi_page_doc.pdf"

def setup_local_process():
    """
    Helper function to set up an ExtractThinker process
    using local LLM endpoints (e.g., Ollama, LocalAI, OnPrem.LLM, etc.)
    """

    # 1) Create an Extractor
    extractor = Extractor()

    # 2) Attach our chosen DocumentLoader (Docling or MarkItDown)
    extractor.load_document_loader(document_loader)

    # 3) Configure your local LLM
    #    For Ollama, you might do:
    os.environ["API_BASE"] = "http://localhost:11434"  # Replace with your local endpoint
    extractor.load_llm("ollama/phi4")  # or "ollama/llama3.3" or your local model
    
    # 4) Attach extractor to each classification
    TEST_CLASSIFICATIONS[0].extractor = extractor
    TEST_CLASSIFICATIONS[1].extractor = extractor

    # 5) Build the Process
    process = Process()
    process.load_document_loader(document_loader)
    return process

def run_local_idp_workflow():
    """
    Demonstrates loading, classifying, splitting, and extracting
    a multi-page document with a local LLM.
    """
    # Initialize the process
    process = setup_local_process()

    # (Optional) You can use ImageSplitter(model="ollama/moondream:v2") for the split
    process.load_splitter(TextSplitter(model="ollama/phi4"))

    # 1) Load the file
    # 2) Split into sub-documents with EAGER strategy
    # 3) Classify each sub-document with our TEST_CLASSIFICATIONS
    # 4) Extract fields based on the matched contract (Invoice or DriverLicense)
    result = (
        process
        .load_file(MULTI_PAGE_DOC_PATH)
        .split(TEST_CLASSIFICATIONS, strategy=SplittingStrategy.LAZY)
        .extract(vision=False, completion_strategy=CompletionStrategy.PAGINATE)
    )

    # 'result' is a list of extracted objects (InvoiceContract or DriverLicense)
    for item in result:
        # Print or store each extracted data model
        if isinstance(item, InvoiceContract):
            print("[Extracted Invoice]")
            print(f"Number: {item.invoice_number}")
            print(f"Date: {item.invoice_date}")
            print(f"Total: {item.total_amount}")
        elif isinstance(item, DriverLicense):
            print("[Extracted Driver License]")
            print(f"Name: {item.name}, Age: {item.age}")
            print(f"License #: {item.license_number}")

# For a quick test, just call run_local_idp_workflow()
if __name__ == "__main__":
    run_local_idp_workflow()

6. Privacy and PII: LLMs in the Cloud

Not every organization can (or wants to) run local hardware. Some organizations prefer cloud-based advanced LLMs. If so, keep in mind:

  • Data privacy risks: Sending sensitive data to the cloud raises potential compliance issues.

  • GDPR/HIPAA: Regulations may completely restrict data from leaving your premises.

  • VPC + Firewall: You can isolate cloud resources in a private network, but this adds complexity.

Note: Many LLM APIs (e.g., OpenAI) do comply with GDPR regulations. But if you are under strict regulation or wish to easily switch providers, consider local or shielded cloud approaches.

PII shielding A reliable method is to build a PII shielding pipeline. Tools like Presidio can automatically detect and edit personal identifiers before sending them to the LLM. This way, you can remain provider-agnostic while staying compliant.

Building A Local Document Intelligence Stack Using Docling, Ollama, Phi-4

7. Conclusion

By combining ExtractThinker with local LLMs (such as Ollama, LocalAI, or OnPrem.LLM) and flexible DocumentLoaders (Docling or MarkItDown), you can build a secure local document intelligence workflow from scratch. If regulatory requirements demand complete confidentiality or minimal external calls, this stack keeps your data internal without sacrificing the capabilities of modern LLMs.

Recently, some friends and experts have formed a communication community that includes RAG and AGENT, where many experts on AnythingLLM and Ollama exchange ideas. If you want to join us, just scan the QR code below.
Building A Local Document Intelligence Stack Using Docling, Ollama, Phi-4
Previous hot 🔥 articles:
① Ollama Model Management Tool — Gollama (78)
② Xorbits Inference: Ollama’s Strongest Competitor (73)
③ Environment Variables That Can Be Set for Ollama (68)

Building A Local Document Intelligence Stack Using Docling, Ollama, Phi-4

If this is helpful to you, don’t rush😝 to hit “Share and View” before you swipe away🫦

Leave a Comment