Project Introduction

Surya is a document OCR toolkit with the following features:

OCR support for over 90 languages, outperforming cloud services in benchmark tests
Line-level text detection for any language
Layout analysis (detection of tables, images, headings, etc.)
Reading order detection

It is suitable for a range of documents (see usage and benchmarks for more details).

Detection	OCR

Layout	Reading Order

Surya is named after the Hindu sun god with universal vision.

Commercial Use

I hope Surya can be used as widely as possible while still funding my development/training costs. Research and personal use are always allowed, but there are some restrictions for commercial use.

The model weights are licensed under cc-by-nc-sa-4.0, but for any organization with total revenue below $5 million in the last 12 months and lifetime venture/angel funding below $5 million, I will waive this weight. If you want to remove the GPL license requirement (dual licensing) and/or use the commercial weights above the revenue limits, please check the options here.

Installation

You need Python 3.9+ and PyTorch. If you are not using a Mac or GPU machine, you may need to install the CPU version of torch first. See here for more details.

Installation:

pip install surya-ocr

On the first run of Surya, the model weights will be downloaded automatically. Note that this does not work with the latest version of transformers 4.37+, so you need to stick with 4.36.2 that comes with Surya.

Usage

Check the settings in surya/settings.py. You can override any settings using environment variables.
The system will automatically detect your torch device, but you can override this setting. For example, TORCH_DEVICE=cuda. There is a bug with mps devices (on Apple) for text detection that may prevent it from working properly.

Interactive Application

I have provided a Streamlit application that allows you to interactively try Surya on images or PDF files. Run it:

pip install streamlitsurya_gui

Pass the –math command line argument to use the math text detection model instead of the default model. This will detect math better but will perform worse in other aspects.

OCR (Optical Character Recognition)

This command will write out a JSON file containing the detected text and bounding boxes:

surya_ocr DATA_PATH --images --langs hi,en

DATA_PATH can be an image, pdf, or a folder of images/pdfs
–langs specifies the languages to be used for OCR. You can specify multiple languages separated by commas (I do not recommend using more than 4). Use the language names here or the two-letter ISO codes. Surya supports over 90 languages in surya/languages.py.
–lang_file If you want to use different languages for different PDFs/images, you can specify the languages here. The format is a JSON dictionary with keys as filenames and values as lists, e.g., {“file1.pdf”: [“en”, “hi”], “file2.pdf”: [“en”]}.
–images will save page images and detected text lines (optional)
–results_dir specifies the directory to save results instead of the default directory
–max specifies the maximum number of pages to process if you do not want to process everything
–start_page specifies the starting page number to process

The results.json file will contain a JSON dictionary where the keys are the input filenames without extensions. Each value will be a list of dictionaries, one for each page of the input document. Each page dictionary contains:

text_lines – the detected text and bounding boxes for each line

text – the text in the line
confidence – the model’s confidence in the detected text (0-1)
polygon – the polygon of the text line in (x1, y1), (x2, y2), (x3, y3), (x4, y4) format. These points are arranged in a clockwise order starting from the top left corner.
bbox – the axis-aligned rectangle of the text line in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, (x2, y2) is the bottom right corner.

languages – the languages specified for the page
page – the page number in the file
image_bbox – the bbox of the image in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, (x2, y2) is the bottom right corner. All line bboxes will be contained within this bbox.

Performance Tips

When using a GPU, properly setting the RECOGNITION_BATCH_SIZE environment variable will make a significant difference. Each batch item will use 50MB of VRAM, allowing for very high batch sizes. The default batch size is 256, which will use about 12.8GB of VRAM. It may also help depending on your number of CPU cores – the default CPU batch size is 32.

From Python

from PIL import Image
from surya.ocr import run_ocr
from surya.model.detection import segformer
from surya.model.recognition.model import load_model
from surya.model.recognition.processor import load_processor

image = Image.open(IMAGE_PATH)
langs = ["en"]  # Replace with your languages
det_processor, det_model = segformer.load_processor(), segformer.load_model()
rec_model, rec_processor = load_model(), load_processor()

predictions = run_ocr([image], [langs], det_model, det_processor, rec_model, rec_processor)

Text Line Detection

This command will write out a JSON file containing the detected bounding boxes.

surya_detect DATA_PATH --images

DATA_PATH can be an image, pdf, or a folder of images/pdfs
–images will save page images and detected text lines (optional)
–max specifies the maximum number of pages to process if you do not want to process everything
–results_dir specifies the directory to save results instead of the default directory
–math uses a specialized math detection model instead of the default model. This will perform better for math detection.

bboxes – the bounding boxes of detected text

bbox – the axis-aligned rectangle of the text line in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, (x2, y2) is the bottom right corner.
polygon – the polygon of the text line in (x1, y1), (x2, y2), (x3, y3), (x4, y4) format. These points are arranged in a clockwise order starting from the top left corner.
confidence – the model’s confidence in the detected text (0-1)

vertical_lines – detected vertical lines in the document

bbox – axis-aligned line coordinates.

horizontal_lines – detected horizontal lines in the document

bbox – axis-aligned line coordinates.

page – the page number in the file
image_bbox – the bbox of the image in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, (x2, y2) is the bottom right corner. All line bboxes will be contained within this bbox.

Performance Tips

When using a GPU, properly setting the DETECTOR_BATCH_SIZE environment variable will make a significant difference. Each batch item will use 280MB of VRAM, allowing for very high batch sizes. The default batch size is 32, which will use about 9GB of VRAM. It may also help depending on your number of CPU cores – the default CPU batch size is 2.

From Python

from PIL import Image
from surya.detection import batch_text_detection
from surya.model.detection.segformer import load_model, load_processor

image = Image.open(IMAGE_PATH)
model, processor = load_model(), load_processor()

# predictions is a list of dicts, one per image
predictions = batch_text_detection([image], model, processor)

Layout Analysis

This command will write out a JSON file using the detected layout.

surya_layout DATA_PATH --images

DATA_PATH can be an image, pdf, or a folder of images/pdfs
–images will save page images and detected text lines (optional)
–max specifies the maximum number of pages to process if you do not want to process everything
–results_dir specifies the directory to save results instead of the default directory

bboxes – the bounding boxes of detected text

bbox – the axis-aligned rectangle of the text line in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, (x2, y2) is the bottom right corner.
polygon – the polygon of the text line in (x1, y1), (x2, y2), (x3, y3), (x4, y4) format. These points are arranged in a clockwise order starting from the top left corner.
confidence – the model’s confidence in the detected text (0-1). Currently, this statement is not very reliable.
label – the label of the bbox. One of Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Figure, Section-header, Table, Text, Title.

page – the page number in the file
image_bbox – the bbox of the image. (x1, y1) is the top left corner, (x2, y2) is the bottom right corner. All line bboxes will be contained within this bbox.

Performance Tips

From Python

from PIL import Image
from surya.detection import batch_text_detection
from surya.layout import batch_layout_detection
from surya.model.detection.segformer import load_model, load_processor
from surya.settings import settings

image = Image.open(IMAGE_PATH)
model = load_model(checkpoint=settings.LAYOUT_MODEL_CHECKPOINT)
processor = load_processor(checkpoint=settings.LAYOUT_MODEL_CHECKPOINT)
det_model = load_model()
det_processor = load_processor()

# layout_predictions is a list of dicts, one per image
line_predictions = batch_text_detection([image], det_model, det_processor)
layout_predictions = batch_layout_detection([image], model, processor, line_predictions)

Reading Order

This command will write out a JSON file containing the detected reading order and layout.

surya_order DATA_PATH --images

DATA_PATH can be an image, pdf, or a folder of images/pdfs
–images will save page images and detected text lines (optional)
–max specifies the maximum number of pages to process if you do not want to process everything
–results_dir specifies the directory to save results instead of the default directory

bboxes – the bounding boxes of detected text

bbox – the axis-aligned rectangle of the text line in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, (x2, y2) is the bottom right corner.
position – the position of the bbox in the reading order, starting from 0.
label – the label of the bbox. For a list of potential labels, see the layout section of the documentation.

page – the page number in the file
image_bbox – the bbox of the image in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, (x2, y2) is the bottom right corner. All line bboxes will be contained within this bbox.

Performance Tips

When using a GPU, properly setting the ORDER_BATCH_SIZE environment variable will make a significant difference. Each batch item will use 360MB of VRAM, allowing for very high batch sizes. The default batch size is 32, which will use about 11GB of VRAM. It may also help depending on your number of CPU cores – the default CPU batch size is 4.

From Python

from PIL import Image
from surya.ordering import batch_ordering
from surya.model.ordering.processor import load_processor
from surya.model.ordering.model import load_model

image = Image.open(IMAGE_PATH)
# bboxes should be a list of lists with layout bboxes for the image in [x1,y1,x2,y2] format
# You can get this from the layout model, see above for usage
bboxes = [bbox1, bbox2, ...]

model = load_model()
processor = load_processor()

# order_predictions will be a list of dicts, one per image
order_predictions = batch_ordering([image], [bboxes], model, processor)

https://github.com/VikParuchuri/surya

Follow the “GitHubStore” public account

Scan the WeChat below

1 Join the technical exchange group, please note “Development Language-City-Nickname”

Surya: An OCR Framework Better Than EasyOCR

Project Introduction

Commercial Use

Installation

Usage

Interactive Application

OCR (Optical Character Recognition)

From Python

Text Line Detection

From Python

Layout Analysis

From Python

Reading Order

From Python

Leave a Comment Cancel reply