Project Introduction
-
OCR support for over 90 languages, outperforming cloud services in benchmark tests -
Line-level text detection for any language -
Layout analysis (detection of tables, images, headings, etc.) -
Reading order detection
Detection | OCR |
---|---|
![]() |
![]() |
Layout | Reading Order |
---|---|
![]() |
![]() |
Surya is named after the Hindu sun god with universal vision.
Commercial Use
I hope Surya can be used as widely as possible while still funding my development/training costs. Research and personal use are always allowed, but there are some restrictions for commercial use.
The model weights are licensed under cc-by-nc-sa-4.0, but for any organization with total revenue below $5 million in the last 12 months and lifetime venture/angel funding below $5 million, I will waive this weight. If you want to remove the GPL license requirement (dual licensing) and/or use the commercial weights above the revenue limits, please check the options here.
Installation
You need Python 3.9+ and PyTorch. If you are not using a Mac or GPU machine, you may need to install the CPU version of torch first. See here for more details.
Installation:
pip install surya-ocr
On the first run of Surya, the model weights will be downloaded automatically. Note that this does not work with the latest version of transformers 4.37+, so you need to stick with 4.36.2 that comes with Surya.
Usage
-
Check the settings in surya/settings.py. You can override any settings using environment variables.
-
The system will automatically detect your torch device, but you can override this setting. For example, TORCH_DEVICE=cuda. There is a bug with mps devices (on Apple) for text detection that may prevent it from working properly.
Interactive Application
I have provided a Streamlit application that allows you to interactively try Surya on images or PDF files. Run it:
pip install streamlitsurya_gui
Pass the –math command line argument to use the math text detection model instead of the default model. This will detect math better but will perform worse in other aspects.
OCR (Optical Character Recognition)
This command will write out a JSON file containing the detected text and bounding boxes:
surya_ocr DATA_PATH --images --langs hi,en
-
DATA_PATH can be an image, pdf, or a folder of images/pdfs
-
–langs specifies the languages to be used for OCR. You can specify multiple languages separated by commas (I do not recommend using more than 4). Use the language names here or the two-letter ISO codes. Surya supports over 90 languages in surya/languages.py.
-
–lang_file If you want to use different languages for different PDFs/images, you can specify the languages here. The format is a JSON dictionary with keys as filenames and values as lists, e.g., {“file1.pdf”: [“en”, “hi”], “file2.pdf”: [“en”]}.
-
–images will save page images and detected text lines (optional)
-
–results_dir specifies the directory to save results instead of the default directory
-
–max specifies the maximum number of pages to process if you do not want to process everything
-
–start_page specifies the starting page number to process
The results.json file will contain a JSON dictionary where the keys are the input filenames without extensions. Each value will be a list of dictionaries, one for each page of the input document. Each page dictionary contains:
-
text_lines – the detected text and bounding boxes for each line
-
text – the text in the line
-
confidence – the model’s confidence in the detected text (0-1)
-
polygon – the polygon of the text line in (x1, y1), (x2, y2), (x3, y3), (x4, y4) format. These points are arranged in a clockwise order starting from the top left corner.
-
bbox – the axis-aligned rectangle of the text line in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, (x2, y2) is the bottom right corner.
-
languages – the languages specified for the page
-
page – the page number in the file
-
image_bbox – the bbox of the image in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, (x2, y2) is the bottom right corner. All line bboxes will be contained within this bbox.
Performance Tips
When using a GPU, properly setting the RECOGNITION_BATCH_SIZE environment variable will make a significant difference. Each batch item will use 50MB of VRAM, allowing for very high batch sizes. The default batch size is 256, which will use about 12.8GB of VRAM. It may also help depending on your number of CPU cores – the default CPU batch size is 32.
From Python
from PIL import Image
from surya.ocr import run_ocr
from surya.model.detection import segformer
from surya.model.recognition.model import load_model
from surya.model.recognition.processor import load_processor
image = Image.open(IMAGE_PATH)
langs = ["en"] # Replace with your languages
det_processor, det_model = segformer.load_processor(), segformer.load_model()
rec_model, rec_processor = load_model(), load_processor()
predictions = run_ocr([image], [langs], det_model, det_processor, rec_model, rec_processor)
Text Line Detection
This command will write out a JSON file containing the detected bounding boxes.
surya_detect DATA_PATH --images
-
DATA_PATH can be an image, pdf, or a folder of images/pdfs
-
–images will save page images and detected text lines (optional)
-
–max specifies the maximum number of pages to process if you do not want to process everything
-
–results_dir specifies the directory to save results instead of the default directory
-
–math uses a specialized math detection model instead of the default model. This will perform better for math detection.
The results.json file will contain a JSON dictionary where the keys are the input filenames without extensions. Each value will be a list of dictionaries, one for each page of the input document. Each page dictionary contains:
-
bboxes – the bounding boxes of detected text
-
bbox – the axis-aligned rectangle of the text line in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, (x2, y2) is the bottom right corner.
-
polygon – the polygon of the text line in (x1, y1), (x2, y2), (x3, y3), (x4, y4) format. These points are arranged in a clockwise order starting from the top left corner.
-
confidence – the model’s confidence in the detected text (0-1)
-
vertical_lines – detected vertical lines in the document
-
bbox – axis-aligned line coordinates.
-
horizontal_lines – detected horizontal lines in the document
-
bbox – axis-aligned line coordinates.
-
page – the page number in the file
-
image_bbox – the bbox of the image in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, (x2, y2) is the bottom right corner. All line bboxes will be contained within this bbox.
Performance Tips
When using a GPU, properly setting the DETECTOR_BATCH_SIZE environment variable will make a significant difference. Each batch item will use 280MB of VRAM, allowing for very high batch sizes. The default batch size is 32, which will use about 9GB of VRAM. It may also help depending on your number of CPU cores – the default CPU batch size is 2.
From Python
from PIL import Image
from surya.detection import batch_text_detection
from surya.model.detection.segformer import load_model, load_processor
image = Image.open(IMAGE_PATH)
model, processor = load_model(), load_processor()
# predictions is a list of dicts, one per image
predictions = batch_text_detection([image], model, processor)
Layout Analysis
This command will write out a JSON file using the detected layout.
surya_layout DATA_PATH --images
-
DATA_PATH can be an image, pdf, or a folder of images/pdfs
-
–images will save page images and detected text lines (optional)
-
–max specifies the maximum number of pages to process if you do not want to process everything
-
–results_dir specifies the directory to save results instead of the default directory
The results.json file will contain a JSON dictionary where the keys are the input filenames without extensions. Each value will be a list of dictionaries, one for each page of the input document. Each page dictionary contains:
-
bboxes – the bounding boxes of detected text
-
bbox – the axis-aligned rectangle of the text line in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, (x2, y2) is the bottom right corner.
-
polygon – the polygon of the text line in (x1, y1), (x2, y2), (x3, y3), (x4, y4) format. These points are arranged in a clockwise order starting from the top left corner.
-
confidence – the model’s confidence in the detected text (0-1). Currently, this statement is not very reliable.
-
label – the label of the bbox. One of Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Figure, Section-header, Table, Text, Title.
-
page – the page number in the file
-
image_bbox – the bbox of the image. (x1, y1) is the top left corner, (x2, y2) is the bottom right corner. All line bboxes will be contained within this bbox.
Performance Tips
When using a GPU, properly setting the DETECTOR_BATCH_SIZE environment variable will make a significant difference. Each batch item will use 280MB of VRAM, allowing for very high batch sizes. The default batch size is 32, which will use about 9GB of VRAM. It may also help depending on your number of CPU cores – the default CPU batch size is 2.
From Python
from PIL import Image
from surya.detection import batch_text_detection
from surya.layout import batch_layout_detection
from surya.model.detection.segformer import load_model, load_processor
from surya.settings import settings
image = Image.open(IMAGE_PATH)
model = load_model(checkpoint=settings.LAYOUT_MODEL_CHECKPOINT)
processor = load_processor(checkpoint=settings.LAYOUT_MODEL_CHECKPOINT)
det_model = load_model()
det_processor = load_processor()
# layout_predictions is a list of dicts, one per image
line_predictions = batch_text_detection([image], det_model, det_processor)
layout_predictions = batch_layout_detection([image], model, processor, line_predictions)
Reading Order
This command will write out a JSON file containing the detected reading order and layout.
surya_order DATA_PATH --images
-
DATA_PATH can be an image, pdf, or a folder of images/pdfs
-
–images will save page images and detected text lines (optional)
-
–max specifies the maximum number of pages to process if you do not want to process everything
-
–results_dir specifies the directory to save results instead of the default directory
The results.json file will contain a JSON dictionary where the keys are the input filenames without extensions. Each value will be a list of dictionaries, one for each page of the input document. Each page dictionary contains:
-
bboxes – the bounding boxes of detected text
-
bbox – the axis-aligned rectangle of the text line in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, (x2, y2) is the bottom right corner.
-
position – the position of the bbox in the reading order, starting from 0.
-
label – the label of the bbox. For a list of potential labels, see the layout section of the documentation.
-
page – the page number in the file
-
image_bbox – the bbox of the image in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, (x2, y2) is the bottom right corner. All line bboxes will be contained within this bbox.
Performance Tips
When using a GPU, properly setting the ORDER_BATCH_SIZE environment variable will make a significant difference. Each batch item will use 360MB of VRAM, allowing for very high batch sizes. The default batch size is 32, which will use about 11GB of VRAM. It may also help depending on your number of CPU cores – the default CPU batch size is 4.
From Python
from PIL import Image
from surya.ordering import batch_ordering
from surya.model.ordering.processor import load_processor
from surya.model.ordering.model import load_model
image = Image.open(IMAGE_PATH)
# bboxes should be a list of lists with layout bboxes for the image in [x1,y1,x2,y2] format
# You can get this from the layout model, see above for usage
bboxes = [bbox1, bbox2, ...]
model = load_model()
processor = load_processor()
# order_predictions will be a list of dicts, one per image
order_predictions = batch_ordering([image], [bboxes], model, processor)
https://github.com/VikParuchuri/surya
Follow the “GitHubStore” public account
Scan the WeChat below
1 Join the technical exchange group, please note “Development Language-City-Nickname”