Click on the above“Beginner’s Guide to Vision”, select to addStar or “Top”
Important content delivered first
As organizations around the world seek to digitize their operations, converting physical documents into digital formats has become very common. This is usually done through Optical Character Recognition (OCR), where text images (scanned physical documents) are converted into machine text using one of several mature text recognition algorithms. The performance of document OCR is best when processing printed text against a clean background, with consistent paragraphs and font sizes.
In practice, this scenario is far from the norm. The information on invoices, forms, and even identification documents is scattered throughout the document space, making the task of extracting relevant data digitally more complex.
In this article, we will explore a simple method for defining document image areas for OCR using Python. We will use a document example with information scattered throughout the document space—a passport. The following sample passport is placed on a white background, simulating a photocopied passport.
From this passport image, we hope to obtain the following fields:
Name/First Name Last Name Chinese Name Surname Passport Number
First, we will import all the necessary packages. The most important packages are OpenCV for computer vision operations and PyTesseract, which is a Python wrapper for the powerful Tesseract OCR engine.
from cv2 import cv2
import pytesseract
import pandas as pd
import numpy as np
import math
from matplotlib import pyplot as plt
Next, we will read our passport image using cv2.imread. Our first task is to extract the actual passport document area from this pseudo-scanned page. We will achieve this by detecting the edges of the passport and cropping it from the image.
img = cv2.imread('images\Passport.png',0)
img_copy = img.copy()
img_canny = cv2.Canny(img_copy, 50, 100, apertureSize = 3)
The Canny algorithm included in the OpenCV library uses a multi-stage process to detect edges in an image. The last three parameters used are the lower threshold and the upper threshold (minVal and maxVal, respectively), and the kernel size.
Running the Canny algorithm produces the following output. Note that the minimum edges are retained due to the selected low threshold.
img_hough = cv2.HoughLinesP(img_canny, 1, math.pi / 180, 100, minLineLength = 100, maxLineGap = 10)
Next, we use another algorithm called the Hough Transform on the edge-detected image to draw the shape of the passport area by detecting lines. The minLineLength parameter defines how many pixels a shape must contain to be considered a “line”, while the maxLineGap parameter indicates the maximum allowed gap between pixels in a sequence that is considered the same shape.
(x, y, w, h) = (np.amin(img_hough, axis = 0)[0,0], np.amin(img_hough, axis = 0)[0,1], np.amax(img_hough, axis = 0)[0,0] - np.amin(img_hough, axis = 0)[0,0], np.amax(img_hough, axis = 0)[0,1] - np.amin(img_hough, axis = 0)[0,1])
img_roi = img_copy[y:y+h,x:x+w]
Our passport has straight lines on all four sides—the edges of the document. Thus, with our line information, we can choose to crop our passport area by the outer edges of the detected lines:
After rotating the passport vertically, we start selecting the area in the image to capture data. Almost all international passports comply with ICAO standards, which outline the design and layout specifications for passport pages. One of these specifications is the Machine Readable Zone (MRZ), which consists of two interesting lines at the bottom of the passport document. Most of the key information in the visual inspection zone (VIZ) of your document is also contained in the machine-readable zone, which can be read by machines. In our exercise, that machine is our trusted Tesseract engine.
img_roi = cv2.rotate(img_roi, cv2.ROTATE_90_COUNTERCLOCKWISE)
(height, width) = img_roi.shape
img_roi_copy = img_roi.copy()
dim_mrz = (x, y, w, h) = (1, round(height*0.9), width-3, round(height-(height*0.9))-2)
img_roi_copy = cv2.rectangle(img_roi_copy, (x, y), (x + w ,y + h),(0,0,0),2)
Let’s define the MRZ area in the passport image using four dimensions: horizontal offset (from the left), vertical offset (from the top), width, and height. For the MRZ, we will assume it is contained within the bottom 10% of our passport. Therefore, using OpenCV’s rectangle function, we can draw a box around the area to validate our size selection.
img_mrz = img_roi[y:y+h, x:x+w]
img_mrz =cv2.GaussianBlur(img_mrz, (3,3), 0)
ret, img_mrz = cv2.threshold(img_mrz,127,255,cv2.THRESH_TOZERO)
In the new image, we crop the selected area. We will perform some basic image preprocessing on the cropped image to facilitate better reading—Gaussian blur and simple thresholding.
mrz = pytesseract.image_to_string(img_mrz, config = '--psm 12')
We are now ready to apply OCR processing. In our image_to_string property, we configured the page segmentation mode for “sparse text with orientation and script detection (OSD)”. This is designed to capture all available text in our image.
Comparing the Pytesseract output with our original passport image, we can observe some errors when reading special characters. For more accurate readings, Pytesseract’s whitelist configuration can be optimized; however, for our purposes, the current reading accuracy is sufficient.
mrz = [line for line in mrz.split('\n') if len(line)>10]
if mrz[0][0:2] == 'P<': lastname = mrz[0].split('<')[1][3:]
else: lastname = mrz[0].split('<')[0][5:]
firstname = [i for i in mrz[0].split('<') if (i).isspace() == 0 and len(i) > 0][1]
pp_no = mrz[1][:9]
By applying some string operations based on ICAO guidelines regarding the structure of MRZ codes, we can extract the passport holder’s last name, first name, and passport number:
What if the text is not in English? No problem—Tesseract has been trained on models for over 100 languages (although the robustness of OCR performance varies for each supported language).
img_roi_copy = img_roi.copy()
dim_lastname_chi = (x, y, w, h) = (455, 1210, 120, 70)
img_lastname_chi = img_roi[y:y+h, x:x+w]
img_lastname_chi = cv2.GaussianBlur(img_lastname_chi, (3,3), 0)
ret, img_lastname_chi = cv2.threshold(img_lastname_chi,127,255,cv2.THRESH_TOZERO)
dim_firstname_chi = (x, y, w, h) = (455, 1300, 120, 70)
img_firstname_chi = img_roi[y:y+h, x:x+w]
img_firstname_chi = cv2.GaussianBlur(img_firstname_chi, (3,3), 0)
ret, img_firstname_chi = cv2.threshold(img_firstname_chi,127,255,cv2.THRESH_TOZERO)
Using the same area selection method, we define dimensions (x, y, w, h) for the target data fields again and apply blurring and threshold processing to the cropped images.
lastname_chi = pytesseract.image_to_string(img_lastname_chi, lang = 'chi_sim', config = '--psm 7')
firstname_chi = pytesseract.image_to_string(img_firstname_chi, lang = 'chi_sim', config = '--psm 7')
Now, in our image_to_string parameters, we will add the language script for the input text, simplified Chinese.
To complete the exercise, we will pass all collected fields into a dictionary and output to a table for practical use.
Explicitly defining the regions of interest in OCR is just one of many methods to obtain the desired data from OCR. Depending on your use case, using other methods (such as contour analysis or object detection) may be most effective, as shown in our passport exercise, where proper preprocessing of the image before applying OCR is crucial. It is important to experiment with different preprocessing techniques to find the best approach for your document type when dealing with real documents of varying image quality.
Group Chat
Welcome to join the public account reader group to communicate with peers. Currently, there are WeChat groups on SLAM, 3D vision, sensors, autonomous driving, computational photography, detection, segmentation, recognition, medical imaging, GAN, algorithm competitions and more (which will gradually be subdivided in the future), please scan the WeChat number below to join the group, with the note: “nickname + school/company + research direction”, for example: “Zhang San + Shanghai Jiao Tong University + Vision SLAM”. Please follow the format for your note, otherwise it will not be approved. After successful addition, you will be invited into related WeChat groups based on your research direction. Please do not send advertisements in the group, otherwise you will be removed from the group, thank you for your understanding~