OCR Image Recognition Using Python

OCR Image Recognition Using Python

Data collection often encounters images that can only be viewed and not copied. Manually extracting text can require a significant amount of work. For example, in the table of prices for a property development, how can one find houses with lower unit prices? It can be difficult to discern with the naked eye. Can we let the computer perform text recognition and then analyze these numerical data?

OCR Image Recognition Using Python

First, we need to extract the unit prices from the images.

OCR Image Recognition Using Python

Then we generate an image:

OCR Image Recognition Using Python

This can be achieved using Python, utilizing the currently popular OCR image recognition. The main idea is to use machine learning models to train an image recognition model using existing images. The specific steps are as follows:

1. Preprocess the Image for Easier Recognition by the Computer

(1) Convert the Image to Grayscale

Use the OpenCV library to process the image.

OCR Image Recognition Using Python

Using the grayscale image, as shown below, helps eliminate distracting information, making recognition more stable.

OCR Image Recognition Using Python

(2) Segment the Image

Split the image into smaller blocks to improve recognition accuracy and facilitate storing data in a tabular format. Parameters can be set to crop the image into small blocks based on a coordinate system, as shown below, and saved in JPG format.

OCR Image Recognition Using Python

2. Build the Image Recognition Model

(1) Merge the Segmented Small Block Images into a TIFF File

Download <span>jTessBoxEditor</span>, open <span>jTessBoxEditor.jar</span>, and use the <span>tools</span> under <span>merge tiff</span> to merge the images into a <span>tiff</span> file.

(2) Perform Initial Recognition on the TIFF File Using an Existing Model

Download and install <span>tesseract</span>, and configure the environment variables. Add the paths for <span>Tesseract-OCR</span> and <span>tessdata</span> to the PATH variable. Tesseract comes with image recognition models, such as the simplified Chinese character recognition model <span>chi_sim.traineddata</span> and the English recognition model <span>eng.traineddata</span>. These models can be downloaded online and placed in the <span>tessdata</span> folder for use.

Then navigate to the folder containing the TIFF file. In the command window, enter:<span>tesseract ***.tif *** -l +++ -psm 7 batch.nochop makebox</span>, and press Enter to generate the <span>box</span> file. Here, <span>***</span> is the TIFF file name, and <span>+++</span> is the name of the generated <span>traindata</span> file.

(3) Adjust the TIFF and Box Files Using <span>jTessBoxEditor</span>

Open <span>jTessBoxEditor.jar</span>, click the open button in the <span>box editor</span>, and open the TIFF file to edit. After editing, save it to generate the box file, and save it in the same folder.

(4) Generate the Model Using TIFF and Box Files

In the folder containing the TIFF and box files, enter the following code in the command window to generate the model (traindata file).

OCR Image Recognition Using Python

The above script can also be written in a bat file to run the script to generate traindata. Finally, just copy the traindata to the tessdata folder to use the model.

3. Apply the Image Recognition Model

After installation and model training, you can use the model in Python. Install pytesseract, find the <span>pytesseract.py</span> file, open it for editing, and change the line <span>"tesseract_cmd = 'tesseract'"</span> to the installation path of <span>tesseract</span> (e.g., <span>C:\Program Files\Tesseract-OCR\tesseract</span>).

Since the model is trained using grayscale images, grayscale images should also be used during recognition.

OCR Image Recognition Using Python

4. Optimize the Image Recognition Model

If there are errors during use, they can be saved and added to the training set to optimize the image recognition model. Generally, accumulate images with recognition errors and create a TIFF file after a period of time. Note: For similar errors, select a few marked ones to keep the training set as small and precise as possible.

Author: Yang Bing, a psychologist who writes code in a bank.

Appreciate the Author

OCR Image Recognition Using Python

More Readings

Implement a Simple Genetic Algorithm from Scratch Using Python

Master the Random Hill Climbing Algorithm in 5 Minutes Using Python

Completely Understand Association Rule Mining Algorithm in 5 Minutes

Special Recommendations

OCR Image Recognition Using Python

OCR Image Recognition Using Python

Click below to read the original text and join thecommunity membership

Leave a Comment