Data collection often encounters images that can only be viewed and not copied. Manually extracting text can require a significant amount of work. For example, in the table of prices for a property development, how can one find houses with lower unit prices? It can be difficult to discern with the naked eye. Can we let the computer perform text recognition and then analyze these numerical data?
First, we need to extract the unit prices from the images.
Then we generate an image:
This can be achieved using Python, utilizing the currently popular OCR image recognition. The main idea is to use machine learning models to train an image recognition model using existing images. The specific steps are as follows:
1. Preprocess the Image for Easier Recognition by the Computer
(1) Convert the Image to Grayscale
Use the OpenCV library to process the image.
Using the grayscale image, as shown below, helps eliminate distracting information, making recognition more stable.
(2) Segment the Image
Split the image into smaller blocks to improve recognition accuracy and facilitate storing data in a tabular format. Parameters can be set to crop the image into small blocks based on a coordinate system, as shown below, and saved in JPG format.
2. Build the Image Recognition Model
(1) Merge the Segmented Small Block Images into a TIFF File
Download <span>jTessBoxEditor</span>
, open <span>jTessBoxEditor.jar</span>
, and use the <span>tools</span>
under <span>merge tiff</span>
to merge the images into a <span>tiff</span>
file.
(2) Perform Initial Recognition on the TIFF File Using an Existing Model
Download and install <span>tesseract</span>
, and configure the environment variables. Add the paths for <span>Tesseract-OCR</span>
and <span>tessdata</span>
to the PATH variable. Tesseract comes with image recognition models, such as the simplified Chinese character recognition model <span>chi_sim.traineddata</span>
and the English recognition model <span>eng.traineddata</span>
. These models can be downloaded online and placed in the <span>tessdata</span>
folder for use.
Then navigate to the folder containing the TIFF file. In the command window, enter:<span>tesseract ***.tif *** -l +++ -psm 7 batch.nochop makebox</span>
, and press Enter to generate the <span>box</span>
file. Here, <span>***</span>
is the TIFF file name, and <span>+++</span>
is the name of the generated <span>traindata</span>
file.
(3) Adjust the TIFF and Box Files Using <span>jTessBoxEditor</span>
Open <span>jTessBoxEditor.jar</span>
, click the open button in the <span>box editor</span>
, and open the TIFF file to edit. After editing, save it to generate the box file, and save it in the same folder.
(4) Generate the Model Using TIFF and Box Files
In the folder containing the TIFF and box files, enter the following code in the command window to generate the model (traindata file).
The above script can also be written in a bat file to run the script to generate traindata. Finally, just copy the traindata to the tessdata folder to use the model.
3. Apply the Image Recognition Model
After installation and model training, you can use the model in Python. Install pytesseract, find the <span>pytesseract.py</span>
file, open it for editing, and change the line <span>"tesseract_cmd = 'tesseract'"</span>
to the installation path of <span>tesseract</span>
(e.g., <span>C:\Program Files\Tesseract-OCR\tesseract</span>
).
Since the model is trained using grayscale images, grayscale images should also be used during recognition.
4. Optimize the Image Recognition Model
If there are errors during use, they can be saved and added to the training set to optimize the image recognition model. Generally, accumulate images with recognition errors and create a TIFF file after a period of time. Note: For similar errors, select a few marked ones to keep the training set as small and precise as possible.
Author: Yang Bing, a psychologist who writes code in a bank.
Appreciate the Author
More Readings
Implement a Simple Genetic Algorithm from Scratch Using Python
Master the Random Hill Climbing Algorithm in 5 Minutes Using Python
Completely Understand Association Rule Mining Algorithm in 5 Minutes
Special Recommendations
Click below to read the original text and join thecommunity membership