Are you still frustrated with manually processing table images in office documents? Don’t worry! Today, I’ll show you how to use Python and image recognition technology to extract text from images with just one click. Whether it’s meeting notes or report screenshots, with just a few lines of code, you can easily handle it and boost your office efficiency!
1. Core Tools for Image Recognition:<span>Pytesseract</span>
<span>Pytesseract</span>
is a Python library based on the Tesseract OCR engine, specifically designed for recognizing text in images. Tesseract is an open-source project maintained by Google, known for its high accuracy.
Installing Necessary Tools
-
First, ensure that you have installed the Tesseract OCR engine.
-
Windows users can download and install it from the Tesseract official installation page. -
macOS users can install it via Homebrew: brew install tesseract
-
Linux users can install it via the package manager: sudo apt install tesseract-ocr
Install the Python libraries:
pip install pytesseract pillow
Simple Example: Recognizing Text in Images
Here’s a simple example demonstrating how to extract text from an image using Python:
from PIL import Image
import pytesseract
# Ensure to specify the Tesseract executable path (for Windows users)
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
# Load the image
image = Image.open("table_image.png")
# Extract text
text = pytesseract.image_to_string(image)
print("Recognized text is as follows:")
print(text)
After running the code, you will see the text content from the image printed directly!
Tip: If the recognition result is not satisfactory, it may be due to poor image quality. You can try preprocessing the image (which will be discussed later).
2. Improve Recognition Accuracy: Image Preprocessing Techniques
Sometimes, the recognition effect of text in images is not ideal. We can improve accuracy through image preprocessing.
Common Preprocessing Methods
-
Convert to Grayscale: Remove color information and focus on the text area. -
Binarization: Convert the image to black and white to enhance contrast. -
Remove Noise: Clean up noise in the image to reduce interference.
Example Code
import cv2
import numpy as np
from PIL import Image
# Use OpenCV to load the image
image = cv2.imread("table_image.png")
# Convert to grayscale
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# Binarization
_, binary = cv2.threshold(gray, 150, 255, cv2.THRESH_BINARY)
# Save the preprocessed image
cv2.imwrite("processed_image.png", binary)
# Use Pytesseract to recognize
processed_image = Image.open("processed_image.png")
text = pytesseract.image_to_string(processed_image)
print("Optimized recognition result:")
print(text)
Tip: Image preprocessing is especially suitable for scenarios with blurry text or complex backgrounds. Give it a try!
3. Extracting Text from Tables: Using <span>Pytesseract</span>
to Recognize Table Structures
If your image contains a table structure (like an Excel screenshot), extracting the content of each cell can be a bit more complex.
Let Pytesseract Output Table Structure
You can obtain the position information of each text block using <span>image_to_boxes</span>
or <span>image_to_data</span>
methods.
# Output the content and position of each cell in the table
data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
# Iterate through all text blocks
for i in range(len(data["text"])):
if data["text"][i].strip(): # Skip empty blocks
print(f"Text: {data['text'][i]},Position: ({data['left'][i]}, {data['top'][i]})")
Practical Application: Generating Excel Files
You can combine it with <span>pandas</span>
to save the recognized text to Excel:
import pandas as pd
# Extract table content
rows = []
for i in range(len(data["text"])):
if data["text"][i].strip():
rows.append(data["text"][i])
# Save to Excel
df = pd.DataFrame(rows, columns=["Content"])
df.to_excel("output.xlsx", index=False)
print("Table content has been saved to output.xlsx!")
4. Advanced Applications: Multilingual Recognition
Tesseract supports multiple languages; just download the corresponding language pack and specify the language parameter.
Installing Language Packs
For example, for Chinese:
sudo apt install tesseract-ocr-chi-sim
Using Language Packs
text = pytesseract.image_to_string(image, lang="chi_sim")
print("Chinese recognition result:")
print(text)
Note: If processing multiple languages simultaneously, you can specify multiple language packs in the form of <span>lang="eng+chi_sim"</span>
.
5. Small Exercise: Give It a Try
-
Download an image containing a table, try extracting all the text from the table and saving it as an Excel file. -
Use OpenCV to preprocess the image and compare the recognition effects before and after preprocessing. -
Try using Pytesseract to extract text content from a multilingual image.
Conclusion
Friends, today’s journey of learning Python ends here! Wish you all happy learning, may your Python skills improve steadily!