PP-ChatOCRv3: Enhanced Accuracy and Fine-Tuning for Text Image Analysis

Click the blue text to follow us

The intelligent analysis technology for text images plays a crucial role in enhancing document processing efficiency and accuracy, promoting information accessibility and usability, aiding digital transformation across multiple industries, and addressing the challenges of document image diversity and complexity. It has a wide range of applications in fields such as automated office work, financial risk control, healthcare, legal services, and education.Recently, with the support of Wenxin Yiyan, the PaddleX low-code development tool has launched a more feature-rich and powerful text image intelligent analysis model production line; Document Scene Information Extraction v3 (PP-ChatOCRv3-doc), assisting developers in better solving document processing challenges.PP-ChatOCRv3 online experience address:https://aistudio.baidu.com/community/app/182491/webUIPP-ChatOCRv3 model production line address:https://aistudio.baidu.com/pipeline/minePP-ChatOCRv3 effect overview

Swipe to see more images

PP-ChatOCRv3 core highlights(1) Higher accuracy of the general model: Significantly improves the text image layout analysis capability, fully leveraging the language understanding advantages of Wenxin Yiyan, with an overall effect improvement of 6% compared to the previous version;(2) Stronger fine-tuning capability of the vertical model: Provides fine-tuning functions for text recognition models based on large-scale data fusion and high-precision layout area positioning models, resulting in a significant improvement in the performance of vertical models.Below, we will elaborate on the core highlights.Higher accuracy of the general modelThe system process of PP-ChatOCRv3 is shown in the diagram below: First, input the predicted image into the general layout analysis system, which predicts the text information and table structure in the image after layout analysis. The layout category, text, and table structure predicted by the layout analysis system are vector searched against the Query to obtain relevant text information, which is then sent to the Prompt generator to be recombined, leveraging the massive data and knowledge fusion of the Wenxin large language model, resulting in high accuracy in information extraction and broad application. The layout analysis system integrates multiple functions such as image correction (optional), layout area positioning, regular text detection, seal text detection, text recognition, and table recognition, allowing for high-precision real-time predictions on CPU/GPU. Through the fusion strategy of small and large models, each part can fully exhibit its advantages:The small model stands out with its high-precision image processing capability, while the large model demonstrates excellent content understanding ability.In this upgrade, new image correction and seal text detection modules have been added. The introduction of the image correction module is an effective solution to the problems of wrinkles, distortion, and tilt in text images under complex and variable shooting environments. Through advanced models, this module can automatically detect and correct geometric distortions in images, ensuring that text images are presented in the best condition, providing high-quality input for subsequent text recognition. This function is optional; users who need correction can choose to integrate the official correction inference model to meet their needs.The addition of the seal text detection module is an important supplement to the document layout analysis capability. Seals, as important components of documents, often carry critical authentication information. The newly added seal text detection module can accurately identify and extract the seal areas and the text information on them through curved text detection models and fine post-processing, providing important evidence for document verification and contract analysis.In addition to the aforementioned new functions, layout area positioning and table recognition models have also been upgraded. The layout analysis model now supports the positioning of images, tables, and seals, allowing for more granular parsing of different areas of documents compared to PP-ChatOCRv2. The table recognition model has shown better performance in recognizing complex tables with wireless tables and merged cells based on generated data.Based on the enhanced layout analysis capability, combined with Wenxin Yiyan, the overall effect of information extraction has improved by at least 6% compared to the previous version.Stronger fine-tuning capability of the vertical model

Fine-tuning of text recognition models based on large-scale data fusion

In response to the common degradation of general text recognition capabilities in training with vertical scene data, this upgrade innovatively integrates OCR text recognition data fusion and fine-tuning technology. The core of this technology lies in its intelligent fusion mechanism, which can automatically and seamlessly integrate a certain proportion of general scene text recognition datasets into vertical training data. This design considers the deep optimization of the model’s vertical recognition accuracy while maintaining general scene recognition capabilities, ensuring a balance between specialization and generalization of the model.When training text recognition models, users only need to easily set the data fusion ratio through an intuitive parameter configuration interface to seamlessly access the large-scale general text recognition data pre-set by the official, making the operation simple and quick. Through data fusion fine-tuning, a dual enhancement of model training effects has been achieved: enhancing the model’s recognition accuracy in specific vertical scene tasks while maintaining its broad applicability, i.e., its excellent general scene text recognition capability, thus achieving a balanced optimization goal in accuracy for model fine-tuning.

High-precision layout area positioning model fine-tuning

In this upgrade, for the layout area positioning model, a higher precision RT-DETR-H_layout model has been provided, which has improved accuracy by over 5% compared to SOTA solutions on a certain Chinese layout area positioning public dataset. On the other hand, a higher precision pre-trained model has been provided, resulting in higher accuracy and faster convergence during vertical scene fine-tuning training.

Exciting Course Preview

To help you quickly and thoroughly understand the text image intelligent analysis PP-ChatOCRv3 model production line and master practical operation skills, senior R&D engineers from Baidu will provide a detailed interpretation of text image intelligent analysis scene tasks and the new development paradigm on September 12 (Thursday) 19:00. Additionally, we will also open a zero-code development industry scenario practical camp targeting PP-ChatOCRv3 tasks, guiding you step by step through the complete development process from data preparation, data validation, model training, performance optimization to model deployment. Developers who register to participate in the practical camp can enjoy free access to PP-ChatOCRv3 zero-code production line training evaluation computing power for a limited time! This is a rare opportunity, so scan the QR code below to make an appointment!

END

#Previous Recommendations #

The front-end technology you want is here! (Issue 13)

Intelligent cockpit will move towards intelligent body direction

The application practice of large models in R&D data platforms

Paddle high-performance inference upgrade: acceleration of large language models and multimodal large model deployment

Exciting Course Preview

Leave a Comment Cancel reply