Recent Advances in Document Image Rectification: Introducing Transformer Framework and Polar Representation

2025

1/22

TextIn.com

TextIn

—— Focused on Intelligent Text Recognition for 18 Years ——

In the article “Overview of Document Digital Capture and Intelligent Processing: Image Distortion Correction Technology”, we introduced the development and representative schemes of document image correction technology. As the demand for intelligent document processing gradually upgrades, document image de-distortion technology is also continuously exploring new possibilities.

Today, we will discuss the recent advances in document image rectification tasks, sharing some directions we are focusing on, and welcome discussions and exchanges for progress.

Exploration of Document Correction under Transformer Architecture

Representative Work

DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction [1]

Research Findings

· A new framework, DocTr, is proposed to address geometric and illumination distortion issues in document images. This framework includes a geometric correction Transformer and an illumination correction Transformer. By setting a set of learned query embeddings, the geometric correction Transformer captures the global context of document images and decodes pixel-level displacement solutions to correct geometric distortion. After geometric correction, the illumination correction Transformer further removes shadow artifacts, enhancing visual quality and OCR accuracy.

Recent Advances in Document Image Rectification: Introducing Transformer Framework and Polar Representation

Innovative Advantages

· This is the first attempt to apply the Transformer architecture in the field of document image correction, providing a new perspective: viewing the correction process as a transformation from a “curved” state to a “flat” state.

· The global context information is captured through self-attention mechanisms, while positional encoding is combined to retain spatial structure, achieving high-quality correction results.

· Successfully extended the advantages of Transformers to the specific task of document correction, demonstrating its ability to handle long-range dependencies.

· Compared to traditional CNN models, it has shown stronger robustness and adaptability in certain extreme cases.

Project Address

https://github.com/fh2019ustc/DocTr

Deep Unrestricted Document Image Rectification [2]

Research Findings

· Introduced DocTr++, a novel unified framework for document image rectification that does not impose any restrictions on the input distorted images.

· A new end-to-end framework is introduced, which considers not only the two-dimensional geometric transformation of document images but also incorporates 3D shape information for more accurate correction. This method can handle more complex non-planar document surfaces, such as book pages.

Model Improvements

Compared to the DocTr framework, the model improvements are mainly reflected in the following aspects:

· Architecture upgrade, adopting a hierarchical encoder-decoder structure: DocTr++ introduces a hierarchical encoder-decoder architecture for multi-scale representation extraction and parsing. This structure can better capture the features of document images at different scales, leading to more accurate understanding and correction of distortions in the images.

· Redefinition of pixel mapping relationships to accommodate unrestricted document image correction: DocTr++ redefines the pixel mapping relationship between distorted document images and their undistorted counterparts. This means that DocTr++ can handle various input situations, including distorted images with complete document boundaries, partial document boundaries, and no document boundaries.

Project Address

https://github.com/fh2019ustc/DocTr-Plus

Document Image Correction Based on Polar Representation

Representative Work

Polar-Doc: One-Stage Document Dewarping with Multi-Scope Constraints under Polar Representation [3]

Research Findings

· Explored the application of polar representation in document de-distortion, proposing the Polar-Doc model. Unlike the two-stage processes adopted by most current works, polar representation allows the segmentation and de-distortion networks to be unified in a single-stage point regression framework. This unification enables the entire model to learn more efficiently under an end-to-end optimization process, achieving a compact representation.

· Proposed a novel multi-scope Polar-Doc-IOU loss function, serving as grid-based regularization under polar coordinates, constraining the relationships between control points, improving learning effectiveness, and achieving better de-wrinkling performance.

Innovative Advantages

· This is the first exploration of polar representation in document de-distortion, making the representation of document contours more flexible and the calculation of IOU loss more efficient.

· The proposed single-stage model unifies the segmentation and de-distortion tasks within a joint regression framework, achieving advanced model performance with fewer parameters.

Attention Mechanism Enhanced Control Point Prediction

Representative Work

DocReal: Robust Document Dewarping of Real-Life Images via Attention-Enhanced Control Point Prediction [4]

Research Findings

· Designed a dual network (Enet + AECP), where Enet is responsible for preliminary edge detection and rough correction, while AECP accurately locates control points by introducing attention mechanisms, achieving more precise local deformation correction.

· Enhanced training data by synthesizing 2D images with 3D deformations and additional deformation types, providing a more comprehensive benchmark containing 200 distorted Chinese images, covering more real-life scenarios.

Innovative Advantages

· By combining Enet and AECP modules, background noise is effectively removed, and readability is improved under different environmental conditions and text types, maintaining high output stability under various lighting conditions.

· The proposed 3D deformation synthesis method provides realistic and diverse deformations for training data, significantly enhancing the robustness of the model.

Hehe Information’s Image Correction System

Hehe Information has launched a high-performance document image correction system that effectively corrects complex backgrounds and various types of real scene distorted images, providing more easily processed input images for document recognition and analysis.

The system link is: https://www.textin.com/market/detail/crop_enhance_image

Summary

From the early use of purely geometric methods to the current combination of deep learning with geometric priors, illumination modeling, multimodal perception, and various other approaches, document image curvature correction (de-distortion) technology is becoming increasingly mature. The new generation of methods not only continuously improves the accuracy of curvature correction but also pays more attention to deployment efficiency and robustness in real mobile scenarios.

As more public datasets emerge and computer vision technology rapidly iterates, document correction technology will gradually move towards a stage that is more precise, robust, and user-friendly, providing important support for subsequent applications such as document analysis and information extraction.

We welcome you to scan the QR code below to join the technical exchange community and discuss the possibilities of technology development and AI applications with us.

For more benefits and learning materials on large model application technology, follow our public account to receive them immediately.

References

[1] Hao Feng, Yuechen Wang, Wengang Zhou, Jiajun Deng, Houqiang Li. “DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction.” In Proceedings of the 29th ACM International Conference on Multimedia (MM ’21), October 20–24, 2021, Virtual Event, China.

[2] Hao Feng, Shaokai Liu, Jiajun Deng, Wengang Zhou, Houqiang Li. “Deep Unrestricted Document Image Rectification.” arXiv preprint arXiv:2304.08796, 2023.

[3] Weiguang Zhang, Qiufeng Wang, Kaizhu Huang. “Polar-Doc: One-Stage Document Dewarping with Multi-Scope Constraints under Polar Representation.” arXiv preprint arXiv:2312.07925, 2023.

[4] Fangchen Yu, Yina Xie, Lei Wu, Yafei Wen, Guozhi Wang, Shuai Ren, Xiaoxin Chen, Jianfeng Mao, Wenye Li. “DocReal: Robust Document Dewarping of Real-Life Images via Attention-Enhanced Control Point Prediction.” In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024.

Leave a Comment Cancel reply