Optical Character Recognition (OCR) is a technology that converts textual information in images into character information that can be processed by computers, functioning as the “eyes” of computers and serving as an important technological foundation for visual interaction between machines and the real world. The early OCR technology can be traced back to 1870, marked by the advent of telegraph technology and reading devices designed for the visually impaired, which heralded the birth of OCR. In recent years, with the practical application of artificial intelligence technology in OCR, the performance and efficiency of OCR have greatly improved. Today, AI-based OCR has been widely applied in various fields such as finance, transportation, government, justice, and healthcare, integrating into every aspect of people’s production and life.
Archive OCR refers to the process of recognizing character shapes in image files of digital copies of paper archives using OCR technology, converting text, and outputting and presenting the text. Implementing AI technology in archive OCR work is of great significance for improving work efficiency and accuracy, accelerating the realization of automated cataloging, full-text retrieval, data analysis, and other system functions, and promoting the transformation of archive information resource construction from digitization to data-centric.
Current Status of Archive OCR Work
Since 2013, under the strong implementation of the national strategy of “stock digitization and incremental electronicization” by the National Archives Administration, a large number of digital copies of paper archives have been produced. The digitization of stock archives in archives (rooms) at all levels across the country has achieved remarkable results, with a significant increase in the digitization ratio. Many archive departments have completed the digitization of all their collections. By the end of 2019, the capacity of digital copies of archives in comprehensive archives across the country had reached 14.078 million GB.
Currently, archive OCR work has been fully launched, and relevant standards and specifications have been timely introduced. Some regional archive departments have started archive OCR work after completing the digitization of paper archives. Additionally, some archive departments have simultaneously conducted archive OCR work while carrying out archive digitization. To standardize the related work, the National Archives Administration issued the “Optical Character Recognition (OCR) Work Specification for Digital Copies of Paper Archives” in December 2019, which stipulates the organization, implementation, and management requirements for OCR work on digital copies of paper archives, defining the overall principles, work processes, and quality regulations for conducting archive OCR work. Based on this, relevant work in archive departments has achieved significant results, and in the future, archive OCR will be integrated into a broader and deeper level of archive work.
Limitations of Traditional OCR
Before the widespread application of artificial intelligence technology, automated recognition of text was a daunting and urgent problem. Traditional OCR recognition is based on the basic shape of characters, performing statistical analysis on the differences between character shapes, and then identifying a set of optimal statistical parameters that can represent the differences between characters, thus achieving character selection and recognition.
The traditional OCR workflow includes processes such as image import, image preprocessing, layout analysis, text segmentation, and text recognition. Over the years, extensive optimization research has been conducted on the traditional OCR workflow. However, due to the complexity of the process and the expressiveness of manually designed features, traditional text detection and recognition methods often yield unsatisfactory results for more complex images, such as those with distortion and blur. The limitations of traditional OCR in recognizing Chinese characters are mainly reflected in the following four aspects.
First, the traditional OCR processing workflow involves too many steps, most of which are serial, leading to errors being continuously amplified. For instance, if each step in the OCR processing workflow has a 90% accuracy rate, which seems high, after five steps of error accumulation, the result may already be unacceptable.
Second, the traditional OCR processing workflow involves a considerable amount of manual design, which may not necessarily capture the essence of the problem. For example, during the binarization preprocessing of text, it can be challenging to adjust the binarization threshold in some cases. Due to the low complexity of this model and its inability to adequately fit all data, many useful pieces of information must be filtered out during actual processing.
Third, in cases where the background is slightly complex or there are variant characters, traditional OCR generally fails, showing weak adaptability of the processing model. The methods for layout analysis and line segmentation can only handle relatively simple scenarios; once faced with complex layouts, it becomes challenging to achieve accurate processing.
Fourth, for single character recognition, traditional OCR cannot consider the semantic associations of the context. To address this issue, traditional OCR has implemented various combinations, such as conducting dynamic path searches on the recognition results. During the path optimization process, it often needs to combine the visual features of characters with language models, leading to a significant amount of coupling and resulting in the accumulation of numerous algorithms in the recognition system. Even so, traditional OCR still faces many unresolved issues, such as difficulties in segmenting handwritten fonts due to many strokes being connected.
These limitations result in a relatively low recognition rate and longer recognition time for traditional OCR.
In recent years, with the practical use of artificial intelligence technologies such as computer vision, natural language understanding, and knowledge graphs in OCR, the performance and efficiency of OCR have greatly improved. Through the adaptive learning-driven approach of deep learning, it can better address some problems arising from traditional OCR, simplify the parameter preprocessing workflow, achieve end-to-end processing, and improve OCR recognition rates. Currently, the recognition rate of AI-based OCR for simplified printed text has reached over 98%.
AI OCR technology can also be applied in recognition scenarios with diversity and complexity. For instance, it can handle text with different sizes, fonts, colors, brightness, and contrast, as well as text arranged and aligned differently, and images where non-text areas share similar textures with text areas, including low-contrast, blurry, broken, and incomplete text. Therefore, AI OCR can be applied not only in document recognition but also in recognizing text images in natural scenes. Additionally, AI OCR can improve work efficiency and save substantial costs.
Based on this, applying AI OCR in archive work is of significant importance and meaning, and it will undoubtedly become an important foundation supporting the digital transformation, intelligent upgrading, and integrated innovation of the archive industry.
AI OCR Workflow mainly includes image input, text detection, text recognition, manual confirmation, and manual intervention.
First, the digital copies of paper archives that need to be recognized are imported into the OCR system individually or in batches.
Second, text detection is performed. Text detection mainly locates the position of text in the digital image and annotates its position. The methods for text detection mainly include candidate box-based text detection, semantic segmentation-based text detection, and hybrid methods based on both. Candidate box-based text detection first pre-generates several candidate boxes, then regresses coordinates and classifications, and finally obtains the final detection results through the NMS (Non-Maximum Suppression) algorithm; semantic segmentation-based text detection directly performs pixel-level semantic segmentation through FPN (Feature Pyramid Network) to obtain the relevant coordinates.
Next, text recognition is carried out. Text recognition focuses on the specific content of the located text areas, converting a string of text in the image into corresponding characters. Text recognition algorithms can be divided into two main categories: methods based on CTC (Connectionist Temporal Classification) technology and network models based on attention mechanisms. Among them, methods based on CTC technology can effectively capture the contextual dependencies of input sequences while addressing the alignment issues between images and text characters; however, they may encounter recognition errors in freely structured handwritten scenarios. Attention mechanism-based network models mainly apply to the allocation of feature weights in convolutional neural networks, enhancing the weights of strong features and reducing the weights of weak features, naturally capturing semantics during the decoding process from image to text.
Then, manual confirmation is conducted to verify the results after OCR recognition and determine if errors exist. During manual confirmation, flexible methods such as post-processing in batches can be adopted.
Finally, manual intervention is performed to correct any potential errors in the OCR recognition results.
AI OCR can be applied in an independent or embedded manner within archive digitization systems. The independent mode operates as standalone software or exchanges data through an application programming interface (API), not relying on the archive digitization system. The embedded mode integrates the OCR module into the archive digitization system as part of its functionality, requiring unified planning during the design and development of the archive management system or modifications to existing systems.
Currently, AI OCR has been introduced in various industries, but its application in the archive industry still faces challenges and limitations, mainly reflected in two aspects.
First, the diversity of archive text. Archive types are diverse, encompassing a wide range of text content, with different languages, fonts, sizes, colors, brightness, arrangements, and alignments, as well as issues such as low contrast, blur, and incompleteness in image content. There are even more challenging cases, such as handwritten scripts and various forms of traditional and simplified characters from different periods. These issues present various challenges to archive OCR work, and AI OCR cannot solve all problems, necessitating that staff find optimal solutions based on specific technical conditions.
Second, technical bottlenecks. In recent years, although AI OCR has significantly improved the performance and efficiency of machine text recognition, there remains a substantial gap between the capabilities and levels of machine text recognition and those of staff in understanding the text in images. Overall, there is still a need to continuously enhance the robustness, efficiency, and intelligence of OCR to better apply it in more complex and challenging archive work.