Data, Methods, and Applications of Ancient Book OCR

China has a glorious civilization spanning thousands of years. Throughout its history, many precious ancient books have been passed down. These ancient texts carry rich information regarding history, culture, politics, economics, and more, holding immense value. According to the “National Ancient Book Census Registration Basic Database” project initiated by the National Library in 2016, as of November 2020, the census counted over 7.9 million ancient books in China. Although there may be overlaps, this number indicates that there are indeed numerous ancient texts that have been transmitted from ancient times to the present. In the digitalization process of ancient books, China has completed many large-scale projects over the past forty years, such as the “Complete Library in Four Sections (Wenyuan Pavilion Electronic Edition)” and the “Basic Ancient Book Database of China”. However, a considerable number of ancient texts still lack text recognition and transcription due to their complex layouts. While current technology can effectively handle simple layouts, automating and accurately recognizing complex formats found in family trees and local chronicles remains challenging. Simply photocopying ancient texts for digitization is not conducive to reading, editing, or searching. Therefore, utilizing Optical Character Recognition technology (OCR) can help us better recognize the content and text within ancient books, analyze the layout, and produce structured outputs. This is of significant importance for the protection, retrieval, and even information mining and knowledge discovery of ancient texts.

However, ancient text recognition remains a highly challenging issue, primarily in four aspects: First, the complex layout of ancient texts, with dense text and mixed graphics, poses considerable difficulties for analysis; second, there are significant differences in the calligraphic styles of engravers from different dynasties; third, some ancient texts suffer from severe image quality degradation issues, such as incompleteness, blurriness, background noise, and stains; fourth, the number of character categories in ancient texts is vast. The “Kangxi Dictionary” contains nearly fifty thousand characters, and currently, there is no complete annotated dataset of ancient text images to recognize such a large-scale dataset, making it quite difficult. Additionally, there are numerous variant characters in ancient texts, and distinguishing these variants is not easy.

Compared to conventional OCR technology, the difficulties and challenges of ancient book OCR mainly lie in the quality, layout, and style of ancient books. At present, mainstream OCR technology has a high recognition rate for printed text, but it cannot be directly applied to ancient book OCR primarily due to the lack of high-quality, large-scale annotated data, especially the scarcity of publicly available large-scale Chinese datasets, with most layout datasets being focused on Western ancient texts.

Recently, we selected some ancient texts from the “Dunhuang Manuscripts” and tested them using OCR engines from several well-known domestic companies, finding that the recognition results were poor, with frequent errors in text detection, even for seemingly simple problems. This does not necessarily imply that their technology is inadequate; I believe that general-purpose OCR engines cannot achieve high levels without targeted training on ancient text data. There are also platforms that perform relatively well, such as the “i-Huiyan OCR” from the Shitongwen Company, which shows decent results, though occasional minor errors occur. Our team has also developed an ancient book OCR recognition engine based on the Tripitaka dataset, which performs well in recognizing “Dunhuang Manuscripts”.

Today, if we have high-quality annotated data, we can leverage artificial intelligence technology, especially deep learning, to effectively address the ancient book OCR problem. In fact, in the field of text recognition, artificial intelligence technologies represented by neural networks have a long history of research. For instance, during the deep learning era, convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have been extensively applied in text recognition. In 1998, Yann LeCun, an international authority in the field of deep learning, developed a handwritten digit recognition system at Bell Labs, formally proposing the convolutional neural network model—LeNet5, which achieved excellent recognition results at the time. The impact of this work has been profound, with over 40,000 citations to date. In 2020, MIT Technology Review interviewed Professor Hinton, an international authority in the field of artificial intelligence, who proposed a viewpoint that deep learning would be ubiquitous and all-powerful. This perspective, while somewhat radical, reflects the significant role that deep learning technologies play in many disciplines today, including computer vision, multimedia data, natural language processing, as well as education, finance, medicine, biology, e-commerce, archaeology, and more.

Why has deep learning played such an important role? First, there is an abundance of big data, especially in the internet era, where data is growing exponentially. Second, we have excellent computing capabilities, such as the rapid development of large-scale parallel computing hardware represented by GPUs. Third, there have been many breakthroughs at the algorithmic level, including methods and techniques for regularization to prevent overfitting, with numerous groundbreaking and impactful works in recognition, detection, segmentation, sequence modeling, as well as large-scale pre-training and self-supervised learning, resulting in many impressive advancements each year. Therefore, I believe that utilizing some new artificial intelligence technologies can effectively solve the ancient book OCR problem—from ancient book image processing, layout analysis, text detection to text recognition, and structured output of text.

It is generally believed that artificial intelligence has three major components: data, algorithms, and computing power. Of course, some scholars also indicate that there is another factor, namely knowledge. Knowledge includes domain knowledge, semantic knowledge, physical common sense, world knowledge, and unsupervised pre-training knowledge, which can help solve various artificial intelligence problems. If there is a large-scale pre-trained model with good language semantic knowledge, it can greatly assist us in addressing the ancient book OCR problem. Due to time constraints and the main focus, I will primarily introduce the traditional three aspects.

Currently, there are some publicly available ancient book datasets both domestically and internationally.

Datasets are a crucial factor in artificial intelligence. If there are vast amounts of high-quality annotated data, ancient book OCR would not be a significant challenge. However, during our research on ancient book OCR in the second half of 2016, we found that publicly available ancient book datasets in China are quite limited, particularly in terms of Chinese datasets. Therefore, our team constructed a Goryeo Tripitaka dataset and a diverse style Tripitaka dataset around 2017, which are referred to as MTH. Compared to the dataset proposed by Professor Liu Chenglin’s team in 2019, our single-character scale is somewhat smaller, but these two datasets are document-level ancient book datasets with annotations for text lines, formats, and layouts, totaling over one million characters, covering thousands of Chinese character categories. From the styles visible in the latest version of our constructed MTH dataset (MTH v2), it covers different layouts, fonts, and typesetting styles, and includes numerous challenges such as double-column notes. Additionally, we conducted character set-level annotations, while also providing annotations for text lines and reading order, making the information quite rich. The second dataset is the “CASIA AH CBD dataset” from the Institute of Automation, Chinese Academy of Sciences. This dataset is large in scale, with numerous categories, and is highly representative, covering the “Four Books and Five Classics” and Buddhist scriptures, making it one of the largest publicly available datasets of single-character Chinese ancient texts. The third dataset, “HDRC-Chinese”, was provided by a Singaporean NGO during the ICDAR OCR competition in 2019, focusing on Chinese family trees. This dataset was used in three competitions at ICDAR 2019, including text line recognition of family trees, layout analysis, and text detection recognition. The family tree layout is complex, and some layouts are not standardized, so the challenge lies in how to better perform layout analysis. The fourth dataset involves oracle bone inscriptions, with three representative datasets: Oracle-20K, Oracle AYNU, and OBC306.

There are many foreign datasets available. Some earlier layout datasets include: 1. IAM-HistDB, a German ancient book dataset composed of three subsets. 2. DIVA-HisDB, a medieval manuscript dataset. 3. HJDataset, a complex layout analysis dataset for Japanese (many characters in this dataset are similar to Chinese characters). The scale of this dataset is decent and contains some layout analysis elements. 4. READ-BAD, a manuscript dataset from European historical archives, containing around 2,000 images. 5. REID2019, a small-scale dataset from India, containing only dozens of images. 6. MapSeg2021, a historical document dataset of ancient maps. Ancient maps have archaeological and academic value, and understanding the text within the maps is crucial, but the text arrangement is often chaotic, making it difficult to interpret and recognize. MapSeg2021 is a Paris map dataset released at the mainstream international conference ICDAR in 2021. Additionally, our team is also constructing a complex layout ancient book dataset, defining ancient book layouts into 27 categories, including headers, frames, volumes, titles, main text, marginal notes, illustrations, etc. This dataset is characterized by data diversity, encompassing the “Complete Library in Four Sections”, ancient rare books from the National Library, Buddhist scriptures, etc.; diversity in decoration, including scrolls, bound books, photocopies, and other formats; and diversity in fonts, covering handwritten and printed styles. This dataset has been officially published in ICFHR 2022 and is available for public download.

Fundamental technologies related to ancient book OCR

The general process of OCR mainly includes layout analysis, text line detection, character segmentation, text line recognition, character recognition, and post-processing. This is the general flow of optical character recognition, and ancient book OCR roughly follows this process. In recent years, OCR, particularly scene text recognition, has become one of the research hotspots. Although there has been extensive research on general OCR or scene text recognition, there is relatively little research and reporting on ancient books. Therefore, I will introduce some technologies involved in ancient book OCR from five aspects: (1) Background knowledge used in text detection, object detection, and scene text detection methods; (2) High-precision ancient text segmentation and detection methods; (3) Recognition methods for ancient texts; (4) Layout analysis and end-to-end recognition of ancient texts; (5) Reading procedures and understanding of ancient book documents.

First, background knowledge. In text detection, representative object detection methods in computer vision can be categorized into one-stage and two-stage methods. Typical one-stage methods include YOLO, SSD, etc., while two-stage methods include Faster R-CNN, Fast R-CNN, etc.

Among one-stage object detection methods, YOLO is widely regarded as a fast and effective approach. It was proposed in 2016 at CVPR as a framework that directly regresses the bounding box coordinates of objects to be detected using a fully convolutional network, and finally obtains detection results through post-processing. YOLO-v2 made certain improvements, primarily through the application of regularization techniques and the introduction of the Anchor mechanism, enhancing detection accuracy. YOLO-v3 established a better backbone network, and clustering methods can achieve better Anchor scale settings, etc. The YOLO series is still undergoing continuous improvement, with YOLO-v4/v5 being recently open-sourced models that introduce stronger backbone networks, Bag of Freebies (BoF), Bag of Specials (BoS), CIoU, and other techniques to enhance detection accuracy.

Among two-stage object detection methods, a typical example is Faster R-CNN, which uses CNN features to extract Region Proposal Networks (RPN) to obtain a series of candidate boxes. It then uses two branches of CNN heads: one for bounding box regression and the other for object classification, ultimately yielding target detection results. Mask R-CNN further improves upon Faster R-CNN by replacing RoI Pooling with RoI Align, which reduces sampling errors during down-sampling. Additionally, it adds an instance segmentation branch called the Mask branch. Although the improvements are minimal, it allows for more detailed object segmentation and detection, making it a simple yet highly effective method that won the Best Paper Award at ICCV 2017.

In the field of text detection, there have also been many specific text detection methods, with recent research hotspots primarily focused on detecting complex scene texts (especially curved texts). Curved texts are relatively rare in ancient books, so I will introduce several representative methods for rectangular text detection. An early representative method is the EAST model proposed at CVPR in 2017, which uses a PVA backbone network to predict the boxes for detecting text, the position of each point to the rectangle’s vertices, and the angle values, effectively achieving multi-directional text detection. Another simpler and practical method proposed in the same year is TextBoxes, which is a text detection method based on the SSD detection framework. TextBoxes makes some special designs for Anchors in text detection, yielding decent results and is one of the earlier open-source text detection methods.

Our team also proposed a multi-directional quadrilateral text detection framework, DMPNet, in 2017. This method is also based on SSD but applies Monte Carlo method matching calculations and introduces some new optimization functions to achieve precise rectangle regression. Based on this method, we won first place in the multilingual text detection competition (MLT) at ICDAR 2017. Subsequently, in 2019, we proposed a detection method for multi-directional text detection boxes, BDN, which is an improvement based on Mask R-CNN. It adds a key edge design branch that effectively resolves ambiguities in sequential labeling during data annotation, achieving first place in the street scene text detection competition at ICDAR 2019. This method has also been open-sourced and performs well in dense text detection.

The second aspect involves methods for ancient text segmentation and detection. Precisely segmenting ancient text from ancient books is very meaningful. Firstly, it helps us better protect cultural relics and texts. Secondly, when establishing OCR data, once the recognized results are transformed into editable text, it may sometimes be necessary to trace back to the original first-hand materials, and high-precision text detection can aid in text provenance. Furthermore, accurately segmenting text can facilitate research by textual experts on the evolution of ancient characters. High precision means that the detection box must overlap sufficiently with the machine’s true labeled box. In computer vision and artificial intelligence, it is generally considered that an IoU of 0.5 counts as a correct detection. However, we note that many ancient texts, if segmented with IoU=0.5, remain incomplete. Only when the IoU exceeds 0.7 or even reaches 0.8 can the text be preserved more completely, making high-precision text detection very significant.

Around 2017-2018, we conducted several attempts regarding text detection and segmentation based on traditional methods. Since annotated data was very limited at the time, we designed a method that, without text box annotations, first utilized histogram projection to obtain the approximate location of the boxes, followed by precise matching and searching of annotated boxes. To improve text detection, we proposed a weakly supervised learning approach for precise text segmentation. The text obtained through projection segmentation cannot precisely match its annotation, so the problem is how to achieve automatic alignment. To this end, we designed a weakly supervised character classifier, based on which we proposed a recognition-guided ancient text segmentation method. The core idea is to improve the text over-segmentation results obtained through histogram projection, align with the text GT annotations using the Knuth-Morris-Pratt algorithm, and then utilize a CNN+CTC-based recognizer to assist and guide precise bounding box search matching. Goryeo Tripitaka (TKH) was the publicly available dataset we could find at the time, and we built a TKH dataset to validate the effectiveness of the proposed method. Overall, the results were quite good, achieving over 90% text segmentation accuracy at high LoU. Compared to other mainstream methods available at the time, our method achieved better segmentation accuracy regardless of whether the images were clear.

Once we accumulate a certain amount of annotated data, assuming we have sufficient annotated data, particularly for character set annotations, ancient text detection becomes less challenging. For instance, with several thousand annotated datasets of different layouts, both Faster R-CNN and YOLO-v3 can achieve over 90% detection accuracy at IoU=0.5, while YOLO-v5 can even reach 98% text detection accuracy. Especially in high-precision detection scenarios, YOLO-v5 still demonstrates excellent performance, overall outperforming Faster R-CNN.

To address the problem of high-precision ancient text detection and segmentation, we also proposed a high-precision method based on reinforcement learning. The preliminary idea of this algorithm is to first perform preliminary detection using traditional object detection methods and then fine-tune using deep reinforcement learning, aiming to achieve more precise ancient text detection under high accuracy. Several primary factors in reinforcement learning include the design of agent actions, the penalty function, and how to learn feature expressions. We applied deep learning in feature expression, and the design of actions is quite simple: once a box is detected, it can expand in all directions or remain stationary if it is already sufficiently precise. Regarding the penalty or reward functions involved in reinforcement learning, we also made some customized designs to address the issue of high-precision detection in ancient OCR. Experimental results indicate that after introducing our reinforcement learning network (FCPN), traditional object detection or text detection methods significantly improve the detection accuracy of ancient texts. Based on two publicly available datasets, MTH and TKH, the improvements are significant under high IoU conditions. Well-known models in the text detection field, such as PixelLink or EAST, can achieve improvements of about 5%-15% after incorporating our reinforcement learning method (IoU=0.8). This means that after fine-tuning with reinforcement learning, the detected boxes become more precise.

Methods for ancient text recognition

First, let’s introduce the mainstream recognition methods in the general OCR field. Before deep learning, the mainstream methods for text recognition were primarily based on segmentation or over-segmentation, with typical representatives such as the segmentation method proposed by Professor Liu Chenglin’s team at the Chinese Academy of Sciences in 2012 in TPAMI, which is still applied in handwriting or ancient text recognition engines today. Our team proposed a new segmentation-based text recognition method this year, yielding better results than the currently mainstream CTC and Attention methods.

Since 2015, the majority of mainstream methods in Chinese text recognition have been segmentation-free. In the field of English text recognition, the CTC sequence modeling framework proposed over a decade ago, such as LSTM+CTC, has played a significant role in sequence recognition, particularly in handwritten English recognition. The first successful application to solve scene text recognition problems was a model proposed by Professor Bai Xiang’s team at Huazhong University of Science and Technology, which combined CNN+BLSTM+CTC, a very effective and influential framework that has publicly available open-source code online. With sufficient annotated data, it is easy to train your own text recognition engine, including ancient text line recognition engines.

The third category of text recognition methods is based on the attention mechanism, known as Attention methods. In the OCR field, a representative method in recent years is ASTER, which combines text image correction, feature encoding, and recognition decoding in an end-to-end manner. Attention-based text recognition has seen extensive research in recent years, with methods like SVTR achieving excellent results in complex natural scene text recognition. Moreover, incorporating language and semantic knowledge can help us improve text recognition. In 2021, a team from the University of Science and Technology of China proposed a method that effectively combines language models and mathematical models for end-to-end training. Language models have been applied in the OCR field for decades, but the combination of transformer-based language models and attention mechanism deep learning networks for end-to-end training has shown great effectiveness. In recent years, some scholars have also combined CTC and Attention methods, leveraging their advantages and complementarity.

Next, I will briefly introduce the text recognition methods used in the ancient Tripitaka. In fact, recognizing ancient Tripitaka is not that difficult if there is a good amount of annotated data, and it does not require particularly complex methods. However, when we began this work five years ago, we found that there were almost no publicly available datasets, and the cost of annotating data was high. At that time, we found the Goryeo Tripitaka data, which is quite large, containing 160,000 images and over 50 million characters, but only has image-to-text alignment at the document level, without any directly trainable detection or text annotation data for OCR engines. We wondered if we could build a decent text line or character recognition engine using weakly supervised learning methods based on the existing document-level alignment annotations. If we could design an automatic segmentation method that does not align with the text labels, it could also resolve the annotation data issue. To address the misalignment problem, we utilized traditional methods to segment the images into columns of text line images, where some samples may not align correctly. Therefore, we designed a training strategy with an adaptive gating mechanism to mitigate the impact of misalignment. Experimental results showed that if the segmentation line method is performed well, over 95% of the time, automatic alignment can be achieved. Based on this, we designed two simple recognition models, CNN+CTC and CNN+LSTM+CTC. Using over 100,000 data from the Goryeo Tripitaka, we randomly selected 75% for the training set and 25% for the test set, finding that the overall results were acceptable, achieving recognition accuracy exceeding 98%.

Our second work in ancient text recognition involves weakly supervised recognition. In cases where there are no character annotations but there are text line annotations, can we link text segmentation and recognition to build a text recognition engine that does not require precise text annotations? To this end, we proposed a weakly supervised incremental adaptive learning method. The recognition engine is quite simple, being a convolutional neural network. The training samples are continuously acquired through weak supervision methods. The recognition network can first use traditional methods to annotate some data for training, and then, based on the confidence provided by the recognition engine, we designed an automatic labeling mechanism based on confidence, which helps us label the samples in the training set. We proposed a weakly supervised gating label generation mechanism, which is a significant new contribution of this work. Before using this method, the recognition accuracy was around 91%; after applying this method, the single-character recognition accuracy reached 98%, and the detection performance improved from 86% to 91%. Overall, this method allows for effective interaction between the recognizer, detector, and segmenter. In document line recognition experiments, we tested about eight or nine different versions of the Tripitaka, including “Pili Tripitaka”, “Sixi Tripitaka”, “Zhaocheng Jin Tripitaka”, “Qianlong Tripitaka”, and “Chinese Tripitaka”, with overall confidence remaining accurate.

The fourth aspect involves layout analysis and end-to-end recognition of ancient texts.

Text detection or recognition, including single-character detection and text line detection and recognition, is not a problem as long as there is annotated data. Current deep learning technology is mature enough to handle this well. However, layout analysis of ancient texts can sometimes be more challenging. The first step in OCR is to perform layout analysis; otherwise, it is impossible to recognize and output in the correct reading order. Our team has conducted three areas of work in ancient layout analysis:

First, an end-to-end framework for layout analysis, text detection, and recognition. For an ancient book image, can we detect not only the large layout framework but also the fine-grained text boxes and character boxes simultaneously? A single model can accomplish all three tasks. We designed a deep network model that includes branches for layout, text localization detection, and classification, jointly optimizing the three branches in an end-to-end manner. Based on the diverse style Tripitaka, the results were quite good. This work’s paper was published around 2020, and compared to some mainstream commercial systems at that time, our results were significantly better. Whether it involves single-column, double-column, three-column, or four-column layouts, the visual layout analysis results are generally quite good.

The second method is based on a work we are currently undertaking. DADSeg is a new method for ancient layout analysis based on Transformers. Such models leverage the instance segmentation concept and utilize the self-attention mechanism of Transformers to learn and predict layout segmentation positions. Through a segmentation prediction head and post-processing, it can analyze complex ancient layouts. This method consists of four modules: CNN feature extraction, deformable attention feature fusion Transformer block, decoder, and post-processing module. Experiments on publicly available ancient layout datasets, such as cBad2017 and MapSeg2021, show that our method performs comparably to current SOTA methods (cBad dataset) or even better (MapSeg dataset). Based on the ICDAR RDCL 2015 publicly released dataset, our method significantly outperformed previous methods; based on stroke-level segmentation datasets, our method nearly achieved SOTA results.

Lastly, I will introduce a method we constructed for understanding the reading order of ancient documents. In the ancient book OCR process, generating outputs with the correct reading order is crucial; otherwise, if a recognition is performed on an ancient image with four column frames, the text in the upper and lower frames may get mixed into one line, resulting in completely chaotic outputs. Therefore, we proposed a reading order method based on graph neural networks and Transformers, including two models for character reading order and text line reading order. The character reading order is constructed using graph convolutional networks, where each character in the graph forms a node, with node features composed of visual features and geometric features, and edge features representing the geometric relationship between two character nodes. Standard graph convolutional networks are then used for inference, achieving correct outputs for character reading order.

For predicting the reading relationships of text lines, we constructed a text line reading method based on Transformers. This method was inspired by the LayoutReader method in the EMNLP field, similarly designed based on Transformers. Evaluation metrics include traditional recall, harmonic precision, and text line connection relationship evaluation metrics like ARD. When applied to the ancient datasets from the Chinese Academy of Sciences and our datasets, our method outperformed heuristic rule-based methods significantly. The images of Goryeo Tripitaka contain large and small characters, as well as large and small lines, making their layout quite complex and difficult to handle, especially when the images have some background noise interference. Using heuristic rules often leads to errors, but learning-based methods can effectively address this issue.

Our recent work aims to simultaneously output text segmentation, recognition, and reading order understanding without sufficient annotation data. The diverse styles of ancient layouts require consideration of how to address layout segmentation, recognition, and prediction with relatively low costs. Our method has shown good results; one of its advantages is the ability to handle curved and deformed ancient texts, such as images of ancient texts captured with mobile phones or those that cannot be pressed down for scanning due to their precious nature, resulting in potentially curved or deformed images.

Lastly, I want to share some applications of ancient book OCR technology. There are many companies in China specializing in ancient book OCR, such as Airu Life and Shitongwen. Our laboratory has also developed an ancient book OCR engine for the Tripitaka, which is now open for public testing. Although this engine is trained on Tripitaka data, it still shows decent recognition performance for other types of ancient books.

Next, I would like to introduce a professional ancient book digitization tool platform—”Rushi Ancient Book Digitization Production Platform”, developed by the Beijing Rushi Artificial Intelligence Research Institute. It includes excellent ancient book digitization tools such as OCR, punctuation, text comparison, segmentation, and error correction. They collaborate with our laboratory, and their OCR recognition and detection engines are provided by us. The platform consists of four parts: first, the OCR tool. Users upload images, and the platform supports layout analysis of different formats, calling the backend recognition engine to display recognition results. For the recognition results, proofreading and error correction can be performed, including text correction, detection box correction, reading order correction, and other error correction methods, even incorporating some AI-assisted error correction technologies, providing excellent human-computer interaction to identify problematic areas. If any text detection results are incorrect, adjustments can be made, and the tool platform can also manage ancient books, exporting recognition results in JSON file format. Second, automatic punctuation. Third, punctuation migration. Fourth, intelligent text comparison, such as identifying text differences between various versions of the “Diamond Sutra”, which has many versions from ancient times to the present. If the differences between different versions can be automatically identified, it would greatly aid research in ancient literature and ancient text criticism. The platform supports both individual user registration and group user registration, allowing group users to manage multiple users.

Our laboratory has developed the “Chinese Ancient Book Document Analysis and Recognition Demonstration System”, utilizing a single GPU card to support text detection and recognition, as well as reading order and layout analysis functions. Overall, the accuracy of recognition results is acceptable, and we welcome everyone to test and provide feedback.

Recently, we have also developed a classical Chinese to modern Chinese translation engine, the “Classical Chinese Neural Machine Translation Demonstration System”, which shares some technical similarities with OCR. Technically, recognizing ancient text lines is a sequence-to-sequence modeling problem, and translating classical Chinese to vernacular Chinese also involves addressing data issues and model training challenges. Compared to professional translation software, our training set is small, yet many classical Chinese translations perform well, and preliminary experimental results appear better than those of Baidu.

For example, a passage from the “Zuo Zhuan”: “At first, Duke Wu of Zheng married a woman from Shen named Wu Jiang, who bore Zhuang Gong and Gong Shu Duan. When Zhuang Gong was born, Jiang Shi was alarmed, hence he was named ‘Wu Sheng’, and she disliked him. She loved Gong Shu Duan and wanted to establish him as the heir. She repeatedly requested this from Duke Wu, but he did not agree.” For someone like me with a background in science and engineering, this is not easy to understand. After reading the translated text (which reads: “Initially, Duke Wu of Zheng married a woman from Shen named Wu Jiang, who gave birth to Zhuang Gong and Gong Shu Duan. When Zhuang Gong was born, his difficult birth startled Jiang Shi, so he was named ‘Wu Sheng’, and she disliked him. Jiang Shi favored Gong Shu Duan and wished to establish him as the heir, repeatedly requesting it from Duke Wu, who did not agree.”), I roughly understand the meaning. I asked a humanities student, and they rated the translation effect at three out of five, indicating it is barely understandable but not on par with professional translations. The main issue lies in the lack of annotated data; naturally, future research could explore unsupervised, weakly supervised, or large-scale pre-training model methods to address classical and modern Chinese translation challenges. Looking at the translation of Buddhist scriptures, the translation of the “Heart Sutra”: “When Avalokiteshvara Bodhisattva practiced deep prajna paramita, he saw that the five aggregates are empty and transcended all suffering and distress.” The translation result is: “When Avalokiteshvara Bodhisattva uses profound prajna wisdom for observation, he sees that the five aggregates—form, feeling, perception, formation, and consciousness—are all empty of substance, thus transcending all suffering and disaster.” The fact that “the five aggregates are empty” is translated correctly, with “form, feeling, perception, formation, and consciousness” all included, which is quite surprising. For our science and engineering students, such tools can help us better understand classical texts and gain a rough understanding of the ancient literature that has been passed down in China.

Today, I mainly introduced the technological advancements related to ancient book OCR from the perspectives of data, methods, and applications. Looking ahead, there are still many unresolved issues in ancient book OCR. First, the problem of recognizing ancient texts with an ultra-large number of categories is not yet well addressed; our current recognition engine supports a maximum of one or two thousand categories. The GB18010-2005 dataset contains over seventy thousand categories, including many variant and rare characters. To date, there is no annotated dataset that fully covers them. Is there a way to utilize some small samples or synthetic samples to address this ultra-large category recognition problem in ancient texts? Second, the restoration of ancient book images is also an important issue that requires further research. For instance, the “Fangshan Stone Classics” are printed from stone, resulting in many characters being missing. Is there a way to use artificial intelligence technology to restore these characters? Third, the layout analysis of complex ancient texts. Fourth, the current artificial intelligence methods represented by deep learning are data-driven. Beyond data, can we also utilize knowledge? For example, for massive ancient texts, can we utilize already organized textual knowledge or knowledge graphs to better address the understanding and recognition of ancient document images? This is worth researching. Fifth, there are many teams in China already developing general ancient book OCR tool platforms, including Beijing Rushi Artificial Intelligence Research Institute and Beijing Shitongwen Digital Technology Co., Ltd., which have developed very excellent ancient book digitization tools. Additionally, I would like to call upon teams from different fields in OCR, digital humanities, ancient literature studies, or paleography to collaborate in constructing and opening larger-scale Chinese ancient book datasets. To my knowledge, there are currently few publicly available and freely accessible Chinese ancient book datasets. I believe that constructing a large-scale, open Chinese ancient book dataset will positively contribute to the development of the entire field, assist in the protection and restoration of national ancient texts and artifacts, and promote advancements in ancient book OCR technology, the construction of ancient knowledge graphs, and information discovery. I hope that teams from different disciplines can collaborate, and experts and scholars in artificial intelligence, digital humanities, ancient literature studies, and paleography can work together to promote the development of ancient book OCR.

Leave a Comment Cancel reply