DA/T77-2019 Optical Character Recognition Standards for Digital Copies of Paper Archives

Preface

This standard is drafted in accordance with the rules given in GB/T 1.1-2009.

This standard is proposed and standardized by the National Archives Administration.

The drafting units of this standard: National Archives Administration, Qingdao Archives.

Main drafters of this standard: Liu Yun, Ding Desheng, Yang Laiqing, Zou Jie.

1 Scope

This standard specifies the organization, implementation, and management of Optical Character Recognition (OCR) work for digital copies of paper archives.

This standard is applicable to OCR work for digital copies of paper archives with clear handwriting and standardized text.

2 Normative References

The following documents are essential for the application of this document. For reference documents with dates, only the version with the indicated date applies to this document. For reference documents without dates, the latest version (including all amendments) applies to this document.

DA/T 13-1994 Archive Numbering Rules

DA/T 22-2015 Archive Document Arrangement Rules

DA/T 31-2017 Digitalization Standards for Paper Archives

3 Terms and Definitions

The following terms and definitions apply to this document.

3.1 Character

An element in a set of elements used to organize, control, or represent data.

[GB 18030-2005, Definition 4.1]

3.2 Character Set

A collection of multiple characters.

Note: Common character sets include ASCII, GB2312, BIG5, GB18030, Unicode, etc.

3.3 Optical Character Recognition; OCR

The process of recognizing character shapes in image files, converting text, and outputting/presenting the text through information technology.

3.4 Digital Copy of Paper-Based Record

A digital image formed after the digitization process of paper archives, stored on carriers such as tapes, disks, and CDs, and can be recognized by electronic devices like computers.

3.5 OCR Outcome of Record

A document that records the text content of digital copies of paper archives obtained through OCR technology.

3.6 Recognition Accuracy

The ratio of correctly recognized characters through OCR technology.

Note: Recognition accuracy = (Number of correctly recognized characters / Total number of characters that should be recognized) × 100%

3.7 Recognition Speed

The number of characters recognized by OCR technology in a unit of time.

4 General Principles

4.1 Archive OCR should be incorporated into the resource construction of digital archives (rooms), planned in a coordinated manner, implemented in an orderly fashion, and gradually normalized.

4.2 Archive OCR should be scientifically conducted, aiming to facilitate the retrieval of archive information and computer-assisted cataloging, research and development, data mining, etc.

4.3 Archive OCR should be based on the digitization of archives, establishing accurate and reliable associations between OCR outcomes and digital copies of paper archives.

4.4 Effective management and technical measures should be taken to strengthen the process management and quality control of archive OCR, ensuring that the OCR process is standardized, the outcomes are reliable, and the data is secure.

4.5 The OCR work for sensitive paper archive digital copies should comply with the management and technical requirements related to sensitive archives.

5 Work Organization

5.1 Organization and Personnel

5.1.1 An archive OCR work organization should be established, equipped with personnel of appropriate quality and technical level, to organize the overall planning, implementation, coordination, technical support, security assurance, supervision, inspection, outcome acceptance, and long-term preservation of the archive OCR work. Archive OCR can be coordinated with the digitalization work of paper archives.

5.1.2 If the archive OCR work is outsourced, the qualifications of the OCR service provider should be strictly examined in terms of the nature of the enterprise, shareholder composition, security and confidentiality, enterprise scale, registered capital, etc.; the management capability of the service provider should be assessed based on the establishment and improvement of regulations, establishing a clear responsibility and supervision mechanism covering the entire work process to ensure the safety of archive information. External personnel should undergo security checks and confidentiality education as required.

5.2 Process Control

5.2.1 The archive OCR process includes five business links: image import, image preprocessing, comparison recognition, modification correction, and result sorting output. Effective control should be implemented throughout the entire archive OCR process based on relevant technical standards.

5.2.2 Quality management and safety management should be strengthened throughout the entire process of archive OCR work, establishing a complete mechanism for discovering and correcting quality and safety issues to ensure the quality of OCR outcomes and the safety of archive information.

5.3 Work Documents and Metadata

5.3.1 Work documents such as the archive OCR work plan, technical plan, work approval materials, process control materials, data acceptance materials, project acceptance reports, and result handover materials should be established. For outsourced services, this should also include project bidding documents, bidding files, award notices, project contracts, confidentiality agreements, operating procedures, and supervision records, strengthening the management of archive OCR work.

5.3.2 Relevant metadata design, capture, cataloging, and management requirements in the archive OCR work process should be proposed according to relevant standards, integrated with the metadata implementation of the management process of corresponding digital copies of paper archives, and included in the database of the digital archive (room) application system.

6 Plan Formulation

6.1 Determine Work Strategy

6.1.1 Before starting the OCR work, an assessment should be made of whether the quality of the digital copies of paper archives meets the basic requirements for OCR. The assessment generally includes image resolution, skew, clarity, distortion, brightness, contrast, and gray scale.

6.1.2 After passing the assessment, the work strategy for archive OCR should be formulated based on the following factors:

—— Image resources: recognizable color (24 BITS), grayscale (256 levels), and black-and-white binary images that meet the import standards. Generally, these should be in TIFF, BMP, JPG, PDF (image), OFD (image) formats.

—— OCR engine: Software development package for high-speed and high-accuracy recognition of text contained in images.

—— OCR software: Software equipped with the OCR engine that can output recognition results quickly and accurately, supporting manual comparison and correction. The scope, quality, efficiency, and technical requirements of OCR should be determined based on the cost-risk balance principle of project resources.

—— Infrastructure: Locations, facilities, and equipment supporting system operation, including OCR equipment and workspace, off-site storage for media, backup server rooms, and auxiliary facilities.

—— Professional technical support capability: The ability to provide support and comprehensive assurance for system operation to achieve expected system goals, including the ability to analyze and solve hardware, system software, and application software issues, network system security management, and communication coordination.

—— Operational maintenance management capability: The ability to ensure that related equipment and software operate normally, providing long-term, timely, and comprehensive technical support, including operational environment management, system management, security management, and change management.

—— Disaster recovery plan: Quick and effective response and recovery from system disasters, including emergency disaster response, post-disaster system reconstruction and resumption of operations, and the establishment of related support mechanisms for communications, logistics, and technology.

6.2 Formulate Technical Plan

6.2.1 A technical plan for each work system of archive OCR should be formulated based on the determined work strategy, including the OCR data management system, OCR recognition processing system, and network system. The systems involved in the technical plan should meet the following conditions:

—— Security protection level equivalent to that of the archive management system;

—— Scalability;

—— No significant impact on the usability and performance of the archive management system.

6.2.2 To ensure that the technical plan meets the requirements of the archive OCR work strategy, the technical plan should be confirmed and verified, and the results of verification and confirmation should be recorded and preserved. Development should proceed according to the confirmed OCR software technical plan to achieve the required data management system, OCR recognition processing system, and network system.

6.2.3 Installation and testing plans for each stage of the OCR software system, as well as plans for supporting different key business functions, should be developed according to the confirmed technical plan, and final users should be organized to conduct tests together. The following functions should be confirmed to be correctly implemented:

—— Preprocessing of recognized images;

—— Data recognition and verification;

—— Output of archive OCR results;

—— Data security management.

7 Implementation of Archive OCR

7.1 Image Import

7.1.1 Before implementing archive OCR, an assessment should be conducted to determine whether the quality of the digital copies of paper archives meets the basic requirements for OCR. The assessment generally includes image resolution, skew, clarity, distortion, brightness, contrast, and gray scale.

7.1.2 The image resolution of digital copies of paper archives should not be less than 200 dpi. In special cases, such as small text, dense text, or poor clarity, the resolution can be appropriately increased. File naming should comply with the provisions of DA/T 13-1994, DA/T 22-2015, DA/T 31-2017.

7.1.3 For digital copies of paper archives that cannot meet the basic requirements for archive OCR work, they should be re-digitized according to the requirements of DA/T 31-2017 before being imported.

7.2 Image Preprocessing

7.2.1 Binarization

7.2.1.1 Before recognition processing, color images should be converted to grayscale and binarized; grayscale images should also undergo binarization. Algorithms such as local adaptive binarization should be adopted, supporting automatic or manual adjustments.

7.2.1.2 Automatic and manual adjustment functions for brightness and contrast values should be available. The settings for brightness and contrast should be based on the coherence and clarity of the strokes of Chinese characters in the adjusted image.

7.2.2 Image Denoising

7.2.2.1 Before recognizing printed characters in images, denoising should be performed on the images to enhance the accuracy of recognition processing.

7.2.2.2 Denoising should remove impurities that affect image quality, such as stains, lines, black edges, and spots caused by paper degradation, water stains, and stitching holes.

7.2.3 Skew Correction

7.2.3.1 Image direction detection and automatic horizontal or vertical skew correction should be performed before recognition.

7.2.3.2 Users should be able to specify the angle of skew in the image, and manual skew correction should be performed using appropriate image rotation algorithms.

7.2.4 Image Monitoring

The image quality control program should automatically detect the quality of image processing. Images that do not meet quality requirements should be marked.

7.3 Comparison Recognition

7.3.1 Layout Analysis

7.3.1.1 Before comparison recognition, the character block structure in the image should be analyzed for layout, grouping similar block information together.

For example, horizontal text, vertical text, tables, graphics, etc.

7.3.1.2 Layout analysis can adopt various analysis methods to automatically detect layout types, logically classify internal areas of the image, record the positions of each block, and store layout information.

7.3.2 Archive Feature Analysis

7.3.2.1 Archive seal analysis. A seal style library should be established to automatically recognize seals in the image and identify field positions such as archive number, year, organization, retention period, document number, and page number based on seal styles.

7.3.2.2 Document element analysis. A document format library should be established to accurately recognize the header, body, and footer of documents, recognizing areas for official seals, signatures, etc., comparing with document styles to identify confidentiality levels, confidentiality periods, urgency levels, document numbers, issuers, titles, main recipients, text, attachment descriptions, signatures of issuing organizations, dates of creation, annotations, attachments, and recipients. OCR recognition requirements for document elements are detailed in Appendix A.

7.3.2.3 Table analysis. A separate table processing module should be established, along with a dedicated table template definition tool, to customize file processing forms, issuing drafts, and other table templates, recognizing field positions in tables.

7.3.2.4 Seal analysis. The positions of seal images should be recognized, seal images should be stored, and a relationship database between seal names and images should be established for layout recovery.

7.3.3 Recognition and Matching

7.3.3.1 During recognition, font, size, bold, italic, first-line indentation, and other character features should be extracted and compared with the feature database using similarity calculation methods to identify them as computer text internal codes.

7.3.3.2 The feature database should store various printed characters, commonly used signatures, and handwritten characters, with updatable and expandable capabilities. A high-frequency library should be established for frequently used Chinese characters, English characters, numbers, and commonly used symbols, signatures, and handwritten characters. Unrecognized handwritten characters should be filtered out for manual recognition, and the recognition results should be stored in the character library.

7.3.3.3 The recognized text should be corrected or amended based on context to find the most logical word from similar candidate groups, enhancing the accuracy of OCR recognition.

7.4 Modification and Correction

7.4.1 The recognized text should undergo automatic semantic recognition and correction. Using vocabulary and semantic databases, characters, words, and sentences in the recognized text should be analyzed and corrected layer by layer. The vocabulary and semantic databases should have updating and automatic learning functions.

7.4.2 Candidates, rejected characters, and potentially problematic words and sentences should be marked.

7.4.3 Manual comparison and correction functions should be supported to meet higher accuracy requirements for OCR results.

7.5 Result Sorting Output

7.5.1 Result Sorting

7.5.1.1 Support should be provided for understanding and reconstructing the layout of paragraphs and tables in OCR results according to the layout of digital copies of paper archives. The layout of the reconstructed OCR results should be consistent with the images of the digital copies of paper archives.

7.5.1.2 The system should automatically analyze and extract various document elements from party and government documents, including confidentiality levels, confidentiality periods, urgency levels, document numbers, issuers, titles, main recipients, text, attachment descriptions, signatures of issuing organizations, dates of creation, annotations, attachments, and recipients. The positions of each document element in the OCR results should be consistent with the images of the digital copies of paper archives.

7.5.1.3 Support for calling, editing, backing up, and exporting OCR results, as well as searching for text and symbols, should be provided.

7.5.2 Result Output

7.5.2.2 Archive OCR results should be saved in both plain text and dual-layer PDF/OFD file formats.

7.5.2.2 The plain text format of archive OCR results should be saved on a per document or page basis. The saving rules for plain text OCR results are detailed in Table 1:

Table 1 Saving Rules for OCR Results

The naming of plain text format archive OCR results should be based on the archive number, ensuring the uniqueness of the file names. When a single archive is saved as multiple OCR result files, the naming should be based on the archive number combined with the sequential number of the OCR results.

Example 1: For a digital copy of a paper archive with the archive number A001-001-001-001, the corresponding OCR result file name should be A00100100010001.txt.

Example 2: For a digital copy of a paper archive with the archive number A001-001-001-0002, which includes a document processing form and the original document, the corresponding OCR result file names should be A00100100010002_01.txt and A00100100010002_02.txt, respectively.

7.5.2.4 Based on the layout file format of the digital copies of paper archives, a dual-layer PDF or OFD file supporting full-text retrieval should be automatically generated for convenient reading after full-text retrieval.

7.5.2.5 The system should support the automatic saving of document elements in party and government documents in the archive OCR results according to archive cataloging rules and electronic archive metadata specifications. The relevant document elements should be saved into the database of the digital archive (room) application system.

7.5.2.6 The system should support the automatic conversion of simplified and traditional Chinese in the archive OCR results.

7.5.3 Result Acceptance

7.5.3.1 Archive OCR results should be accepted and inspected using a combination of computer automated inspection and manual inspection.

7.5.3.2 Acceptance inspection content includes OCR results, extracted document elements from party and government documents, data linkage status, OCR work documents, and storage media.

7.5.3.3 Items that can be automatically inspected by computers should undergo 100% computer automated inspection, while items that cannot be automatically inspected should be manually inspected based on sampling, with a sampling ratio of no less than 5%.

8 Quality Requirements for Archive OCR

8.1 Recognition Accuracy

8.1.1 The recognition accuracy of archive OCR for Chinese, numeric, and English printed characters should be above 95%.

8.1.2 The recognition accuracy for commonly used signatures should reach above 90%, and handwritten character recognition accuracy should be above 80%.

8.2 Strong Noise Resistance

8.2.1 Archive OCR should have strong resistance to noise, effectively shielding significant noise interference during the recognition process.

8.2.2 Archive OCR should accurately identify stains, lines, black edges, paper degradation spots, water stains, and stitching holes on digital copies of paper archives to improve recognition accuracy.

8.3 Recognition Speed

8.3.1 Recognition speed indicators should apply simultaneously with recognition accuracy indicators.

8.3.2 Under mainstream computer hardware and software platforms, the recognition speed for Chinese characters on A4 paper should not be less than 1000 characters/second, and the recognition speed for English should not be less than 2000 characters/second.

8.4 Layout Restoration Degree

8.4.1 Accurate restoration of complex layouts should be achieved, using column technology and intelligent analysis of Chinese (simplified, traditional) and English fonts, mixed text, tables, and graphics, with no manual intervention required for layout restoration after recognition.

8.4.2 The restoration degree of the recognized document layout compared to the original imported image should reach above 90%.

9 Management and Application of Archive OCR Results

9.1 Result Management

9.1.1 The logical hierarchy and correlation between each component of archive OCR results, digital copies of paper archives, and metadata should be maintained.

9.1.2 Archive OCR results saved in plain text format should use the archive number as the file name, and can be stored in a hierarchical folder structure based on the archive number or unified with the digital copies of paper archives.

9.1.3 Dual-layer PDF or OFD files that support full-text retrieval can be stored together with the corresponding digital copies of paper archives. The digital archive (room) application system should record and maintain the relationships between different file versions.

9.1.4 The file management permissions for archive OCR results should be the same as those for digital copies of paper archives.

9.1.5 Data backup for OCR results should be conducted simultaneously with that of digital copies of paper archives.

9.2 Result Application

9.2.1 Archive OCR results should be implemented through the digital archive (room) application system for full-text retrieval, improving the efficiency of archive information retrieval.

9.2.2 The extracted archiving information and document elements from party and government documents can assist in automatic cataloging of archives, quality verification of catalogs, and accuracy verification of digital copies of paper archives.

9.2.3 Archive OCR results can be utilized in conjunction with data mining technologies for data analysis, knowledge management, and vocabulary construction.

Appendix A

(Normative Appendix)

Table A.1 OCR Recognition Requirements for Document Elements

DA/T77-2019 Optical Character Recognition Standards for Digital Copies of Paper Archives

Source: Digital Archive Management

Editor: Xu Aoxue

Reviewer: Wang Xiaowei

DA/T 31-2017 Digitalization Standards for Paper Archives

[GB 18030-2005, Definition 4.1]

Table 1 Saving Rules for OCR Results

Leave a Comment Cancel reply