Fudan PhD Uses OCR and Regex to Verify Nucleic Acid Reports in 2 Minutes

Fudan PhD Uses OCR and Regex to Verify Nucleic Acid Reports in 2 Minutes
Source: Big Data Digest

This article is about 2000 words long and is recommended for a 5-minute read.
800 images take only 2 minutes, and the program has been packaged.

Fudan PhD Uses OCR and Regex to Verify Nucleic Acid Reports in 2 Minutes

Recently, according to reports from Fudan University, Li Xiaokang, a PhD student from the School of Information Science and Engineering, used OCR and regular expressions to help the college verify hundreds of nucleic acid completion screenshots in just a few minutes, greatly improving the efficiency and accuracy of the verification.
This topic has also sparked a lot of discussion on Zhihu, with over 3 million views so far.
Fudan PhD Uses OCR and Regex to Verify Nucleic Acid Reports in 2 Minutes
Using OCR and Regular Expressions for Epidemic Prevention
First, we need to briefly introduce OCR.
OCR, which stands for Optical Character Recognition, is a method of automatically inputting text.
OCR primarily obtains character image information from paper through optical input methods such as scanning and photography, analyzes the morphological features of the text using various pattern recognition algorithms, and converts documents, newspapers, books, manuscripts, and other printed materials into image information, which can then be transformed into usable computer input through text recognition technology.
Fudan PhD Uses OCR and Regex to Verify Nucleic Acid Reports in 2 Minutes
Li Xiaokang stated, “OCR can recognize the text in images and convert it into text information, making it convenient for verification. Moreover, since the nucleic acid screenshots are printed text, the recognition rate is very high, almost achieving 100% accuracy.”
A single screenshot contains a lot of text information, including anonymized names, document types, document numbers, sampling times, organizations, etc., but not all information is useful. Among them, the name, sampling time, and whether sampling has been done are the most critical pieces of information that need to be retrieved.
Fudan PhD Uses OCR and Regex to Verify Nucleic Acid Reports in 2 Minutes
On this basis, Li Xiaokang thought of using regular expressions in the Python language. Regular expressions use a single string to describe and match a series of strings that conform to a certain syntactic rule, and they are often used in many text editors to search for and replace text that matches a certain pattern.
“Using regular expressions allows us to filter out the desired information from the text recognized by OCR. Finally, after confirming the name, detection time, and whether sampling has been done in each screenshot, we output the results of all individuals into an Excel file for manual confirmation.”
On the evening of March 15, Li Xiaokang spent more than an hour writing the initial code, totaling 130 lines. He found that it could indeed run smoothly and had a very high efficiency. After validating it with the nucleic acid screenshot data from his class, he discovered that the program not only had a high accuracy rate and a short run time, but also completed over 80 images in just over 20 seconds, uncovering problems that had not been found in previous manual checks.
Fudan PhD Uses OCR and Regex to Verify Nucleic Acid Reports in 2 Minutes
800 Images in 2 Minutes, Program Packaged
Since early March, Fudan has launched routine nucleic acid screening work, and class counselors must verify “no one is left out.”
As a PhD student in biomedical engineering with a research focus on medical imaging and artificial intelligence, Li Xiaokang often encounters many image processing methods. He stated that his original intention in developing this program was to reduce the workload for himself and his colleagues.
“Although the principle is quite simple, anyone who knows how to code will understand it immediately. However, those who do not engage in related work cannot feel the time and effort it takes, and naturally will not think of a solution. I just used the knowledge I learned to solve practical difficulties in my work.”
After Li Xiaokang shared this on his social media, many colleagues in student affairs expressed great interest. He also shared the code so that teachers in need could use it promptly. “Since the program is written in Python, and the code comments are very complete, anyone who knows how to use Python can quickly get started.”
To facilitate teachers who do not know programming, Li Xiaokang ultimately packaged the program. “When everyone needs to use it, they just need to input a line of code in the command line to run it, which is very simple.”
Currently, the program is in service at the college. Li Xiaokang has allowed other teachers to try using his program for verification. 800 screenshots, which originally required several people to verify for over an hour, now only take 2 minutes to get the results.
Fudan PhD Uses OCR and Regex to Verify Nucleic Acid Reports in 2 Minutes
Netizens: This Insight and Problem-Solving Ability is Worth Acknowledgment
On Zhihu, many netizens have also expressed considerable admiration for this.
For example, Zhihu user @AimiBritni stated that this product itself does not have particularly outstanding features, but “this insight and problem-solving ability is something we should learn from.”
At the same time, many netizens have contributed their own ideas regarding epidemic prevention and control. For instance, Zhihu user @Daiming wrote:
If the contact tracing personnel had a professional application, they could scan the health code of positive cases, automatically identify the personal information of positive cases, and at the same time, the program would call data from various systems such as public security, industry and information technology, and payment to grasp the movement trajectory of positive cases in various locations. First, generate a semi-finished movement trajectory based on the data already mastered by various departments, which is not visible to the contact tracing personnel. Then, generate a form based on time, location, and other factors at the front end of the application. The contact tracing personnel can fill in the information not available in the big data by asking the positive cases. When filling in, the locations are automatically linked to the standard place names in the national name database, and then a primary contact tracing information report is generated with one click. The contact tracing personnel then complete the confirmation of the contact tracing information using the currently adopted verification method, generating the final contact tracing report.
Link:
https://www.zhihu.com/question/526681561/answer/2431023725
However, behind this, we still cannot ignore some basic facts. That is, after years of epidemic prevention, nucleic acid data still does not have a nationally unified public API interface. A small function that poses no technical difficulty has not been developed and provided by any health code system.
While epidemic prevention is indeed important, how to combine the informatization process with epidemic prevention to allow volunteers to engage in more meaningful work and service is also a question that needs to be considered.
Related Reports:

https://mp.weixin.qq.com/s/l8u9JifKDlRDoz32-jZWQg

https://mp.weixin.qq.com/s/RogQcUAsZszW5HkYwYcV-w

Editor: Wenjing
Fudan PhD Uses OCR and Regex to Verify Nucleic Acid Reports in 2 Minutes
Fudan PhD Uses OCR and Regex to Verify Nucleic Acid Reports in 2 Minutes

Leave a Comment