Applications and Development Trends of Image Recognition Technology

Click the above“Beginner Learning Vision”, select to add “Starred” or “Pinned”

Heavyweight content delivered at the first time

Background of Image Recognition Technology

The development of mobile internet, smartphones, and social networks has brought a massive amount of image information. According to an article from BI in May, Instagram uploads about 60 million images daily; in February this year, WhatsApp sent 500 million images daily; the domestic WeChat Moments is also driven by image sharing. Images, which are not restricted by region and language, gradually replace cumbersome and subtle text, becoming the main medium for conveying meaning. The reasons why images have become the main medium of internet information exchange are mainly twofold:

First, from the perspective of users’ information reading habits, compared to text, images can provide users with more vivid, easily understandable, interesting, and artistic information;

Second, regarding the source of images, smartphones have brought us convenient shooting and screenshotting methods, helping us to collect and record information faster through images.

However, as images become the main information carrier on the internet, challenges arise. When information is recorded in text, we can easily find the required content through keyword searches and edit it at will. But when information is recorded in images, we cannot search the content within the images, which affects our efficiency in finding key content from images. Images provide us with a quick way to record and share information, but they reduce our efficiency in information retrieval. In this environment, computer image recognition technology becomes particularly important.

Image recognition is a technology that processes, analyzes, and understands images to recognize various different patterns of targets and objects. The recognition process includes image preprocessing, image segmentation, feature extraction, and judgment matching. In simple terms, image recognition is how computers understand the content of images like humans do. With the help of image recognition technology, we can not only obtain information more quickly through image searches but also create a new way to interact with the external world, and even make the external world operate more intelligently. Baidu’s Robin Li mentioned in 2011 that “a new era of image reading has arrived.” Now, with the continuous advancement of image recognition technology, more and more technology companies are beginning to engage in the field of image recognition, marking the official arrival of the image reading era and leading us into a more intelligent future.

Primary Stage of Image Recognition—Entertainment and Tool Utilization

In this stage, users mainly use image recognition technology to meet certain entertainment needs. For example, Baidu Magic Photo’s “Star Match” feature can help users find the celebrity that most resembles them; Baidu’s image search can find similar images; Facebook developed DeepFace for facial matching based on photos; Yahoo’s acquired image recognition company IQ Engine developed Glow, which can automatically generate tags for photos through image recognition to help users manage photos on their phones; and the domestic startup Megvii Technology established the VisionHacker game studio, developing motion-sensing games for mobile devices using image recognition technology; Chuangshi New Technology developed a machine vision surface inspection system using image recognition technology.

Another very important subfield in this stage is OCR (Optical Character Recognition), which refers to the process of optical devices checking characters printed on paper, determining their shapes by detecting dark and light patterns, and then translating those shapes into computer text using character recognition methods, which is the computer’s reading of text. Language and text are the most basic and important means for us to obtain information. In the digital world, we can easily obtain and process text through the internet and computers. However, once text is represented in image form, it adds many difficulties to our ability to obtain and process text. This is reflected in two ways: one is the text stored as images in the digital world for specific reasons; the other is all physical forms of text we see in real life. Therefore, we need to use OCR technology to extract these texts and information. In this regard, domestic products include Baidu’s Tu Shu Notebook and Baidu Translate; while Google, using a large distributed neural network trained by DistBelief, has a recognition rate of over 90% for millions of house numbers in Google Street View, identifying millions of house numbers daily.

In this stage, image recognition technology exists merely as an auxiliary tool for us, providing strong support and enhancement for our human vision, offering us a completely new way to interact with the external world. We can find key information in images through searches; we can quickly find various related information about a strange object by simply taking a picture of it; we can take a photo of a potential romantic interest and check their social network in advance; and we can also use facial recognition as a primary means of identity authentication… These applications may seem ordinary, but when image recognition technology permeates every aspect of our habits, we essentially outsource a part of our vision to machines, just as we have outsourced part of our memory to search engines.

This will greatly improve our interaction with the external world. Previously, our process of exploring the external world using technological tools was as follows: the human eye captures target information, the brain analyzes the information, converts it into keywords that machines can understand, and interacts with machines to obtain results. However, when image recognition technology gives machines “eyes,” this process can be simplified to: the human eye captures target information with the help of machines, and the machine and the internet directly analyze the information and return results. Image recognition makes cameras the key to decrypting information; we only need to point the camera at an unknown object to get the expected answer. As Baidu scientist Yu Kai said, the camera has become one of the important entrances connecting people and world information.

Advanced Stage of Image Recognition—Machines with Vision

As mentioned above, the current image recognition technology serves as a tool to help us interact with the external world, only providing auxiliary support for our vision, with all actions still needing to be completed by ourselves. When machines truly possess vision, they may completely replace us in completing these actions. Current applications of image recognition are like guide dogs for the blind, guiding them during their movements; while future image recognition technology will integrate with other artificial intelligence technologies to become a full-time butler for the blind, not requiring the blind to perform any actions, but rather having this butler help them complete everything. For example, if image recognition is a tool, it is like wearing Google Glass while driving a car, where it analyzes external information and conveys it to us, and we then make driving decisions based on that information; while if image recognition is used in machine vision and artificial intelligence, it is like Google’s self-driving car, where the machine not only acquires and analyzes external information but also takes full responsibility for all driving activities, freeing us completely.

In the book “Artificial Intelligence: A Modern Approach,” it is mentioned that in artificial intelligence, perception provides machines with information about the world they are in by interpreting sensor responses, among which the shared perceptual forms with humans include vision, hearing, and touch, with vision being the most important because it is the basis of all actions. In a forum, Baidu IDL’s Yu Kai asked everyone which sense they thought was the most important. No one could answer quickly. Later, Yu Kai changed the question: if you had to give up one sense, which one would you least want to give up? At that time, everyone answered that it was vision. Chris Frith in “The Construction of the Mind” mentioned that our perception of the world is not direct but relies on “unconscious reasoning,” meaning that before we can perceive an object, the brain must infer what that object might be based on the information reaching the senses, which constitutes the most important ability of humans to predict and handle sudden events. Vision is the most timely and accurate channel for information acquisition in this process, with 80% of the information we sense being visual information. The significance of machine vision in artificial intelligence is akin to that of vision in humans, and the technology that determines machine vision is image recognition technology.

More importantly, in certain application scenarios, machine vision has advantages over human physiological vision; it is more accurate, objective, and stable. Human vision has inherent limitations; we seem to be able to immediately and effortlessly perceive the world, and we can also seemingly perceive the entire visual scene in detail and vividly, but this is merely an illusion. We can only see the detailed and colorful parts of the visual scene that are projected onto the center of the eyeball. At about 10 degrees away from the center, the nerve cells are more dispersed and less capable of detecting light and shadow. In other words, the edges of our visual world are colorless and blurry. Therefore, we experience “change blindness,” where we may focus on one thing while overlooking the occurrence of other things happening simultaneously, unaware of them. Machines, on the other hand, have more advantages in this regard; they can discover and record everything happening within their field of vision. Taking video surveillance as the most widely used application, traditional monitoring requires someone to remain highly vigilant in front of the TV wall and then draw conclusions based on their judgment of the video. However, this is often affected by human fatigue, visual limitations, and distraction. With mature image recognition technology, combined with artificial intelligence support, computers can autonomously analyze and judge videos, directly alerting for anomalies, bringing higher efficiency and accuracy. In the field of counter-terrorism, the use of machine facial recognition technology far exceeds human subjective judgment.

Many technology giants have also begun to layout in the field of image recognition and artificial intelligence. Yann LeCun, an artificial intelligence expert signed by Facebook, made significant achievements in the field of image recognition. His proposed convolutional neural network represented by LeNet has achieved good results in various image recognition tasks and is considered one of the representatives of general image recognition systems. Google, using the simulated neural network “DistBelief,” learned the key features of cats by studying millions of YouTube videos, which is a case of a machine understanding the concept of a cat without human help. It is worth mentioning that Andrew NG, who was responsible for this project, has now moved to Baidu to lead the Baidu Research Institute, with one of his important research directions being artificial intelligence and image recognition. This also shows the importance that domestic technology companies place on image recognition and artificial intelligence technologies.

Image recognition technology connects machines with this unknown world, helping them to understand this world better, and ultimately replacing us in completing more tasks.

Good news!
The Beginner Learning Vision Knowledge Planet
is now open to the public👇👇👇



Download 1: OpenCV-Contrib Extension Module Chinese Version Tutorial
Reply to "Extension Module Chinese Tutorial" in the "Beginner Learning Vision" public account to download the first OpenCV extension module tutorial in Chinese, covering over twenty chapters including extension module installation, SFM algorithms, stereo vision, target tracking, biological vision, super-resolution processing, etc.

Download 2: Python Vision Practical Project 52 Lectures
Reply to "Python Vision Practical Project" in the "Beginner Learning Vision" public account to download 31 practical vision projects including image segmentation, mask detection, lane line detection, vehicle counting, eyeliner addition, license plate recognition, character recognition, emotion detection, text content extraction, facial recognition, etc., to help quickly learn computer vision.

Download 3: OpenCV Practical Project 20 Lectures
Reply to "OpenCV Practical Project 20 Lectures" in the "Beginner Learning Vision" public account to download 20 practical projects based on OpenCV for advanced learning in OpenCV.

Group Chat

Welcome to join the public account reader group to communicate with peers. Currently, there are WeChat groups for SLAM, 3D vision, sensors, autonomous driving, computational photography, detection, segmentation, recognition, medical imaging, GAN, algorithm competitions, etc. (these will gradually be subdivided). Please scan the WeChat ID below to join the group, with a note: "Nickname + School/Company + Research Direction", for example: "Zhang San + Shanghai Jiao Tong University + Vision SLAM". Please follow the format; otherwise, you will not be approved. After successful addition, you will be invited to relevant WeChat groups according to your research direction. Please do not send advertisements in the group; otherwise, you will be removed. Thank you for your understanding~

Leave a Comment Cancel reply