Understanding Image Recognition and Machine Learning

Understanding Image Recognition and Machine Learning

Introduction: On June 6th at the Tsinghua Artificial Intelligence Forum, Academician Zhang Bo warned us to face the current “AI craze” with calmness. Professors Wang Shengjin, Zhang Changshui, Zheng Fang, Microsoft’s Rui Yong, and Sogou’s Wang Xiaochuan also spoke. The brilliant speeches from academic leaders and industry guests sparked a wealth of insights about the past, present, and future of artificial intelligence.

Image recognition is a very core topic in the field of artificial intelligence. From a research perspective, machine learning is also a research direction under artificial intelligence. Therefore, this topic is likely to resonate with many.

What is Image Recognition?

Understanding Image Recognition and Machine Learning

What is image recognition? Taking this image as an example, the first question is: Is there a streetlight in this image? In academic research, we call this question image detection. The second question is to find the location of the streetlight, which is called localization. The third question is the classification and identification of objects, pointing out that this is a mountain, this is a tree, this is a sign, and this is a building. We may also perform a scene classification of the entire image, determining the environment in which the photo was taken. It could be an outdoor image related to urban life, etc. Essentially, these are some possible research questions we encounter in image recognition.

What Are the Applications of Image Recognition?

What are the potential uses of this research? For example, in autonomous vehicles: if a car has an auxiliary system with a camera that can recognize all situations in the scene, including lane lines, traffic signs, and obstacles, it would make driving easier and more convenient.

Additionally, some cameras can find where a person’s face is when the user presses the shutter halfway, focusing the lens on the face to make the image clearer.

Also, our computers often have thousands of photos; how do we organize them so that users can quickly find a specific photo? If there is such an image recognition system, I might tell the computer that I want a photo with two people, and that it was taken in the Summer Palace.

What Are the Difficulties in Image Recognition?

There are many difficulties in image recognition. The first difficulty is that there is a lot of variation in viewpoints. When we take pictures of the same object, the appearance of the image varies due to different viewpoints. So, viewing the same object from different angles can result in very different appearances. However, two different objects may look very similar. This creates a challenge for image recognition.

The second difficulty is the scale issue. Objects appear larger when closer and smaller when farther away in images, which complicates image recognition.

Changes in light and shadow have always been a significant concern in computer vision, and they represent the third difficulty in image recognition. The same person can look completely different under different lighting conditions.

The fourth difficulty is complex backgrounds. It is very challenging to find a person with a cane or a person wearing a hat against a complex background.

The fifth difficulty is occlusion. Occlusion is a significant concern in computer vision. For example, in a crowded image, we might identify a girl with brown hair wearing a short-sleeved shirt. Humans are quite skilled at this, but computers are not yet capable of doing it.

The sixth difficulty is deformation. Non-rigid objects deform while moving. The appearance of the same horse can vary greatly under different conditions.

The History of Image Recognition Development

Image recognition began with single object recognition. The image above shows results from traditional geometric methods. Our objective world is so complex and diverse; how do we approach recognition? We start with very simple problems, which is a common method in scientific research: starting with simple questions. For example, we can start with block recognition because blocks have standardized shapes. The above image shows a simple razor that has been recognized. Recognizing simple geometric shapes like rectangles, squares, and triangles enables effective detection and recognition of razors and tools. Another method is appearance-based recognition. I do not consider the geometric structure of the object to be recognized; I only look at its appearance. Here are examples of face detection.

Research on face recognition has been ongoing for a relatively long time, starting around the 1970s. Even today, there is still a significant amount of research published on face recognition.

Another topic is handwritten digit recognition. Handwritten digit recognition may seem simple, but it has sparked numerous research methods, yielding many results and proving to be an interesting topic. Other topics include vehicle detection. I am only listing a few here; there were also fingerprint recognition, OCR (Optical Character Recognition), and more during that time. Some research had already progressed to the level of commercialization, including OCR and fingerprint recognition.

Before 2000, image recognition employed geometric, structural, and rule-based methods, as well as some relatively simple pattern recognition methods.

What happened in the field of machine learning in the late 1980s and 1990s? During this period, machine learning experienced rapid development, yielding remarkable research results, including support vector machines, AdaBoost methods, and computational learning theory. These advancements greatly propelled machine learning and recognition forward. After 2002, a Chinese female scientist named Fei-Fei Li began using a new approach to image recognition. They aimed to design a unified framework for image recognition rather than creating a specialized method for each image recognition task. They hoped this unified framework could recognize thousands of objects. Additionally, they sought to apply the outstanding results from the field of machine learning to image recognition, borrowing methods from text analysis, such as the “bag of words” method for image recognition.

What is the “bag of words” method? For example, to recognize a face, we do not consider the complex structure of the face; we only check for the presence of a nose, eyes, mouth, and chin. If these components are present together, we classify it as a face. You might think this is simple.

This method originated from text research. In natural language processing, there is a task of text classification. In text classification, the “bag of words” method is employed.

For example, if we have an article, we want to know its category. Is it discussing military matters or science? One approach is to understand each sentence, parse it, and grasp its grammatical structure to comprehend its content. However, parsing sentences is challenging, and we often do not do well, so we do not use this method. Instead, we can use a simpler approach: we only look at the frequency of words that appear in the article. High-frequency words in the article might include: vision, perception, brain, nerve, cell, leading us to classify the article as belonging to neuroscience. Another article may have high-frequency words such as: China, trade, export, import, bank, currency, indicating it belongs to economics. This method does not require analyzing and parsing the grammatical structure of sentences and paragraphs; it merely aggregates these high-frequency words into a “bag of words”.

How can this method be applied to image recognition? When recognizing images, we can also aggregate the “high-frequency words” from the image for recognition. What are the “words” here? Intuitively, they are small image blocks. For instance, when recognizing a face, the image may contain blocks resembling skin or eyes. Conversely, when recognizing a bicycle, blocks related to the bike, such as the seat and frame, will appear. These image blocks are the “words”. In reality, the “words” in images are not as straightforward as we describe; they are small image patches, very small, sized 3×3, 5×5, or 7×7 pixels. These small image patches do not convey highly abstract semantics.

This method has led to many interesting related papers being published. However, it also has drawbacks. In the field of image recognition, there is an object recognition competition where participants are given images to design and train their algorithms. During the competition, new images are provided, and the algorithm must identify the category of each image. If the correct category is among the top five predictions, it is considered accurate; otherwise, it is counted as incorrect. In 2010, the first-place result in this competition was 72%, and in 2011, it was 74%. We know that with so many excellent teams and resources globally, the annual progress is only about 1% to 2%.

Since 2000, what has machine learning been working on? It has continued to conduct fundamental research, yielding numerous outstanding results. In 2006, Hinton published an article in Science introducing their deep learning methods. Someone suggested that Hinton try their methods on the object recognition problem. As a result, in the 2012 competition, they achieved first place with an 85% recognition rate. Later, everyone realized how effective this method was, leading to a rush to apply it to their respective concerns. Why is AI so popular now? This is primarily the reason.

Zhang Changshui’s laboratory also used this method to recognize traffic signs as part of a project funded by the National Natural Science Foundation. They put in a lot of effort, and the results were excellent, reaching a practical level.

Challenges Ahead and Future Research Directions

It appears that image recognition has made great strides, and many people are optimistic and excited. However, image recognition has not yet reached its full potential. What challenges remain? Let us consider a few examples.

For instance, when we conduct image recognition, we typically need to label data, indicating that this is a bird, this is a cat, and then use these images for training. Labeling data is indeed a tedious task that is time-consuming and costly. The first dataset collected by Fei-Fei Li’s project group contained 101 object categories. This image library was well constructed, and some algorithms could achieve over 99% recognition rates on this database. People said that the image quality was too high, but the variety was too limited. Later, they created a new database with 256 object categories, but the images were not well aligned. Despite this, the initial database was still too small.

In 2009, Fei-Fei Li’s team released a new database called ImageNet, which contains tens of millions of image data.

Labeling data is a challenging task. For example, this database requires enclosing each object in a bounding box and providing a category label. These are some typical images, and every object needs to be framed and labeled correctly.

There is also an image database called LabelMe. The above image is one of them, labeled very precisely, including the shape and outline of houses, windows, cars, grass, and roads. There are over 100,000 images, with about 10,000 images labeled very well. One time, Professor Zhang complimented a student from MIT on their impressive database, saying how much effort it took to create. The student replied that it was another student who created it. In reality, most of the images were labeled by his mother, who was retired and spent her days labeling data. What an impressive job for a mother!

Another remarkable Chinese scientist, Zhu Songchun, suggested that we should label images more precisely. For instance, in this image, the chair can be labeled very specifically, detailing the seat, backrest, and leg contours accurately. They labeled various types of chairs. They hired dozens of artists to label data continuously for years, yet the database only contained a few hundred thousand images. Therefore, labeling data is an expensive endeavor. This has led machine learning researchers to consider whether it is possible to improve image recognition without the extensive effort of labeling data. For example, in this image, if you simply tell me there is a motorcycle, I can detect and recognize the motorcycle without needing to specify where it is.

Many problems remain unsolved. For instance, our current technology only performs some analysis on images, identifying parts of the image as a bird or a tree, but it does not provide a deeper understanding of the image. For example, the algorithm does not understand the relationships between these objects. Understanding the relationships between objects is crucial for comprehending an image.

In this regard, I would like to introduce a project we are working on called image captioning. Several organizations, including Microsoft, Baidu, and Google, are engaged in this work. The results displayed here are from our laboratory. What is image captioning? It involves providing an image without needing to specify that there is a dog holding something. You only need to state that a dog is holding a flying saucer. We are currently designing a model using over 80,000 images and corresponding natural language sentences to train it. This model should generate a natural language sentence describing a new image when provided with it. For example, a generated sentence for this image could be: A train is parked on the tracks next to the train station. Another example could be: A group of zebras is standing closely together. Yet another could be: A dog is holding a flying saucer. Furthermore, during this process, we also obtained some interesting results. This model includes a visual attention mechanism that can automatically identify relevant image blocks. When generating the sentence “A brown cow is standing in the grass,” we see that the words brown, cow, and grass correspond to the correct image blocks. Notice that we did not inform the algorithm which block is the cow and which block is the grass in the training data. This indicates that the algorithm has learned these concepts. Since this is the case, can we also identify other concepts? We examined other words and their corresponding image blocks. For example, the blocks corresponding to the concept of fire hydrant and the concept of black cat were correctly identified. Interestingly, in addition to nouns, the program also identified concepts associated with verbs. For instance, “fill with” shows images of containers filled with things.

This result is fascinating and resembles how a child learns as they grow. We teach a toddler, saying, “This is a table,” or “This is a laser pointer.” We do not, and cannot, explain that “a” is a quantifier and “table” is the key word. Yet, the child gradually learns these concepts. Our algorithm learns in a similar manner.

We have listed the achievements we have made. However, many, many problems remain unresolved. For example, while we can currently recognize a bouquet of flowers, the algorithm cannot specify where the leaves, flower center, or petals are. Although it seems we have achieved so much, in some sense, we can convert these achievements into products to serve us, we also see that there are more problems that have not been well solved, which requires our continued effort.

This article mainly discusses images. However, we see that many of the methods used are machine learning methods. Therefore, it is the combined efforts of machine learning researchers and computer vision scientists that have led to the current achievements. We can convert these achievements into products, making our lives a little more intelligent.

This content is excerpted from Professor Zhang Changshui’s speech titled “Image Recognition and Machine Learning” at the Tsinghua University Artificial Intelligence Forum.

Source: DataPi (datapi)

Compiled by: Li Kenan Proofread by: Guo Xinrui Edited by: Zhang Meng

Leave a Comment