Image Recognition and Machine Learning Insights

Introduction: On June 6, at the Tsinghua Artificial Intelligence Forum, Academician Zhang Bo warned us to calmly face the current “artificial intelligence fever”. Professor Wang Shengjin, Professor Zhang Changshui, Professor Zheng Fang, Microsoft’s Rui Yong, and Sogou’s Wang Xiaochuan each gave speeches. The brilliant talks by academic masters and industry guests sparked a wealth of insights about the past, present, and future of artificial intelligence.

This article is based on Professor Zhang Changshui of the Department of Automation at Tsinghua University‘s speech titled “Image Recognition and Machine Learning” at the “Tsinghua Artificial Intelligence” forum.

整理：李柯南

校对：郭芯芮

编辑：张梦

Image Recognition and Machine Learning Insights Image recognition is a core topic in the field of artificial intelligence. From a research perspective, machine learning is also a research direction under artificial intelligence. Therefore, this topic is likely to resonate more with everyone.

1. What is Image Recognition?

Image Recognition and Machine Learning Insights What is image recognition? Taking this image as an example, the first question is: Is there a streetlight in this image? In academic research, we call this question image detection. The second question is to locate the position of the streetlight, which is called localization. The third question is object classification and recognition, identifying this as a mountain, this as a tree, this as a sign, a building. We may also perform a scene classification for the entire image, determining what environment the photo was taken in. It could be an outdoor image, related to urban life, etc. Essentially, these are some possible research questions involved in image recognition.

2. What are the Applications of Image Recognition?

What are the potential uses of this research? For example, in self-driving cars: if a car has an auxiliary system equipped with a camera that can recognize all situations in this scene, including lane lines, traffic signs, obstacles, etc., it can make driving easier and more convenient.

Additionally, some cameras can find the location of a person’s face when a user presses the shutter halfway, focusing the lens on the face to make the image clearer.

Moreover, our computers often have thousands of photos; how can we organize them so that users can quickly find a specific photo? If there is such an image recognition system, I might tell the computer that I want to find a photo with two people, taken at the Summer Palace.

3. What are the Difficulties in Image Recognition?

Image recognition has many challenges. The first difficulty is the significant variation in viewpoints. When we take photos of the same object, the appearance of the images is different due to different viewpoints. Therefore, looking at the same object from different angles can yield very different appearances. However, two different objects may appear very similar, which poses a challenge for image recognition.

The second challenge is the scale issue. Objects appear larger when closer and smaller when farther away; this presents certain difficulties in image recognition.

The variation in lighting and shadows has always been a significant concern in computer vision. The change in lighting is the third difficulty of image recognition. The same person can appear very different under different lighting conditions.

The fourth challenge is the complexity of backgrounds. In a complex background, it is very difficult to find a person with a cane or someone wearing a hat.

The fifth challenge is occlusion. Occlusion is a significant concern in computer vision. For example, in a crowded image, we might recognize that this is likely a girl: she has brown hair and is wearing a short-sleeved shirt. Humans have a strong ability to recognize gender in such cases, but computers cannot do this yet.

The sixth challenge is deformation. Non-rigid bodies undergo deformation during motion. The same horse can appear very different under different conditions.

4. The Development History of Image Recognition

Image recognition began with single object recognition. The image above shows the results of traditional geometric methods. Our objective world is so complex and diverse; how should we approach recognition? We start with very simple problems, which is a common approach in scientific research: starting from simple issues. For instance, beginning with block recognition, as blocks have several standardized shapes. The image above shows a simple razor that has been recognized. These artificially standardized geometric combinations can be easily recognized by identifying rectangles, squares, triangles, etc. Another method is based on appearance recognition. I do not consider the geometric structure of the object to be recognized; I only look at how it appears. Here are examples of face detection.

The research history of face recognition is relatively long, dating back to the 1970s. Even now, there are still many research works published on face recognition.

Another topic is handwritten digit recognition. Handwritten digit recognition seems simple, but the research on it has led to many methods and significant results, making it an interesting topic. Other topics include vehicle detection. I have only listed a few here. In fact, during the same period, there were also studies on fingerprint recognition, optical character recognition (OCR), etc. Some of these research works had already reached the level of commercialization, including OCR and fingerprint recognition.

Before the year 2000, image recognition relied on geometric methods, structural methods, rule-based methods, and some relatively simple pattern recognition methods.

What happened in the field of machine learning during the late 1980s and 1990s? During this period, machine learning experienced rapid development, yielding remarkable research results, including support vector machines, AdaBoosting methods, computational learning theory, etc. These results greatly advanced machine learning and recognition. In the period after 2002, a Chinese female scientist named Li Fei-Fei began to approach image recognition with a new idea. They aimed to design a unified framework for image recognition rather than developing a specialized method for each specific recognition task. They hoped this unified framework could recognize thousands of objects. Additionally, they aimed to apply outstanding results from machine learning to image recognition. They also borrowed methods from text analysis, specifically the “bag of words” method, for image recognition.

What is the “bag of words” method? For example, when recognizing a face, we do not consider the complex structure of the face; we only look for the presence of features like a nose, eyes, mouth, and chin. If these components are together, we can say this is a face. You might think this is simple.

This method originated from text research. In natural language processing, there is a task of text classification. The bag of words method is used in text classification.

For example, suppose we have an article, and we want to know which category it belongs to: is it discussing military issues or scientific topics? One way to approach this is to read and parse each sentence to understand its grammatical structure and content. However, performing grammatical analysis on sentences is difficult, and understanding sentences is challenging. We often do not do well with this method, so we do not adopt it. In reality, we can use a simpler method: we only need to look at the frequency of high-frequency words in the article. The high-frequency words in this article are: vision, perception, brain, neurons, cells; you would say this article belongs to the field of neuroscience. Another article contains high-frequency words like China, trade, export, import, banks, currency, etc.; you would know this article belongs to the economic category. This method does not require analyzing and parsing the grammatical structure of sentences and paragraphs; it simply aggregates these high-frequency words into a “bag of words”.

How can we apply this method to image recognition? When recognizing images, we can also aggregate the “high-frequency words” from the image. What are the “words” here? Intuitively, they are small image blocks. For instance, when recognizing a face, the image will contain blocks like skin or eyes. Conversely, if we are recognizing a bicycle, there will be image blocks related to the bicycle, such as the seat and frame. These image blocks are the “words”. Thus, we can apply the “bag of words” method. In reality, the “words” in an image are not as intuitive as we say; they are very small image blocks, typically 3×3, 5×5, or 7×7 in size. These small image blocks do not express highly abstract semantics.

After this method was proposed, many interesting related papers were published. However, this method also has its flaws. We see several numbers in the field of image recognition competitions, where participants are given some images to design and train their algorithms. During these competitions, new images are provided, and the algorithms must identify the categories of the images. If the correct answer is among the top five predicted categories, it is considered correct; otherwise, it is counted as incorrect. In the first competition in 2010, the first-place score was 72%, and in 2011, it was 74%. We know that with so many excellent teams and resources worldwide, the annual improvement is about 1%-2%.

After 2000, what has machine learning been doing? Machine learning continues to focus on fundamental research, yielding many excellent results. In 2006, Hinton published an article in Science introducing their deep learning method. Someone suggested Hinton try their method on the object recognition problem. As a result, in the 2012 competition, they achieved first place with an 85% recognition rate. Later, everyone realized how effective this method was, leading to a surge in its application to various problems of interest. Why is artificial intelligence so hot now? Mainly for this reason.

Professor Zhang Changshui’s lab also used this method for traffic sign recognition, which was a project funded by the National Natural Science Foundation of China. A lot of effort was put into it, and the results were excellent, reaching a practical level.

5. Challenges Ahead and Future Research Directions

It seems that image recognition has made great progress, and many people are optimistic and excited. However, image recognition has not yet achieved its full potential. What challenges remain? Let’s consider a few examples.

For instance, when conducting image recognition, we typically need to annotate data, indicating what is a bird, what is a cat, etc. Annotating data is actually a tedious task that is time-consuming and costly. The first dataset collected by Li Fei-Fei’s project team contained 101 categories of objects. This image library was well constructed, and some algorithms could achieve over 99% recognition rate on this database. People said that the image quality was too high, and the variety was too limited. Later, they created another database with 256 categories of objects, but the images were not as well aligned. Despite this, the initial database was still too small.

In 2009, Li Fei-Fei and her team released a new database called ImageNet, which contains tens of millions of image data.

Annotating data is a headache. For example, this database requires each object to be enclosed in a box and assigned a category label. These are typical images, and each object must be boxed and labeled correctly.

Another image database is called LabelMe. The above image is one of the images, which is labeled very finely, detailing the shape, outline, windows of buildings, cars, all grass, and roads clearly. There are over 100,000 images, with about 10,000 images labeled very well. Professor Zhang once told a student from MIT that their database was remarkable and required a lot of effort. The student replied that it was another student who did it. In fact, most of the images were labeled by his mother, who is retired and spends her days labeling data for him; what an impressive mother!

Another remarkable Chinese scientist, Zhu Songchun, said we should label images more finely. For example, in this image, the chair can be labeled very precisely, detailing the seat, backrest, and leg contours. They also labeled various types of chairs. They hired dozens of artists to label data day and night for several years, but the database only contained hundreds of thousands of images. Therefore, annotating data is a very costly task. As a result, machine learning researchers are considering whether it is possible to improve image recognition without spending so much effort on data annotation. For example, in this image, if you only tell me that there is a motorcycle in the image, without specifying where it is, I can still detect and recognize the motorcycle.

Currently, many problems remain unresolved. For instance, our technology can only perform some parsing on images, recognizing that this part is a bird and this is a tree, but it does not provide a deeper understanding of the image. For example, the algorithm does not understand the relationships between these objects, which is crucial for understanding an image.

In this regard, I would like to introduce a work we are doing called image captioning. Several institutions, including Microsoft, Baidu, and Google, are working on this. The results shown here are from our laboratory’s work. What is image captioning? It means you give me an image, and you do not need to tell me that there is a dog holding something. You only need to tell me that a dog is holding a frisbee. We are currently using over 80,000 images and corresponding natural language sentences to design a model and train it so that when you give me a new image, the algorithm can generate a natural language sentence to describe it. For example, this is the sentence generated for this image: A train is parked on the tracks next to the train station. Another example: A group of zebras is standing closely together. And this image: A dog is holding a frisbee. Moreover, while doing this, we have obtained some interesting results. The algorithm includes a visual attention model that can automatically find the relevant image blocks. When generating the sentence “A brown cow is standing in the grass” for this image, we see that “brown”, “cow”, and “grass” correspond to the correct image blocks. Notice that we did not inform the algorithm which block is the cow and which block is the grass in the training data. This indicates that the algorithm has learned these concepts. Given that this is the case, can it also identify other concepts? We examined other words to see which image blocks they correspond to. For example, the row shows the image blocks corresponding to the concept of a fire hydrant. This is the concept of a black cat. Interestingly, the program also identified concepts corresponding to verbs. For example, “fill with” shows images of containers filled with things.

This result is fascinating and resembles how a child learns as they grow. We teach a child just over one year old, saying, “This is a table,” and “This is a laser pointer.” We do not and cannot say, “A table” is a quantifier, and “table” is the key term. However, the child gradually learns these concepts. Our algorithm learns in a similar manner.

The above lists the results we have achieved. However, many, many problems remain unresolved. For example, we can currently identify that this is a bouquet of flowers, but the algorithm cannot specify where the leaves are, where the flower center is, and where the petals are. Although we seem to have made significant progress, in some sense, we can turn these results into products to serve us, yet we also see that there are still many unresolved issues, which requires us to continue our efforts.

This article primarily discusses images. However, we see that many of the methods used are machine learning methods. Therefore, it is the joint effort of machine learning researchers and computer vision scientists that has led to the current achievements. We can use these results to develop products that make our lives more intelligent.

Leave a Comment Cancel reply