1Recommended by New Intelligence
Source: Authorized Reprint by Data Party
Author: Zhang Changshui
[New Intelligence Guide]Professor Zhang Changshui from the Department of Automation at Tsinghua University delivered a speech titled ‘Image Recognition and Machine Learning’ at the ‘Tsinghua Artificial Intelligence’ forum, introducing the development, application, and challenges of image recognition technology, with a particular emphasis on the driving role of machine learning. Professor Zhang also mentioned concerns among machine learning practitioners, such as why deep networks are effective? Are there other models? From an engineering perspective, with more data, can you compute it? Can machines run it?
Image recognition is a core topic in the field of artificial intelligence. From a research perspective, machine learning is also a research direction under artificial intelligence. Therefore, this topic resonates more easily with everyone.
What is image recognition? Taking this image as an example, the first question is: is there a streetlight in this image? In academic research, we call this question image detection. The second question is to find the location of the streetlight, which is called localization. The third question is the classification and recognition of objects, pointing out that this is a mountain, this is a tree, this is a sign, and this is a building. We may also classify the entire image scene, determining the environment in which the photo was taken. It could be an outdoor image related to urban life, etc. Basically, these are some possible research questions involved in image recognition.
What can these studies be used for? For example, in self-driving cars: if a car has an auxiliary system with a camera that can recognize all situations in the scene, including lane lines, traffic signs, obstacles, etc., it can make driving easier and more convenient.
Additionally, some cameras, when taking pictures, will find the location of a person’s face halfway through pressing the shutter. Once the face is found, the camera will focus on the face, making the image clearer.
Moreover, our computers often have thousands of photos; how can we organize them so that users can quickly find a photo? If there is such an image recognition system, I might tell the computer that I want to find a photo with two people, and that the photo was taken at the Summer Palace.
There are many difficulties in image recognition. The first difficulty is the variation in viewpoints. When we take pictures of the same object, the appearance of the image varies due to different viewpoints. Therefore, looking at the same object from different angles can yield very different appearances. However, two different objects may also look very similar. This is one of the challenges of image recognition.
The second difficulty is the scale issue. Objects appear larger when closer and smaller when farther away, which presents challenges for image recognition.
The variation of light and shadow has always been a significant concern in computer vision; it is the third challenge of image recognition. The same person can look completely different under different lighting conditions.
The fourth difficulty is the complexity of the background. Finding a person with a cane or a person wearing a hat in a complex background can be very challenging.
The fifth difficulty is occlusion. Occlusion is a significant concern in computer vision. For example, in a crowded image, we may recognize this is probably a girl: she has brown hair and is wearing a short-sleeved shirt. Humans are very capable, and we can still identify gender in such cases, but computers are not there yet.
The sixth difficulty is deformation. Non-rigid objects can deform during motion. The same horse may appear very differently in various situations.
Image recognition began with single object recognition. The image above shows the results of traditional geometric methods. Our objective world is so complex and diverse; how should we approach recognition? We start with particularly simple problems, which is a common approach in scientific research: start with simple issues. For example, we began with the recognition of building blocks, as they have a few standardized shapes. The image above shows a simple razor that has been recognized. These standardized geometric combinations can be easily detected by identifying rectangles, squares, triangles, etc. Another method is to recognize based on appearance. I do not consider the geometric structure of the object to be recognized, only its visual appearance. Here are examples of face detection.
The history of face recognition research is relatively long; it began around the 1970s and continues to see many publications in face recognition research.
Another topic is handwritten digit recognition. Handwritten digit recognition may seem simple, but the research has led to a significant number of methods and many results, making it a very interesting topic. Other topics include vehicle detection. I have only listed a few here; in fact, during the same period, there was also fingerprint recognition, OCR, etc. Some research had already reached the level of commercialization, including OCR and fingerprint recognition.
Before 2000, image recognition used geometric methods, structural methods, rule-based methods, and some simpler pattern recognition methods.
In the late 1980s and 1990s, what happened in the field of machine learning? During this period, machine learning experienced rapid development, yielding remarkable research results, including support vector machine methods, AdaBoosting methods, and computational learning theory. These achievements significantly advanced machine learning and recognition. After 2002, a Chinese female scientist named Li Fei-Fei began to approach image recognition with a new idea. They aimed to design a unified framework for image recognition rather than developing specialized methods for specific image recognition tasks. They hoped this unified framework could recognize thousands of objects and that outstanding results in machine learning could be applied to image recognition. They also borrowed the method of ‘bag of words’ from text analysis for image recognition.
What is the ‘bag of words’ method? For example, to recognize a face, we do not consider the complex structure of the face; we only need to check for the presence of a nose, eyes, mouth, and chin. If these components are together, we can say this is a face. You might think this is simple.
This method originates from text research. In natural language processing, there is a task of text classification. The ‘bag of words’ method is used in text classification.
For example, if we have an article and we want to know which category it belongs to—whether it discusses military topics or scientific topics—what do we do? One approach is to understand each sentence, parse it, understand its grammatical structure, and comprehend its content. However, parsing grammar is hard, and understanding sentences is challenging. We often do not do well, so we do not use this method. In fact, we can use a simple method: we only need to look at which words appear frequently in the article. The high-frequency words in this article are: vision, perception, brain, neurons, cells; you would say this article belongs to neuroscience. Another article may have high-frequency words like China, trade, export, import, bank, currency, etc., and you would know this article belongs to economics. This method does not require analyzing and parsing the grammatical structure of sentences and paragraphs; it simply groups these high-frequency words together, called the ‘bag of words.’
How can we apply this method to image recognition? When recognizing images, we can also group the ‘high-frequency words’ in the images. What are these ‘words’? Intuitively, they are small image blocks. For example, to recognize a face, the image will contain blocks like skin or eyes. In contrast, for recognizing a bicycle, there will be image blocks related to the bike, such as the seat and frame. These image blocks are the ‘words.’ Thus, we can use the ‘bag of words’ method. In reality, the ‘words’ in images are not as straightforward as we say; they are very small image blocks, such as 3×3, 5×5, or 7×7 in size. These small image blocks do not convey very abstract semantics.
This method has led to many interesting related papers being published. However, this method also has flaws. We see a few numbers; there is an object recognition competition in the field of image recognition, which provides images for you to design and train your algorithm. During the competition, new images are provided, and the algorithm must identify the category of each image. If the correct answer is among the top five predicted categories, it is considered correct; otherwise, it is counted as incorrect. In 2010, the first-place result was 72%, and in 2011, the first-place result was 74%. We know that with so many excellent teams and resources worldwide, the annual progress is only about 1%-2%.
After 2000, what has machine learning been doing? Machine learning is still conducting foundational research, yielding many excellent results. Among them, in 2006, Hinton published an article in Science introducing their deep learning method. Someone suggested that Hinton try their method on this object recognition problem. As a result, they achieved first place in the 2012 competition with an 85% recognition rate. Later, everyone realized this method was so effective that many rushed to use it to solve their respective problems. Why is artificial intelligence so popular now? Mainly because of this reason.
Professor Zhang Changshui’s lab also used this method for traffic sign recognition, which is a project funded by the National Natural Science Foundation. They put in a lot of effort, and the results were quite good, reaching a practical level.
It seems that image recognition has made significant progress, and many people are optimistic and excited. However, image recognition is not yet perfect. What challenges remain? Let’s consider a few examples.
For instance, when we conduct image recognition, we usually need to annotate data, meaning we need to label whether this is a bird or a cat. Then we use these images for training. Annotating data is quite a headache; it is time-consuming and expensive. The first dataset collected by Li Fei-Fei’s project group included 101 object categories. This image library was well done, and some algorithms could achieve over 99% recognition accuracy on this database. People said the image quality was too good, and the variety was too limited. Later, they created another database with 256 object categories, but the images were not aligned as well. Even so, this database started too small.
In 2009, Li Fei-Fei and her team released a new database called ImageNet, which contains tens of millions of image data.
Annotating data is a headache. For example, this database requires each object to be enclosed in a box and assigned a category label. These are typical images where each object must be boxed and labeled accordingly.
There is also an image database called LabelMe. The above image is one of them, where the outlines of houses, contours, windows, cars, and all grass and roads are clearly labeled. There are over 100,000 images, and about 10,000 are very well annotated. Professor Zhang once told a student from MIT that this database is impressive and requires a lot of effort. The student replied that it was done by another student. In fact, most of the images were annotated by his mother, who was retired and spent her days annotating data. What an impressive mother!
Another remarkable Chinese scientist, Zhu Songchun, said we should annotate images in more detail. For example, in this image, the chair can be annotated very precisely, with the seat, backrest, and leg contours accurately marked. They also annotated various types of chairs. They hired dozens of artists to annotate data day and night, but after several years, the database only contained a few hundred thousand images. Therefore, annotating data is a very costly endeavor. Thus, those in machine learning are considering whether it is possible to improve image recognition without such painstaking annotation. For example, in this image, if you tell me there is a motorcycle, I can detect and recognize the motorcycle without needing you to specify where it is.
Currently, there are still many unresolved issues. For example, our technology only performs some analysis on images, recognizing this part as a bird and that part as a tree, but it does not provide a deeper understanding of the image. For instance, the algorithm does not understand the relationships between these objects. Understanding the relationships between objects is crucial for comprehending an image.
In this regard, I would like to introduce a project we are working on called image captioning. Several organizations, including Microsoft, Baidu, and Google, are working on this. The results shown here are from our laboratory’s work. What is image captioning? It means that when you give me an image, you do not need to tell me there is a dog holding something; you only need to tell me that a dog is holding a flying saucer. We are now using over 80,000 images and corresponding natural language sentences to design a model and train it so that when you provide a new image, the algorithm can generate a natural language sentence to describe it. For example, for this image, the generated sentence might be: ‘A train is stopped on the tracks next to the train station.’ Another image might yield the sentence: ‘A group of zebras is standing closely together.’ Another image could say: ‘A dog is holding a flying saucer.’ Moreover, while doing this, we obtained some interesting results. This algorithm includes a visual attention model that can automatically find relevant image blocks. When generating the sentence ‘A brown cow is standing in the grass,’ we see that ‘brown,’ ‘cow,’ and ‘grass’ correspond to the correct image blocks. Notice that we did not tell the algorithm which block is the cow and which is the grass in the training data. This indicates that the algorithm has learned these concepts. Since this is the case, can it also correctly identify other concepts? We examined other words and their corresponding image blocks. For instance, this row corresponds to the concept of ‘fire hydrant.’ This is for the concept of ‘black cat.’ Interestingly, besides nouns, the program also identified concepts corresponding to verbs. For example, ‘fill with’ shows images of containers filled with something.
This result is fascinating and very much resembles how a child learns as they grow. We teach a child over one year old by saying, ‘This is a table,’ ‘This is a laser pointer.’ We cannot and do not say, ‘A table’ is a quantifier, and ‘table’ is a keyword. However, the child gradually learns these concepts. Our algorithm works similarly.
We have listed the results we have achieved. However, there are still many unresolved issues ahead. For instance, we currently see that deep networks are very effective, but we do not fully understand why they are effective. Are there other models? Additionally, there are engineering issues: with so much data, can you compute it? Can machines handle it? These are significant concerns for those in machine learning.
Furthermore, for example, consider a bouquet of flowers. The algorithm can now recognize it as a bouquet of flowers, but it cannot identify where the leaves, flower center, and petals are. Although we have achieved so many results, from a certain perspective, we can turn them into products to serve us. However, we also see that there are many more problems that have not been well solved, which requires our continued efforts.
This article primarily discusses images. However, we see that many methods used are based on machine learning. Therefore, it is the joint efforts of machine learning practitioners and computer vision scientists that have led to the current achievements. We can transform these achievements into products to make our lives a bit smarter.
Compiled by: Li Kenan
Proofread by: Guo Xinrui
Edited by: Zhang Meng
