▲Click the above Leifeng Network to follow

To some extent, the greatest advantage of deep learning is its ability to automatically create features that no one would think of.
Now, deep learning has a place in many fields, especially in computer vision. Although many people are fascinated by it, the deep network is essentially a black box, and most of us, even scientists trained in the field, do not know how it works.
Numerous success or failure cases of deep learning have taught us valuable lessons about how to handle data correctly. In this article, we will delve into the potential of deep learning, the relationship between deep learning and classical computer vision, and the potential dangers of using deep learning for critical applications.
The Simplicity and Complexity of Visual Problems
First, we need to present some views on visual/computer vision problems. In principle, it can be understood this way: a person is given an image taken by a camera and allows the computer to answer questions related to the content of that image.
The range of questions can vary from simple ones like “Is there a triangle in the image?” or “Is there a face in the image?” to more complex questions, such as “Is there a dog chasing a cat in the image?” Although these types of questions seem similar, and even trivial to humans, the complexity hidden in these questions proves to be vastly different.
While answering questions like “Is there a red circle in the image?” or “How many bright spots are in the image?” is relatively easy, other seemingly simple questions like “Is there a cat in the image?” are much more complex. The distinction between “simple” visual questions and “complex” visual questions is hard to delineate.
This is noteworthy because for humans, who are highly visual animals, all the above questions are not difficult, even for children, answering these visual questions is not challenging. However, deep learning, which is in a transformative period, struggles to answer these questions.
Traditional Computer Vision vs. Deep Learning
Traditional computer vision is a collection of broad algorithms that allow computers to extract information from images (usually represented as arrays of pixel values). Currently, traditional computer vision has various applications, such as denoising, enhancement, and detection of different objects.
Some applications aim to look for simple geometric primitives, such as edge detection, morphological analysis, Hough transform, spot detection, corner detection, and various image thresholding techniques. There are also feature representation techniques, such as Histogram of Oriented Gradients, which can serve as a front end for machine learning classifiers to build more complex detectors.
Contrary to popular belief, the tools discussed above can be combined to create detectors for specific objects that are powerful and efficient. In addition, one can build face detectors, car detectors, and traffic sign detectors, which are likely to outperform deep learning in terms of accuracy and computational complexity.
The problem is that each detector needs to be constructed from scratch by capable individuals, which is inefficient and expensive. Historically, well-performing detectors have only been suitable for objects that need to be frequently detected and can justify the upfront investment.
Many of these detectors are proprietary and not available to the public, such as face detectors, license plate recognition systems, etc. However, no sane person would spend money writing a dog detector or classifier to classify dog breeds from images. Hence, deep learning comes into play.
Inspiration from the Top Student
Suppose you are teaching a computer vision course, and in the first half of the course, you guide students through a lot of specialized knowledge and then leave time for them to complete tasks, which involve collecting image content and asking questions. The tasks start simple, such as asking whether there is a circle or square in the image, then moving on to more complex tasks, such as distinguishing between cats and dogs.
Students are required to write computer programs weekly to complete the tasks, and you are responsible for reviewing the code they write and running it to see how it performs.
This semester, a new student joins your class. He is quiet, not social, and has not asked any questions. However, when he submits his first task proposal, you are somewhat surprised. The code written by this new student is difficult to understand, and you have never seen code like this before. It appears he convolves each image with random filters and then uses very strange logic to arrive at the final answer.
You run this code, and it performs exceptionally well. You think that although this solution is unusual, as long as it works, that is enough. Weeks pass, and the tasks that students need to complete become increasingly challenging, and you receive increasingly complex code from this new student. His code successfully handles the increasingly difficult tasks, but you cannot truly understand its content.
By the end of the semester, you assign students a task to distinguish between cats and dogs using a set of real images. The result is that no student can achieve an accuracy rate above 65%, but the code written by the new student achieves an accuracy of 95%, leaving you astonished. You begin to analyze these cryptic codes in depth over the next few days. You provide it with new examples, modify it, and try to identify the factors influencing the program’s decisions to reverse engineer it.
Eventually, you come to a very surprising conclusion: the code detects dog labels. If it can detect the label, it can determine if the bottom of the object is brown. If so, it returns “cat”; otherwise, it returns “dog”. If it cannot detect the label, it checks if the left side of the object is yellower than the right. If so, it returns “dog”; otherwise, it returns “cat”.
You invite this new student to your office and present your findings to him. You ask him whether he thinks he really solved the problem. After a long silence, he finally mumbles that he solved the task as displayed by the dataset, but he had no idea what a dog looks like or what the differences between dogs and cats are…
Clearly, he cheated because his solution was unrelated to the task’s purpose and what you wanted. However, he did not cheat because his solution was indeed effective. Meanwhile, other students did not perform well. They tried to solve the task through questioning rather than through the raw dataset. Although their programs did not run well, they did not make strange errors.
The Blessing and Curse of Deep Learning
Deep learning is a technique that uses an optimization technique called gradient backpropagation to generate “programs” (also known as “neural networks”), much like the programs written by the student in the story above. These “programs” and optimization techniques know nothing about the world; they only care about constructing a set of transformations and conditions to assign the correct labels to the correct images in the dataset.
By adding more data to the training set, false biases can be eliminated; however, with millions of parameters and thousands of condition checks, the “programs” generated by backpropagation become very large and complex, allowing them to lock in more subtle combinations of biases. Any method that assigns the correct labels to statistically optimize the objective function can be used, regardless of whether it relates to the “semantic spirit” of the task.
Can these networks ultimately lock in “semantically correct” priors? Certainly. But there is now overwhelming evidence that this is not the case. Contradictory examples indicate that very slight, imperceptible modifications to images can change detection results.
Researchers have studied new examples from datasets that have been trained, and results show that generalization outside the original dataset is much weaker than within the dataset, indicating that the given dataset relies on specific low-level features. In some cases, modifying a single pixel is sufficient to produce a new deep network classifier.
To some extent, the greatest advantage of deep learning is its ability to automatically create features that no one would think of, which is also its greatest weakness because most of these features, at least semantically, appear to be “suspicious.”
When Does It Make Sense, When Doesn’t It?
Deep learning is undoubtedly an interesting addition to computer vision systems. We can now relatively easily “train” detectors to detect those expensive and impractical objects. We can also extend these detectors to use more computational power to some extent.
However, the cost of this luxury is high: we do not know how deep learning makes judgments, and we do know that the basis for classification is likely unrelated to the “semantic spirit” of the task. Furthermore, as long as the input data violates the low-level biases in the training set, the detectors will fail. These failure conditions are currently unknown.
Therefore, in practice, deep learning is very useful for applications where errors are not very serious and where it is guaranteed that inputs will not differ significantly from the training dataset, which can afford an error rate of less than 5%, including image search, surveillance, automated retail, and almost everything that is not a “mission-critical” task.
Ironically, most people believe that deep learning is a revolution in the application field, as deep learning’s decisions are real-time, and errors can have serious consequences, even leading to fatal outcomes, such as in autonomous driving and autonomous robots (for example, recent studies have shown that deep neural network-based autonomous driving is indeed susceptible to adversarial attacks in real life). I can only describe this belief as a misunderstanding of the “unfortunate.”
Some people have high hopes for the application of deep learning in medicine and diagnostics. However, there are also some concerning findings in this area, such as a model trained on data from one institution failing to detect data from another institution effectively. This again reinforces the view that the data acquired by these models is shallower than many researchers would hope.
Data Is Shallower Than We Think
Surprisingly, deep learning has taught us something about visual data (often high-dimensional data), which is an interesting perspective: to some extent, data is much “shallower” than we previously thought.
There seem to be more ways to statistically separate visual datasets labeled with high-level human categories than there are ways to separate these “semantically correct” datasets. In other words, this set of low-level image features is statistically more significant than we imagined. This is the great discovery of deep learning.
The question of how to generate “semantically reasonable” methods to separate visual dataset models remains, and in fact, this question now seems harder to answer than before.
Conclusion
Deep learning has become an important component of computer vision systems. However, traditional computer vision has not reached that point, and it can still be used to build very powerful detectors. These artificially created detectors may not achieve the high performance of deep learning on certain specific dataset metrics, but they can guarantee a “semantically relevant” feature set based on the input.
Deep learning provides statistically powerful detectors without sacrificing feature engineering, but still requires a large amount of labeled data, numerous GPUs, and deep learning experts. However, these powerful detectors can also encounter unexpected failures because their applicability cannot be easily described (or more precisely, cannot be described at all).
It should be noted that the discussion above has nothing to do with AI in “artificial intelligence.” I do not believe that deep learning has any relation to solving the problems of artificial intelligence. However, I do believe that combining deep learning, feature engineering, and logical reasoning can achieve very interesting and useful technological capabilities in a broad automation space.
Leifeng Network Note: 【Cover Image Source: Website Name Google, Owner: Google】
– END –
◆ ◆ ◆
Recommended Reading