Computer Vision: Enabling Machines to Understand Our Colorful World

(Author: Jian Sun, Chief Researcher at Microsoft Research Asia / Editor: Guangpu)

From the savage days of raw survival to the concrete jungles of modern cities, humanity has gradually projected its capabilities onto computers. Whether in computational power or memory, today’s computers perform remarkably well. However, having these abilities is still far from enough; we expect computers to do much more. The globally popular film “Interstellar” has sparked countless desires to explore the mysteries of the vast universe, and it has made many remember Tars, the smart, lovable, and humorous robot. Hollywood films themed around artificial intelligence have always been beloved by fans, as humanity constructs one fascinating future world after another with boundless imagination and dazzling special effects, leaving audiences mesmerized. Yet, back in reality, the actions of computer scientists seem to lag far behind the imaginations of filmmakers. After all, movies are just movies. To develop an intelligent robot like Tars that can understand the surrounding world, comprehend human language, and engage in smooth conversations with humans, we still have a long way to go.

Tars / “Interstellar”

For a long time, enabling computers to see, hear, and speak has been a goal pursued tirelessly by myself and my colleagues in the field of computing. Having worked in the field of computer vision for over a decade, empowering computers with keen eyesight to understand this colorful world has been a significant driving force for me on this challenging journey. Although computers cannot yet match the intelligence portrayed in films, they have already achieved many surprising results. In this article, I will introduce the basic concepts of computer vision, the challenges faced in this field, some key breakthrough technologies, and prospects for future evolution.

How Does the World Form in Our Eyes?

For humans, “recognizing people” seems to be an innate ability, as even newborns can mimic their parents’ expressions within days of birth; it enables us to distinguish each other with just a few details. In dim light, we can still recognize a friend at the end of the corridor. However, this ability, which is effortless for humans, is a significant challenge for computers. For a long time, computer vision technology has stagnated. Before delving deeper, let’s first discuss how we observe the world with our eyes.

I’m sure everyone has experienced the principle of pinhole imaging in their physics classes. However, the human eye is far more complex than a pinhole camera. When we observe an object, we scan it approximately three times per second, with one fixation. When the photoreceptors in the retina perceive the outline of a candle, a region known as the fovea actually records the shape of the candle in a distorted manner.

So, the question arises: why do we see a world that is neither distorted nor deformed? The answer is simple: because humans have the cerebral cortex, a universal “converter” that transforms the signals captured by our visual neurons into real images. This “converter” can be simplified into four regions, which biologists refer to as V1, V2, V4, and IT. Neurons in the V1 region respond only to a small part of the entire visual field; for example, some neurons become exceptionally active when detecting a straight line. This line could be part of anything—a table edge, the floor, or a stroke of a character in this article. Every time the eye scans, the activity of this group of neurons may change rapidly.

The mystery lies in the IT region at the top of the cerebral cortex. Biologists have found that when an object appears anywhere in the visual field (for example, a face), certain neurons remain consistently active. This means that human visual recognition progresses from the retina to the IT region, with the nervous system evolving from recognizing subtle features to gradually identifying targets. If computer vision could also have a “converter,” the efficiency of computer recognition would greatly improve. The operation of human visual neurons provides inspiration for breakthroughs in computer vision technology.

Why Does Computer Vision Always Seem to Be “Foggy”?

Despite the gradual unveiling of the mysteries of human eye recognition, directly applying this knowledge to computers is no easy task. We find that computer recognition often appears “foggy”; once light, angles, and other factors change, computers struggle to keep pace with the environment and often misidentify. For computers, recognizing a person in different environments is more challenging than recognizing two people in the same environment. This is because early researchers attempted to treat human faces as templates and used machine learning to grasp the rules of these templates. However, while a face may seem fixed, variations in angle, lighting, and appearance make it difficult for a simple template to match all faces.

Thus, the core issue of facial recognition lies in how to enable computers to ignore the internal differences of the same person while being able to distinguish between two different individuals—that is, making the same person appear similar while different individuals appear distinct.

The introduction of artificial neural networks is the key to surpassing template recognition in computer vision. However, how can we guide computers to progress when humans have yet to fully understand the workings of neurons? Artificial neural networks began to emerge in the 1960s, with early theories fixed on simple models, namely the “input-hidden layer-output” model from biology classes. When explaining how neurons work, teachers generally state that external stimuli contact input neurons, which then link to other parts to form hidden layers, ultimately expressed through output neurons. The strength of these neuron connections varies, much like the different dynamics of musical scores. Artificial neural networks learn to map input methods to output based on these varying connection strengths.

However, the “musical score” is static and can only move from “input to output,” with no reverse presentation. This means that if a person remains still, the computer may read through this principle, but this is impossible in real life. In the late 1980s, the invention of the “backpropagation algorithm” for artificial neural networks enabled the error from output units to be sent back to input units, allowing the network to remember it. This method allows artificial neural networks to learn statistical patterns from a large number of training samples and make predictions about unknown events. However, compared to the complexity and hierarchical structure of the brain, this type of neural network, which only contains one hidden layer, still seems trivial.

Deep Neural Networks Help Computers “See Through the Fog”

In 2006, Professor Geoffrey Hinton from the University of Toronto made breakthroughs in training deep neural networks. On one hand, he proved that multi-hidden-layer artificial neural networks possess superior feature learning capabilities; on the other hand, he overcame the training challenges that had long plagued researchers through layer-wise initialization—essentially, ensuring network initialization with a large amount of unsupervised data and then optimizing adjustments using supervised data on the initialized or pre-trained network.

Inspired by these factors, most current research on facial or image recognition is based on the principles of CNN (Convolutional Neural Networks). CNN can be viewed as a “machine” that scans layer by layer. The first layer detects edges, corners, and flat or uneven areas, which contain almost no semantic information; the second layer combines the results detected by the first layer and passes the combination to the next layer, and so on. Through multi-layer scanning, accuracy accumulates, and the computer moves toward the previously mentioned goal of “making the same person appear similar while different individuals appear distinct.”

The full name of CNN is deep neural networks with convolutional structures, and this network’s object recognition can be divided into two steps: image classification and object detection. In the first phase, the computer first identifies the type of object, such as a person, animal, or other item; in the second phase, the computer obtains the precise location of the object within the image—these two stages answer the questions of “what is it” and “where is it.” Microsoft’s intelligent chatbot “Xiaoice” exemplifies CNN’s capabilities by identifying dog breeds. First, a multi-layer deep convolutional network needs to be built. The first layer is similar to the human visual system’s definition, used to detect small edges or color patches; the second layer composes these small structures into larger ones, such as a dog’s leg and eyes; this process continues upward until the dog breed can be identified. Next, a large number of images must be input into this convolutional deep neural network to train the system’s accuracy in identifying dogs.

In 2013, researchers at the University of California, Berkeley, proposed a method for object detection called R-CNN (Region-based CNN), which has high recognition accuracy. It divides each image into multiple windows or sub-regions, applying neural networks for classification in each sub-region. However, its main drawback is that the algorithm is too slow for real-time detection. To detect several objects in a single image, the entire neural network may require thousands of calculations.

At Microsoft Research Asia, researchers in the visual computing group have implemented a new algorithm called Spatial Pyramid Pooling (SPP), which conducts only one computation for the entire image by recognizing internal features instead of detecting from scratch in each region. Utilizing this new algorithm, object detection speed has improved by hundreds of times without compromising accuracy. In the 2014 ImageNet Large Scale Visual Recognition Challenge, the system using the SPP algorithm from Microsoft Research Asia achieved third place in classification and second place in detection. This technology has since been successfully integrated into OneDrive. After adopting this technology, OneDrive can automatically tag uploaded images. Additionally, users can search for corresponding images by entering keywords.

Looking Ahead: Computer Vision Dancing with Humanity

If we only recognize faces without considering hairstyles and other body parts, the human accuracy rate is about 97.5%, while computers can currently achieve over 99%. Does this mean that computers have surpassed humans? Not at all, because we do not only observe faces; body shape and posture also aid us in recognizing others. In complex lighting conditions, humans can more intelligently select these branches to help themselves make decisions, while computers still lag behind in this aspect. However, when faced with large amounts of data or unfamiliar faces, computers can be stronger. If we can leverage each other’s strengths, the lyrics “lend me your keen eyes” may become a reality.

Humans constantly invent new technologies to replace old ones, completing tasks more efficiently and economically. This is also true in the field of computer vision, where we develop more convenient facial recognition systems for access control, replacing manual input of usernames and passwords—Microsoft’s Xbox One facial recognition system designed with an infrared camera has received positive feedback from users.

In addition to the recognition functions that humans can also perform, computer vision can also be applied in areas beyond human capabilities, in fields where sensory organs cannot reach, and in tedious tasks—automatically pressing the shutter at the moment of a smile, assisting drivers in parking, capturing body poses for interaction with computer games, accurately welding parts and inspecting defects in factories, helping warehouses sort goods during busy shopping seasons, cleaning rooms with robotic vacuums when leaving home, and automatically categorizing digital photos… Perhaps in the near future, supermarket scales will be able to identify types of vegetables; access control systems will recognize friends carrying gifts or thieves wielding crowbars; wearable devices and smartphones will help us identify any object in the frame and search for related information. Even more remarkably, it can surpass human sensory perception, using sound waves and infrared rays to perceive the world, observing the surging clouds to predict the weather, monitoring vehicle operations to manage traffic, and even breaking through our imagination to assist theoretical physicists in analyzing the motion of objects in higher-dimensional spaces.

Once, humans recorded the magnificent history with their eyes. In the future, we hope to gradually open the eyes of computers, allowing them not only to understand this colorful world but also to help humanity accomplish work and life more efficiently and intelligently. We look forward to a world that is not only colorful but also wise, dancing together with computer vision and humanity.

Image Source: Shutterstock

Leave a Comment Cancel reply