Can Computers See? Understanding Computer Vision

1. The Birth of Vision

Life on Earth experienced little change for billions of years after its emergence. They lay “flat” on the ocean floor, unable to move independently or hunt for food.

It wasn’t until about 500 million years ago that evolution suddenly exploded, and in just a few tens of millions of years, life explored various body structures, covering almost all types of organisms we see today. They developed complex behaviors, such as predation, phototaxis, and avoiding harm.

While there are many reasons for the Cambrian explosion of life, one key factor was the emergence of vision. Vision allowed organisms to adapt to their environments dramatically, becoming the most important sensory function.

At first glance, vision seems to be a function of the eyes, as we always use our eyes to see things. However, the eyes are merely sensory organs that passively receive light information from the outside world. This information must undergo complex decoding before the brain can understand it, allowing us to know what is happening around us and how we should respond. Therefore, the brain is actually the most important visual organ.

For computers, simulating the function of “eyes” is not difficult; a camera can easily achieve this. However, truly understanding visual information like the brain’s visual area is extremely challenging.

Image source: pixabay

As humans, when we are young, we only need to see a few cats in our lives to clearly understand their visual characteristics. The next time we see a strange cat, we can recognize it at a glance. However, it is difficult for us to convert such features into a form that computers can understand. For instance, although the cats in the images are all cats, they hold no similarity to the computer.

Thus, while traditional visual algorithms set many rules, attempting to extract various image features, they have failed to understand the content of images. They struggle even with tasks as simple for humans as distinguishing whether an object in an image is a cat or a dog.

2. The Aid of Neural Network Algorithms

To validate the accuracy of algorithms in image classification, computer scientist Fei-Fei Li, who was teaching at Princeton University at the time, released a massive image dataset called ImageNet in 2010, containing over a thousand categories. In 2010, the most advanced algorithms could only correctly identify about 72% of images.

However, the advent of deep learning changed everything. In 2012, Geoffrey Hinton from the University of Toronto and his two students published the neural network AlexNet. This network immediately made a significant breakthrough on ImageNet, raising the accuracy to over 84%.

A few years later, Hinton received the Turing Award, while another author of the paper, Ilya Sutskever, became a founding team member of OpenAI, but that’s a story for later.

How does a neural network recognize images? Let’s look at a simple example. Suppose we want to recognize a handwritten digit on a 28*28 image. We can stretch the pixels in the image into a sequence of 784 numbers. Then, we can send this sequence as input to the neural network. The output of the neural network includes 10 neurons, each representing a digit.

Initially, after inputting the image data, the output results are random. However, if we train this neural network with a large amount of training data, allowing the network to adjust its parameters based on correct results and provide continuous feedback, the neural network can gradually learn to recognize digits correctly.

Yet, this simple neural network also has its issues.

3. The Emergence of New Problems

The first issue is that it has a very large number of parameters. For example, if we use 100 neurons as the hidden layer, in addition to the input and output, we will have 784*100 + 100*10 = 79400 connections. And the images we need to process are often much larger than 28*28 pixels, leading to too many parameters in the model, making it difficult to train. The second issue is that this method disrupts the distribution of pixels in the original image, which does not align with how humans view images.

How can we solve these two problems? Researchers observed two characteristics.

First, identifying objects in an image does not necessarily require scanning every pixel; it is sufficient to find whether a key feature appears in a focal area of the image. For instance, if we see a black-and-white striped skin, we might immediately conclude that the animal in the image is a zebra.

Second, the location of this feature in the image is not crucial. Regardless of where a cat appears in a photo, it is still a cat.

Thus, researchers no longer scrambled the pixels but instead used a tool similar to a small window to slide over the image, capturing local features at different positions. These small windows can slide across the entire image using a set of parameters, thus reducing the number of parameters and capturing different areas of the image. Neural networks that use such “small windows” are called convolutional neural networks. AlexNet is actually a simple convolutional neural network.

Subsequently, neural network technology continued to be optimized, with the number of neurons and layers increasing, leading to improved performance. A few years later, the accuracy on ImageNet exceeded 97%, coming close to human levels for this dataset.

However, besides image classification, computer vision encompasses many other tasks. Object recognition, which is slightly more challenging than image classification, requires not only identifying objects in images but also marking their locations. Sometimes, images may contain more than one type of object.

Object recognition is widely used in autonomous driving, as autonomous systems must recognize different types of objects, such as other vehicles, pedestrians, traffic lights, and signs.

Additionally, we need models to understand different “modalities” of data and combine them. For example, models that integrate text and images can generate images based on text.

Besides processing existing images, we also want machines to generate new images and videos. Currently, organizations like OpenAI, Google, and Baidu have developed relatively mature image generation tools, but video generation technology is still in its infancy and has significant room for improvement.

Another open question in the field of computer vision is whether we can develop a general visual model similar to GPT-4 or ChatGPT. After all, visual understanding is an indispensable part of intelligence; language models lacking visual capabilities cannot convincingly represent complete intelligence.

The article is produced by the Science Popularization China – Starry Sky Project (Creative Cultivation). Please indicate the source when reprinting.

Author: Guan Xinyu, Science Popularization Author

Reviewed by: Yu Yang, Head of Tencent Xuanwu Laboratory

Leave a Comment Cancel reply