Understanding Applications of Deep Learning in Computer Vision

Source: New Machine Vision

Originally from: Chengmai Technology

Abstract: This article mainly introduces the five major technologies in computer vision, which are image classification, object detection, object tracking, semantic segmentation, and instance segmentation. Each technology is given a basic concept and corresponding typical methods, making it simple and easy to read.

Computer vision is one of the hottest research areas today, being a multidisciplinary field that encompasses computer science (graphics, algorithms, theoretical research, etc.), mathematics (information retrieval, machine learning), engineering (robotics, NLP, etc.), biology (neuroscience), and psychology (cognitive science). As computer vision represents a relative understanding of the visual environment and background, many scientists believe that research in this field will lay the foundation for the development of the artificial intelligence industry.

So, what is computer vision? Here are some recognized definitions:

1. The clear and meaningful description of the structure of physical objects from images (Ballard & Brown, 1982);

2. Calculating the properties of the three-dimensional world from one or more digital images (Trucco & Verri, 1998);

3. Making useful decisions about real objects and scenes based on remote sensing images (Sockman & Shapiro, 2001);

So, why study computer vision? The answer is obvious; a series of applications can be derived from this field, such as:

1. Face recognition: Face detection algorithms can recognize a person’s identity from a photo;

2. Image retrieval: Similar to Google Images, using content-based queries to search for relevant images, algorithms return images that best match the query content.

3. Gaming and control: Motion-sensing games;

4. Surveillance: Surveillance cameras commonly found in public places to monitor suspicious behavior;

5. Biometric technology: Fingerprint, iris, and facial recognition are commonly used methods in biometric identification;

6. Intelligent vehicles: Vision remains the primary source of information for observing traffic signs, signals, and other visual features;

As stated in Stanford University’s open course CS231, most computer vision tasks are accomplished using convolutional neural networks. For example, image classification, localization, and detection. So, what tasks occupy a major position and have an impact on the world in computer vision? This article will share with readers five important computer vision technologies, along with their related deep learning models and applications. I believe these five technologies can change your view of the world.

1. Image Classification

The task of image classification occurs frequently in our daily lives, and we often take it for granted. Every morning, when we brush our teeth, we need to pick up items like a toothbrush and towel; how accurately we can grab these items is an image classification task. The official definition is: given a set of images, each labeled with a corresponding category, predict the label category for a new set of test images and measure the prediction accuracy.

How to write an algorithm that can classify images? Computer vision researchers have proposed a data-driven approach to solve this problem. Researchers no longer care about how images are expressed in the code, but instead provide the computer with many images (including each category), then develop learning algorithms that allow the computer to learn the features of these images itself, and then classify the images based on the learned features.

Accordingly, the complete steps of image classification generally follow this format:

1. First, input a set of training image datasets;

2. Then, train a classifier using the training set that can learn the features of each category;

3. Finally, use a test set to evaluate the performance of the classifier by comparing the predicted results with the actual category labels;

For image classification, the most popular method is the convolutional neural network (CNN). CNNs are a commonly used method in deep learning, and their performance far exceeds that of general machine learning algorithms. The CNN network structure is mainly composed of convolutional layers, pooling layers, and fully connected layers, where the convolutional layer is considered the main component for extracting image features. It is similar to a “scanner” that performs convolution operations with the image pixel matrix using convolution kernels, scanning only the size of the convolution kernel each time, then sliding to the next area for related calculations; this calculation is called sliding window.

As can be seen from the image, the input image is sent into the convolutional neural network, where feature extraction is performed through the convolutional layer, then details are filtered through the pooling layer (generally using max pooling or average pooling), and finally, features are expanded in the fully connected layer, sent to the corresponding classifier to obtain classification results.

Most image classification algorithms are trained on the ImageNet dataset, which consists of 1.2 million images covering 1000 categories. This dataset can also be called the dataset that changed artificial intelligence and the world. The ImageNet dataset made people realize that building a good dataset is central to AI research, and data is as crucial as algorithms. Therefore, the world organization also held a competition for this dataset—the ImageNet Challenge.

The first place in the first ImageNet Challenge was won by Alex Krizhevsky (NIPS 2012), using deep convolutional neural networks. The network structure is shown in the following figure. In this model, several techniques were used, such as max pooling, rectified linear unit (ReLU) activation functions, and GPU simulation computing, which opened the curtain for deep learning research.

Since the AlexNet network model won the competition, many CNN-based algorithms have also achieved particularly good results on ImageNet, such as ZFNet (2013), GoogleNet (2014), VGGNet (2014), ResNet (2015), and DenseNet (2016), etc.

2. Object Detection

Object detection typically outputs a bounding box and label for a single target from an image. For example, in vehicle detection, it is necessary to use bounding boxes to detect all vehicles in a given image.

The CNN that previously shone in the image classification task can also be applied here. The first efficient model is R-CNN (Region-based Convolutional Neural Network), as shown in the following figure. In this network, the image is first scanned, and a search algorithm generates possible regions. Then, CNN is run on each possible region, and finally, the output of each CNN network is sent to an SVM classifier to classify the regions and perform linear regression, marking the targets with bounding boxes.

Essentially, it transforms object detection into an image classification problem. However, this method has some issues, such as slow training speed, high memory consumption, and long prediction time.

To solve these problems, Ross Girshick proposed the Fast R-CNN algorithm, which improved detection speed from two aspects:

1) Perform feature extraction before providing suggested regions, thus only needing to run CNN once on the entire image; 2) Use a Softmax classifier instead of an SVM classifier;

Although Fast R-CNN improved speed, selecting the search algorithm still requires a lot of time to generate suggested regions. Therefore, the Faster R-CNN algorithm was proposed, which introduced a Region Proposal Network (RPN) to replace the selective search algorithm, integrating everything into one network, greatly improving detection speed and accuracy.

In recent years, the trend in object detection research has mainly developed towards faster and more efficient detection systems. There are currently other methods available, such as YOLO, SSD, and R-FCN, etc.

3. Object Tracking

Object tracking refers to the process of tracking specific objects of interest or multiple objects in a given scene. In simple terms, it involves providing the initial state (such as position and size) of the target in the first frame of the tracking video and automatically estimating the state of the target object in subsequent frames. This technology is crucial in fields like autonomous vehicles.

Based on the observation model, object tracking can be divided into two categories: generative methods and discriminative methods. Among them, generative methods mainly use generative models to describe the appearance features of the target and then minimize reconstruction error by searching for candidate targets. Common algorithms include sparse coding, principal component analysis (PCA), etc. In contrast, discriminative methods train classifiers to distinguish between the target and the background, showing more stable performance and gradually becoming the main research method in the field of object tracking. Common algorithms include stacked autoencoders (SAE), convolutional neural networks (CNN), etc.

The most classic deep network for object tracking using the SAE method is the Deep Learning Tracker (DLT), which proposes offline pre-training and online fine-tuning. The main steps of this method are as follows:

1. First, use a stacked denoising autoencoder (SDAE) for unsupervised offline pre-training on a large-scale natural image dataset to obtain general object representation capabilities.

2. Combine the encoding part of the pre-trained network with a classifier to form a classification network, and then fine-tune the network using positive and negative samples obtained from the initial frame to distinguish the current object from the background. During tracking, the patch with the highest score output by the classification network is selected as the final predicted target.

3. The model update strategy uses a method with a limited threshold.

Typical algorithms for object tracking based on CNN include FCNT and MD Net.

One of the highlights of FCNT is its in-depth analysis of the performance of CNN features pre-trained on ImageNet for object tracking tasks:

1. CNN feature maps can be used for locating tracking targets;

2. Many CNN feature maps contain noise or are less relevant to the task of distinguishing between the target and the background;

3. Different layers of CNN extract different features. High-level features are more abstract and excel at distinguishing between different categories of objects, while low-level features focus more on the local details of the target.

Based on these observations, FCNT ultimately proposed a model structure as shown in the following figure:

1. For Conv4-3 and Conv5-3, use the structure of the VGG network to select the feature map channels most relevant to the current tracking target;

2. To avoid overfitting, construct GNet and SNet to capture category information from the filtered Conv5-3 and Conv4-3 features respectively;

3. Generate a heatmap from the bounding box provided in the first frame to regressively train SNet and GNet;

4. For each frame, the prediction result is the center-cropped area, which is input into GNet and SNet respectively to obtain two predicted heatmaps, and the decision on which heatmap to use is made based on the presence of interference.

In contrast to FCNT, MD Net uses all sequences in the video to track their motion. However, training on sequences also presents issues, as different tracking sequences may be entirely different from the tracking target. Ultimately, MD Net proposed a multi-domain training concept, and the network structure is shown in the following figure, which is divided into two parts: a shared layer and a classification layer. The network structure part is used to extract features, and the final classification layer distinguishes different categories.

4. Semantic Segmentation

The core of computer vision is the segmentation process, which divides the entire image into pixel groups and then labels and classifies them. Semantic segmentation attempts to understand the role of each pixel in the image semantically (e.g., car, motorcycle, etc.).

CNN also demonstrates excellent performance in this task. A typical method is FCN, with the structure shown in the following figure. The FCN model directly obtains density predictions at the output end after inputting an image, i.e., the category of each pixel, thus achieving an end-to-end method for image semantic segmentation.

Unlike FCN’s upsampling, SegNet moves max pooling to the decoder, improving segmentation resolution and enhancing memory usage efficiency.

There are also other methods, such as fully convolutional networks, dilated convolutions, DeepLab, and RefineNet, etc.

5. Instance Segmentation

In addition to semantic segmentation, instance segmentation also segments different class instances, such as marking five cars with five different colors. In classification, there is usually an image focused on one object, and the task is to identify what this image is. However, to segment instances, we need to perform a more complex task. We see complex scenes with multiple overlapping objects and everyday backgrounds; we not only classify these everyday objects but also determine their boundaries, differences, and relationships with each other.

So far, we have seen how to effectively locate everyday items in images with bounding boxes using CNN features. Can we extend these techniques to locate the precise pixels of each object, not just bounding boxes?

CNN also performs excellently in this task, with a typical algorithm being Mask R-CNN. Mask R-CNN adds a branch to output binary masks based on Faster R-CNN. This branch runs in parallel with the existing classification and bounding box regression, as shown in the following figure:

Faster R-CNN performs poorly in instance segmentation tasks. To correct its shortcomings, Mask R-CNN proposes the RoIAlign layer to improve accuracy. Essentially, RoIAlign uses bilinear interpolation to avoid rounding errors that lead to inaccurate detection and segmentation.

Once the masks are generated, Mask R-CNN combines the classifier and bounding boxes to produce very accurate segmentations:

Conclusion

The five computer vision technologies mentioned above can help computers extract, analyze, and understand useful information from a single image or a series of images. Additionally, there are many other advanced techniques waiting for our exploration, such as style transfer and action recognition. I hope this article can guide you to change the way you view the world.