Overview of Deep Learning Convolutional Neural Networks (CNN): From Basic Technology to Research Prospects

Today 170+/10000, includes:

Essentials – Technical Text
Insights – Algorithm Thoughts
Hot Topics – What Everyone is Watching

Source丨Machine Heart

Editor丨Algorithm Insights

Convolutional Neural Networks (CNN) have achieved unprecedented success in the field of computer vision, but we still do not have a comprehensive understanding of the reasons behind their remarkable effectiveness. Recently, Isma Hadji and Richard P. Wildes from the Department of Electrical Engineering and Computer Science at York University published a paper titled “What Do We Understand About Convolutional Networks?” which outlines the technical foundations, components, current status, and research prospects of convolutional networks, introducing our current understanding of CNNs. Machine Heart has compiled a summary of this paper; for more detailed information, please refer to the original paper and the related literature indexed within it.

Paper link: https://arxiv.org/abs/1803.08834

1 Introduction

1.1 Motivation

In recent years, research in computer vision has primarily focused on Convolutional Neural Networks (commonly referred to as ConvNet or CNN). These works have achieved new state-of-the-art performance on a wide range of classification and regression tasks. Relatively speaking, although the history of these methods dates back several years, the theoretical understanding of how these systems achieve excellent results is still lagging behind. In fact, many achievements in the current field of computer vision treat CNNs as black boxes. This approach is effective, but the reasons for its effectiveness remain very unclear, which severely fails to meet the requirements of scientific research. In particular, there are two complementary questions:

(1) What exactly is being learned in terms of learning aspects (e.g., convolutional kernels)?

(2) In terms of architectural design (e.g., the number of layers, the number of kernels, pooling strategies, the choice of non-linearity), why are some choices superior to others? The answers to these questions are not only beneficial for enhancing our scientific understanding of CNNs but can also improve their practicality.

Moreover, current methods for implementing CNNs require a large amount of training data, and design decisions have a significant impact on performance. A deeper theoretical understanding should help reduce reliance on data-driven designs. Although empirical studies have investigated how the implemented networks operate, to date, these results have largely been limited to visualizations of internal processing processes aimed at understanding what happens in different layers of CNNs.

1.2 Objective

In response to the above situation, this report will outline the most prominent methods proposed by researchers using multi-layer convolutional architectures. It is important to emphasize that this report will discuss various components of typical convolutional networks by outlining different methods and will introduce the biological discoveries and/or reasonable theoretical foundations on which their design decisions are based. Additionally, this report will outline different attempts to understand CNNs through visualization and empirical research. The ultimate goal of this report is to clarify the role of each processing layer involved in CNN architectures, consolidate our current understanding of CNNs, and highlight the unresolved issues.

1.3 Report Outline

The structure of this report is as follows: this chapter provides the motivation for reviewing our understanding of convolutional networks. Chapter 2 will describe various multi-layer networks and provide the most successful architectures used in computer vision applications. Chapter 3 will focus more specifically on each construction module of typical convolutional networks and will discuss the design of different components from both biological and theoretical perspectives. Finally, Chapter 4 will discuss current trends in CNN design and understanding, while also highlighting some key shortcomings that still exist.

2 Multi-layer Networks

In general, this chapter will briefly outline the most prominent multi-layer architectures used in the field of computer vision. It should be noted that, although this chapter covers the most significant contributions in the literature, it will not provide a comprehensive overview of these architectures, as such overviews already exist elsewhere (e.g., [17, 56, 90]). Rather, the purpose of this chapter is to set the discussion foundation for the remainder of this report, so that we can detail and discuss the current understanding of convolutional networks used for visual information processing.

2.1 Multi-layer Architectures

Before the recent success of deep learning-based networks, the state-of-the-art computer vision systems for recognition relied on two separate but complementary steps. The first step is to transform the input data into a suitable form through a set of manually designed operations (such as convolution with a basic set, local or global encoding methods). The transformation of the input usually requires finding a compact and/or abstract representation of the input data while injecting some invariants based on the current task. The goal of this transformation is to change the data in a way that is easier for classifiers to separate. Secondly, the transformed data is typically used to train certain types of classifiers (such as support vector machines) to recognize the content of the input signals. Generally speaking, the performance of any classifier is significantly influenced by the transformation methods used.

Multi-layer learning architectures bring a different perspective to this problem, proposing not only to learn classifiers but also to learn the required transformation operations directly from the data. This form of learning is often referred to as “representation learning,” and when applied in deep multi-layer architectures, it is called “deep learning.”

Multi-layer architectures can be defined as computational models that allow useful information to be extracted from multi-layer abstractions of input data. In general, the design goal of multi-layer architectures is to highlight important aspects of the input at higher layers while becoming increasingly robust to less important variations. Most multi-layer architectures stack simple building blocks with alternating linear and non-linear functions. Over the years, researchers have proposed many different types of multi-layer architectures, and this chapter will cover the most prominent of these architectures used in computer vision applications. Artificial neural networks are a key focus here, as their performance is outstanding. For simplicity, these networks will be referred to as “neural networks”.

2.1.1 Neural Networks

A typical neural network consists of an input layer, an output layer, and multiple hidden layers, each containing multiple units.

Figure 2.1: Schematic diagram of a typical neural network architecture, image from [17]

Autoencoders can be defined as multi-layer neural networks consisting of two main parts. The first part is the encoder, which transforms the input data into a feature vector; the second part is the decoder, which maps the generated feature vector back to the input space.

Figure 2.2: Structure of a typical autoencoder network, image from [17]

2.1.2 Recurrent Neural Networks

When it comes to tasks that rely on sequential input, Recurrent Neural Networks (RNN) are one of the most successful multi-layer architectures. RNNs can be viewed as a special type of neural network where the input to each hidden unit consists of the data observed at the current time step and the state from the previous time step.

Figure 2.3: Schematic diagram of standard recurrent neural network operations. The input to each RNN unit is the new input at the current time step and the previous time step’s state; the new output is then computed based on this input, which can be fed to the next layer of the multi-layer RNN for processing.

Figure 2.4: Schematic diagram of a typical LSTM unit. The input to this unit consists of the current time input and the previous time input, then it returns an output and feeds it to the next time. The final output of the LSTM unit is controlled by the input gate, output gate, and memory cell state. Image from [33]

2.1.3 Convolutional Networks

Convolutional Networks (CNN) are a class of neural networks particularly suitable for computer vision applications because they can use local operations to achieve layered abstractions of representations. Two key design ideas have driven the success of convolutional architectures in the field of computer vision. First, CNNs leverage the 2D structure of images, where pixels in adjacent regions are often highly correlated. Therefore, CNNs do not need to use one-to-one connections between all pixel units (which most neural networks do) but can instead use grouped local connections. Second, CNN architectures rely on feature sharing, so each channel (i.e., output feature map) is generated by convolving the same filter across all positions.

Figure 2.5: Schematic diagram of the structure of a standard convolutional network, image from [93]

Figure 2.6: Schematic diagram of the structure of Neocognitron, image from [49]

2.1.4 Generative Adversarial Networks

A typical Generative Adversarial Network (GAN) consists of two competing modules or sub-networks: the generator network and the discriminator network.

Figure 2.7: Schematic diagram of the general structure of a generative adversarial network

2.1.5 Training Multi-layer Networks

As discussed earlier, the success of many multi-layer architectures largely depends on the success of their learning processes. Their training process is usually based on the error backpropagation using gradient descent. Due to its simplicity, gradient descent has widespread application in training multi-layer architectures.

2.1.6 Brief Discussion on Transfer Learning

The applicability of features extracted using multi-layer architectures across various datasets and tasks can be attributed to their hierarchical nature, where representations evolve from simple and local to abstract and global. Therefore, features extracted at lower levels in their hierarchy are often common across various different tasks, making it easier to implement transfer learning with multi-layer structures.

2.2 Spatial Convolutional Networks

Theoretically, convolutional networks can be applied to data of any dimension. Their 2D instances are particularly suitable for the structure of a single image, which has garnered significant attention in the field of computer vision. With large-scale datasets and powerful computers for training, CNNs have recently seen rapid growth in applications across various tasks. This section will introduce prominent 2D CNN architectures that have introduced relatively novel components compared to the original LeNet.

2.2.1 Key Architectures in Recent Developments of CNN

Figure 2.8: AlexNet architecture. It should be noted that although this appears to be a two-stream architecture, it is actually a single-stream architecture; this image simply illustrates the parallel training of AlexNet on 2 different GPUs. Image from [88]

Figure 2.9: GoogLeNet architecture. (a) A typical inception module demonstrating sequential and parallel operations. (b) A schematic diagram of a typical inception architecture consisting of stacked inception modules. Image from [138]

Figure 2.10: ResNet architecture. (a) Residual module. (b) A schematic diagram of a typical ResNet architecture consisting of stacked residual modules. Image from [64]

Figure 2.11: DenseNet architecture. (a) Dense module. (b) A schematic diagram of a typical DenseNet architecture consisting of stacked dense modules. Image from [72]

2.2.2 Achieving Invariance in CNNs

One major challenge of using CNNs is the requirement for very large datasets to learn all the fundamental parameters. Even large-scale datasets like ImageNet, which contain over one million images, are still considered too small when training specific deep architectures. One way to meet this large dataset requirement is to artificially augment the dataset, which includes random flipping, rotation, and jittering of images. One significant advantage of these augmentation methods is that they allow the resulting networks to better maintain invariance when faced with various transformations.

2.2.3 Achieving Localization in CNNs

In addition to simple classification tasks such as object recognition, CNNs have recently excelled in tasks requiring precise localization, such as semantic segmentation and object detection.

2.3 Spatio-Temporal Convolutional Networks

The use of CNNs has brought significant performance improvements for various image-based applications and has also sparked researchers’ interest in extending 2D spatial CNNs to 3D spatio-temporal CNNs for video analysis. Generally speaking, the various spatio-temporal architectures proposed in the literature attempt to extend the 2D architecture in the spatial domain (x,y) into the temporal domain (x, y, t). In the field of training spatio-temporal CNNs, there are three prominent different architectural design decisions: LSTM-based CNNs, 3D CNNs, and Two-Stream CNNs.

2.3.1 LSTM-based Spatio-Temporal CNNs

LSTM-based spatio-temporal CNNs are some of the early attempts to extend 2D networks to handle spatio-temporal data. Their operations can be summarized in three steps as shown in Figure 2.16. First, a 2D network processes each frame and extracts feature vectors from the last layer of these 2D networks. Second, these features from different time steps are used as input to the LSTM to obtain temporal results. Third, these results are averaged or linearly combined and then passed to a softmax classifier for final predictions.

2.3.2 3D CNNs

This prominent spatio-temporal network is the most direct generalization of 2D CNNs to the spatio-temporal domain of images. It directly processes the temporal flow of RGB images by applying learned 3D convolution filters to these images.

2.3.3 Two-Stream CNNs

This type of spatio-temporal architecture relies on a two-stream design. The standard two-stream architecture employs two parallel pathways—one for processing appearance and the other for processing motion; this method is similar to the dual-stream hypothesis in biological vision system research.

2.4 Overall Discussion

It is important to note that although these networks have achieved competitive results in many computer vision applications, their main drawbacks still exist: limited understanding of the exact nature of the learned representations, reliance on large-scale training datasets, lack of capability to support accurate performance boundaries, and unclear selection of network hyperparameters.

3 Understanding the Building Blocks of CNNs

Given the numerous unresolved issues in the field of CNNs, this chapter will introduce the role and significance of each processing layer in typical convolutional networks. To this end, this chapter will outline the most prominent works addressing these issues. Notably, we will demonstrate how CNN components are modeled from both theoretical and biological perspectives. Each component’s introduction will be followed by a summary of our current level of understanding.

3.1 Convolutional Layer

The convolutional layer can be considered one of the most critical steps in CNN architectures. Essentially, convolution is a linear, translation-invariant operation composed of local weighted combinations performed on the input signal. Depending on the chosen set of weights (i.e., the selected point spread function), different properties of the input signal will be revealed. In the frequency domain, the modulation function associated with the point spread function indicates how the frequency components of the input are modulated through scaling and phase shifting. Thus, selecting an appropriate kernel is crucial for capturing the most significant and important information contained in the input signal, allowing the model to make better inferences about the content of that signal. This section will discuss various methods for implementing this kernel selection step.

3.2 Rectification

Multi-layer networks are often highly non-linear, and rectification is usually the first processing stage to introduce non-linearity into the model. Rectification refers to applying point-wise non-linearity (also known as activation functions) to the outputs of the convolutional layer. This term borrows from the field of signal processing, where rectification refers to converting AC into DC. This is also a processing step that can be found from both biological and theoretical perspectives. Computational neuroscientists introduce the rectification step to find suitable models that best explain current neuroscience data. On the other hand, machine learning researchers use rectification to allow models to learn faster and better. Interestingly, researchers from both areas often agree that they not only need rectification but also converge on the same type of rectification.

Figure 3.7: Non-linear rectification functions used in the literature of multi-layer networks

3.3 Normalization

As mentioned earlier, due to the presence of cascading non-linear operations in these networks, multi-layer architectures are highly non-linear. In addition to the rectification non-linearity discussed in the previous section, normalization is another non-linear processing module that plays an important role in CNN architectures. The most widely used form of normalization in CNNs is known as Divisive Normalization (DN, also referred to as Local Response Normalization). This section will introduce the role of normalization and describe how it corrects the shortcomings of the first two processing modules (convolution and rectification). Similarly, we will discuss normalization from both biological and theoretical perspectives.

3.4 Pooling

Whether biologically inspired, purely learning-based, or completely manually designed, almost all CNN models include a pooling step. The goal of pooling operations is to provide a degree of invariance to changes in position and size, as well as to aggregate responses within and across feature maps. Similar to the three CNN modules discussed in previous sections, pooling is supported by both biological and theoretical research. The main debate regarding this processing layer in CNN networks is the choice of pooling function. The two most widely used pooling functions are average pooling and max pooling. This section will explore the advantages and disadvantages of various pooling functions described in the relevant literature.

Figure 3.10: Comparison of average pooling and max pooling on images after Gabor filtering. (a) Shows the effect of average pooling at different scales, where the top row in (a) presents results applied to the original grayscale image, and the bottom row in (a) shows results applied to the Gabor-filtered image. Average pooling provides a smoother version of the grayscale image, while the sparse Gabor-filtered image fades. In contrast, (b) presents the effect of max pooling at different scales, where the top row in (b) shows results applied to the original grayscale image, and the bottom row in (b) shows results applied to the Gabor-filtered image. Here, max pooling leads to a decrease in quality for the grayscale image, while the sparse edges in the Gabor-filtered image are enhanced. Image from [131]

4 Current Status

The discussion of the roles of various components in CNN architectures highlights the importance of the convolutional module, which is largely responsible for capturing the most abstract information in the network. In contrast, our understanding of this processing module is the least, as it requires the heaviest computation. This chapter will introduce current trends in attempts to understand what different CNN layers learn. At the same time, we will highlight the unresolved issues in these trends.

4.1 Current Trends

Despite the continued advancement of various CNN models in achieving state-of-the-art performance in multiple computer vision applications, progress in understanding how these systems work and why they are so effective remains limited. This issue has attracted the interest of many researchers, leading to the emergence of various methods for understanding CNNs. Generally, these methods can be categorized into three directions: visualizing the learned filters and extracted feature maps, conducting ablation studies inspired by biological methods for understanding the visual cortex, and minimizing the learning process by introducing analytical principles into network design. This section will briefly overview each of these methods.

4.2 Unresolved Issues

Based on the above discussion, the following key research directions exist for visualization-based methods:

First and foremost, it is crucial to develop methods that make the assessment of visualizations more objective, which can be achieved by introducing metrics to evaluate the quality and/or significance of the generated visualization images.
Additionally, although it seems that network-centric visualization methods are more promising (as they do not rely on the network itself for generating visualization results), it also appears necessary to standardize their evaluation processes. One possible solution is to use a benchmark to generate visualization results for networks trained under the same conditions. Such standardization methods can also facilitate metric-based evaluations rather than the current interpretative analyses.
Another development direction is to visualize multiple units simultaneously to better understand the distributed aspects of the representations under investigation, and even to follow a controlled method simultaneously.

Here are potential research directions based on ablation studies:

Utilizing a common systematic organization dataset that includes various common challenges in the field of computer vision (such as viewpoint and lighting variations) and requires more complex categories (such as texture, parts, and complexity on targets). In fact, such datasets have recently emerged [6]. Using ablation studies on such datasets, along with an analysis of the resulting confusion matrices, can identify the patterns of errors made by CNN architectures, leading to better understanding.
Furthermore, systematic studies on the impact of multiple coordinated ablations on model performance are of great interest. Such studies should extend our understanding of how independent units operate.

Finally, these controlled methods present promising future research directions; because compared to fully learning-based methods, these methods can provide us with deeper insights into the operations and representations of these systems. Some intriguing research directions include:

Gradually fixing network parameters and analyzing their impacts on network behavior. For example, fixing the convolution kernel parameters of one layer at a time (based on existing prior knowledge of the task) to analyze the applicability of the chosen kernels at each layer. This progressive approach is expected to reveal the role of learning and can also serve as an initialization method to minimize training time.
Similarly, studying the design of the network architecture itself (such as the number of layers or the number of filters per layer) by analyzing the properties of the input signals (such as common content in the signals). This method helps achieve the appropriate complexity of the architecture for applications.
Finally, applying controlled methods to the implementation of networks can simultaneously conduct systematic studies on the roles of other aspects of CNNs, as this area has received less attention due to the focus on learned parameters. For instance, various pooling strategies and residual connections can be studied while fixing most of the learned parameters.

[ Summary ]

Due to tight time constraints for daily updates, there may inevitably be some omissions; please forgive me.

This content is for learning and exchange purposes only; some materials are sourced from the internet and will be deleted upon request.

Today 170+/10000, includes:

Leave a Comment Cancel reply