Overview of 7 Major Innovations in Convolutional Neural Networks (CNN)

Editor’s Note

This review categorizes recent innovations in CNN architectures into seven different categories based on spatial utilization, depth, multi-path, width, feature map utilization, channel enhancement, and attention.

Deep Convolutional Neural Networks (CNN) are a special type of neural network that have demonstrated state-of-the-art results on various competitive benchmarks. The high performance achieved by deep CNN architectures in challenging benchmark tasks indicates that innovative architectural concepts and parameter optimization can enhance CNN performance in various vision-related tasks.

Introduction CNN first gained attention through LeCun’s work in 1989 on grid-like topological data (images and time series data). CNNs are regarded as one of the best techniques for understanding image content and have shown the best performance in tasks such as image recognition, segmentation, detection, and retrieval. The success of CNNs has attracted attention beyond academia. In the industry, companies such as Google, Microsoft, AT&T, NEC, and Facebook have established research teams to explore new architectures for CNNs. Currently, most leaders in image processing competitions adopt models based on deep CNNs.Since 2012, various innovations in CNN architectures have been proposed. These innovations can be categorized into parameter optimization, regularization, structural reorganization, etc. However, it has been observed that the performance improvement of CNN networks is primarily attributed to the reconstruction of processing units and the design of new modules. Since AlexNet demonstrated extraordinary performance on the ImageNet dataset, applications based on CNNs have become increasingly popular. Similarly, Zeiler and Fergus introduced the concept of feature hierarchical visualization, which changed the trend of extracting features using deep architectures (like VGG) at simple low spatial resolutions. Today, most new architectures are built on the simple principles and homogeneous topology introduced by VGG.On the other hand, the Google team introduced a very famous concept regarding splitting, transforming, and merging, known as the Inception module. The initial block used the concept of intra-layer branching for the first time, allowing feature extraction at different spatial scales. In 2015, the concept of residual connections introduced by ResNet became well-known for training deep CNNs, and most subsequent networks like Inception-ResNet, WideResNet, ResNext, etc., utilize it. Similarly, architectures like WideResNet, Pyramidal Nets, and Xception have introduced the concept of multi-layer transformations, achieving this through additional cardinality and increased width. Therefore, the focus of research has shifted from parameter optimization and connection readjustment to network architecture design (layer structure). This has sparked numerous new architectural concepts such as channel enhancement, spatial and channel utilization, and attention-based information processing.The structure of this article is as follows:Figure 1: Structure of the ArticleFigure 2: Basic Layout of a Typical Pattern Recognition (PR) System. The PR system is divided into three phases: Phase 1 is related to data mining, Phase 2 performs preprocessing and feature selection, and Phase 3 is based on model selection, parameter tuning, and analysis. CNNs have good feature extraction capabilities and strong discriminative abilities, so they can be used in the feature extraction/generation and model selection phases of a PR system. Architectural Innovations in CNNsSince 1989, there have been many different improvements in CNN architectures. All innovations in CNNs are achieved through a combination of depth and spatial considerations. Depending on the type of architectural modification, CNNs can be roughly divided into seven categories: spatial utilization, depth, multi-path, width, channel enhancement, feature map utilization, and attention-based CNNs. The classification of deep CNN architectures is shown in Figure 3.Figure 3: Classification of Deep CNN Architectures Spatial Utilization-Based CNNCNNs have a large number of parameters, such as the number of processing units (neurons), layers, filter sizes, strides, learning rates, and activation functions. Since CNNs consider the neighborhood of input pixels (locality), different sizes of filters can be used to explore different levels of correlation. Therefore, in the early 2000s, researchers utilized spatial transformations to enhance performance and also evaluated the impact of different filter sizes on network learning rates. Different sizes of filters encapsulate different levels of granularity; generally, smaller filters extract fine-grained information while larger filters extract coarse-grained information. Thus, by adjusting filter sizes, CNNs can perform well on both coarse and fine details.Depth-Based CNNDeep CNN architectures are based on the hypothesis that as depth increases, the network can better approximate the target function through a large number of nonlinear mappings and improved feature representations. The depth of the network plays an important role in the success of supervised learning. Theoretical studies have shown that deep networks can represent specific types of 20 functions more effectively than shallow networks exponentially. In 2001, Csáji demonstrated the universal approximation theorem, indicating that a single hidden layer is sufficient to approximate any function, but this requires an exponential number of neurons, which generally leads to computational infeasibility. In this regard, Bengio and elalleau argued that deeper networks have the potential to maintain the performance of the network at a lower cost. In 2013, Bengio et al. empirically demonstrated that for complex tasks, deep networks are more effective both computationally and statistically. The best-performing Inception and VGG in the 2014 ILSVR competition further illustrate that depth is an important dimension in regulating the network’s learning capability.Once features are extracted, the position of extraction becomes less important as long as its approximate position relative to others is preserved. Pooling or down-sampling (like convolution) is an interesting local operation. It summarizes similar information near the receptive field and outputs the main response within that local area. As a result of convolution operations, feature patterns may appear in different locations in the image.Multi-Path Based CNNTraining deep networks is quite challenging, which has also been a theme in much recent deep network research. Deep CNNs provide efficient computation and statistics for complex tasks. However, deeper networks may encounter performance degradation or gradient vanishing/exploding problems, which are usually caused by increased depth rather than overfitting. The gradient vanishing problem not only leads to higher test errors but also higher training errors. To train deeper networks, the concept of multi-path or cross-layer connections was proposed. Multi-path or shortcut connections can systematically connect one layer to another by skipping some intermediate layers, allowing specific information to flow across layers. Cross-layer connections partition the network into several blocks. These paths also attempt to solve the gradient vanishing problem by allowing lower layers to access gradients. For this purpose, different types of shortcut connections such as zero padding, projection-based, dropout, and 1×1 connections are used.The activation function is a decision function that helps learn complex patterns. Choosing the appropriate activation function can accelerate the learning process. The activation function of the convolution feature map is defined as equation (3).Width-Based Multi-Connection CNNFrom 2012 to 2015, the focus of network architecture was on the power of depth and the importance of multi-channel regulatory connections in network regularization. However, the width of the network is as important as depth. By parallelly using multiple processing units within a layer, multi-layer perceptrons gain the advantage of mapping complex functions on perceptrons. This indicates that width is an important parameter defining learning principles just as depth is. Lu et al. and Hanin & Sellke recently demonstrated that neural networks with linear rectified activation functions need to be sufficiently wide to maintain universal approximation characteristics as depth increases. Moreover, if the maximum width of the network does not exceed the input dimension, the class of continuous functions on compact sets cannot be well approximated by networks of arbitrary depth. Therefore, multi-layer stacking (increasing layers) may not increase the representational capacity of neural networks. An important issue related to deep architectures is that some layers or processing units may not be able to learn useful features. To address this issue, the focus of research has shifted from deep and narrower architectures to shallower and wider architectures.Feature Map (Channel Feature Map) Based CNN DevelopmentCNNs are known for their hierarchical learning and automatic feature extraction capabilities in MV tasks. Feature selection plays a crucial role in determining the performance of classification, segmentation, and detection modules. The performance of classification modules in traditional feature extraction techniques is limited by the singularity of features. Compared to traditional techniques, CNNs use multi-stage feature extraction, extracting different types of features based on the assigned inputs (referred to as feature maps in CNNs). However, some feature maps have little or no target discriminative power. A large feature set has noise effects that can lead to network overfitting. This indicates that, aside from network engineering, the selection of specific category feature maps is crucial to improving the generalization performance of the network. In this section, feature maps and channels are used interchangeably, as many researchers have replaced the term feature map with channel.Channel (Input Channel) Utilization Based CNNImage representation plays a significant role in determining the performance of image processing algorithms. Good representation of images can define prominent features from compact codes. In various studies, different types of traditional filters have been used to extract different levels of information from single-type images. These different representations are used as inputs to models to enhance performance. CNNs are excellent feature learners that can automatically extract discriminative features based on the problem. However, the learning of CNNs relies on input representations. If there is a lack of diversity and category-defining information in the input, the performance of CNNs as discriminators will be affected. To address this, the concept of auxiliary learners has been introduced into CNNs to enhance the input representation of the network.Attention-Based CNNDifferent levels of abstraction play an important role in defining the discriminative ability of neural networks. Additionally, selecting contextually relevant features is important for image localization and recognition. In the human visual system, this phenomenon is called attention. Humans observe scenes in quick glances and pay attention to contextually relevant parts. In this process, humans not only focus on the selected areas but also infer different interpretations of the objects at those locations. Therefore, it helps humans grasp visual structures in a better way. Similar interpretative capabilities have been added to neural networks such as RNNs and LSTMs. These networks utilize attention modules to generate sequential data and weigh it according to the occurrence of new samples in previous iterations. Different researchers have incorporated the attention concept into CNNs to improve representations and overcome computational limitations of data. The attention concept helps make CNNs smarter, enabling them to recognize objects even in cluttered backgrounds and complex scenes.Paper: A Survey of the Recent Architectures of Deep Convolutional Neural NetworksPaper Link: https://arxiv.org/abs/1901.06032Abstract: Deep Convolutional Neural Networks (CNN) are a special type of neural network that have demonstrated state-of-the-art results on various competitive benchmarks. The super learning capability of deep CNNs is primarily achieved through multiple nonlinear feature extraction stages that can automatically learn hierarchical representations from data. The availability of vast amounts of data and improvements in hardware processing units have accelerated CNN research, and very interesting deep CNN architectures have been reported recently. Recently, the high performance achieved by deep CNN architectures in challenging benchmark tasks indicates that innovative architectural concepts and parameter optimization can enhance CNN performance in various vision-related tasks. In light of this, various ideas regarding CNN design have been explored, such as using different activation functions and loss functions, parameter optimization, regularization, and reconstruction of processing units. However, the main improvements in representational capacity have been achieved through the reconstruction of processing units. In particular, the idea of using blocks instead of layers as structural units has received great appreciation. This review categorizes recent innovations in CNN architectures into seven different categories, which are based on spatial utilization, depth, multi-path, width, feature map utilization, channel enhancement, and attention. Furthermore, this paper also covers a basic understanding of the components of CNNs and reveals the challenges currently faced by CNNs and their applications.

Editor / Fan Ruiqiang

Reviewer / Fan Ruiqiang

Recheck / Fan Ruiqiang

Click below

Editor’s Note This review categorizes recent innovations in CNN architectures into seven different categories based on spatial utilization, depth, multi-path, width, feature map utilization, channel enhancement, and attention.

Leave a Comment Cancel reply

Editor’s Note

This review categorizes recent innovations in CNN architectures into seven different categories based on spatial utilization, depth, multi-path, width, feature map utilization, channel enhancement, and attention.