Development of CNN Network Structures

Development of CNN Network Structures
Source: Deep Learning Enthusiasts

This article is about 3000 words long and is recommended to be read in 10 minutes.
This article introduces the basic components of CNN and classic network structures.

The full name of CNN is “Convolutional Neural Network”. A neural network is a mathematical model or computational model that mimics the structure and function of biological neural networks (the central nervous system of animals, especially the brain).

Author丨zzq@Zhihu
Link丨https://zhuanlan.zhihu.com/p/68411179

1. Introduction to Basic Components of CNN

1. Local Receptive Field

In images, the connections between local pixels are relatively close, while the connections between distant pixels are relatively weak. Therefore, each neuron does not need to perceive the entire image globally; it only needs to perceive local information, which can then be combined at higher levels to obtain global information. The convolution operation is the realization of the local receptive field, and because convolution operations can share weights, it also reduces the number of parameters.

2. Pooling

Pooling reduces the input image size, decreases pixel information, and retains only important information, mainly to reduce computational load. It mainly includes max pooling and average pooling.

3. Activation Function

The activation function is used to introduce non-linearity. Common activation functions include sigmoid, tanh, and ReLU; the first two are often used in fully connected layers, while ReLU is common in convolutional layers.

4. Fully Connected Layer

The fully connected layer acts as a classifier in the entire convolutional neural network. The outputs from previous layers need to be flattened before entering the fully connected layer.

2. Classic Network Structures

1. LeNet5

Consists of two convolutional layers, two pooling layers, and two fully connected layers. The convolution kernels are all 5×5, stride=1, and the pooling layer uses max pooling.

Development of CNN Network Structures

2. AlexNet

The model consists of eight layers (not counting the input layer), including five convolutional layers and three fully connected layers. The last layer uses softmax for classification output.

AlexNet uses ReLU as the activation function; it employs dropout and data augmentation to prevent overfitting; it implements dual GPUs; and it uses LRN.

Development of CNN Network Structures
Development of CNN Network Structures

3. VGG

It uses a stack of 3×3 convolution kernels to simulate a larger receptive field and has a deeper network. VGG has five segments of convolution, each followed by a max pooling layer. The number of convolution kernels increases gradually.

Summary: LRN has little effect; deeper networks perform better; 1×1 convolutions are also effective but not as good as 3×3.

Development of CNN Network Structures

4. GoogLeNet (Inception v1)

From VGG, we learned that deeper networks perform better. However, as the model gets deeper, the number of parameters increases, leading to easier overfitting and requiring more training data. Additionally, complex networks imply more computational load and larger model storage needs, which require more resources and reduce speed. GoogLeNet is designed to reduce parameters.

GoogLeNet increases network complexity by widening the network, allowing the network to choose how to select convolution kernels itself. This design reduces parameters while enhancing the network’s adaptability to multiple scales. Using 1×1 convolutions allows for increased network complexity without adding parameters.

Development of CNN Network Structures

Inception-v2

On the basis of v1, batch normalization technology is added. In TensorFlow, using BN before the activation function yields better results; the 5×5 convolution is replaced with two consecutive 3×3 convolutions to make the network deeper with fewer parameters.

Inception-v3

The core idea is to decompose convolution kernels into smaller convolutions, for example, decomposing 7×7 into 1×7 and 7×1 convolutions to reduce network parameters and increase depth.

Inception-v4 Structure

Introduced ResNet to accelerate training and improve performance. However, when the number of filters is too large (>1000), training becomes very unstable; an activation scaling factor can be added to alleviate this.

5. Xception

Based on Inception-v3, the basic idea is depthwise separable convolution but with differences. The model parameters are slightly reduced, but the accuracy is higher. Xception first performs a 1×1 convolution followed by a 3×3 convolution, i.e., merging channels first, then performing spatial convolution. Depthwise is the opposite, first performing spatial 3×3 convolution, then 1×1 convolution on channels. The core idea is to follow an assumption: during convolution, the convolution of channels should be separated from the spatial convolution. MobileNet-v1 uses the depthwise order and adds BN and ReLU. The parameter count of Xception is not much different from Inception-v3; it increases network width to enhance accuracy, while MobileNet-v1 aims to reduce parameters and improve efficiency.

Development of CNN Network Structures
Development of CNN Network Structures

6. MobileNet Series

V1

Uses depthwise separable convolutions; abandons pooling layers and uses stride=2 convolutions. The number of channels in standard convolution kernels equals the number of input feature map channels; while the depthwise convolution kernel channel count is 1; there are two parameters that can be controlled, a controls input-output channel count; p controls image (feature map) resolution.

Development of CNN Network Structures
Development of CNN Network Structures

V2

Compared to v1, there are three differences: 1. Introduced residual structure; 2. Before dw, a 1×1 convolution is performed to increase the feature map channel count, which is different from the general residual block; 3. After pointwise, ReLU is abandoned in favor of a linear activation function to prevent ReLU from damaging features. This is because the features extracted by the dw layer are limited by the input channel count; if the traditional residual block is used, the features that dw can extract are even less. Therefore, initially not compressing but rather expanding is the approach. However, when using expansion-convolution-compression, a problem arises after compression: ReLU can damage features, and since features are already compressed, passing through ReLU will lose some features, so linear should be used.

Development of CNN Network Structures
Development of CNN Network Structures

V3

Complementary search technology combination: executed by resource-constrained NAS module set search, NetAdapt performs local searches; network structure improvements: move the last average pooling layer forward and remove the last convolution layer, introduce h-swish activation function, modified the initial filter group.

V3 integrates the depthwise separable convolutions of v1, the linear bottleneck residual structure of v2, and the lightweight attention model of SE structure.

Development of CNN Network Structures
Development of CNN Network Structures
Development of CNN Network Structures

7. EffNet

EffNet is an improvement on MobileNet-v1, with the main idea: decompose the dw layer of MobileNet-1 into two 3×1 and 1×3 dw layers, so that pooling is adopted after the first layer, thereby reducing the computational load of the second layer. EffNet is smaller and progresses faster than MobileNet-v1 and ShuffleNet-v1 models.

Development of CNN Network Structures
Development of CNN Network Structures

8. EfficientNet

Research on expanding network design in depth, width, and resolution, and the interrelationships between them. Higher efficiency and accuracy can be achieved.

Development of CNN Network Structures

9. ResNet

VGG proved that deeper networks are an effective means of improving accuracy, but deeper networks can easily lead to gradient vanishing, making it difficult for the network to converge. Tests showed that more than 20 layers resulted in increasingly poor convergence with increasing layers. ResNet can effectively address the gradient vanishing problem (actually alleviating it, not truly solving it) by adding shortcut connections.

Development of CNN Network Structures

10. ResNeXt

Based on ResNet and Inception’s split+transform+concatenate combination. However, it performs better than ResNet, Inception, and Inception-ResNet. Group convolution can be used. Generally, there are three ways to increase network expressive power: 1. Increase network depth, such as from AlexNet to ResNet, but experimental results show that the improvement brought by network depth becomes smaller; 2. Increase the width of network modules, but increasing width inevitably leads to exponential parameter scale increases, which is not mainstream CNN design; 3. Improve CNN network structure design, such as Inception series and ResNeXt. Experiments found that increasing cardinality, i.e., the number of identical branches in a block, can better enhance model expressive power.

Development of CNN Network Structures
Development of CNN Network Structures

11. DenseNet

DenseNet significantly reduces the number of parameters through feature reuse and alleviates the gradient vanishing problem to some extent.

Development of CNN Network Structures

12. SqueezeNet

Proposed the fire-module: squeeze layer + expand layer. The squeeze layer is a 1×1 convolution, and the expand layer uses both 1×1 and 3×3 convolutions, followed by concatenation. SqueezeNet has 1/50 the parameters of AlexNet and after compression is 1/510, but the accuracy is comparable to AlexNet.

Development of CNN Network Structures

13. ShuffleNet Series

V1

Reduces computational load through grouped convolutions and 1×1 pointwise group convolution kernels, enriching the information of each channel through channel reorganization. Xception and ResNeXt are inefficient in small network models due to the high resource consumption of numerous 1×1 convolutions; therefore, pointwise group convolution is proposed to reduce computational complexity. However, using pointwise group convolution has side effects, so channel shuffle is proposed to aid information flow. Although dw can reduce computation and parameter count, its efficiency in low-power devices is worse than dense operations, thus ShuffleNet aims to use depthwise convolutions at bottlenecks to minimize overhead.

Development of CNN Network Structures

V2

Design criteria for making neural networks more efficient:

  • Keeping the number of input channels equal to the number of output channels can minimize memory access costs;
  • Using too many groups in grouped convolutions increases memory access costs;
  • Overly complex network structures (too many branches and basic units) reduce the network’s parallelism;
  • Element-wise operations are also not negligible.

Development of CNN Network Structures

14. SENet

Development of CNN Network Structures

15. SKNet

Development of CNN Network Structures

Editor: Huang Jiyuan

Development of CNN Network Structures

Leave a Comment