Introduction to Basic Components of CNN

1. Local Receptive Field

In images, the connections between local pixels are relatively tight, while the connections between pixels that are far apart are weaker. Therefore, each neuron does not need to perceive the entire image globally; it only needs to perceive local information, which can then be combined at higher levels to obtain global information. The convolution operation is the implementation of the local receptive field, and because convolution operations can share weights, they also reduce the number of parameters.

2. Pooling

Pooling reduces the size of the input image, retaining only important information while decreasing pixel information, mainly to reduce computational load. It mainly includes max pooling and average pooling.

3. Activation Function

The activation function is used to introduce non-linearity. Common activation functions include sigmoid, tanh, and ReLU, with the first two often used in fully connected layers and ReLU commonly found in convolutional layers.

4. Fully Connected Layer

The fully connected layer acts as a classifier in the entire convolutional neural network. Prior to the fully connected layer, the previous outputs need to be flattened.

Classic Network Structures

1. LeNet5

Composed of two convolutional layers, two pooling layers, and two fully connected layers. The convolutional kernels are all 5×5, with stride=1, and the pooling layer uses max pooling.

Development of CNN Network Structures: A Comprehensive Overview

2. AlexNet

The model consists of eight layers (excluding the input layer), including five convolutional layers and three fully connected layers. The last layer uses softmax for classification output. AlexNet uses ReLU as the activation function; to prevent overfitting, it employs dropout and data augmentation; it implements dual GPU; and uses LRN.

3. VGG

Uses stacked 3×3 convolutional kernels to simulate a larger receptive field, and the network has a deeper layer structure. VGG has five segments of convolutions, each followed by a max pooling layer. The number of convolutional kernels gradually increases.

Summary: LRN has little effect; deeper networks perform better; 1×1 convolutions are effective but not as good as 3×3.

4. GoogLeNet (Inception v1)

From VGG, we learned that deeper networks yield better results. However, as models become deeper, the number of parameters increases, leading to a higher risk of overfitting, which requires more training data. Additionally, complex networks mean more computational load and larger model storage, requiring more resources and slower speeds. GoogLeNet was designed to reduce parameters.

GoogLeNet increases network complexity by widening the network, allowing it to decide how to select convolutional kernels. This design reduces parameters while enhancing the network’s adaptability to multiple scales. It uses 1×1 convolutions to increase network complexity without adding parameters.

Inception-v2

Based on v1, it adds batch normalization technology; in TensorFlow, using BN before the activation function yields better results; replaces 5×5 convolutions with two consecutive 3×3 convolutions to deepen the network and reduce parameters.

Inception-v3

The core idea is to decompose convolutional kernels into smaller convolutions, such as decomposing 7×7 into 1×7 and 7×1 convolutions, reducing network parameters while deepening the structure.

Inception-v4 structure

Introduces ResNet to accelerate training and improve performance. However, when the number of filters exceeds 1000, training becomes unstable; an activation scaling factor can be added to alleviate this.

5. Xception

Proposed based on Inception-v3, the basic idea is channel-separable convolutions, but with differences. The model slightly reduces parameters while achieving higher accuracy. Xception performs 1×1 convolutions first, followed by 3×3 convolutions, i.e., merging channels before performing spatial convolutions. Depthwise convolutions are the opposite, performing spatial 3×3 convolutions first, then channel 1×1 convolutions. The core idea follows the assumption: during convolution, channel convolutions should be separated from spatial convolutions. MobileNet-v1 uses the depthwise order and adds BN and ReLU. The parameter count of Xception is not much different from Inception-v3; it increases network width to enhance accuracy, while MobileNet-v1 aims to reduce parameters and improve efficiency.

6. MobileNet Series

Utilizes depthwise separable convolutions; abandons pooling layers, using stride=2 convolutions instead. The number of channels in standard convolutions equals the number of input feature map channels; while the depthwise convolution kernel has 1 channel; two parameters can be controlled: a controls the input-output channel count; p controls the image (feature map) resolution.

Compared to v1, there are three differences: 1. Introduces residual structures; 2. Performs 1×1 convolutions to increase feature map channel count before dw; this differs from typical residual blocks; 3. After pointwise, discards ReLU in favor of a linear activation function to prevent ReLU from damaging features. This is because the features extracted by the dw layer are limited by the input channel count; if traditional residual blocks are used, compressing the dw would yield even fewer extractable features. Therefore, initially expanding instead of compressing is preferred. However, when adopting expansion-convolution-compression, after compression, a problem arises: ReLU damages features, and since features are already compressed, additional loss occurs through ReLU, necessitating the use of linear.

Combines complementary search techniques: resource-constrained NAS executes module set search, while NetAdapt performs local searches; network structure improvements: moves the last average pooling layer forward and removes the last convolutional layer, introduces h-swish activation function, and modifies the initial filter group.

V3 integrates depthwise separable convolutions from v1, linear bottleneck structures from v2, and lightweight attention models from SE structures.

7. EffNet

EffNet is an improvement over MobileNet-v1, primarily focused on decomposing the dw layer of MobileNet-1 into two 3×1 and 1×3 dw layers, thus using pooling after the first layer to reduce the computational load of the second layer. EffNet is smaller and more efficient than MobileNet-v1 and ShuffleNet-v1 models.

8. EfficientNet

Researches methods for expanding network design in depth, width, and resolution, and their interrelationships, achieving higher efficiency and accuracy.

9. ResNet

VGG proved that increasing the depth of the network is an effective means of improving accuracy, but deeper networks are prone to gradient vanishing, leading to convergence issues. Tests show that networks deeper than 20 layers have increasingly poor convergence as depth increases. ResNet effectively addresses the gradient vanishing issue (actually alleviates it, rather than completely solves it) by increasing shortcut connections.

10. ResNeXt

Based on ResNet and Inception, it combines split+transform+concatenate. However, it outperforms ResNet, Inception, and Inception-ResNet. Group convolution can be used. Generally, there are three ways to enhance the expressive power of networks: 1. Increase network depth, as seen from AlexNet to ResNet, but experiments show that the improvement from depth becomes smaller; 2. Increase the width of network modules, but this inevitably leads to exponential parameter growth, which is not mainstream in CNN design; 3. Improve CNN network structure design, as in Inception series and ResNeXt. Experiments found that increasing cardinality, i.e., the number of identical branches in a block, can better enhance model expressiveness.

11. DenseNet

DenseNet significantly reduces the number of parameters through feature reuse and alleviates the gradient vanishing problem to some extent.

12. SqueezeNet

Introduced the fire-module: squeeze layer + expand layer. The squeeze layer consists of 1×1 convolutions, while the expand layer uses both 1×1 and 3×3 convolutions, followed by concatenation. SqueezeNet has 1/50 the parameters of AlexNet, and after compression, it is 1/510, but its accuracy is comparable to AlexNet.

13. ShuffleNet Series

Reduces computational load through grouped convolutions and 1×1 pointwise group convolution kernels, enriching information across channels through channel reorganization. Xception and ResNeXt are less efficient in small network models due to the resource-intensive nature of numerous 1×1 convolutions; hence, pointwise group convolutions are proposed to reduce computational complexity. However, using pointwise group convolutions has side effects, leading to the introduction of channel shuffle to assist information flow. Although dw can reduce computational and parameter counts, it is less efficient in low-power devices compared to dense operations; thus, ShuffleNet aims to use depthwise convolutions at bottlenecks to minimize overhead.

Design principles for more efficient CNN network structures:

Keeping the number of input and output channels equal minimizes memory access costs.

Using too many groups in grouped convolutions increases memory access costs.

A complex network structure (too many branches and basic units) reduces the network’s parallelism.

The cost of element-wise operations should not be overlooked.

14. SENet

15. SKNet

-End-

Author:zzq

Source:https://zhuanlan.zhihu.com/p/68411179

Introduction to Basic Components of CNN

Classic Network Structures

Leave a Comment Cancel reply