Development of CNN Architecture: Comprehensive Overview

MLNLP community is a well-known machine learning and natural language processing community both domestically and internationally, with an audience that includes NLP master’s and PhD students, university professors, and corporate researchers.

The community’s vision is to promote communication and progress between the academic and industrial sectors of natural language processing and machine learning, especially for beginners.

Reprinted from | Jishi Platform

Author丨zzq

Source丨https://zhuanlan.zhihu.com/p/68411179

『Introduction to Basic Components of CNN』

1. Local Receptive Field

In images, the relationship between local pixels is relatively tight, while the relationship between distant pixels is relatively weak. Therefore, each neuron does not need to perceive the entire image globally; it only needs to perceive local information, and then the local information can be integrated at higher levels to obtain global information. The convolution operation is the implementation of the local receptive field, and it also reduces the number of parameters due to weight sharing.

2. Pooling

Pooling reduces the size of the input image, decreases pixel information, and only retains important information, mainly to reduce computational load. It mainly includes max pooling and average pooling.

3. Activation Function

The activation function is used to introduce non-linearity. Common activation functions include sigmoid, tanh, and ReLU, with the first two commonly used in fully connected layers and ReLU often seen in convolutional layers.

4. Fully Connected Layer

The fully connected layer acts as a classifier in the entire convolutional neural network. The outputs from previous layers need to be flattened before entering the fully connected layer.

『Classic Network Structures』

1. LeNet5

Consists of two convolutional layers, two pooling layers, and two fully connected layers. The convolution kernels are all 5×5, with stride=1, and the pooling layer uses max pooling.

Development of CNN Architecture: Comprehensive Overview

2. AlexNet

The model has eight layers (excluding the input layer), including five convolutional layers and three fully connected layers. The last layer uses softmax for classification output.

AlexNet uses ReLU as the activation function; applies dropout and data augmentation to prevent overfitting; implements dual GPU; and uses LRN.

3. VGG

Uses a stack of 3×3 convolution kernels to simulate a larger receptive field and has a deeper network. VGG has five segments of convolutions, each followed by a max pooling layer. The number of convolution kernels gradually increases.

Summary: LRN has little effect; deeper networks perform better; 1×1 convolutions are also effective but not as good as 3×3.

4. GoogLeNet (Inception v1)

From VGG, we learned that deeper networks yield better results. However, as the model gets deeper, the number of parameters increases, making networks more prone to overfitting and requiring more training data; additionally, complex networks mean more computational load and larger model storage, needing more resources and not being fast enough. GoogLeNet is designed from the perspective of reducing parameters.

GoogLeNet increases network complexity by widening the network, allowing the network to choose convolution kernels on its own. This design reduces parameters while enhancing the network’s adaptability to multiple scales. Using 1×1 convolutions allows for increased network complexity without adding parameters.

Inception-v2

Based on v1, it introduces batch normalization technology. In TensorFlow, using BN before the activation function yields better results; replaces 5×5 convolutions with two consecutive 3×3 convolutions, making the network deeper and with fewer parameters.

Inception-v3

The core idea is to decompose convolution kernels into smaller convolutions, such as decomposing 7×7 into 1×7 and 7×1 convolutions, reducing network parameters and increasing depth.

Inception-v4 Structure

Introduces ResNet to accelerate training and improve performance. However, when the number of filters is too large (>1000), training becomes unstable; an activation scaling factor can be added to alleviate this.

5. Xception

Proposed based on Inception-v3, the basic idea is depthwise separable convolution, but with differences. The model parameters are slightly reduced, but accuracy is higher. Xception first performs 1×1 convolutions and then 3×3 convolutions, merging channels first before performing spatial convolutions. Depthwise does the opposite, performing spatial 3×3 convolutions first, then channel 1×1 convolutions. The core idea is to follow an assumption: during convolution, the convolution of channels should be separated from spatial convolution. MobileNet-v1 uses the depthwise order and adds BN and ReLU. The parameter count of Xception is not much different from Inception-v3, but it increases network width, aiming to improve network accuracy, while MobileNet-v1 aims to reduce network parameters and enhance efficiency.

6. MobileNet Series

Uses depthwise separable convolutions; abandons pooling layers in favor of stride=2 convolutions. The number of channels in standard convolution kernels equals the number of input feature map channels; while depthwise convolution kernels have a channel count of 1; there are two parameters to control: a controls the input and output channel count; p controls the image (feature map) resolution.

Compared to v1, there are three differences: 1. Introduces residual structures; 2. Performs a 1×1 convolution before dw to increase feature map channel count, which is different from general residual blocks; 3. After pointwise, ReLU is abandoned in favor of a linear activation function to prevent ReLU from damaging features. This is because the features extracted by the dw layer are limited by the input channel count; if traditional residual blocks are used, compressing that will result in fewer features that can be extracted. Therefore, it is better to expand first and then compress. However, when using expansion-convolution-compression, a problem arises after compression: ReLU damages features; if features are already compressed, passing through ReLU will lose some features, hence linear should be used.

Complementary search technology combination: Resource-constrained NAS executes module set search, NetAdapt executes local search; network structure improvements: moves the average pooling layer before the last step and removes the last convolution layer, introduces h-swish activation function, and modifies the initial filter set.

V3 integrates depthwise separable convolutions from v1, the linear bottleneck of v2, and the lightweight attention model of the SE structure.

7. EffNet

EffNet is an improvement over MobileNet-v1, with the main idea being to decompose the dw layer of MobileNet-1 into two 3×1 and 1×3 dw layers, thus pooling after the first layer to reduce the computational load of the second layer. EffNet is smaller and more efficient than MobileNet-v1 and ShuffleNet-v1 models.

8. EfficientNet

Researches methods for expanding network design in depth, width, and resolution, and their interrelationships. Achieves higher efficiency and accuracy.

9. ResNet

VGG proves that increasing the depth of the network is an effective means to improve accuracy, but deeper networks are prone to gradient vanishing, leading to non-convergence. Tests show that beyond 20 layers, convergence worsens with increasing layers. ResNet effectively alleviates the gradient vanishing problem (actually mitigates, not completely solves) by adding shortcut connections.

10. ResNeXt

Based on ResNet and Inception, combining split + transform + concatenate. However, it performs better than ResNet, Inception, and Inception-ResNet. Group convolution can be used. Generally, there are three ways to enhance network expressiveness: 1. Increase network depth, as from AlexNet to ResNet, but experimental results show that improvements brought by depth diminish; 2. Increase the width of network modules, but increasing width leads to an exponential increase in parameter scale, which is not a mainstream CNN design; 3. Improve CNN network structure design, such as Inception series and ResNeXt. Experiments show that increasing cardinality, i.e., the number of identical branches in a block, can better enhance model expressiveness.

11. DenseNet

DenseNet significantly reduces the number of parameters in the network through feature reuse, and also alleviates the gradient vanishing problem to some extent.

12. SqueezeNet

Introduced the fire-module: squeeze layer + expand layer. The squeeze layer uses 1×1 convolution, and the expand layer uses 1×1 and 3×3 convolutions respectively, followed by concatenation. SqueezeNet has 1/50 the parameters of AlexNet, and after compression, it is 1/510, but the accuracy is comparable to AlexNet.

13. ShuffleNet Series

Reduces computational load through grouped convolutions and 1×1 pointwise group convolutions, enriching information across channels by reorganizing them. Xception and ResNeXt are inefficient in small network models because a large number of 1×1 convolutions consume resources; thus, pointwise group convolutions are proposed to reduce computational complexity, but they have side effects, hence channel shuffle is introduced to aid information flow. Although dw can reduce computational and parameter load, its efficiency in low-power devices is worse compared to dense operations, thus ShuffleNet aims to use deep convolutions at bottlenecks to minimize overhead.

Design criteria for more efficient CNN network structures:

Maintaining equal input and output channel counts minimizes memory access costs.

Using too many groups in grouped convolutions increases memory access costs.

Overly complex network structures (too many branches and basic units) reduce network parallelism.

Element-wise operations should not be overlooked.

14. SENet

15. SKNet

Technical Group Invitation

△ Long press to add assistant

Scan the QR code to add the assistant’s WeChat

Please note: Name – School/Company – Research Direction

(For example: Xiao Zhang – Harbin Institute of Technology – Dialogue System)

to apply to join the Natural Language Processing / Pytorch technical group.

About Us

MLNLP Community is a grassroots academic community jointly built by domestic and foreign scholars in machine learning and natural language processing. It has now developed into a well-known community for machine learning and natural language processing, aiming to promote progress between the academic and industrial sectors of machine learning and natural language processing.

The community can provide an open communication platform for practitioners in further study, employment, and research. Everyone is welcome to follow and join us.

About Us

Leave a Comment Cancel reply