Development of CNN Architecture: From LeNet to EfficientNet

Click on the above “CVer“, select to add “Star” or “Top”

Important content delivered instantly

Author: zzq

https://zhuanlan.zhihu.com/p/68411179

This article is authorized, and reprinting without permission is prohibited

Introduction to Basic Components of CNN

1. Local Receptive Field

In images, the connections between local pixels are relatively tight, while the connections between distant pixels are relatively weak. Therefore, each neuron does not need to perceive the entire image globally; it only needs to perceive local information, which can then be combined at higher levels to obtain global information. The convolution operation is the realization of the local receptive field, and because convolution operations can share weights, they also reduce the number of parameters.

2. Pooling

Pooling reduces the input image size, decreases pixel information, and retains only the important information, primarily to reduce computational load. This includes max pooling and average pooling.

3. Activation Function

The activation function is used to introduce non-linearity. Common activation functions include sigmoid, tanh, and ReLU, with the first two often used in fully connected layers and ReLU commonly found in convolutional layers.

4. Fully Connected Layer

The fully connected layer acts as a classifier in the entire convolutional neural network. The outputs from previous layers need to be flattened before entering the fully connected layer.

Classic Network Architectures

1. LeNet5

Consists of two convolutional layers, two pooling layers, and two fully connected layers. The convolution kernels are all 5×5, with stride=1, and the pooling layer uses max pooling.

Development of CNN Architecture: From LeNet to EfficientNet

2. AlexNet

The model has eight layers (excluding the input layer), consisting of five convolutional layers and three fully connected layers. The last layer uses softmax for classification output.

AlexNet uses ReLU as the activation function; employs dropout and data augmentation to prevent overfitting; implements dual GPUs; and utilizes LRN.

3. VGG

Uses only stacked 3×3 convolution kernels to simulate a larger receptive field, resulting in a deeper network. VGG has five segments of convolutions, each followed by a max pooling layer. The number of convolution kernels increases gradually.

Summary: LRN has little effect; deeper networks yield better performance; 1×1 convolutions are effective but not as good as 3×3.

4. GoogLeNet (Inception v1)

From VGG, we learned that deeper networks yield better results. However, as the model deepens, the number of parameters increases, leading to a higher risk of overfitting, requiring more training data; moreover, complex networks mean more computational load, larger model storage, and slower speeds. GoogLeNet was designed to reduce parameters.

GoogLeNet increases network complexity by widening the network, allowing it to choose how to select convolution kernels. This design reduces parameters while enhancing the network’s adaptability to multiple scales. The use of 1×1 convolutions increases network complexity without adding parameters.

Inception-v2

Introduces batch normalization technology based on v1; in TensorFlow, using BN before the activation function yields better results; replaces 5×5 convolutions with two consecutive 3×3 convolutions to deepen the network while reducing parameters.

Inception-v3

The core idea is to decompose convolution kernels into smaller convolutions, such as breaking down a 7×7 convolution into 1×7 and 7×1 convolutions, reducing network parameters while increasing depth.

Inception-v4

Introduces ResNet to speed up training and improve performance. However, when the number of filters exceeds 1000, training becomes unstable; adding an activation scaling factor can help alleviate this.

5. Xception

Based on Inception-v3, the core idea is depthwise separable convolution but with differences. The model parameters are slightly reduced, but accuracy is improved. Xception first performs a 1×1 convolution followed by a 3×3 convolution, merging channels before performing spatial convolution. Depthwise convolution, on the other hand, first performs spatial 3×3 convolution followed by channel 1×1 convolution. The core idea is to separate channel convolution from spatial convolution during convolution operations. MobileNet-v1 uses the depthwise order and adds BN and ReLU. The parameter count of Xception is similar to that of Inception-v3, as it increases network width to improve accuracy, while MobileNet-v1 aims to reduce parameters and enhance efficiency.

6. MobileNet Series

Uses depthwise separable convolutions; abandons pooling layers and uses stride=2 convolutions instead. The number of channels in standard convolution kernels equals the number of input feature map channels; whereas, the depthwise convolution kernel has a channel count of 1; there are two parameters that can be controlled: ‘a’ controls the input-output channel count; ‘p’ controls the image (feature map) resolution.

Compared to v1, there are three differences: 1. Introduced residual structures; 2. Performed a 1×1 convolution before the dw to increase the number of feature map channels, which differs from the typical residual block; 3. After pointwise, ReLU is replaced with a linear activation function to prevent ReLU from damaging features. This is because the features extracted by the dw layer are limited by the input channel count; if traditional residual blocks are used, compressing the dw would extract even fewer features; thus, initially expanding rather than compressing is preferred. However, when using expansion-convolution-compression, an issue arises after compression, as ReLU can damage features, and since features are already compressed, further use of ReLU would lose additional features; hence a linear function should be used.

Complementary search technology combination: resource-constrained NAS executes module set search, NetAdapt executes local search; network structure improvement: moves the last average pooling layer forward and removes the last convolution layer, introduces h-swish activation function, and modifies the initial filter group.

V3 integrates v1’s depthwise separable convolutions, v2’s linear bottleneck residual structure, and SE structure’s lightweight attention model.

7. EffNet

EffNet is an improvement over MobileNet-v1, mainly based on the idea of decomposing the dw layer of MobileNet-1 into two 3×1 and 1×3 dw layers. This way, pooling is applied after the first layer, reducing the computational load of the second layer. EffNet is smaller and more efficient than MobileNet-v1 and ShuffleNet-v1 models.

8. EfficientNet

Studies methods of expanding in depth, width, and resolution during network design, as well as the interrelationships between them. This can achieve higher efficiency and accuracy.

9. ResNet

VGG has proven that deeper networks are an effective means to improve accuracy, but deeper networks are prone to gradient vanishing, leading to non-convergence. Tests show that beyond 20 layers, convergence deteriorates with the increase in layers. ResNet effectively alleviates the gradient vanishing issue (though it does not completely solve it) by adding shortcut connections.

10. ResNeXt

Combines ResNet and Inception using split+transform+concatenate. However, it outperforms ResNet, Inception, and Inception-ResNet. Group convolution can be used. Generally, there are three ways to enhance network expressiveness: 1. Increase network depth, e.g., from AlexNet to ResNet, but experiments show that improvements from depth become less significant; 2. Increase the width of network modules, but increasing width inevitably leads to an exponential increase in parameter scale, which is not mainstream in CNN design; 3. Improve CNN network structure design, such as Inception series and ResNeXt. Experiments have found that increasing cardinality, or the number of identical branches in a block, can better enhance model expressiveness.

11. DenseNet

DenseNet significantly reduces the number of parameters through feature reuse and alleviates the gradient vanishing problem to some extent.

12. SqueezeNet

Introduces the fire module: squeeze layer + expand layer. The squeeze layer uses 1×1 convolution, while the expand layer uses both 1×1 and 3×3 convolutions, followed by concatenation. SqueezeNet has 1/50 the parameters of AlexNet and after compression is 1/510, but achieves similar accuracy to AlexNet.

13. ShuffleNet Series

Reduces computational load through grouped convolutions and 1×1 pointwise group convolutions, enriching information across channels by reorganizing them. Xception and ResNeXt are less efficient in small network models due to the resource-intensive nature of many 1×1 convolutions; thus, pointwise group convolutions were proposed to reduce computational complexity, though they have side effects. Therefore, channel shuffle was introduced to facilitate information flow. While dw can reduce computational and parameter loads, it is less efficient than dense operations on low-power devices; hence ShuffleNet aims to use deep convolutions at bottlenecks to minimize overhead.

Design principles for making neural networks more efficient:

Keeping input and output channel counts equal minimizes memory access costs.

Using too many groups in grouped convolutions increases memory access costs.

Overly complex network structures (too many branches and basic units) reduce network parallelism.

Element-wise operations also have non-negligible costs.

14. SENet

15. SKNet

Important! CVer – Academic Exchange Group has been established

Scan the code to add the CVer assistant, and apply to join the CVer large group and sub-direction groups, covering:Object Detection, Image Segmentation, Object Tracking, Face Detection & Recognition, OCR, Pose Estimation, Super Resolution, SLAM, Medical Imaging, Re-ID, GAN, NAS, Depth Estimation, Autonomous Driving, Reinforcement Learning, Lane Detection, Model Pruning & Compression, Denoising, Dehazing, Deraining, Style Transfer, Remote Sensing Images, Behavior Recognition, Video Understanding, Image Fusion, Image Retrievaland more groups.

Be sure to note:Research Direction + Location + School/Company + Nickname(e.g., Object Detection + Shanghai + SJTU + Kaka), following this format will help you get approved and invited into the group faster.

▲ Long press to join the group

▲ Long press to follow us

Please give me a thumbs up!

Leave a Comment Cancel reply