The Development of CNN Architectures: From LeNet to EfficientNet

Author: zzq

https://zhuanlan.zhihu.com/p/68411179

This article is authorized, and unauthorized reproduction is not allowed.

Introduction to Basic Components of CNN

1. Local Receptive Field

In images, the connections between local pixels are relatively tight, while the connections between distant pixels are weaker. Therefore, each neuron does not need to perceive the entire image globally; it only needs to sense local information, which can then be integrated at higher levels to obtain global information. The convolution operation is the implementation of the local receptive field, and because convolution allows for weight sharing, it also reduces the number of parameters.

2. Pooling

Pooling reduces the size of the input image, decreasing pixel information and retaining only essential information, primarily to reduce computational load. It mainly includes max pooling and average pooling.

3. Activation Function

The activation function is used to introduce non-linearity. Common activation functions include sigmoid, tanh, and ReLU; the first two are often used in fully connected layers, while ReLU is commonly used in convolutional layers.

4. Fully Connected Layer

The fully connected layer acts as a classifier in the convolutional neural network. The output from previous layers needs to be flattened before entering the fully connected layer.

Classic Network Architectures

1. LeNet5

Consists of two convolutional layers, two pooling layers, and two fully connected layers. The convolution kernels are all 5×5, with stride=1, and the pooling layer uses max pooling.

The Development of CNN Architectures: From LeNet to EfficientNet

2. AlexNet

The model contains a total of eight layers (excluding the input layer), including five convolutional layers and three fully connected layers. The last layer uses softmax for classification output.

AlexNet uses ReLU as the activation function; it employs dropout and data augmentation to prevent overfitting; it utilizes dual GPUs; and it incorporates LRN.

3. VGG

All layers use 3×3 convolution kernels stacked to simulate a larger receptive field, resulting in a deeper network. VGG consists of five blocks of convolutions, each followed by a max pooling layer. The number of convolution kernels gradually increases.

Summary: LRN has little effect; deeper networks yield better results; 1×1 convolutions are effective but not as good as 3×3.

4. GoogLeNet (Inception v1)

From VGG, we learned that deeper networks yield better results. However, as the model deepens, the number of parameters increases, making the network more prone to overfitting, requiring more training data; additionally, complex networks mean more computational load, larger model storage, and slower speeds. GoogLeNet was designed to reduce parameters.

GoogLeNet increases network complexity by widening the network, allowing the network to choose convolution kernels itself. This design reduces parameters while enhancing the network’s adaptability to various scales. Using 1×1 convolutions allows for increased network complexity without adding parameters.

Inception-v2

Based on v1, it incorporates batch normalization technology. In TensorFlow, using BN before the activation function yields better results; replacing the 5×5 convolution with two consecutive 3×3 convolutions deepens the network while reducing parameters.

Inception-v3

The core idea is to decompose convolution kernels into smaller convolutions, such as breaking down a 7×7 kernel into 1×7 and 7×1 kernels, reducing network parameters while increasing depth.

Inception-v4

Introduces ResNet to accelerate training and improve performance. However, when the number of filters exceeds 1000, training becomes unstable; adding an activation scaling factor can alleviate this.

5. Xception

Proposed based on Inception-v3, the core idea is depthwise separable convolutions, but with distinctions. The model slightly reduces parameters while achieving higher accuracy. Xception first performs a 1×1 convolution followed by a 3×3 convolution, merging channels before spatial convolution. Depthwise convolutions do the opposite, first performing spatial 3×3 convolutions, then channel 1×1 convolutions. The core idea is to separate channel convolutions from spatial convolutions. MobileNet-v1 uses the depthwise order and adds BN and ReLU. The parameter count of Xception is similar to that of Inception-v3, but it increases network width to improve accuracy, while MobileNet-v1 aims to reduce parameters and enhance efficiency.

6. MobileNet Series

Utilizes depthwise separable convolutions; abandons pooling layers in favor of stride=2 convolutions. In standard convolutions, the number of channels in the convolution kernel equals the number of input feature map channels; in depthwise convolutions, the kernel channel count is 1; two parameters can be controlled: a controls the input-output channel count; p controls the image (feature map) resolution.

Compared to v1, there are three differences: 1. Introduced residual structures; 2. Performed a 1×1 convolution before the dw layer to increase the feature map channel count, differing from conventional residual blocks; 3. After pointwise convolution, replaced ReLU with a linear activation function to prevent ReLU from damaging features. This is because the features extracted by the dw layer are constrained by the input channel count; if a traditional residual block is used, the dw layer can extract even fewer features. Therefore, it is better to expand first. However, when using expansion-convolution-compression, after compression, a problem arises where ReLU damages features. Since features are already compressed, further application of ReLU leads to additional loss of features; hence, a linear function should be used.

Complementary search technology combination: executed by resource-constrained NAS for module set search, NetAdapt performs local searches; network structure improvements: moving the average pooling layer before the last step and removing the last convolution layer, introducing the h-swish activation function, modifying the initial filter group.

V3 combines the depthwise separable convolutions of v1, the linear bottleneck residual structure of v2, and the lightweight attention model of the SE structure.

7. EffNet

EffNet is an improvement over MobileNet-v1, primarily focusing on decomposing the dw layer of MobileNet-1 into two 3×1 and 1×3 dw layers, allowing pooling to be applied after the first layer, thereby reducing the computational load of the second layer. EffNet is smaller and more efficient than MobileNet-v1 and ShuffleNet-v1 models.

8. EfficientNet

Researches network design methods for expanding depth, width, and resolution, as well as their interrelationships, to achieve higher efficiency and accuracy.

9. ResNet

VGG demonstrates that increasing network depth is an effective means to improve accuracy, but deeper networks are prone to gradient vanishing, leading to poor convergence. Tests show that beyond 20 layers, convergence worsens with increasing depth. ResNet effectively addresses the gradient vanishing problem (actually alleviates it rather than solves it) by adding shortcut connections.

10. ResNeXt

Combines split-transform-concatenate based on ResNet and Inception. However, it outperforms ResNet, Inception, and Inception-ResNet. Group convolution can be utilized. Generally, there are three ways to enhance network expressiveness: 1. Increasing network depth, as seen from AlexNet to ResNet, but experiments show diminishing returns from increased depth; 2. Increasing the width of network modules, but this inevitably leads to exponential increases in parameter size, which is not the mainstream CNN design; 3. Improving CNN network structure design, such as Inception series and ResNeXt. Experiments have shown that increasing cardinality—the number of identical branches in a block—can better enhance model expressiveness.

11. DenseNet

DenseNet significantly reduces network parameters through feature reuse and also alleviates the gradient vanishing problem to some extent.

12. SqueezeNet

Proposes the fire module: squeeze layer + expand layer. The squeeze layer uses 1×1 convolutions, while the expand layer uses both 1×1 and 3×3 convolutions, followed by concatenation. The parameters of SqueezeNet are 1/50 of AlexNet, and after compression, it is 1/510, yet its accuracy is comparable to that of AlexNet.

13. ShuffleNet Series

Reduces computational load through grouped convolutions and 1×1 pointwise group convolutions, enriching information across channels by reorganizing them. Xception and ResNeXt are less efficient in small network models due to the resource-intensive nature of numerous 1×1 convolutions, leading to the proposal of pointwise group convolutions to reduce computational complexity. However, pointwise group convolutions have side effects, hence the introduction of channel shuffle to facilitate information flow. Although dw can reduce computational and parameter loads, it is less efficient in low-power devices compared to dense operations, hence ShuffleNet aims to use depthwise convolutions at bottlenecks to minimize overhead.

Design principles for more efficient CNN network structures:

Keeping the number of input channels equal to the number of output channels minimizes memory access costs.

Excessive grouping in grouped convolutions increases memory access costs.

Overly complex network structures (too many branches and basic units) reduce the network’s parallelism.

Element-wise operations also cannot be ignored.

14. SENet

15. SKNet

Leave a Comment Cancel reply