Comprehensive Overview of CNN Architecture Development

↓↓↓Click to follow, reply withmaterials, a 10GB surprise

Author: zzq Editor: Jishi Platform
Source: https://zhuanlan.zhihu.com/p/68411179
For academic sharing only, does not represent the position of this publicaccount. Contact for removal if infringing.

Introduction to Basic Components of CNN

1. Local Receptive Field
In images, the connections between local pixels are relatively tight, while the connections between distant pixels are weaker. Therefore, each neuron does not need to perceive the entire image globally; it only needs to sense local information, which can then be integrated at higher layers to obtain global information. The convolution operation is the realization of the local receptive field, and because convolution allows weight sharing, it also reduces the number of parameters.
2. Pooling
Pooling reduces the size of the input image, decreasing pixel information while retaining important information, mainly to reduce computation. It mainly includes max pooling and average pooling.
3. Activation Function
The activation function is used to introduce non-linearity. Common activation functions include sigmod, tanh, and relu; the first two are commonly used in fully connected layers, while relu is common in convolutional layers.
4. Fully Connected Layer
The fully connected layer acts as a classifier in the entire convolutional neural network. The outputs before the fully connected layer need to be flattened.

Classic Network Architectures

1. LeNet5
Consists of two convolutional layers, two pooling layers, and two fully connected layers. The convolution kernels are all 5×5, with stride=1, and the pooling layer uses max pooling.
Comprehensive Overview of CNN Architecture Development
2. AlexNet
The model has eight layers (excluding the input layer), including five convolutional layers and three fully connected layers. The last layer uses softmax for classification output.
AlexNet uses ReLU as the activation function; dropout and data augmentation are used to prevent overfitting; implemented with dual GPUs; uses LRN.
Comprehensive Overview of CNN Architecture Development
Comprehensive Overview of CNN Architecture Development
3. VGG
Uses stacks of 3×3 convolution kernels to simulate a larger receptive field, with a deeper network. VGG has five segments of convolutions, each followed by a layer of max pooling. The number of convolution kernels gradually increases.
Summary: LRN has little effect; deeper networks perform better; 1×1 convolutions are effective but not as good as 3×3.
Comprehensive Overview of CNN Architecture Development
4. GoogLeNet (Inception v1)
From VGG, we learned that deeper networks yield better results. However, as the model gets deeper, the number of parameters increases, making the network prone to overfitting, requiring more training data; additionally, complex networks mean more computation, larger model storage, and slower speeds. GoogLeNet was designed to reduce parameters.
GoogLeNet increases network complexity by widening the network, allowing it to choose convolution kernels on its own. This design reduces parameters while improving the network’s adaptability to various scales. Using 1×1 convolutions can increase network complexity without adding parameters.
Comprehensive Overview of CNN Architecture Development
Inception-v2
On the basis of v1, batch normalization technology was added. In TensorFlow, using BN before the activation function yields better results; replacing 5×5 convolutions with two consecutive 3×3 convolutions makes the network deeper with fewer parameters.
Inception-v3
The core idea is to decompose convolution kernels into smaller convolutions, such as decomposing 7×7 into two kernels of 1×7 and 7×1, reducing network parameters and increasing depth.
Inception-v4 structure
Introduced ResNet to accelerate training and improve performance. However, when the number of filters is too large (>1000), training becomes unstable; an activation scaling factor can be added to alleviate this.
5. Xception
Proposed based on Inception-v3, the basic idea is depthwise separable convolutions, but with differences. The model’s parameters are slightly reduced, but accuracy is higher. Xception first performs a 1×1 convolution followed by a 3×3 convolution, merging channels before performing spatial convolutions. Depthwise does the opposite: it first performs a 3×3 spatial convolution and then a 1×1 channel convolution. The core idea is to separate the convolution of channels from spatial convolutions. MobileNet-v1 uses the depthwise order and adds BN and ReLU. The parameter count of Xception is not much different from Inception-v3, but it increases network width to improve accuracy, while MobileNet-v1 aims to reduce parameters and enhance efficiency.
Comprehensive Overview of CNN Architecture Development
Comprehensive Overview of CNN Architecture Development
6. MobileNet Series
V1
Uses depthwise separable convolutions; abandons pooling layers in favor of stride=2 convolutions. The number of channels in standard convolution kernels equals the number of input feature map channels; while for depthwise, the number of channels is 1. There are two parameters to control: a controls the input-output channel count; p controls the image (feature map) resolution.
Comprehensive Overview of CNN Architecture Development
Comprehensive Overview of CNN Architecture Development
V2
Compared to v1, there are three differences: 1. Introduced residual structures; 2. Performed a 1×1 convolution before dw to increase feature map channel count, which differs from a typical residual block; 3. After pointwise, ReLU is replaced with a linear activation function to prevent ReLU from damaging features. This is because the features extracted by the dw layer are limited by the input channel count; if traditional residual blocks are used, the features that dw can extract would be reduced. Thus, initially expanding rather than compressing is better. However, when using expansion-convolution-compression, a problem arises after compression: ReLU damages features, and since features are already compressed, applying ReLU will lose some features, so linear should be used.
Comprehensive Overview of CNN Architecture Development
Comprehensive Overview of CNN Architecture Development
V3
Complementary search technology combination: resource-constrained NAS executes module set search, NetAdapt performs local search; network structure improvements: move the final average pooling layer forward and remove the last convolution layer, introduce h-swish activation function, modified the initial filter group.
V3 integrates depthwise separable convolutions from v1, the linear bottleneck of the inverse residual structure from v2, and the lightweight attention model of SE structure.
Comprehensive Overview of CNN Architecture Development
Comprehensive Overview of CNN Architecture Development
Comprehensive Overview of CNN Architecture Development
7. EffNet
EffNet is an improvement over MobileNet-v1, with the main idea being: decompose the dw layer of MobileNet-1 into two 3×1 and 1×3 dw layers, so that after the first layer, pooling is used to reduce the computational load of the second layer. EffNet is smaller and more efficient than MobileNet-v1 and ShuffleNet-v1 models.
Comprehensive Overview of CNN Architecture Development
Comprehensive Overview of CNN Architecture Development
8. EfficientNet
Researches ways to expand network design in depth, width, and resolution, and the interrelationships between them. This achieves higher efficiency and accuracy.
Comprehensive Overview of CNN Architecture Development
9. ResNet
VGG proved that increasing the depth of networks is an effective means to improve accuracy, but deeper networks are prone to gradient vanishing, causing the network to fail to converge. Tests show that beyond 20 layers, convergence deteriorates as the number of layers increases. ResNet effectively addresses the gradient vanishing problem (actually alleviates it, rather than truly solving it) by adding shortcut connections.
Comprehensive Overview of CNN Architecture Development
10. ResNeXt
Based on the combination of ResNet and Inception’s split+transform+concatenate. Its performance surpasses that of ResNet, Inception, and Inception-ResNet. Group convolution can be used. Generally, there are three ways to enhance network expressiveness: 1. Increase network depth, as from AlexNet to ResNet, but experimental results show that the improvement from network depth diminishes; 2. Increase the width of network modules, but width increase inevitably leads to exponential growth in parameter scale, which is not a mainstream CNN design; 3. Improve CNN network structure design, such as Inception series and ResNeXt. Experiments have shown that increasing cardinality, i.e., the number of identical branches in a block, can better enhance model expressiveness.
Comprehensive Overview of CNN Architecture Development
Comprehensive Overview of CNN Architecture Development
11. DenseNet
DenseNet significantly reduces the number of parameters through feature reuse and alleviates the gradient vanishing problem to some extent.
Comprehensive Overview of CNN Architecture Development
12. SqueezeNet
Introduced the fire-module: squeeze layer + expand layer. The squeeze layer consists of 1×1 convolutions, while the expand layer uses both 1×1 and 3×3 convolutions, followed by concatenation. SqueezeNet’s parameters are 1/50 of AlexNet’s, and after compression, it is 1/510, yet achieves comparable accuracy to AlexNet.
Comprehensive Overview of CNN Architecture Development
13. ShuffleNet Series
V1
Reduces computation through group convolutions and 1×1 pointwise group convolutions, enriching information across channels by rearranging them. Xception and ResNeXt are less efficient in small network models due to the resource-intensive nature of numerous 1×1 convolutions, leading to the proposal of pointwise group convolutions to reduce computational complexity. However, pointwise group convolutions have side effects, which is why channel shuffle is introduced to aid information flow. Although dw can reduce computational and parameter loads, in low-power devices, it is less efficient in computation and storage access compared to dense operations, hence ShuffleNet aims to use depth convolutions at bottlenecks to minimize overhead.
Comprehensive Overview of CNN Architecture Development
V2
Design principles for making neural networks more efficient:
Keep the number of input and output channels equal to minimize memory access costs.
Using too many groups in group convolutions increases memory access costs.
Overly complex network structures (too many branches and basic units) reduce network parallelism.
Element-wise operations should not be overlooked.
Comprehensive Overview of CNN Architecture Development
14. SENet
Comprehensive Overview of CNN Architecture Development
15. SKNet
Comprehensive Overview of CNN Architecture Development
Preparing to write this book
【Wishing You Prosperity in the Year of the Tiger】Technical New Year's Gift Package
Everyone Can Understand EM Algorithm Derivation
2021 Report on the Development of AI Infrastructure
Tencent Releases China's First Explainable AI Report!
A One-Time Summary: 64 Commonly Used Data Analysis Terms!

205-page PDF! Tsinghua University's "Research Report on the Development of the Metaverse"
"Mushen" Bilibili Teaches You How to Read Papers: How to Judge Research Value
3D Visualization of Neural Networks, Convolution, and Pooling Clearly!

Triple Click to See, Monthly Income of Millions 👇

Leave a Comment