Comprehensive Overview of CNN Architecture Development

Click aboveComputer Vision Alliance to get more valuable content

For academic sharing only, does not represent the position of this public account, contact for removal in case of infringement
Reproduced from: Author丨zzq
Source丨https://zhuanlan.zhihu.com/p/68411179
Editor丨Jishi Platform
Recommended notes from 985 AI PhD
Zhou Zhihua’s “Machine Learning” handwritten notes are officially open-source! Includes PDF download link, Github 2500 stars!

Introduction to Basic Components of CNN

1. Local Receptive Field
In an image, the connections between local pixels are relatively tight, while the connections between distant pixels are relatively weak. Therefore, each neuron does not need to perceive the entire image but only needs to sense local information, which can then be combined at higher levels to obtain global information. The convolution operation implements the local receptive field and reduces the number of parameters due to weight sharing.
2. Pooling
Pooling reduces the size of the input image, decreasing pixel information while retaining important details, primarily to reduce computational load. It mainly includes max pooling and average pooling.
3. Activation Function
The activation function is used to introduce non-linearity. Common activation functions include sigmoid, tanh, and ReLU; the first two are commonly used in fully connected layers, while ReLU is often found in convolutional layers.
4. Fully Connected Layer
The fully connected layer acts as a classifier within the entire convolutional neural network. The outputs from previous layers must be flattened before entering the fully connected layer.

Classic Network Structures

1. LeNet5
Consists of two convolutional layers, two pooling layers, and two fully connected layers. The convolutional kernel is 5×5, with a stride of 1, and the pooling layer uses max pooling.
Comprehensive Overview of CNN Architecture Development
2. AlexNet
The model has eight layers (excluding the input layer), including five convolutional layers and three fully connected layers. The last layer uses softmax for classification output.
AlexNet uses ReLU as the activation function; dropout and data augmentation are used to prevent overfitting; implemented with dual GPUs; employs LRN.
Comprehensive Overview of CNN Architecture Development
Comprehensive Overview of CNN Architecture Development
3. VGG
Utilizes stacked 3×3 convolutional kernels to simulate a larger receptive field and has a deeper network. VGG has five segments of convolutions, each followed by a max pooling layer. The number of convolutional kernels gradually increases.
Summary: LRN has little effect; deeper networks yield better results; 1×1 convolutions are effective but not as good as 3×3 ones.
Comprehensive Overview of CNN Architecture Development
4. GoogLeNet (Inception v1)
From VGG, we learned that deeper networks yield better results. However, as the model deepens, the number of parameters increases, leading to a higher chance of overfitting and requiring more training data; additionally, complex networks mean more computation, larger model storage, and slower speeds. GoogLeNet was designed to reduce parameters.
GoogLeNet increases network complexity by widening the network, allowing the network to choose convolutional kernels autonomously. This design reduces parameters while enhancing the network’s adaptability to multiple scales. Using 1×1 convolutions allows for increased complexity without adding parameters.
Comprehensive Overview of CNN Architecture Development
Inception-v2
On the basis of v1, batch normalization technology was added; in TensorFlow, using BN before the activation function yields better results; replaced 5×5 convolutions with two consecutive 3×3 convolutions to deepen the network while reducing parameters.
Inception-v3
The core idea is to decompose larger convolutions, such as 7×7, into two convolutions of 1×7 and 7×1, reducing parameters while increasing depth.
Inception-v4 Structure
Introduced ResNet to accelerate training and improve performance. However, when the number of filters exceeds 1000, training becomes unstable; an activation scaling factor can be added to alleviate this.
5. Xception
Proposed on the basis of Inception-v3, the basic idea is channel-separable convolutions, but with differences. The model slightly reduces parameters while achieving higher accuracy. Xception first performs 1×1 convolutions followed by 3×3 convolutions, merging channels before spatial convolutions. Depthwise convolutions do the opposite, performing spatial 3×3 convolutions first, followed by channel 1×1 convolutions. The core idea is to separate channel convolutions from spatial convolutions during convolution. MobileNet-v1 uses the depthwise order, adding BN and ReLU. The parameter count of Xception is similar to that of Inception-v3, increasing network width to improve accuracy, while MobileNet-v1 aims to reduce parameters and improve efficiency.
Comprehensive Overview of CNN Architecture Development
Comprehensive Overview of CNN Architecture Development
6. MobileNet Series
V1
Uses depthwise separable convolutions; eliminates pooling layers, using stride=2 convolutions instead. The number of channels in standard convolutional kernels equals the number of input feature map channels; while depthwise convolutional kernels have a channel count of 1. Two parameters can control the input-output channel count and the image resolution.
Comprehensive Overview of CNN Architecture Development
Comprehensive Overview of CNN Architecture Development
V2
Has three differences compared to v1: 1. Introduces residual structures; 2. Performs a 1×1 convolution before the dw to increase the feature map channel count, differing from traditional residual blocks; 3. After the pointwise convolution, ReLU is discarded in favor of a linear activation function to prevent ReLU from damaging the features. This is because the features extracted by the dw layer are limited by the input channel count; if traditional residual blocks are used, compressing the dw features would yield even fewer features, so initially expanding is preferred. However, when using expansion-convolution-compression, a problem arises: ReLU can damage features that are already compressed, thus linear activation is preferred.
Comprehensive Overview of CNN Architecture Development
Comprehensive Overview of CNN Architecture Development
V3
Combines complementary search techniques: NAS performs module set searches with resource constraints, while NetAdapt executes local searches; network structure improvements include moving the final average pooling layer forward and removing the last convolution layer, introducing h-swish activation function, and modifying the initial filter set.
V3 integrates depthwise separable convolutions from v1, linear bottleneck structures from v2, and lightweight attention models from SE structures.
Comprehensive Overview of CNN Architecture Development
Comprehensive Overview of CNN Architecture Development
Comprehensive Overview of CNN Architecture Development
7. EffNet
EffNet is an improvement upon MobileNet-v1, with the main idea being: decompose the dw layer of MobileNet-1 into two 3×1 and 1×3 dw layers, thus pooling is adopted after the first layer, reducing computation in the second layer. EffNet is smaller and more efficient than MobileNet-v1 and ShuffleNet-v1 models.
Comprehensive Overview of CNN Architecture Development
Comprehensive Overview of CNN Architecture Development
8. EfficientNet
Studies methods for expanding network design in depth, width, and resolution, and the interrelationships among them. This can achieve higher efficiency and accuracy.
Comprehensive Overview of CNN Architecture Development
9. ResNet
VGG proves that increasing network depth is an effective means to improve accuracy, but deeper networks are prone to gradient vanishing, leading to poor convergence. Tests show that networks with more than 20 layers have increasingly worse convergence as the number of layers increases. ResNet effectively addresses the gradient vanishing problem (actually alleviates it, not truly solves it) by adding shortcut connections.
Comprehensive Overview of CNN Architecture Development
10. ResNeXt
Based on ResNet and Inception, combining split+transform+concatenate. It outperforms ResNet, Inception, and Inception-ResNet. Group convolution can be used. Generally, there are three ways to enhance network expressiveness: 1. Increase network depth, as seen from AlexNet to ResNet, but experimental results show that the improvement from increased depth diminishes; 2. Increase the width of network modules, but this inevitably leads to exponential growth in parameter scale, which is not a mainstream CNN design; 3. Improve CNN architecture design, as seen in Inception series and ResNeXt. Experiments show that increasing cardinality, or the number of identical branches in a block, can better enhance model expressiveness.
Comprehensive Overview of CNN Architecture Development
Comprehensive Overview of CNN Architecture Development
11. DenseNet
DenseNet significantly reduces the number of parameters through feature reuse, while also alleviating the gradient vanishing problem to some extent.
Comprehensive Overview of CNN Architecture Development
12. SqueezeNet
Proposed the fire-module: squeeze layer + expand layer. The squeeze layer uses 1×1 convolution, while the expand layer uses both 1×1 and 3×3 convolutions, followed by concatenation. SqueezeNet’s parameters are 1/50 of AlexNet’s, and after compression, they are 1/510, but the accuracy is comparable to AlexNet.
Comprehensive Overview of CNN Architecture Development
13. ShuffleNet Series
V1
Reduces computation by using grouped convolutions with 1×1 pointwise group convolution kernels, enriching the information of each channel through channel reorganization. Xception and ResNeXt are less efficient in small network models due to the resource-intensive 1×1 convolutions, thus pointwise group convolutions are proposed to reduce computational complexity, although they have side effects, leading to the introduction of channel shuffle to aid information flow. While dw can reduce computation and parameter counts, it is less efficient in low-power devices compared to dense operations, so ShuffleNet aims to use depthwise convolutions at bottlenecks to minimize overhead.
Comprehensive Overview of CNN Architecture Development
V2
Design principles for making neural networks more efficient:
Maintaining equal input and output channel counts minimizes memory access costs.
Excessive grouping in grouped convolutions increases memory access costs.
Overly complex network structures (too many branches and basic units) reduce network parallelism.
Element-wise operations also consume significant resources.
Comprehensive Overview of CNN Architecture Development
14. SENet
Comprehensive Overview of CNN Architecture Development
15. SKNet
Comprehensive Overview of CNN Architecture Development

——————-

END

——————–

I am Wang Bo Kings, a 985 AI PhD, Huawei Cloud Expert, CSDN Blog Expert (high-quality author in the field of AI). A single AI open-source project has now gained over 2100 stars. Currently working on AI-related content, welcome to discuss and learn about various aspects of life and study together!

Our WeChat group covers the following areas (but is not limited to): Artificial Intelligence, Computer Vision, Natural Language Processing, Object Detection, Semantic Segmentation, Autonomous Driving, GAN, Reinforcement Learning, SLAM, Face Detection, Latest Algorithms, Latest Papers, OpenCV, TensorFlow, PyTorch, Open Source Frameworks, Learning Methods…

This is my private WeChat, limited spots available, let’s improve together!

Comprehensive Overview of CNN Architecture Development
Wang Bo’s public account, welcome to follow for more valuable content
Recommended notes:
Mind Map | “Model Evaluation and Selection” | “Linear Models” | “Decision Trees” | “Neural Networks” | Support Vector Machines (Upper) | Support Vector Machines (Lower) | Bayesian Classification (Upper) | Bayesian Classification (Lower) | Ensemble Learning (Upper) | Ensemble Learning (Lower) | Clustering | Dimensionality Reduction and Metric Learning | Sparse Learning | Computational Learning Theory | Semi-supervised Learning | Probabilistic Graph Models | Rule Learning
Broaden Your Knowledge:
Is it hard to go to college after earning a PhD? | What are some tips for reading papers? | Let’s talk about job hopping | Let’s discuss the components of internet salary income | How can Master’s and PhD students in Machine Learning save themselves? | Let’s talk about the employment choices of top 2 computer PhDs in 2021 | How can someone without a formal background transition to computer science? | What are some research experiences that are regrettable to know too late? | Experience | How can computer science majors improve their programming skills? | How can PhD students read literature efficiently? | What are some life experiences that are better learned early? |
Other Learning Notes:
PyTorch Tensor | Architecture of Convolutional Neural Networks (CNN) | Deep Learning Semantic Segmentation | Understanding Transformers in Depth | Scaled-YOLOv4! | PyTorch Installation and Getting Started | PyTorch Neural Network Box | Numpy Basics | 10 Articles on Image Classification | CVPR 2020 Object Detection | Visualization and Explanation of Neural Networks | YOLOv4 Full Text Interpretation and Translation Summary |

Leave a Comment