Click the above“Beginner Learning Vision” to selectStar or “Pin”

Introduction
In the field of CV, we need to master the most basic knowledge, which is the various architectures of Convolutional Neural Networks (CNNs). Whether we are dealing with image classification, segmentation, object detection, or NLP, we will use the basic CNN architectures.
Since the emergence of AlexNet in 2012, followed by VGG in 2014 and ResNet in 2015, CNNs have established their dominance in this field. Network models have become deeper, and it has been proven that deeper networks fit better. However, the corresponding parameter and computational requirements increase rapidly, hindering the promotion and application of this technology.
As a result, some lightweight network structures have gradually emerged, such as the MobileNet series, ShuffleNet series, ResNext, DenseNet, and EfficientNet models. They learn from each other's advantages, reducing either the number of parameters or the computational load while achieving higher classification accuracy, thus gaining more attention. Next, we will provide a detailed interpretation of various CNN architectures and their advantages and disadvantages!
AlexNet (2012)
1. Introduced the ReLU non-linear activation function, enhancing the model's non-linear expression capability, which became a standard for subsequent convolutional layers.
2. The dropout layer prevents overfitting, becoming a standard for subsequent fully connected layers.
3. Data augmentation is used to reduce overfitting.
4. Introduced normalization layers (Local Response Normalization): by amplifying neurons that contribute significantly to classification and suppressing those that contribute less, it achieves its effect through local normalization.
VGG (2014)
The main innovations of the paper are:
1. Replaced 5x5 or 7x7 convolution kernels with 3x3 small convolution kernels.
2. Deepened the network based on AlexNet, proving that deeper networks can better extract features.
GoogleNet (2014)
When designing the network structure, not only the depth of the network is considered but also the width, which is defined as the Inception structure.
1. The primary purpose of introducing 1x1 convolutions is to reduce dimensions and also to correct linear activation (ReLU).
2. The network finally uses average pooling to replace fully connected layers.
Later Inception v2/v3 are extensions based on this method in v1, with the main goals being:
1. Reducing the number of parameters and computational load.
2. Deepening the network, enhancing the network's non-linear expression capability.
ResNet (2015)
Problems:
1. The first issue brought by increasing depth is the problem of gradient explosion/vanishing. As the number of layers increases, the gradients during backpropagation become unstable, becoming particularly large or small. The vanishing gradient problem is often encountered.
2. To overcome gradient vanishing, many solutions have been proposed, such as using BatchNorm, changing the activation function to ReLU, and using Xavier initialization. It can be said that gradient vanishing has been well addressed.
Given the known network degradation, instead of seeking to increase depth to improve accuracy, can we at least allow deep networks to achieve the same performance as shallow networks? Based on this idea, the author proposed the residual module to help the network achieve identity mapping.
Design characteristics of ResNet:
1. Core unit modularization allows for simple stacking.
2. Shortcut connections solve the problem of gradient vanishing in the network.
3. Average pooling layers replace fully connected layers.
4. Introduced BN layers to accelerate network training speed and stability during convergence.
5. Increased network depth to improve the model's feature extraction capability.
MobileNet v1
In 2017, Google proposed a lightweight CNN network focused on mobile or embedded devices: MobileNet. The most significant innovation is depthwise separable convolutions.
By decomposing standard convolutions into depthwise convolutions and pointwise convolutions, it significantly reduces the number of parameters and computational load. Introduced the ReLU6 activation function.
Parameter and computational load calculations:
Network structure as follows:
MobileNet v2
The improvements mainly include the following aspects:
1. Introduced residual structures, first increasing dimensions and then decreasing them, enhancing gradient propagation and significantly reducing memory usage during inference.
Inverted Residuals:
Residual module: The input first undergoes 1x1 convolution for compression, then uses 3x3 convolution for feature extraction, and finally uses 1x1 convolution to change the number of channels back. The entire process is “compress - convolution - expand.” This approach aims to reduce the computational load of the 3x3 module and improve the computational efficiency of the residual module.
Inverted Residual Module: The input first undergoes 1x1 convolution for channel expansion, then uses 3x3 depthwise convolution, and finally uses 1x1 pointwise convolution to compress the number of channels back. The entire process is “expand - convolution - compress.”
Performing ReLU operations on low dimensions can easily lead to information loss. However, performing ReLU operations in high dimensions results in minimal information loss.
Linear Bottleneck:
This module addresses the initial problem of low-dimensional - high-dimensional - low-dimensional by replacing the last layer's ReLU6 with a linear activation function, while other layers still use ReLU6.
MobileNet v3
V3 combines depthwise separable convolutions from V1, Inverted Residuals and Linear Bottleneck from V2, along with the SE module, and employs NAS (Neural Architecture Search) to search for network parameters.
Complementary Search Techniques - NAS & NetAdapt
h-swish activation function
out = F.relu6(x + 3., self.inplace) / 6. return out * x
Improvement 1: The following diagram shows the organized model architecture of MobileNet-v2. It can be seen that the last part of the network first maps to high dimensions through 1x1 convolution, then collects features through GAP, and finally uses 1x1 convolution to divide into K classes. Therefore, the layer that extracts features is the one performing 1x1 convolution at a resolution of 7x7.
In contrast, V3 first performs pooling and then 1x1 convolution for feature extraction, while V2 first performs 1x1 convolution for feature extraction and then pooling.
ShuffleNet
Group convolution is a technique that groups different feature maps from the input layer and applies different convolution kernels to each group, thereby reducing the computational load of convolutions.
In general, convolutions are performed across all input feature maps, which is known as full-channel convolution, a type of channel dense connection. In contrast, group convolution is a form of channel sparse connection.
Depthwise convolution is a special type of group convolution where the number of groups equals the number of channels, meaning each group contains only one feature map.
Group convolutions create a conflict regarding feature communication. Another issue with group convolution layers is that feature maps from different groups need to communicate. Therefore, networks like MobileNet adopt dense 1x1 pointwise convolutions to ensure information exchange between feature maps of different groups after group convolution.
To achieve feature communication, we do not employ dense pointwise convolution but consider other approaches: channel shuffle.
The core of ShuffleNet is the use of two operations: pointwise group convolution and channel shuffle, which significantly reduces the computational load of the model while maintaining accuracy. Its basic unit is an improvement based on a residual unit.
Copyright Statement: This article is an original work by CSDN blogger "Jeremy_lf", and follows the CC 4.0 BY-SA copyright agreement. Please attach the original source link and this statement when reprinting.
Original link:
https://blog.csdn.net/Jeremy_lf/article/details/105501697
Editor: Guyueju
This article is for academic sharing only. If there is any infringement, please contact us to delete the article.
Download 1: OpenCV-Contrib Extension Module Chinese Tutorial
Reply "Extension Module Chinese Tutorial" in the backend of the "Beginner Learning Vision" public account to download the first OpenCV extension module tutorial in Chinese, covering installation of extension modules, SFM algorithms, stereo vision, object tracking, biological vision, super-resolution processing, and more than twenty chapters of content.
Download 2: Python Vision Practical Project 52 Lectures
Reply "Python Vision Practical Project" in the backend of the "Beginner Learning Vision" public account to download 31 practical vision projects including image segmentation, mask detection, lane line detection, vehicle counting, eye line addition, license plate recognition, character recognition, emotion detection, text content extraction, facial recognition, etc., to aid in quickly learning computer vision.
Download 3: OpenCV Practical Project 20 Lectures
Reply "OpenCV Practical Project 20 Lectures" in the backend of the "Beginner Learning Vision" public account to download 20 practical projects based on OpenCV for advanced learning in OpenCV.
Group Chat
You are welcome to join the reader group of the public account to communicate with peers. Currently, we have WeChat groups for SLAM, 3D vision, sensors, autonomous driving, computational photography, detection, segmentation, recognition, medical imaging, GAN, algorithm competitions, etc. (these will gradually be subdivided). Please scan the WeChat ID below to join the group, and note: "Nickname + School/Company + Research Direction", for example: "Zhang San + Shanghai Jiao Tong University + Vision SLAM". Please follow the format for notes, otherwise, your request will not be approved. After successful addition, you will be invited to relevant WeChat groups based on your research direction. Please do not send advertisements in the group, otherwise, you will be removed. Thank you for your understanding~