Click aboveComputer Vision Alliance to get more valuable content

For academic sharing only, does not represent the position of this public account, contact for removal in case of infringement

Reproduced from: Author丨zzq

Source丨https://zhuanlan.zhihu.com/p/68411179

Editor丨Jishi Platform

Recommended notes from 985 AI PhD

Zhou Zhihua’s “Machine Learning” handwritten notes are officially open-source! Includes PDF download link, Github 2500 stars!

Introduction to Basic Components of CNN

1. Local Receptive Field

In an image, the connections between local pixels are relatively tight, while the connections between distant pixels are relatively weak. Therefore, each neuron does not need to perceive the entire image but only needs to sense local information, which can then be combined at higher levels to obtain global information. The convolution operation implements the local receptive field and reduces the number of parameters due to weight sharing.

2. Pooling

Pooling reduces the size of the input image, decreasing pixel information while retaining important details, primarily to reduce computational load. It mainly includes max pooling and average pooling.

3. Activation Function

The activation function is used to introduce non-linearity. Common activation functions include sigmoid, tanh, and ReLU; the first two are commonly used in fully connected layers, while ReLU is often found in convolutional layers.

4. Fully Connected Layer

The fully connected layer acts as a classifier within the entire convolutional neural network. The outputs from previous layers must be flattened before entering the fully connected layer.

Classic Network Structures

1. LeNet5

Consists of two convolutional layers, two pooling layers, and two fully connected layers. The convolutional kernel is 5×5, with a stride of 1, and the pooling layer uses max pooling.

Comprehensive Overview of CNN Architecture Development

2. AlexNet

The model has eight layers (excluding the input layer), including five convolutional layers and three fully connected layers. The last layer uses softmax for classification output.

AlexNet uses ReLU as the activation function; dropout and data augmentation are used to prevent overfitting; implemented with dual GPUs; employs LRN.

3. VGG

Utilizes stacked 3×3 convolutional kernels to simulate a larger receptive field and has a deeper network. VGG has five segments of convolutions, each followed by a max pooling layer. The number of convolutional kernels gradually increases.

Summary: LRN has little effect; deeper networks yield better results; 1×1 convolutions are effective but not as good as 3×3 ones.

4. GoogLeNet (Inception v1)

From VGG, we learned that deeper networks yield better results. However, as the model deepens, the number of parameters increases, leading to a higher chance of overfitting and requiring more training data; additionally, complex networks mean more computation, larger model storage, and slower speeds. GoogLeNet was designed to reduce parameters.

GoogLeNet increases network complexity by widening the network, allowing the network to choose convolutional kernels autonomously. This design reduces parameters while enhancing the network’s adaptability to multiple scales. Using 1×1 convolutions allows for increased complexity without adding parameters.

Inception-v2

On the basis of v1, batch normalization technology was added; in TensorFlow, using BN before the activation function yields better results; replaced 5×5 convolutions with two consecutive 3×3 convolutions to deepen the network while reducing parameters.

Inception-v3

The core idea is to decompose larger convolutions, such as 7×7, into two convolutions of 1×7 and 7×1, reducing parameters while increasing depth.

Inception-v4 Structure

Introduced ResNet to accelerate training and improve performance. However, when the number of filters exceeds 1000, training becomes unstable; an activation scaling factor can be added to alleviate this.

5. Xception

Proposed on the basis of Inception-v3, the basic idea is channel-separable convolutions, but with differences. The model slightly reduces parameters while achieving higher accuracy. Xception first performs 1×1 convolutions followed by 3×3 convolutions, merging channels before spatial convolutions. Depthwise convolutions do the opposite, performing spatial 3×3 convolutions first, followed by channel 1×1 convolutions. The core idea is to separate channel convolutions from spatial convolutions during convolution. MobileNet-v1 uses the depthwise order, adding BN and ReLU. The parameter count of Xception is similar to that of Inception-v3, increasing network width to improve accuracy, while MobileNet-v1 aims to reduce parameters and improve efficiency.

6. MobileNet Series

Uses depthwise separable convolutions; eliminates pooling layers, using stride=2 convolutions instead. The number of channels in standard convolutional kernels equals the number of input feature map channels; while depthwise convolutional kernels have a channel count of 1. Two parameters can control the input-output channel count and the image resolution.

Has three differences compared to v1: 1. Introduces residual structures; 2. Performs a 1×1 convolution before the dw to increase the feature map channel count, differing from traditional residual blocks; 3. After the pointwise convolution, ReLU is discarded in favor of a linear activation function to prevent ReLU from damaging the features. This is because the features extracted by the dw layer are limited by the input channel count; if traditional residual blocks are used, compressing the dw features would yield even fewer features, so initially expanding is preferred. However, when using expansion-convolution-compression, a problem arises: ReLU can damage features that are already compressed, thus linear activation is preferred.

Combines complementary search techniques: NAS performs module set searches with resource constraints, while NetAdapt executes local searches; network structure improvements include moving the final average pooling layer forward and removing the last convolution layer, introducing h-swish activation function, and modifying the initial filter set.

V3 integrates depthwise separable convolutions from v1, linear bottleneck structures from v2, and lightweight attention models from SE structures.

7. EffNet

EffNet is an improvement upon MobileNet-v1, with the main idea being: decompose the dw layer of MobileNet-1 into two 3×1 and 1×3 dw layers, thus pooling is adopted after the first layer, reducing computation in the second layer. EffNet is smaller and more efficient than MobileNet-v1 and ShuffleNet-v1 models.

8. EfficientNet

Studies methods for expanding network design in depth, width, and resolution, and the interrelationships among them. This can achieve higher efficiency and accuracy.

9. ResNet

VGG proves that increasing network depth is an effective means to improve accuracy, but deeper networks are prone to gradient vanishing, leading to poor convergence. Tests show that networks with more than 20 layers have increasingly worse convergence as the number of layers increases. ResNet effectively addresses the gradient vanishing problem (actually alleviates it, not truly solves it) by adding shortcut connections.

10. ResNeXt

Based on ResNet and Inception, combining split+transform+concatenate. It outperforms ResNet, Inception, and Inception-ResNet. Group convolution can be used. Generally, there are three ways to enhance network expressiveness: 1. Increase network depth, as seen from AlexNet to ResNet, but experimental results show that the improvement from increased depth diminishes; 2. Increase the width of network modules, but this inevitably leads to exponential growth in parameter scale, which is not a mainstream CNN design; 3. Improve CNN architecture design, as seen in Inception series and ResNeXt. Experiments show that increasing cardinality, or the number of identical branches in a block, can better enhance model expressiveness.

11. DenseNet

DenseNet significantly reduces the number of parameters through feature reuse, while also alleviating the gradient vanishing problem to some extent.

12. SqueezeNet

Proposed the fire-module: squeeze layer + expand layer. The squeeze layer uses 1×1 convolution, while the expand layer uses both 1×1 and 3×3 convolutions, followed by concatenation. SqueezeNet’s parameters are 1/50 of AlexNet’s, and after compression, they are 1/510, but the accuracy is comparable to AlexNet.

13. ShuffleNet Series

Reduces computation by using grouped convolutions with 1×1 pointwise group convolution kernels, enriching the information of each channel through channel reorganization. Xception and ResNeXt are less efficient in small network models due to the resource-intensive 1×1 convolutions, thus pointwise group convolutions are proposed to reduce computational complexity, although they have side effects, leading to the introduction of channel shuffle to aid information flow. While dw can reduce computation and parameter counts, it is less efficient in low-power devices compared to dense operations, so ShuffleNet aims to use depthwise convolutions at bottlenecks to minimize overhead.

Design principles for making neural networks more efficient:

Maintaining equal input and output channel counts minimizes memory access costs.

Excessive grouping in grouped convolutions increases memory access costs.

Overly complex network structures (too many branches and basic units) reduce network parallelism.

Element-wise operations also consume significant resources.

14. SENet

15. SKNet

——————-

END

——————–

I am Wang Bo Kings, a 985 AI PhD, Huawei Cloud Expert, CSDN Blog Expert (high-quality author in the field of AI). A single AI open-source project has now gained over 2100 stars. Currently working on AI-related content, welcome to discuss and learn about various aspects of life and study together!

Our WeChat group covers the following areas (but is not limited to): Artificial Intelligence, Computer Vision, Natural Language Processing, Object Detection, Semantic Segmentation, Autonomous Driving, GAN, Reinforcement Learning, SLAM, Face Detection, Latest Algorithms, Latest Papers, OpenCV, TensorFlow, PyTorch, Open Source Frameworks, Learning Methods…

This is my private WeChat, limited spots available, let’s improve together!

Wang Bo’s public account, welcome to follow for more valuable content

Recommended notes:

Broaden Your Knowledge:

Is it hard to go to college after earning a PhD? | What are some tips for reading papers? | Let’s talk about job hopping | Let’s discuss the components of internet salary income | How can Master’s and PhD students in Machine Learning save themselves? | Let’s talk about the employment choices of top 2 computer PhDs in 2021 | How can someone without a formal background transition to computer science? | What are some research experiences that are regrettable to know too late? | Experience | How can computer science majors improve their programming skills? | How can PhD students read literature efficiently? | What are some life experiences that are better learned early? |

Other Learning Notes:

Introduction to Basic Components of CNN

Classic Network Structures

Leave a Comment Cancel reply