Join the professional CV group of Jishi, interact with 6000+ visual developers from top companies and universities such as Tencent, Huawei, Baidu, Peking University, Tsinghua University, and the Chinese Academy of Sciences! There is also an opportunity to interact with Professor Kai-Fu Lee and other experts!
We also provide monthly expert live sharing, real project demand matching, and industry technology exchanges. Follow the Jishi Platform official account and reply with Join Group to apply immediately!
Author:zzqSource:https://zhuanlan.zhihu.com/p/68411179This article is authorized by the author; please contact the original author for reprints.
Introduction to Basic Components of CNN
1. Local Receptive FieldIn images, the connections between local pixels are relatively tight, while the connections between distant pixels are weaker. Therefore, each neuron does not need to perceive the entire image globally; it only needs to perceive local information, which can then be combined at higher levels to obtain global information. The convolution operation is the implementation of the local receptive field, and because convolution allows for weight sharing, it also reduces the number of parameters.2. PoolingPooling reduces the input image size, decreasing pixel information while retaining important information, primarily to reduce computational load. This includes max pooling and average pooling.3. Activation FunctionThe activation function is used to introduce non-linearity. Common activation functions include sigmoid, tanh, and ReLU, with the first two often used in fully connected layers, and ReLU commonly found in convolutional layers.4. Fully Connected LayerThe fully connected layer acts as a classifier in the entire convolutional neural network. The output from previous layers must be flattened before entering the fully connected layer.
Classic Network Structures
1. LeNet5Consists of two convolutional layers, two pooling layers, and two fully connected layers. The convolution kernels are all 5×5, with stride=1, and the pooling layer uses max pooling.2. AlexNetThe model consists of eight layers (excluding the input layer), with five convolutional layers and three fully connected layers. The last layer uses softmax for classification output.AlexNet uses ReLU as the activation function; to prevent overfitting, it employs dropout and data augmentation; it implements dual GPU; and uses LRN.3. VGGUses stacked 3×3 convolution kernels to simulate a larger receptive field, with a deeper network structure. VGG has five segments of convolutions, each followed by a max pooling layer, with the number of convolution kernels gradually increasing.Summary: LRN has little effect; deeper networks perform better; 1×1 convolutions are also effective but not as good as 3×3.4. GoogLeNet (Inception v1)From VGG, we learned that deeper networks yield better results. However, as the model deepens, the number of parameters increases, making overfitting more likely and requiring more training data; additionally, complex networks imply more computational load and larger model storage, needing more resources, and resulting in slower speeds. GoogLeNet is designed from the perspective of reducing parameters.GoogLeNet increases network complexity by widening the network, allowing the network to choose convolution kernels independently. This design reduces parameters while improving the network’s adaptability to multiple scales. It uses 1×1 convolutions to increase network complexity without adding parameters.Inception-v2Based on v1, it adds batch normalization technology; in TensorFlow, using BN before the activation function yields better results; replaces the 5×5 convolution with two consecutive 3×3 convolutions to deepen the network and reduce parameters.Inception-v3The core idea is to decompose convolution kernels into smaller convolutions, such as decomposing a 7×7 convolution into 1×7 and 7×1 convolutions, reducing network parameters and increasing depth.Inception-v4 StructureIntroduces ResNet to accelerate training and improve performance. However, when the number of filters exceeds 1000, training becomes unstable; adding an activate scaling factor helps alleviate this.5. XceptionProposed on the basis of Inception-v3, the basic idea is depthwise separable convolutions, but with differences. The model slightly reduces parameters while achieving higher accuracy. Xception first performs a 1×1 convolution followed by a 3×3 convolution, merging channels first and then applying spatial convolution. Depthwise is the opposite, performing spatial 3×3 convolution first, then channel 1×1 convolution. The core idea is to separate channel convolution from spatial convolution during convolution. MobileNet-v1 uses the depthwise order, adding BN and ReLU. The parameter count of Xception is similar to Inception-v3, but it increases network width to enhance accuracy, while MobileNet-v1 aims to reduce parameters and improve efficiency.6. MobileNet SeriesV1Uses depthwise separable convolutions; abandons pooling layers, using stride=2 convolutions instead. The number of channels in the standard convolution kernel equals the number of input feature map channels; while the depthwise convolution kernel has one channel; there are two parameters that can be controlled, ‘a’ controls input/output channel numbers; ‘p’ controls image (feature map) resolution.V2Compared to v1, there are three differences: 1. Introduced residual structure; 2. Performed 1×1 convolution before dw to increase feature map channel numbers, which differs from traditional residual blocks; 3. After pointwise, ReLU is abandoned in favor of a linear activation function to prevent ReLU from damaging features. This is because the features extracted by the dw layer are limited by the input channel numbers; if a traditional residual block is used, the features that dw can extract will be further compressed, so initially, it is expanded rather than compressed. However, when using expansion-convolution-compression, a problem arises after compression where ReLU damages features, and since the features are already compressed, further processing by ReLU will lose some features, thus linear should be used.V3Combines complementary search techniques: resource-constrained NAS executes module set searches, and NetAdapt performs local searches; network structure improvements include moving the last average pooling layer forward and removing the last convolution layer, introducing h-swish activation function, and modifying the initial filter group.V3 integrates depthwise separable convolutions from v1, linear bottleneck structures from v2, and lightweight attention models from SE structures.7. EffNetEffNet is an improvement over MobileNet-v1, with the main idea being: decomposing the dw layer of MobileNet-1 into two 3×1 and 1×3 dw layers, allowing pooling to be applied after the first layer, thus reducing the computation load of the second layer. EffNet is smaller and more efficient than MobileNet-v1 and ShuffleNet-v1 models.8. EfficientNetStudies methods of expanding network design in depth, width, and resolution, as well as their interrelationships, achieving higher efficiency and accuracy.9. ResNetVGG demonstrated that deeper networks effectively improve accuracy, but very deep networks are prone to gradient vanishing, leading to convergence issues. Tests show that beyond 20 layers, convergence worsens with increasing depth. ResNet effectively addresses the gradient vanishing problem (actually alleviates it, not completely solves it) by adding shortcut connections.10. ResNeXtCombines the split-transform-concatenate method based on ResNet and Inception. However, it performs better than ResNet, Inception, and Inception-ResNet. Group convolution can be used. Generally, there are three ways to increase the expressive power of a network: 1. Increase the network depth, as from AlexNet to ResNet, but experimental results show that improvements from depth become smaller; 2. Increase the width of network modules, but this leads to an exponential increase in parameter scale, which is not mainstream in CNN design; 3. Improve CNN network structure design, as seen in Inception series and ResNeXt. Experiments have shown that increasing cardinality, or the number of identical branches in a block, can better enhance model expressiveness.11. DenseNetDenseNet significantly reduces the number of parameters through feature reuse, while also alleviating the gradient vanishing problem to some extent.12. SqueezeNetIntroduced the fire-module: squeeze layer + expand layer. The squeeze layer is a 1×1 convolution, and the expand layer uses both 1×1 and 3×3 convolutions, followed by concatenation. SqueezeNet has 1/50 the parameters of AlexNet, and after compression, it is 1/510, but achieves comparable accuracy to AlexNet.13. ShuffleNet SeriesV1Reduces computation load through grouped convolutions and pointwise group convolutions, enriching information across channels by reorganizing them. Xception and ResNeXt are inefficient in small network models due to the resource-intensive nature of numerous 1×1 convolutions, hence pointwise group convolutions are proposed to lower computational complexity. However, pointwise group convolutions have side effects, so channel shuffle is introduced to aid information flow. While dw can reduce computation and parameter count, its efficiency is poorer than dense operations on low-power devices, hence ShuffleNet aims to use depth convolutions at bottlenecks to minimize overhead.V2Design principles for making neural networks more efficient:Keeping input and output channel numbers equal minimizes memory access costs.Using too many groups in grouped convolutions increases memory access costs.Overly complex network structures (too many branches and basic units) reduce network parallelism.Element-wise operations should not be overlooked in terms of consumption.14. SENet15. SKNet
-End-
*Further Reading
How CNN Achieved Translation Invariance and Improved ImageNet Scores: Adobe Open Source New Method, Featured at ICML
Event | Join the Jishi Original Authors and Achieve a Small Goal
What experiences do you have with tuning deep learning (RNN, CNN)?
Join Jishi in the Year of the Rat and check in to receive red envelopes!New friends joining quickly clickYear of the Rat Benefits | Receive a thousand yuan red envelope without collecting five blessings~ Check the event details~
Red Envelope Code【5】
↓↓↓
PS:During the New Year holiday, Jishi will share a series of conference report videos from the top computer vision conference ICCV 2019, welcome to watch on Bilibili [Jishi Platform] during Spring Festival, keep learning, and support Jishi’s continuous updates!
https://www.bilibili.com/video/av83390901
CV Sub-field Group
Add Jishi Assistant WeChat (ID: cv-mart), note: Research Direction-Name-School/Company-City (e.g., Object Detection-Xiaoji-Peking University-Shenzhen), to apply to join technical groups on Object Detection, Tracking, Face Recognition, Industrial Detection, Medical Imaging, 3D & SLAM, Image Segmentation, etc. (Those who have already added the assistant can directly message), with monthly expert live sharing, real project demand matching, and industry technology exchanges, let the light of thought shine further together!