Illustrated Overview of 10 Major CNN Architectures

Click on the above “AI Youdao“, select “Star” public account

Heavyweight content delivered first time

Author | Raimi Karim

Translator | Major

Editor | Zhao Xue

Produced by | AI Technology Camp (ID: rgznai100)

Introduction: In recent years, many Convolutional Neural Networks (CNN) have come into the spotlight, and as their depths become increasingly profound, it is difficult to have a clear understanding of the structure of a certain CNN. Therefore, this article carefully selects detailed illustrations of 10 CNN architectures for discussion.

How do you track different CNNs? In recent years, we have seen many CNNs emerge. These networks have become so deep that visualizing the entire model has become extremely difficult. We treat them as black box models.

Perhaps you don’t think this way. However, if you do encounter such problems, this article is exactly what you need to read. This article is an illustrated overview of 10 common CNN architectures, carefully selected by the author. These illustrations showcase the essence of the entire model without having to browse through those Softmax layers one by one. In addition to these diagrams, the author also provides some annotations explaining how they have evolved over time—convolutional layers from 5 to 50, from ordinary convolutional layers to convolutional modules, from 2-3 towers to 32 towers, and convolution kernels from 7×7 to 5×5.

The term “common” refers to these models whose pre-trained weights are usually shared by deep learning libraries (such as TensorFlow and PyTorch), available for developers to use, and these models are often taught in classes. Some of these models have achieved success in competitions (such as ILSVRC ImageNet Large Scale Image Recognition Challenge).

10 architectures to be discussed and the corresponding paper publication time

6 network architectures’ pre-trained weights can be obtained in Keras, see https://keras.io/applications/?source=post_page

The reason for writing this article is that there are not many blogs and articles mentioning these compact structural illustrations. Therefore, the author decided to write one as a reference. For this purpose, the author read the papers and codes mentioned in this article (most of which are TensorFlow and Keras) and obtained these results. It should also be noted that the sources of these CNN architectures are varied—improvements in computer hardware performance, the ImageNet competition, solutions to specific problems, new ideas, etc. A researcher at Google, Christian Szegedy, once mentioned:

“This process is largely not just due to stronger hardware, larger datasets, and larger models, but a series of new ideas, algorithms, and improvements in network structures.”

Now let’s take a look at how these “beast-like” network architectures have gradually evolved.

[Author’s Note] Note on Visualization:Please note that in these illustrations, the author has omitted some information, such as the number of convolution filters, Padding, Stride, Dropout, and flatten operations.

Table of Contents (Sorted by Publication Time)

LeNet-5
AlexNet
VGG-16
Inception-v1
Inception-v3
ResNet-50
Xception
Inception-v4
Inception-ResNets
ResNeXt-50

Legend

1. LeNet-5 (1998)

Figure 1: LeNet-5 Network Structure

LeNet-5 is one of the simplest network architectures. It has 2 convolutional layers and 3 fully connected layers (a total of 5 layers, this naming convention is common in neural networks, where this number represents the total of convolutional and fully connected layers). The Average-Pooling layer, now referred to as the subsampling layer, has some trainable weights (which are now not commonly seen when designing CNN networks). This network architecture has about 60,000 parameters.

What Innovations Are There?

This network architecture has become the standard “template”: stacking convolutional and pooling layers, with one or more fully connected layers at the end of the network.

Related Literature

Paper: Gradient-Based Learning Applied to Document Recognition

Link:http://yann.lecun.com/exdb/publis/index.html?source=post_page
Authors: Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner
Published in: Proceedings of the IEEE (1998)

2. AlexNet (2012)

Figure 2: AlexNet Network Structure

AlexNet has 60 M parameters and a total of 8 layers: 5 convolutional layers and 3 fully connected layers. AlexNet simply stacked more layers on top of LeNet-5. When the paper was published, the authors pointed out that their network architecture was “one of the largest convolutional neural networks for the current ImageNet subset.”

What Innovations Are There?

1. Their network architecture was the first CNN to use ReLU as the activation function;

2. The use of overlapping pooling in CNN.

Related Literature

Paper: ImageNet Classification with Deep Convolutional Neural Networks

Link:https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks?source=post_page
Authors: Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton. University of Toronto, Canada.
Published in: NeurIPS 2012

3. VGG-16 (2014)

Figure 3: VGG-16 Network Structure

You should have noticed that CNNs are becoming deeper and deeper. This is because the most direct way to improve the performance of deep neural networks is to increase their scale (Szegedy et al.). The Visual Geometry Group (VCG) researchers invented VGG-16, which has 13 convolutional layers and 3 fully connected layers, inheriting the ReLU tradition from AlexNet. It consists of 138 M variables and occupies 500 MB of storage space. They also designed a deeper version, VGG-19.

What Innovations Are There?

As they mentioned in the paper abstract, the contribution of this paper is to design deeper networks (about twice the depth of AlexNet).

Related Literature

Paper: Very Deep Convolutional Networks for Large-Scale Image Recognition

Link: https://arxiv.org/abs/1409.1556?source=post_page
Authors: Karen Simonyan, Andrew Zisserman. University of Oxford, UK.
Published in arXiv preprint, 2014

4. Inception-v1 (2014)

Illustrated Overview of 10 Major CNN Architectures

Figure 4: Inception-v1 Network Structure. This CNN has two auxiliary networks (which are discarded during inference), and the network structure is based on Figure 3 in the paper.

This 22-layer network architecture has 5 M parameters and is known as Inception-v1. In this architecture, the Network in Network method is widely applied (see appendix), implemented by using the Inception Module. The architecture design of the module is completed through research on sparse structure estimation.

Each module embodies three ideas:

Using parallel convolution towers with different filters, then stacking them, using 1×1, 3×3, 5×5 convolution kernels to identify different features and cluster them. This idea is inspired by the paper by Arora et al. “Provable bounds for learning some deep representations”, suggesting a layer-wise construction approach that allows for analyzing the relevant statistics of the last layer and clustering them into groups of highly correlated units.
1×1 convolution kernels are used for dimensionality reduction to avoid computational bottlenecks.
1×1 convolution kernels add non-linearity within a convolution.
The paper’s authors also introduced two auxiliary classifiers to expand the differences in the final stage of the classifier, increasing the backpropagation grid signal and providing additional regularization. The auxiliary networks (branches connected to the auxiliary classifiers) are discarded during inference.

It is worth noting that “the main achievement of this network architecture is to improve the utilization of internal computational resources of the network”.

Author’s Note:The naming of the modules (Stem and Inception) was not proposed in this version of the Inception network architecture and was officially used only in later versions, namely Inception-v4 and Inception-ResNet. The author included these here for easier comparison.

What Innovations Are There?

Using compact modules to build networks. Instead of stacking convolution layers, stacking modules composed of convolution layers. The name Inception comes from the sci-fi film “Inception”.

Related Literature

Paper: Going Deeper with Convolutions

Link: https://arxiv.org/abs/1409.4842?source=post_page
Authors: Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich. Google, University of Michigan, University of North Carolina
Published in: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

5. Inception-v3 (2015)

Figure 5: Inception-v3 Network Structure. This CNN has two auxiliary networks (which are discarded during inference). Note: Batch norm and ReLU activation are used after all convolution layers.

Inception-v3 is the successor to Inception-v1, with 24 M parameters. Where did Inception-v2 go? Don’t worry, it’s just an early prototype of v3, thus very similar to v3 but not commonly used. The authors of the paper documented many successful experiences when proposing Inception-v2. Inception is the crystallization of these successful experiences (such as improvements in optimizers, loss functions, and adding batch normalization to auxiliary layers in auxiliary networks, etc.).

The reason for proposing Inception-v2 and Inception-v3 is to avoid representation bottlenecks (which significantly reduce the input dimensions of the next layer) and improve computational efficiency through a partitioning approach.

The naming of the modules (Stem and Inception) was not proposed in this version of the Inception network architecture and was officially used only in later versions, namely Inception-v4 and Inception-ResNet. The author included these here for easier comparison.

What Innovations Are There?

One of the designers of batch normalization (not reflected in the diagram for simplification).

What Improvements Are There Compared to the Previous Inception-v1 Version?

Decomposing n×n convolutions into asymmetric convolutions 1×n and n×1 convolutions.
Decomposing 5×5 convolutions into 2 3×3 convolution operations.
Replacing 7×7 convolutions with a series of 3×3 convolutions.

Related Literature

Paper: Rethinking the Inception Architecture for Computer Vision

Link: https://arxiv.org/abs/1512.00567?source=post_page
Authors: Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna. Google, University College London
Published in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

6. ResNet-50 (2015)

Figure 6: ResNet-50 Network Structure

Yes, this answers the question at the beginning of the article.

In the above CNNs, we only saw that increasing the number of layers in design achieved better performance. However, “as the network depth increases, accuracy saturates (which is not surprising), and thus the network performance begins to decline rapidly.” Experts at Microsoft Research adopted ResNet (Residual Network) to solve this problem when building deeper networks, avoiding the need for fully connected methods.

ResNet is one of the first networks to adopt batch normalization (Ioffe and Szegedy submitted the batch normalization paper to ICML in 2015). The above diagram is the network architecture of ResNet-50, which uses 26 M parameters.

ResNet’s basic building blocks are conv layers and identity blocks. Because they look very similar, you can simplify ResNet into the following diagram:

What Innovations Are There?

Avoiding the use of fully connected layers (they are not the first to do this).
Designing deeper CNN networks (up to 152 layers) without losing the generative capability of the network.
One of the first network architectures to adopt batch normalization.

Related Literature

Paper: Deep Residual Learning for Image Recognition

Link: https://arxiv.org/abs/1512.03385?source=post_page
Authors: Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Microsoft
Published in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

7. Xception (2016)

Figure 7: Xception Network Structure. Note: Depthwise separable convolution is referred to as conv sep.

Xception is an application of the Inception network structure, where the Inception modules are replaced with depthwise separable convolutions. It has a roughly equivalent number of parameters to Inception-v1 (23M).

Xception adopts the eXtreme Inception hypothesis:

First, cross-channel (or cross-feature map) correlations can be detected by 1×1 convolutions.

Therefore, the spatial correlations of each channel can be detected by regular 3×3 or 5×5 convolutions.

Pushing this idea to the extreme means performing 1×1 convolutions for each channel and 3×3 convolutions for each output. This is equivalent to replacing the Inception modules with depthwise separable convolutions.

What Innovations Are There?

Fully based on depthwise separable convolution layers, introducing CNN.

Related Literature

Paper: Xception: Deep Learning with Depthwise Separable Convolutions

Link: https://arxiv.org/abs/1610.02357?source=post_page

Authors: François Chollet. Google.

Published in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

8. Inception-v4 (2016)

Figure 8: Inception-v4 Network Structure. This CNN has two auxiliary networks (which are discarded during inference). Note: Batch norm and ReLU activation are used after all convolution layers.

Google researchers proposed Inception-v4 again (43M parameters). This is an improvement over Inception-v3, with the main differences being the Stem group and minor modifications to the Inception-C module. The authors of the paper also mentioned that “adding Uniform selection to the Inception blocks of each grid size can significantly speed up training.”

In summary, it is noteworthy that the paper mentions Inception-v4 performs better due to the increase in model size.

What Improvements Are There Compared to the Previous Inception-v3 Version?

Changed the Stem module.
Added more Inception modules.
Adopted the Uniform selection in Inception-v3, meaning the same number of filters were used in each module.

Related Literature

Paper: Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning

Link: https://arxiv.org/abs/1602.07261?source=post_page
Authors: Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi. Google.
Published in: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence

9. Inception-ResNet-V2 (2016)

Figure 9: Inception-ResNet-V2 Network Structure. Note: Batch norm and ReLU activation are used after all convolution layers.

In the same paper proposing Inception-v4, the authors also proposed Inception-ResNet: the Inception-ResNet-v1 and Inception-ResNet-v2 network series, with the v2 series having 56M parameters.

What Improvements Are There Compared to the Previous Inception-v3 Version?

Converted Inception modules to residual Inception modules.
Added more Inception modules.
Added a new type of Inception module (Inception-A) behind the Stem module.

Related Literature

Paper: Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning

Link: https://arxiv.org/abs/1602.07261?source=post_page
Authors: Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi. Google
Published in: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence

10. ResNeXt-50 (2017)

Figure 10: ResNeXt Network Structure

If you think of ResNet, yes, they are related. ResNeXt has 25 M parameters (ResNet-50 has 25.5M). The difference between them is that ResNeXt increases the number of parallel towers/branches in each module. The above diagram totals 32 towers.

What Innovations Are There?

Increased the number of parallel towers (cardinality) in a module.

Related Literature

Paper: Aggregated Residual Transformations for Deep Neural Networks

Link: https://arxiv.org/abs/1611.05431?source=post_page
Authors: Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, Kaiming He. University of California San Diego, Facebook Research
Published in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Appendix: Network In Network (2014)

We note that in a convolution, the pixel values are linear combinations of the weights in the filter and the current sliding window. Consider a mini neural network with only 1 hidden layer. This is why they called it Mlpconv. The network we are dealing with is a (convolutional neural) network with only 1 hidden layer.

The idea of Mlpconv is closely related to 1×1 convolution kernels, which became a major feature of the Inception network architecture.

What Innovations Are There?

MLP convolution layer, 1×1 convolution.
Global average pooling (averaging each feature map and feeding the resulting vector to the Softmax layer).

Related Literature

Paper: Network In Network

Link: https://arxiv.org/abs/1312.4400?source=post_page
Authors: Min Lin, Qiang Chen, Shuicheng Yan. National University of Singapore
Published in: arXiv preprint, 2013

The following resources can help you visualize neural networks:

Netron
TensorBoard API by TensorFlow
plot_model API by Keras
pytorchviz package

References

The author used the papers proposing these CNN architectures as references in the text. In addition to these papers, the author listed some other references in this article:

https://github.com/tensorflow/models/tree/master/research/slim/nets(github.com/tensorflow)
Implementation of deep learning models from the Keras team(github.com/keras-team)
Lecture Notes on Convolutional Neural Network Architectures: from LeNet to ResNet (slazebni.cs.illinois.edu)
Review: NIN — Network In Network (Image Classification)(towardsdatascience.com)

Original link:

https://towardsdatascience.com/illustrated-10-cnn-architectures-95d78ace614d

Recommended Reading

(Click the title to jump to read)

Comprehensive AI Learning Path, Most Detailed Resource Compilation!

Heavy Content | Selected Historical Articles from Public Account

My Deep Learning Entry Path

My Machine Learning Entry Roadmap

Heavyweight! AI Youdao Academic Exchange Group has been established!

Scan the QR code below, add AI Youdao Assistant WeChat, you can apply to join the Lin Xuantian Machine Learning Group (Number 1), Wu Enda deeplearning.ai Learning Group (Number 2). Be sure to note: Join which group (1 or 2 or 1+2) + Location + School/Company + Nickname. For example:1+Shanghai+Fudan+Little Cow.

Long press to scan the code and apply to join the group

(Due to the large number of people adding, please be patient)

Latest AI Heavyweight Content, IAm Watching

4. Inception-v1 (2014)

5. Inception-v3 (2015)

6. ResNet-50 (2015)

7. Xception (2016)

8. Inception-v4 (2016)

9. Inception-ResNet-V2 (2016)

10. ResNeXt-50 (2017)

Appendix: Network In Network (2014)

Leave a Comment Cancel reply