10 Major CNN Architectures Explained Clearly

Click Machine Learning Algorithms and Python LearningSelect Star

Don’t miss out on the wonderful content

10 Major CNN Architectures Explained Clearly

Author | Raimi Karim, Produced by | AI Technology Camp (ID: rgznai100)

This article carefully selects 10 detailed diagrams of CNN architectures for discussion. Curated by the author. These diagrams showcase the essence of the entire model without the need to browse through each Softmax layer individually. Besides these illustrations, the author also provides annotations explaining how they have evolved over time—ranging from 5 to 50 convolutional layers, from regular convolutional layers to convolutional modules, from 2-3 towers to 32 towers, and kernel sizes from 7×7 to 5×5.

The term “common” refers to the fact that the pre-trained weights of these models are often shared by deep learning libraries (such as TensorFlow and PyTorch), available for developers to use, and these models are typically taught in classrooms. Some of these models have already achieved success in competitions (such as the ILSVRC ImageNet Large Scale Image Recognition Challenge).

10 Major CNN Architectures Explained Clearly

The 10 architectures to be discussed along with their respective paper publication dates

10 Major CNN Architectures Explained ClearlySix of the network architectures’ pre-trained weights can be found in Keras, see https://keras.io/applications/?source=post_page

The reason for writing this article is that there are not many blogs or articles mentioning these compact structural diagrams. Therefore, the author decided to write one as a reference. For this purpose, the author read the papers and codes mentioned in this article (most of which are TensorFlow and Keras) to achieve these results. It is also worth noting that the sources of these CNN architectures vary widely—improvements in computer hardware performance, the ImageNet competition, solutions to specific problems, new ideas, and so on. A researcher working at Google, Christian Szegedy, once mentioned:

“This process is largely not just due to more powerful hardware, larger datasets, and larger models, but also a series of new ideas, algorithms, and improvements in network architectures.”

Now let’s take a look at how these “behemoth” network architectures have gradually evolved.

[Author’s Note] On visual annotations:Please note that in these illustrations, the author has omitted some information, such as the number of convolutional filters, Padding, Stride, Dropout, and flatten operations.

Table of Contents (sorted by publication date)

  1. LeNet-5

  2. AlexNet

  3. VGG-16

  4. Inception-v1

  5. Inception-v3

  6. ResNet-50

  7. Xception

  8. Inception-v4

  9. Inception-ResNets

  10. ResNeXt-50

Legend

10 Major CNN Architectures Explained Clearly

1. LeNet-5 (1998)

10 Major CNN Architectures Explained Clearly Figure 1: LeNet-5 Network Structure

LeNet-5 is one of the simplest network architectures. It has 2 convolutional layers and 3 fully connected layers (a total of 5 layers, a common naming convention in neural networks, representing the sum of convolutional and fully connected layers). The Average-Pooling layer, now referred to as the subsampling layer, has some trainable weights (which is now uncommon in CNN network design). This network architecture has approximately 60,000 parameters.

What Innovations Are There?

This network architecture has become the standard “template”: stacking convolutional and pooling layers, with one or more fully connected layers at the end of the network.

Related Works

  • Paper: Gradient-Based Learning Applied to Document Recognition

    Link:http://yann.lecun.com/exdb/publis/index.html?source=post_page

  • Authors: Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner
  • Published in: Proceedings of the IEEE (1998)

2. AlexNet (2012)

10 Major CNN Architectures Explained Clearly

Figure 2: AlexNet Network Structure

AlexNet has 60M parameters and consists of 8 layers: 5 convolutional layers and 3 fully connected layers. AlexNet simply stacks more layers than LeNet-5. At the time of the paper’s publication, the authors pointed out that their network architecture was “one of the largest convolutional neural networks for the current ImageNet subset.”

What Innovations Are There?

1. Their network architecture was the first CNN to use ReLU as the activation function;

2. Used overlapping pooling in CNN.

Related Works

  • Paper: ImageNet Classification with Deep Convolutional Neural Networks
    Link:https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks?source=post_page
  • Authors: Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton. University of Toronto, Canada.
  • Published in: NeurIPS 2012

3. VGG-16 (2014)

10 Major CNN Architectures Explained Clearly

Figure 3: VGG-16 Network Structure

You should have noticed that CNNs are getting deeper. This is because the most straightforward way to improve the performance of deep neural networks is to increase their scale (Szegedy et. al.). The Visual Geometry Group (VCG) researchers invented VCG-16, which has 13 convolutional layers and 3 fully connected layers, inheriting the ReLU tradition from AlexNet. It consists of 138M parameters, occupying 500MB of storage space. They also designed a deeper version, VCG-19.

What Innovations Are There?

  1. As mentioned in the paper’s abstract, the contribution of this paper is the design of a deeper network (approximately twice the depth of AlexNet).
Related Works
  • Paper: Very Deep Convolutional Networks for Large-Scale Image Recognition
    Link: https://arxiv.org/abs/1409.1556?source=post_page
  • Authors: Karen Simonyan, Andrew Zisserman. University of Oxford, UK.
  • Published in arXiv preprint, 2014

4. Inception-v1 (2014)

10 Major CNN Architectures Explained Clearly
Figure 4: Inception-v1 Network Structure. This CNN has two auxiliary networks (discarded during inference), and the network structure is based on Figure 3 in the paper.
This 22-layer network architecture has 5M parameters and is referred to as Inception-v1. In this architecture, the Network in Network method is widely applied (see appendix), implemented through the Inception Module. The architecture design of the module is completed through research on sparse structure estimation.
Each module embodies three ideas:
  1. Using parallel convolution towers with different filters, then stacking them, using 1×1, 3×3, and 5×5 convolutional kernels to identify different features and thus cluster them. This idea is inspired by Arora et al.’s paper “Provable bounds for learning some deep representations“, suggesting a layer-wise construction approach to analyze the relevant statistics of the final layer and cluster them into groups of highly correlated units.
  2. 1×1 convolution kernels are used for dimensionality reduction to avoid computational bottlenecks.
  3. 1×1 convolution kernels increase non-linearity within a convolution.
  4. The authors of the paper also introduced two auxiliary classifiers to expand the differentiation at the final stage of the classifier, increasing the backpropagation grid signal and providing additional regularization. The auxiliary networks (branches connected to the auxiliary classifiers) are discarded during inference.
It is worth noting that “the main achievement of this network architecture is to improve the utilization of computational resources within the network”.
Author’s Note:The naming of the modules (Stem and Inception) was not introduced in this version of the Inception network architecture until later versions, namely Inception-v4 and Inception-ResNet. The author included these for easier comparison.
What Innovations Are There?
  1. Using tightly integrated modules to build networks. Instead of stacking convolutional layers, stacking modules composed of convolutional layers. The name Inception comes from the sci-fi movie Inception.
Related Works
  • Paper: Going Deeper with Convolutions
    Link: https://arxiv.org/abs/1409.4842?source=post_page
  • Authors: Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich. Google, University of Michigan, University of North Carolina
  • Published in: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

5. Inception-v3 (2015)

10 Major CNN Architectures Explained Clearly
Figure 5: Inception-v3 Network Architecture. This CNN has two auxiliary networks (discarded during inference). Note: All convolutional layers use batch norm and ReLU activation
Inception-v3 is a successor to Inception-v1, with 24M parameters. Where did Inception-v2 go? Don’t worry, it was just an early prototype of v3, thus very similar to v3 but not commonly used. The authors of the paper conducted many experiments when proposing Inception-v2 and documented some successful experiences. Inception is the culmination of these successful experiences (such as improvements to the optimizer, loss function, and adding batch normalization to the auxiliary networks).
The reason for proposing Inception-v2 and Inception-v3 is to avoid representation bottlenecks (which significantly reduce the input dimensions for the next layer) and improve computational efficiency through a partitioning method.
The naming of the modules (Stem and Inception) was not introduced in this version of the Inception network architecture until later versions, namely Inception-v4 and Inception-ResNet. The author included these for easier comparison.
What Innovations Are There?
  1. One of the designers of batch normalization (not reflected in the above image for simplification).
What Improvements Are There Compared to PreviousInception-v1 Versions?
  1. Decomposing n×n convolutions into asymmetric convolutionsn and n×1 convolutions.
  2. Decomposing 5×5 convolutions into two 3×3 convolution operations
  3. Replacing 7×7 convolutions with a series of 3×3 convolutions.
Related Works
  • Paper: Rethinking the Inception Architecture for Computer Vision
    Link: https://arxiv.org/abs/1512.00567?source=post_page
  • Authors: Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna. Google, University College London
  • Published in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

6. ResNet-50 (2015)

10 Major CNN Architectures Explained Clearly
Figure 6: ResNet-50 Network Architecture
Yes, this answers the question at the beginning of the article.
In the previous CNNs, we only saw that increasing the number of layers during design led to better performance. However, “as network depth increases, accuracy saturates (which is not surprising), and thus network performance begins to drop rapidly.” Experts at Microsoft Research adopted ResNet (Residual Network) to solve this problem when building deeper networks, no longer using fully connected methods.
ResNet is one of the first networks to adopt batch normalization (Ioffe and Szegedy submitted a paper on batch normalization to ICML in 2015). The image above shows the network architecture of ResNet-50, which uses 26M parameters.
ResNet’s basic building blocks are conv layers and recognition blocks. Since they look quite similar, you can simplify ResNet into the following diagram:
10 Major CNN Architectures Explained Clearly
What Innovations Are There?
  1. Avoiding the use of fully connected layers (they are not the first to do this).
  2. Designing deeper CNN architectures (up to 152 layers) without sacrificing the generative capabilities of the network.
  3. One of the first network architectures to adopt batch normalization.
Related Works
  • Paper: Deep Residual Learning for Image Recognition
    Link: https://arxiv.org/abs/1512.03385?source=post_page
  • Authors: Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Microsoft
  • Published in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

7. Xception (2016)

10 Major CNN Architectures Explained Clearly
Figure 7: Xception Network Architecture. Note: Depthwise separable convolutions are denoted by conv sep.
Xception is an application of the Inception network structure, where the Inception modules are replaced by depthwise separable convolutions. It has a roughly equivalent number of parameters to Inception-v1 (23M).
Xcpetion adopts the eXtreme assumption of Inception:
  • First, cross-channel (or cross-feature map) correlations can be detected by 1×1 convolutions.
  • Thus, the spatial correlations of each channel can be detected by regular 3×3 or 5×5 convolutions.
Taking this idea to the extreme means applying 1×1 convolutions for each channel and 3×3 convolutions for each output. This is equivalent to replacing the Inception modules with depthwise separable convolutions.
What Innovations Are There?
  1. Completely based on depthwise separable convolution layers, introducing CNN.
Related Works
  • Paper: Xception: Deep Learning with Depthwise Separable Convolutions
    Link: https://arxiv.org/abs/1610.02357?source=post_page
  • Authors: François Chollet. Google.
  • Published in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

8. Inception-v4 (2016)

10 Major CNN Architectures Explained Clearly
Figure 8: Inception-v4 Network Architecture. This CNN has two auxiliary networks (discarded during inference). Note: All convolutional layers use batch norm and ReLU activation.
Google researchers proposed Inception-v4 again (43M parameters). This is an improvement over Inception-v3, with the main differences being the Stem group and minor modifications to the Inception-C module. The authors of the paper also mentioned that “adding Uniform selection for each grid size of Inception blocks can significantly improve training speed.” They also noted that using “residual connections can greatly enhance training speed.”
What Improvements Are There Compared to PreviousInception-v3 Versions?
  1. Changed the Stem module.
  2. Added more Inception modules.
  3. Adopted the Uniform selection of Inception-v3, meaning the same number of filters was used in each module.
Related Works
  • Paper: Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning
    Link: https://arxiv.org/abs/1602.07261?source=post_page
  • Authors: Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi. Google.
  • Published in: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence

9. Inception-ResNet-V2 (2016)

10 Major CNN Architectures Explained Clearly
Figure 9: Inception-ResNet-V2 Network Structure. Note: All convolutional layers use batch norm and ReLU activation.
In the same paper proposing Inception-v4, the authors also introduced Inception-ResNet: the Inception-ResNet-v1 and Inception-ResNet-v2 network series, with the v2 series having 56M parameters.
What Improvements Are There Compared to PreviousInception-v3 Versions?
  1. Converted Inception modules into residual Inception modules.
  2. Added more Inception modules.
  3. After the Stem module, added a new type of Inception module (Inception-A).
Related Works
  • Paper: Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning
    Link: https://arxiv.org/abs/1602.07261?source=post_page
  • Authors: Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi. Google
  • Published in: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence

10. ResNeXt-50 (2017)

10 Major CNN Architectures Explained Clearly
Figure 10: ResNeXt Network Architecture
If you recall ResNet, yes, they are related. ResNeXt has 25M parameters (ResNet-50 has 25.5M). The difference between them is that ResNeXt increases the number of parallel towers/branches and paths in each module. The above image has a total of 32 towers.
What Innovations Are There?
  1. Increased the number of parallel towers (cardinality) within a module.
Related Works
  • Paper: Aggregated Residual Transformations for Deep Neural Networks
    Link: https://arxiv.org/abs/1611.05431?source=post_page
  • Authors: Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, Kaiming He. University of California San Diego, Facebook Research
  • Published in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Appendix: Network In Network (2014)

We note that in a convolution, the pixel values are a linear combination of the weights in the filter and the current sliding window. Consider a mini neural network with only one hidden layer. This is why it is called Mlpconv. The network we are dealing with is a (convolutional neural) network with only one hidden layer.
The idea of Mlpconv is closely related to 1×1 convolution kernels, which have become a key feature of the Inception network architecture.
What Innovations Are There?
  1. MLP convolutional layers, 1×1 convolutions.
  2. Global average pooling (averaging each feature map and feeding the resulting vector into the Softmax layer).
Related Works
  • Paper: Network In Network
    Link: https://arxiv.org/abs/1312.4400?source=post_page
  • Authors: Min Lin, Qiang Chen, Shuicheng Yan. National University of Singapore
  • Published in: arXiv preprint, 2013
The following resources are listed to help you visualize neural networks:
  • Netron
  • TensorBoard API by TensorFlow
  • plot_model API by Keras
  • pytorchviz package

References
The author referenced the papers proposing these CNN architectures in the text. Besides these papers, the author listed some other references in this article:
  • https://github.com/tensorflow/models/tree/master/research/slim/nets(github.com/tensorflow)

  • Implementation of deep learning models from the Keras team(github.com/keras-team)

  • Lecture Notes on Convolutional Neural Network Architectures: from LeNet to ResNet (slazebni.cs.illinois.edu)

  • Review: NIN — Network In Network (Image Classification)(towardsdatascience.com)

10 Major CNN Architectures Explained Clearly

    10 Major CNN Architectures Explained Clearly

    Every “like” you click, I take it seriously as AI

Leave a Comment