Evolution of CNN Architecture: From AlexNet to ResNet

Hello everyone, I am Sister Liu. Today we will delve into the evolution of Convolutional Neural Networks (CNN), which is one of the most important technological developments in the field of computer vision.

Background Knowledge

Before the rise of deep learning, traditional image recognition methods relied on manually designed feature extraction techniques. These methods often had significant limitations and struggled with complex visual tasks. The emergence of Convolutional Neural Networks (CNN) has completely changed this situation.

Basic Concepts

Convolutional Layer: Extracts local features by sliding a convolution kernel over the input data.
Pooling Layer: Reduces feature dimensions and decreases computational complexity.
Fully Connected Layer: Maps the extracted features to the final classification results.

Technical Evolution

AlexNet (2012)

AlexNet is a milestone in deep learning image recognition. It first demonstrated the immense potential of deep convolutional neural networks in large-scale image recognition tasks.

Key Innovations

Utilized the ReLU activation function to solve the vanishing gradient problem.
Adopted Dropout regularization method.
Employed data augmentation techniques.
Leveraged GPUs for parallel computing.

VGGNet (2014)

VGGNet demonstrated the impact of network depth on performance by stacking more convolutional layers.

Main Features

Used 3×3 small convolution kernels.
Network depth reached 16-19 layers.
Simplified network design philosophy.

ResNet (2015)

ResNet introduced the concept of shortcut connections, addressing the degradation problem of deep networks.

Core Breakthroughs

Design of residual blocks.
Extremely deep network structure (152 layers).
Better propagation of gradient information.

Implementation Method: PyTorch Implementation of ResNet Residual Block

import torch
import torch.nn as nn

class BasicBlock(nn.Module):
    expansion = 1

    def __init__(self, in_channels, out_channels, stride=1):
        super(BasicBlock, self).__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, 
                               stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)

        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, 
                               stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)

        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != self.expansion * out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, self.expansion * out_channels, 
                          kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(self.expansion * out_channels)
            )

    def forward(self, x):
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += self.shortcut(x)
        out = self.relu(out)
        return out

Performance Analysis

Comparative Analysis

Network Model	Top-1 Accuracy	Parameter Count	Computational Complexity
AlexNet	57.1%	60M	Medium
VGGNet	71.3%	138M	High
ResNet-50	76.2%	25M	Low

Limitations and Improvement Directions

Computational resource demands.
Risk of overfitting.
Transfer learning capabilities.

Extended Application Case: Object Detection

class ResNetBackbone(nn.Module):
    def __init__(self, block, num_blocks, num_classes=1000):
        super(ResNetBackbone, self).__init__()
        self.in_channels = 64

        self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)

        self.layer1 = self._make_layer(block, 64, num_blocks[0], stride=1)
        self.layer2 = self._make_layer(block, 128, num_blocks[1], stride=2)
        self.layer3 = self._make_layer(block, 256, num_blocks[2], stride=2)
        self.layer4 = self._make_layer(block, 512, num_blocks[3], stride=2)

        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512 * block.expansion, num_classes)

    def _make_layer(self, block, out_channels, num_blocks, stride):
        strides = [stride] + [1] * (num_blocks - 1)
        layers = []
        for stride in strides:
            layers.append(block(self.in_channels, out_channels, stride))
            self.in_channels = out_channels * block.expansion
        return nn.Sequential(*layers)

    def forward(self, x):
        x = self.maxpool(self.relu(self.bn1(self.conv1(x))))
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.fc(x)
        return x

Conclusion

The development of CNN demonstrates the revolutionary progress of deep learning in the field of computer vision. From AlexNet to ResNet, each generation of networks continues to break performance limits, making significant contributions to the development of artificial intelligence.

Sister Liu hopes everyone can deeply understand the design philosophies of these networks, not only to use them but also to comprehend the underlying principles. Continuous learning and innovation are the keys to success in the field of research!