What is a Convolutional Neural Network (CNN)? CNN is a model structure method in deep learning, typically composed of an input layer, convolutional layers, pooling layers, fully connected layers, and an output layer. The core is to extract local features of the data through convolution operations, learning higher-level abstract representations layer by layer. It is mainly used in image processing tasks such as object detection and face recognition.
Structure of CNN: CNN resembles a pipeline that processes data layer by layer. Input layer – [Convolutional Layer – (Batch Normalization) – ReLU Activation – Pooling Layer] – Fully Connected Layer – Output Layer [for a repeatable training pipeline]
(1) Function of the Input Layer: Receives raw data, such as a 28✖️28 handwritten digit image. Format: Usually a three-dimensional matrix. For example, a grayscale image [28✖️28✖️1] has one channel, while a color image RGB [224✖️224✖️3] has three channels.
(2) Function of the Convolutional Layer: Extracts local features such as edges, lines, colors, shapes, etc. Method: The convolutional kernel slides over the input image, performing convolution operations (weighted summation). Calculation formula: Feature Map = Input * Filter + Bias Key hyperparameters: Filter (Kernel) size, Stride – determines how many steps the convolutional kernel moves each time, Padding – whether to pad zeros at the edges to prevent size reduction. We can think of the filter as a “magnifying glass,” sliding over different areas of the image, with each filter focusing on different specific patterns like edges and corners.
(3) Function of the ReLU Activation Layer: Increases non-linearity, allowing the model to learn complex features. Calculation method: f(x) = max(0, x). This function allows negative values to become 0, removing meaningless information while retaining positive values that preserve important features. For instance, if certain pixels in the image are useless for recognition, such as overly dark areas, ReLU acts like a “filter,” zeroing them out.
(4) Function of the Pooling Layer: Reduces dimensionality, decreases computational load while retaining the most important information. Method: Max pooling or average pooling. Max pooling takes the maximum value from the region, while average pooling takes the average value. Key parameters: Pooling window size and stride. We can understand pooling as the process of shrinking the image, where the image is reduced in size but we still retain the most crucial information. For example, when recognizing a cat image, we do not need every hair’s information, just the overall outline.
(5) Function of the Fully Connected Layer: Combines the features extracted earlier to form the final classification decision. It connects multiple neurons, similar to traditional neural networks. Method: Flatten the output of the convolutional layer into a 1D vector. Use Softmax or Sigmoid activation functions to calculate class probabilities. The fully connected layer is like the final step of a decision tree, integrating all evidence to provide a final judgment.
(6) Function of the Output Layer: Provides the final classification result. Method:
For classification tasks, use Softmax to calculate the probability for each category.
For regression tasks, directly output continuous values.
This is akin to a judge making a final ruling after hearing all the evidence. In summary, the workflow of CNN is as follows: (1) Input an image (assume a cat image) (2) Convolutional layer extracts features (e.g., edges, contours) (3) ReLU maintains non-linearity in the data (4) Pooling layer reduces dimensionality, decreasing computational load (5) Multiple layers of convolution + pooling extract higher-level features (e.g., cat ears, whiskers) (6) Fully connected layer integrates all features and makes a final judgment (7) Output layer provides classification results (e.g., 90% likely a cat, 10% likely a dog). For CNN, sharing parameters (convolutional kernels) reduces computational load. It focuses on local features, thus not requiring global connections. Through translation, features remain effective regardless of their position. A brief description of Batch Normalization (BN), which is not mandatory. The main purpose of batch normalization is to stabilize the training process and improve the model’s convergence speed. It is commonly applied in (1) deep networks: when the number of layers is too many, gradients may easily vanish or explode. (2) Unstable training processes: large fluctuations in loss, making it difficult to adjust the learning rate. (3) Using a large learning rate: BN allows for a more stable large learning rate. (4) Significant changes in data distribution: BN can alleviate the “covariate shift” problem. (5) Reducing sensitivity to initialization: BN makes the network less sensitive to parameter initialization.
Therefore, using Batch Normalization can (1) accelerate training: BN makes the network less sensitive to the learning rate, allowing for a larger learning rate to speed up convergence. (2) Alleviate gradient vanishing or explosion: by maintaining the stability of intermediate layer activation values, it prevents gradients from becoming too large or too small. (3) Reduce dependence on parameter initialization: no need for excessive tuning of initialization methods. (4) To some extent, reduce overfitting: BN has a regularization-like effect, reducing dependence on mini-batch variations.