Fundamentals of Convolutional Neural Networks (CNN)

Source: Deep Learning Beginner

This article is about 2400 words, and it is recommended to read in 5 minutes.
This article summarizes some common knowledge about Convolutional Neural Networks.

0 Introduction

In the past few days, I have watched some videos and blogs about Convolutional Neural Networks. I have organized the knowledge and content that I find useful, clarified the logic, and written it here for my future reference, so I will never lose it.

1 Convolutional Neural Networks

Since it is called a Convolutional Neural Network, it consists of convolution and neural networks, a combination of the two. The concept of convolution actually comes from the field of signal processing, generally performing convolution operations on two signals, as shown in the figure below:

Fundamentals of Convolutional Neural Networks (CNN)

Neural networks, are the pioneers of machine learning, simulating the working mechanism of neurons in the human brain. Each neuron is a computational unit where the input data is multiplied by weights, summed up, and then added with a bias. The resulting data is passed through an activation function to produce an output, as shown in the figure below. Multiple neurons are interconnected to form a neural network, which I will not elaborate on.

Fundamentals of Convolutional Neural Networks (CNN)

Convolutional Neural Networks have many applications in image classification and recognition, initially used for handwritten digit classification and gradually developed from there.

2 Image Formats

Let’s start with handwritten image recognition. If an image is monochrome, it can be viewed as a two-dimensional numerical matrix, where the color of each pixel can be represented by a grayscale value. If the image is colored, it can be viewed as a combination of three monochrome images: RGB.

Fundamentals of Convolutional Neural Networks (CNN)

Each pixel in an image is actually a numerical value, and the entire image can be viewed as a three-dimensional matrix.

3 Image Convolution Operations

So what happens when we perform convolution on a colored image? The animated image below illustrates the convolution computation process well. The original image has RGB three channels (channel1-3), corresponding to three convolution kernels (Kernel1-3). Each channel’s image is multiplied and added with the corresponding convolution kernel, and the values obtained from each channel are summed up, along with the overall bias to get a value in the feature map.

The following image provides a three-dimensional display.

4 Kernels and Feature Maps

The first question here is why the convolution kernel is 3*3 in size. This size has been determined through continuous research and is currently believed that a 3*3 receptive field is sufficient, and the computational load is relatively low. There are also 1*1 convolution kernels in use, while others are rarely used.

The second question is how the parameters in the convolution kernel are derived. In fact, the parameters here are to be achieved through machine learning. Once we adjust all the kernel parameters, the model is determined. There are also some prior convolution kernels, such as the ones below, which can achieve sharpening and edge extraction effects after convolution.

After performing convolution on an image, a Feature map is formed, which extracts some features. Using different kernels for convolution will output multiple Feature maps.

Convolution Kernels (convolution kernel), also called filters.
Feature map, the result obtained after filtering the image pixel values.

The following two images visually demonstrate what kernels and feature maps look like.

In the process of Convolutional Neural Networks, as the model computation deepens, the image size (h*w) will become smaller, but the number of extracted features will increase.

5 Padding

Due to boundary issues, each convolution inevitably compresses the image a bit, which involves a concept called padding. If the padding value is set to “same”, then a circle of pixels will be added around the original image, generally padding with 0, so that the dimensions of the subsequent images will be the same as the original image. The default parameter is “valid”, which means that the pixels of the image being operated on with the convolution kernel are all valid, meaning there is no outer circle of padding.

unvalid

Fundamentals of Convolutional Neural Networks (CNN)

valid

Fundamentals of Convolutional Neural Networks (CNN)

The image below demonstrates the effect of convolution with padding. The issue with this image is that it uses a 4*4 convolution kernel, which is not used in practice.

Using a 3*3 convolution kernel can keep the image size unchanged after convolution.

Image source: https://github.com/vdumoulin/conv_arithmetic

6 Stride

The above image shows the case where the stride is 1. If the stride is 2, it means that convolution is performed every two rows or two columns, effectively reducing the dimensionality, resulting in a smaller feature map size after convolution.

Image source: https://github.com/vdumoulin/conv_arithmetic

7 Pooling

Pooling primarily serves to reduce dimensionality, also known as downsampling, which can effectively prevent overfitting. There are mainly two pooling methods: Max pooling and Avg pooling. Typically, the pooling area is 2*2 in size, so a 4*4 image will become 2*2 after pooling.

8 Shape

In TensorFlow and PyTorch, there are differences in the structure of shapes.

TensorFlow input shape is (batch_size, height, weight, in_channels) / (number of samples, image height, image width, number of image channels).

PyTorch input shape is (batch_size, in_channels, height, weight).

In the above image,

Input image shape: [inChannels, height, weight] / [3, 8, 8];

Convolution kernel shape: [outChannels, inChannels, height, weight] / [5, 3, 3, 3];

Output image shape: [outChannels, outHeight, outWeight] / [5, 6, 6];

The number of input channels (in depth) of the convolution kernel is determined by the number of channels of the input matrix (inChannels). For example, an RGB format image has an input channel count of 3.

The number of output channels (out depth) of the output matrix is determined by the number of output channels of the convolution kernel. For instance, in the animation below, if there are 8 convolution kernels, then the output outChannels will be 8.

Image source: https://animatedai.github.io/

9 Epoch, Batch, Batch Size, Step

Epoch: Represents the training process where all samples in the training dataset are passed through once (and only once). In one epoch, the training algorithm will input all samples into the model in the set order for forward propagation, loss calculation, backward propagation, and parameter updates. One epoch usually contains multiple steps.

Batch: Generally translated as “batch”, representing a group of samples input to the model at once. During the training of neural networks, the training data is often large, for example, tens of thousands or even hundreds of thousands of records. If we input all this data into the model at once, the requirements on computer performance and the learning ability of the neural network model are too high. Therefore, we can divide the training data into multiple batches and subsequently input each batch of samples into the model together for forward propagation, loss calculation, backward propagation, and parameter updates. However, it should be noted that the term batch is not commonly used; in most cases, people focus only on batch size.

Batch Size: Indicates the number of images passed to the model in a single training iteration. During the training process of neural networks, we often need to divide the training data into multiple batches; the specific number of samples in each batch is specified by the batch size.

Step: Generally translated as “step”, representing the operation of updating parameters once in an epoch. In simple terms, each time we complete training on a batch of data, we have completed one step.

10 Neural Networks

In fact, the convolution process described above is all about feature extraction from images, and to perform classification or prediction, we need to leverage neural networks. Therefore, it is generally necessary to flatten the data after convolution (flatten) to convert it into one-dimensional data, making it easier to input into the input layer of the neural network.

In neural network models (see the figure below), fully connected layers/Dense layers are a commonly used type of neural network layer in deep learning, also known as dense connection layers or multilayer perceptron layers. They can serve as input layers, output layers, or hidden layers.

Recommended tool for drawing neural network diagrams: NN-SVG.

11 Activation Functions

In neural networks, activation functions are used to introduce non-linearity, allowing the network to learn complex mapping relationships. Without activation functions, the output of each layer would be a linear function of the input from the previous layer. Regardless of how many layers the neural network has, the output will be a linear combination of the input. Below are some commonly used activation functions:

That’s all for now.

Editor: Huang Jiyan

0 Introduction

Leave a Comment Cancel reply