Fundamentals of Convolutional Neural Networks (CNN)

0 Introduction

In the past few days, I have watched several videos and read blog posts about convolutional neural networks. I have organized the knowledge and content that I found useful, clarified the logic, and wrote it down here for future reference, so I can easily revisit it without losing it.

1 Convolutional Neural Networks

Since it is called a convolutional neural network, it primarily consists of convolution and then neural networks, which is a combination of the two. The concept of convolution actually comes from the field of signal processing, generally involving the convolution operation of two signals, as shown in the figure below:

Fundamentals of Convolutional Neural Networks (CNN)

Neural Networks, this is the elder of machine learning, simulating the working mechanism of human brain neurons. Each neuron is a computational unit, where the input data is multiplied by weights, summed, and then added to a bias. The resulting data is then passed through an activation function to produce output, as shown in the figure below. Multiple neurons are interconnected to form a neural network, but I won’t elaborate on that here.

Fundamentals of Convolutional Neural Networks (CNN)

Convolutional neural networks have many applications in image classification and recognition, initially used for handwritten digit classification and gradually developed thereafter.

2 Image Formats

Let’s start with handwritten character image recognition. If an image is monochrome, it can be viewed as a two-dimensional numerical matrix, where the color of each pixel can be represented by a grayscale value. If the image is colored, it can be viewed as a combination of three monochrome images representing RGB.

Fundamentals of Convolutional Neural Networks (CNN)

Each pixel point of an image is actually a value, and the whole can be viewed as a three-dimensional matrix.

3 Image Convolution Operations

So what happens when we perform convolution on a color image? The animation below illustrates the process of image convolution calculation well. The original image has three RGB channels (channel 1-3), corresponding to three convolution kernels (Kernel 1-3). Each channel’s image undergoes a multiply-accumulate operation with its corresponding convolution kernel, and the values obtained from each channel are summed, along with the overall bias, to produce a value in the feature map.

The following image provides a three-dimensional representation.

4 Kernel and Feature Map

One question here is why the convolution kernel is 3×3 in size. In fact, this size has been determined through continuous research by scholars, and it is currently believed that a 3×3 receptive field is sufficient, with relatively low computation. There are also 1×1 convolution kernels used, while others are basically not used.

The second question is how the parameters in the convolution kernel are obtained. In fact, these parameters are to be learned through machine learning. When we have adjusted all the kernel parameters, the model is also determined. There are also some prior convolution kernels, such as the ones below, which can achieve sharpening and edge extraction effects after convolution.

After performing convolution on an image, we will form a Feature Map, which will extract some features. Using different kernels for convolution will output multiple feature maps.

Convolution kernels (Kernels) are also called filters.

The feature map is the result obtained after filtering the image pixel values.

The following two images intuitively show the actual appearance of kernels and feature maps.

During the processing of convolutional neural networks, as the model computations deepen, the image dimensions (h*w) will become smaller, but the extracted features will increase.

5 Padding

Due to boundary issues, after each convolution, the image inevitably gets compressed a bit. This involves a concept called padding. If the padding value is set to “same“, a circle of pixels will be added around the original image, usually filled with zeros, so that the subsequent image dimensions will be the same as the original image. The default parameter is “valid“, which means that the pixels of the image being operated on with the convolution kernel are all valid, meaning there is no outer zero padding.

unvalid

Fundamentals of Convolutional Neural Networks (CNN)

valid

Fundamentals of Convolutional Neural Networks (CNN)

The image below shows the convolution effect with padding. The issue with this image is that it uses a 4×4 convolution kernel, which is not used in practice.

Using a 3×3 convolution kernel can maintain the image dimension after convolution.

Image source: https://github.com/vdumoulin/conv_arithmetic

6 Stride

The above image shows the case with a stride of 1. If the stride is 2, it means performing convolution every two rows or columns, effectively reducing the dimensionality, resulting in a smaller feature map size after convolution.

Image source: https://github.com/vdumoulin/conv_arithmetic

7 Pooling

Pooling primarily serves to reduce dimensionality, also called downsampling, and can effectively avoid overfitting. There are two main types of pooling: Max pooling and Avg pooling. Typically, the pooling region is 2×2 in size, and after pooling, a 4×4 image will become a 2×2 size.

8 Shape

In TensorFlow and PyTorch, the structure of shape differs.

TensorFlow input shape is (batch_size, height, width, in_channels) (sample count, image height, image width, image channels).

PyTorch input shape is (batch_size, in_channels, height, width).

In the above image,

Input image shape: [inChannels, height, width] / [3,8,8];

Convolution kernel shape: [outChannels, inChannels, height, width] / [5,3,3,3];

Output image shape: [outChannels, outHeight, outWidth] / [5,6,6];

The number of input channels of the convolution kernel (in depth) is determined by the number of channels in the input matrix (inChannels). For example, in an RGB formatted image, the number of input channels is 3.

The number of output channels of the output matrix (out depth) is determined by the number of output channels of the convolution kernel. For instance, in the animation below, if the convolution kernel has 8, then the output outChannels will be 8.

Image source: https://animatedai.github.io/

9 Epoch, Batch, Batch Size, Step

Epoch: indicates the training process where all samples in the training dataset are passed through once (and only once). In one epoch, the training algorithm will sequentially input all samples into the model for forward propagation, calculate loss, perform backpropagation, and update parameters. One epoch typically contains multiple steps.

Batch: generally translated as “batch“, indicating a group of samples input to the model at once. In neural network training, the training data is often large, such as tens of thousands or even hundreds of thousands of samples. If we were to input all this data at once into the model, the computational requirements for the computer and the neural network model’s learning ability would be too high. Therefore, we can divide the training data into multiple batches and then input each batch together into the model for forward propagation, loss calculation, backpropagation, and parameter updates. However, it is important to note that the term batch is not used frequently; in most cases, people only focus on batch size.

Batch Size (batch size): indicates the number of images passed to the model in a single training instance. During neural network training, we often need to divide the training data into multiple batches; the specific number of samples in each batch is specified by the batch size.

Step: generally translated as “step”, indicating a single parameter update operation performed by the model in one epoch. In simple terms, during neural network training, each time a batch of data is trained, it completes one step.

10 Neural Networks

In fact, the convolution processing process described above is all about feature extraction from images. To perform classification or prediction, we need to leverage neural networks. Therefore, after convolution processing, it is generally necessary to flatten (flatten) the data to convert it into one-dimensional data for input into the neural network’s input layer.

In the neural network model (see the image below), fully connected layers / Dense layers are a commonly used type of neural network layer in deep learning, also known as dense connection layers or multi-layer perceptron layers. It can serve as an input layer, an output layer, and also as a hidden layer.

Recommended tool for drawing neural network diagrams: NN-SVG.

11 Activation Functions

In neural networks, activation functions are used to introduce non-linearity, allowing the network to learn complex mapping relationships. If activation functions are not used, the output of each layer is a linear function of the input from the previous layer, regardless of how many layers the neural network has; the output will always be a linear combination of the input. Below are some commonly used activation functions.

Editor / Zhang Zhihong

Reviewed / Fan Ruiqiang

Checked / Zhang Zhihong

Click below

Read the original text

Leave a Comment Cancel reply