Understanding the Mathematical Essence of Convolutional Networks

Researchers from the South China University of Technology have published a paper that describes the mathematical principles of convolutional networks. This paper explains the operations and propagation processes of convolutional networks from a mathematical perspective.This paper is very helpful for understanding the mathematical essence of convolutional networks and assists readers in implementing convolutional networks “by hand” (without using libraries).

Understanding the Mathematical Essence of Convolutional Networks

In this paper, we will explore the mathematical essence of convolutional networks from aspects such as convolution architecture, component modules, and propagation processes. Readers may already understand the specific operations of convolutional networks, and beginners can refer to the first part of the Capsule paper interpretation for a detailed understanding of the convolution process. However, we generally do not focus on how convolutional networks are mathematically implemented. Since major deep learning frameworks provide concise convolution layer APIs, we can build various convolution layers without needing mathematical expressions, and we only need to focus on the tensor dimensions of the input and output of convolution operations. While this allows us to perfectly implement the network, we still lack clarity on the mathematical essence and processes of convolutional networks, which is the purpose of this paper.

Below we will briefly introduce the main content of this paper and attempt to understand the mathematical processes of convolutional networks. Readers with a background can refer to the original paper for a deeper understanding; additionally, we may be able to implement a simple convolutional network without using the high-level API by leveraging the computational formulas from this paper.

Convolutional Neural Networks (CNN), also known as ConvNet, are widely used in many tasks such as visual image and speech recognition. After Krizhevsky et al. first applied deep convolutional networks in the 2012 ImageNet Challenge, the architecture design of deep convolutional neural networks has attracted many researchers to contribute. This has also significantly influenced the construction of deep learning architectures, such as TensorFlow, Caffe, Keras, MXNet, etc. Although the implementation of deep learning can be easily accomplished through frameworks, the mathematical theories and concepts are very difficult for beginners and practitioners to understand. This paper will attempt to outline the architecture of convolutional networks and explain the mathematical derivation involving activation functions, loss functions, forward propagation, and backward propagation. In this paper, we use grayscale images as input information, ReLU and Sigmoid activation functions to construct the nonlinear properties of convolutional networks, and the cross-entropy loss function to calculate the distance between predicted values and true values. The architecture of this convolutional network includes a convolution layer, pooling layer, and multiple fully connected layers.

2 Architecture

Understanding the Mathematical Essence of Convolutional Networks

Figure 2.1: Convolutional Neural Network Architecture

2.1 Convolution Layer

The convolution layer consists of a set of parallel feature maps, which are formed by sliding different convolution kernels over the input image and performing certain operations. Additionally, at each sliding position, the convolution kernel performs an element-wise multiplication and summation operation with the input image to project the information within the receptive field onto a single element in the feature map. This sliding process can be called stride Z_s, which is a factor that controls the output feature map size. The size of the convolution kernel is much smaller than that of the input image, and it overlaps or acts in parallel on the input image. All elements in a feature map are calculated using a single convolution kernel, meaning that a feature map shares the same weights and biases.

However, using smaller convolution kernels will lead to imperfect coverage and limit the learning algorithm’s capability. Therefore, we generally use zero-padding around the image or the Z_p process to control the size of the input image. Using zero-padding around the image [10] will also control the size of the feature map. During the training process of the algorithm, the dimensions of a set of convolution kernels are generally (k_1, k_2, c), and these convolution kernels slide over the input image of fixed size (H, W, C). Stride and Padding are important means of controlling the dimensions of the convolution layer, thus producing the feature maps stacked together to form the convolution layer. The size of the convolution layer (feature map) can be calculated using the following formula 2.1.

Understanding the Mathematical Essence of Convolutional Networks

Where H_1, W_1, and D_1 are the height, width, and depth of a feature map, respectively, Z_p is Padding, and Z_s is the stride size.

2.2 Activation Function

The activation function defines the output of a neuron given a set of inputs. We pass the weighted sum of the linear network input values to the activation function for nonlinear transformation. Typical activation functions are based on conditional probabilities, returning either 1 or 0 as output values, i.e., op {P(op = 1|ip) or P(op = 0|ip)}. When the network input information ip exceeds a threshold, the activation function returns a value of 1 and passes the information to the next layer; if the network input ip is below the threshold, it returns a value of 0 and does not pass information. Based on the separation of relevant and irrelevant information, the activation function determines whether the neuron should be activated. The higher the network input value, the greater the activation. Different types of activation functions have various applications, and some commonly used activation functions are shown in Table 1.

Understanding the Mathematical Essence of Convolutional Networks

Table 1: Nonlinear Activation Functions

2.3 Pooling Layer

The pooling layer refers to the downsampling layer, which combines the output of a cluster of neurons from the previous layer with a single neuron in the next layer. Pooling operations are performed after nonlinear activation, where the pooling layer helps reduce the number of parameters and avoid overfitting, and it can also serve as a smoothing technique to eliminate unwanted noise. The most common pooling method is simple max pooling, and in some cases, we also use average pooling and L2 norm pooling operations.

When using the number of convolution kernels D_n and stride size Z_s to perform pooling operations, its dimensions can be calculated by the following equation:

Understanding the Mathematical Essence of Convolutional Networks

2.4 Fully Connected Layer

After the pooling layer, the three-dimensional pixel tensor needs to be converted into a single vector. These vectorized and concatenated data points are then fed into the fully connected layer for classification. The function of the fully connected layer is the weighted sum of features plus the bias term, which is fed to the activation function. The architecture of the convolutional network is shown in Figure 2. This type of local connection architecture surpasses traditional machine learning algorithms in image classification problems [11][12].

2.5 Loss or Cost Function

The loss function maps the events of one or more variables to a real number associated with a certain cost. The loss function is used to measure the model’s performance and the inconsistency between actual values y_i and predicted values y hat. The model’s performance increases as the value of the loss function decreases.

If the output vector of all possible outputs is y_i = {0, 1} and the event x has a set of input variables x = (xi, x2 . . . xt), then the mapping from x to y_i is as follows:

Understanding the Mathematical Essence of Convolutional Networks

Where L(y_i hat, y_i) is the loss function. Many types of loss functions are applied differently, and some of them are listed below.

2.5.1 Mean Squared Error

Mean Squared Error, or square loss function, is often used to evaluate performance in linear regression models. If y_i hat is the output value of t training samples, and y_i is the corresponding label value, then the mean squared error (MSE) is:

Understanding the Mathematical Essence of Convolutional Networks

The downside of MSE is that when it appears with the Sigmoid activation function, it can lead to slow learning speed (slower convergence).

Other loss functions described in this section include Mean Squared Logarithmic Error, L_2 loss function, L_1 loss function, Mean Absolute Error, Mean Absolute Percentage Error, etc.

2.5.7 Cross-Entropy

The most commonly used loss function is the cross-entropy loss function, as shown below. If the output y_i has a probability in the training set label Understanding the Mathematical Essence of Convolutional Networks and the probability of output y_i not being in the training set label Understanding the Mathematical Essence of Convolutional Networks. The expected label is y, so:

Understanding the Mathematical Essence of Convolutional Networks

To minimize the cost function,

Understanding the Mathematical Essence of Convolutional Networks

In the case of i training samples, the cost function is:

Understanding the Mathematical Essence of Convolutional Networks

3 Learning Convolutional Networks

3.1 Feedforward Inference Process

The feedforward propagation process of convolutional networks can be mathematically explained as multiplying the input values by randomly initialized weights, then each neuron adds an initial bias term, and finally summing all the products from all neurons to feed into the activation function, which performs a nonlinear transformation on the input values and outputs the activation results.

In a discrete color space, images and convolution kernels can be represented as three-dimensional tensors (H, W, C) and (k_1, k_2, c), respectively, where m, n, c represent the pixel at row m and column n of the c-th image channel. The first two parameters represent spatial coordinates, and the third parameter represents the color channel.

If a convolution kernel slides over a color image, the convolution operation of the multidimensional tensor can be represented as:

Understanding the Mathematical Essence of Convolutional Networks

The convolution process can be denoted by the symbol ⓧ. For grayscale scalar images, the convolution process can be represented as,

Understanding the Mathematical Essence of Convolutional Networks

A convolution kernel Understanding the Mathematical Essence of Convolutional Networks (hereinafter referred to as k_p,q|u,v) slides to the position of the image I_m,n with a stride of 1 and with Padding. Then the feature map Understanding the Mathematical Essence of Convolutional Networks (hereinafter referred to as C_p,q|m,n) can be calculated as

Understanding the Mathematical Essence of Convolutional Networks

Understanding the Mathematical Essence of Convolutional Networks

Figure 3.1: Convolutional Neural Network

After performing convolution, we need to use a nonlinear activation function to obtain the feature map:

Understanding the Mathematical Essence of Convolutional Networks

Where σ is the ReLU activation function. The pooling layer P_p,q|m,n can be constructed by selecting the maximum value from the convolution layer at m,n, and the construction of the pooling layer can be written as,

Understanding the Mathematical Essence of Convolutional Networks

The output of the pooling layer P^p,q can be concatenated into a vector of length p*q, and then we can feed this vector into the fully connected network for classification. Subsequently, the vectorized data points from layer l-1

Understanding the Mathematical Essence of Convolutional Networks

can be calculated using the following equation:

Understanding the Mathematical Essence of Convolutional Networks

The long vector is fed from layer l to the fully connected network of layer L+1. If there are L fully connected layers and n neurons, then l can represent the first fully connected layer, L represents the last fully connected layer, and L+1 is the classification layer shown in Figure 3.2. The feedforward process in the fully connected layer can be represented as:

Understanding the Mathematical Essence of Convolutional Networks

Figure 3.2: Feedforward Process in the Fully Connected Layer

Understanding the Mathematical Essence of Convolutional Networks

As shown in Figure 3.3, we consider a single neuron (j) in the fully connected layer l. The input values a_l-1,i are weighted sums with weights w_ij and a bias term b_l,j is added. Then we feed the input value z_l,i of the last layer into the nonlinear activation function σ. The input value of the last layer can be calculated using the following equation,

Understanding the Mathematical Essence of Convolutional Networks

Where z_l,i is the input value of the activation function for neuron j in layer l.

Understanding the Mathematical Essence of Convolutional Networks

Thus, the output of layer l is

Understanding the Mathematical Essence of Convolutional Networks

Understanding the Mathematical Essence of Convolutional Networks

Figure 3.3: Feedforward Process of Neuron j in Layer l

Understanding the Mathematical Essence of Convolutional Networks

Where a^l is

Understanding the Mathematical Essence of Convolutional Networks

W^l is

Understanding the Mathematical Essence of Convolutional Networks

Similarly, the output value of the last layer L is

Understanding the Mathematical Essence of Convolutional Networks

Where

Understanding the Mathematical Essence of Convolutional Networks

Expanding these to the classification layer, the final output prediction value y_i hat for neuron unit (i) in layer L + 1 can be represented as:

Understanding the Mathematical Essence of Convolutional Networks

If the predicted value is y_i hat, and the actual label value is y_i, the performance of the model can be calculated using the following loss function equation. According to Eqn.2.14, the cross-entropy loss function is:

Understanding the Mathematical Essence of Convolutional Networks

This is a brief overview of the mathematical process of forward propagation. This paper also emphasizes the mathematical process of backward propagation; however, due to space limitations, we do not present it in this article. Interested readers can refer to the original paper.

4 Conclusion

This article provides an overview and explanation of the architecture of convolutional neural networks, including different activation functions and loss functions, while detailing the steps of feedforward and backward propagation. For mathematical clarity, we use grayscale images as input information. The stride value of the convolution kernel is set to 1, and Padding is used. The nonlinear transformations of the intermediate and last layers are accomplished using ReLU and sigmoid activation functions. The cross-entropy loss function is used to measure the model’s performance. However, many optimization and regularization steps are needed to minimize the loss function, increase the learning rate, and avoid overfitting of the model. This paper attempts to consider only the typical convolutional neural network architecture formulated with gradient descent optimization.

For more exciting content, please click: Selected Articles on Machine Learning!

Follow 👇 the official account and reply with 【CNN】 to download the complete paper

Understanding the Mathematical Essence of Convolutional Networks

Leave a Comment