Understanding the Mathematical Principles of Convolutional Neural Networks

Click on "Xiaobai Learns Vision" above, choose to add "Star" or "Top"
Heavy content delivered at the first time

Author: Piotr Skalski

Source: AI Youdao @ WeChat Official Account

Translation: Tongye (Sun Yat-sen University), had_in (University of Electronic Science and Technology)

Introduction

This article adopts an image presentation approach to help everyone understand the related complex mathematical equations. It focuses on understanding how neural networks work, mainly focusing on some typical issues of CNNs: the data structure of digital images, stride convolution, connection pruning and parameter sharing, backpropagation of convolutional layers, pooling layers, etc.

Original Title: Gentle Dive into Math Behind Convolutional Neural Networks

Fields such as autonomous driving, smart healthcare, and self-service retail were considered impossible until recently, and computer vision has helped us achieve these things. Today, the dream of having autonomous vehicles or automated grocery stores no longer sounds so unattainable. In fact, we use computer vision every day—when we unlock our phones with facial recognition or use auto-enhance before posting photos on social media. Convolutional neural networks may be the most critical building blocks behind this tremendous success. This time, we will deepen our understanding of how neural networks work with CNNs. As a suggestion, this article will include quite complex mathematical equations, so if you are not familiar with linear algebra and calculus, please do not be discouraged. My goal is not to make you memorize these formulas but to give you an intuitive understanding of what is happening below.

Note: In this article, I mainly focus on some typical issues of CNNs. If you are looking for more basic information about deep neural networks, I suggest you read my other articles in this series (https://towardsdatascience.com/https-medium-com-piotr-skalski92-deep-dive-into-deep-networks-math-17660bc376ba). As always, you can find the complete source code with visualizations and annotations on my GitHub (https://github.com/SkalskiP/ILearnDeepLearning.py). Let’s get started!

Introduction

In the past, we already knew about neural networks known as dense connections. The neurons of these networks are divided into several groups, forming continuous layers. Each such neuron is connected to every neuron in the adjacent layer. The figure below shows an example of this architecture.

Understanding the Mathematical Principles of Convolutional Neural Networks

When we solve classification problems based on a set of limited manually designed features, this method works effectively. For example, we predict a soccer player’s position based on his statistics during a match. However, the situation becomes more complex when dealing with photos. Of course, we can treat the pixel values of each pixel as individual features and pass them as inputs to our dense network. Unfortunately, to make this network applicable to a specific smartphone photo, our network would need to contain tens of millions or even hundreds of millions of neurons. On the other hand, we could downscale our photos, but in the process, we would lose some useful information. We immediately realize that traditional strategies do not work for us; we need a new effective method to make the most of as much data as possible while reducing the required computation and parameter count. This is where CNNs come into play.

Data Structure of Digital Images

Let’s take some time to explain how digital images are stored. Most of you may know that they are actually a matrix composed of many numbers. Each of these numbers corresponds to the brightness of a pixel. In the RGB model, a color image is actually composed of three matrices corresponding to the red, green, and blue color channels. In black and white images, we only need one matrix. Each matrix stores values between 0 and 255. This range is a compromise between the efficiency of storing image information (values within 256 can be expressed using one byte) and the sensitivity of the human eye (we can distinguish a limited number of identical color grayscale values).

Convolution

Convolution is not only used in neural networks but is also a key component of many other computer vision algorithms. In this process, we take a smaller-shaped matrix (called a kernel or filter), input the image, and transform the image based on the values of the filter. The subsequent feature map values are calculated using the following formula, where the input image is represented by f, our kernel by h, and the row and column indices of the result matrix are represented by m and n, respectively.

After placing the filter on the selected pixel, we extract the values from the kernel at each corresponding position and multiply them with the corresponding values in the image. Finally, we summarize everything and place the result in the corresponding position of the output feature map. Above, we can see how such an operation is implemented in detail, but what is more concerning is what applications we can achieve by performing kernel convolution over a complete image. Figure 4 shows the convolution results of several different filters.

Valid Convolution and Same Convolution

As shown in Figure 3, when we convolve a 6×6 image with a 3×3 kernel, we get a 4×4 feature map. This is because there are only 16 different positions where we can place the filter in this image. Because each convolution operation reduces the size of the image, we can only perform a limited number of convolutions until the image completely disappears. More importantly, if we observe how the convolution kernel moves within the image, we will find that the influence of pixels located at the edges of the image is much smaller than that of pixels located at the center of the image. Thus, we lose some information contained in the image. The following diagram shows how the position of pixels changes its influence on the feature map.

To solve these two problems, we can pad the image with an extra border. For example, if we use a 1px padding, we increase the size of the photo to 8×8, then the output of convolving with a 3×3 filter will be 6×6. In practice, we usually use 0 padding for the extra padding area. Depending on whether we use padding, we will judge based on two types of convolution—valid convolution and same convolution. This naming is not very appropriate, so for clarity: Valid means we only use the original image, Same means we also consider the surrounding border of the original image, so the input and output image sizes are the same. In the second case, the padding width should satisfy the following equation, where p is the padding width and f is the filter dimension (generally odd).

Good news!
Xiaobai Learns Vision Knowledge Planet
is now open to the public👇👇👇





Download 1: OpenCV-Contrib Extension Module Chinese Version Tutorial
Reply "Extension Module Chinese Tutorial" in the backend of "Xiaobai Learns Vision" WeChat public account to download the first OpenCV extension module tutorial in Chinese on the internet, covering more than twenty chapters including extension module installation, SFM algorithms, stereo vision, target tracking, biological vision, super-resolution processing, etc.

Download 2: 52 Lectures on Python Vision Practical Projects
Reply "Python Vision Practical Projects" in the backend of "Xiaobai Learns Vision" WeChat public account to download 31 practical vision projects including image segmentation, mask detection, lane line detection, vehicle counting, eye line addition, license plate recognition, character recognition, emotion detection, text content extraction, face recognition, etc., to help quickly learn computer vision.

Download 3: 20 Lectures on OpenCV Practical Projects
Reply "OpenCV Practical Projects 20 Lectures" in the backend of "Xiaobai Learns Vision" WeChat public account to download 20 practical projects based on OpenCV to achieve advanced learning of OpenCV.

Discussion Group

Welcome to join the reader group of the public account to communicate with peers. Currently, there are WeChat groups for SLAM, 3D vision, sensors, autonomous driving, computational photography, detection, segmentation, recognition, medical imaging, GAN, algorithm competitions, etc. (will be gradually subdivided in the future). Please scan the WeChat number below to join the group, and note: "Nickname + School/Company + Research Direction", for example: "Zhang San + Shanghai Jiao Tong University + Vision SLAM". Please follow the format for remarks; otherwise, you will not be approved. After successful addition, you will be invited to enter the relevant WeChat group based on your research direction. Please do not send advertisements in the group; otherwise, you will be removed from the group. Thank you for your understanding~

Understanding the Mathematical Principles of Convolutional Neural Networks

Connection Pruning and Parameter Sharing

Leave a Comment Cancel reply