The input layer reads in the normalized (uniform size) images, where each neuron in every layer takes a set of small local neighboring units from the previous layer as input, utilizing local receptive fields and weight sharing. Neurons extract some basic visual features such as edges and corners, which will later be used by higher-level neurons. Convolutional Neural Networks (CNNs) obtain feature maps through convolution operations, where at each position, units from different feature maps receive their respective types of features. A convolutional layer typically contains multiple feature maps with different weight vectors, allowing for richer feature retention in images. Pooling layers follow convolutional layers to perform down-sampling, which reduces the image resolution and parameter count while providing robustness against translations and deformations. The alternating distribution of convolutional and pooling layers gradually increases the number of feature maps while decreasing the resolution, forming a dual pyramid structure.

Features of CNN

(1) CNNs have advantages that traditional techniques lack: good fault tolerance, parallel processing capability, and self-learning ability. They can handle complex environmental information, unclear background knowledge, and ambiguous inference rules, allowing samples to have significant defects and distortions, with fast operation speeds and good adaptability, achieving high resolution. It integrates feature extraction functionality into multi-layer perceptrons through structural reorganization and weight reduction, omitting the complex image feature extraction process before recognition.

(2) The generalization ability is significantly superior to other methods. CNNs have been applied in pattern classification, object detection, and object recognition. A pattern classifier can be established using CNNs, treating CNNs as universal pattern classifiers directly applicable to grayscale images.

(3) CNNs are a type of feedforward neural network that can extract the topological structure from a 2D image, using backpropagation algorithms to optimize the network structure and solve for unknown parameters within the network.

(4) They are a type of multi-layer neural network specifically designed to handle 2D data. CNNs are considered the first truly successful deep learning method that employs a multi-layer hierarchical structure robustly. By mining the spatial correlations in data, CNNs reduce the number of trainable parameters in the network, improving the efficiency of the backpropagation algorithm for forward propagation networks. Because CNNs require very little data preprocessing, they are also regarded as a deep learning method. In CNNs, small regions of an image (also known as ‘local receptive fields’) are treated as the bottom layer input data in the hierarchy, with information propagated through various layers of the network, each composed of filters to capture significant features of the observed data. Since local receptive fields can capture basic features such as edges and corners in images, this method provides a degree of invariance to displacement, stretching, and rotation.

(5) The tight connections between layers and spatial information in CNNs make them particularly suitable for image processing and understanding, enabling automatic extraction of rich relevant characteristics from images.

(6) CNNs fully utilize features such as locality contained in the data by combining local receptive fields, weight sharing, and spatial or temporal down-sampling to optimize network structures, ensuring a certain degree of invariance to displacement and deformation.

(7) CNNs are a type of machine learning model under deep supervised learning, with strong adaptability, adept at mining local features of data, extracting global training features and classifications. Their weight-sharing structure makes them more similar to biological neural networks, achieving great results in various fields of pattern recognition.

(8) CNNs can be used to recognize 2D or 3D images with invariance to displacement, scaling, and other forms of distortion. The parameters of the feature extraction layers in CNNs are learned from training data, thus avoiding manual feature extraction and instead learning directly from training data. Moreover, the neurons in the same feature map share weights, reducing the network parameters, which is a significant advantage of convolutional networks over fully connected networks. This special structure of shared local weights is closer to real biological neural networks, giving CNNs unique advantages in image processing and speech recognition. On the other hand, weight sharing also reduces the complexity of the network, and the characteristic of allowing multi-dimensional input signals (speech, images) to be directly input into the network avoids the data reordering process during feature extraction and classification.

(9) The classification model of CNNs differs from traditional models in that it can directly input a 2D image into the model, and then provide classification results at the output end. Its advantage lies in not requiring complex preprocessing, fully encapsulating feature extraction and pattern classification in a black box, continuously optimizing to obtain the necessary parameters for the network, and providing the required classification at the output layer. The core of the network is the design of the network structure and the resolution of the network. This resolution structure performs better than many previous algorithms.

(10) The number of parameters in hidden layers is unrelated to the number of neurons in hidden layers; it only relates to the size of the filters and the number of types of filters. The number of neurons in hidden layers is also related to the original image size (number of input neurons), the size of the filters, and the sliding step of the filters in the image.

Solving CNN

Essentially, CNN is a mapping from input to output, capable of learning a large number of mapping relationships between inputs and outputs without requiring any precise mathematical expression between them. By training the convolutional network with known patterns, the network gains the ability to map between input-output pairs.

The convolutional network performs supervised training, so its sample set consists of vector pairs of the form: (input vector, ideal output vector). All these vector pairs should come from the actual ‘operating’ structure of the system that the network will simulate, and they can be collected from actual operating systems.

(1) Parameter Initialization:

Before training begins, all weights should be initialized with some different random numbers. ‘Small random numbers’ ensure that the network does not enter a saturation state due to excessively large weights, leading to training failure; ‘different’ ensures that the network can learn normally. In fact, if the same number is used to initialize the weight matrix, the network will have no learning capability.

(2) The training process includes four steps:

① First Stage: Forward Propagation Stage

Take a sample from the sample set and input it into the network.
Calculate the corresponding actual output; during this stage, information is transmitted from the input layer through successive transformations to the output layer, which is also the process executed when the network operates normally after training.

② Second Stage: Backward Propagation Stage

Calculate the difference between the actual output and the corresponding ideal output.
Adjust the weight matrix according to the method of minimizing errors.

The training process of the network is as follows:

① Select a training group, randomly seeking N samples from the sample set as the training group;

② Set all weights and thresholds to small random values close to 0, and initialize precision control parameters and learning rates;

③ Add an input pattern from the training group to the network and provide its target output vector;

④ Calculate the output vector of the intermediate layer and the actual output vector of the network;

⑤ Compare the elements in the output vector with those in the target vector to calculate the output error; errors for the hidden units in the intermediate layer also need to be calculated;

⑥ Sequentially calculate the adjustment amounts for each weight and threshold;

⑦ Adjust weights and thresholds;

⑧ After experiencing M iterations, check whether the criteria meet precision requirements. If not, return to (3) and continue iterating; if met, proceed to the next step;

⑨ Training ends, save the weights and thresholds in a file. At this point, it can be considered that all weights have stabilized, and the classifier has formed. Training again can directly export weights and thresholds from the file without needing initialization.

Link: https://blog.csdn.net/jiaoyangwm/article/details/80011656

Source: Xiaobai Learns Vision

Author: Daidai Cat

This article is for technical information exchange only. If there is any infringement, please contact for deletion.

Recently, Everyone is Watching

1. How Should Chinese Enterprises Navigate the Path of Large Models?

2. What are the ‘New Playful Ways’ for New Energy? Photovoltaics, Wind Energy, Hydrogen Energy… Plenty to See!

3. From ‘Feature Phones’ to ‘Smart Phones’: Robots Call for RobotGPT

4. Latest Trends and Outlook for Future Industries

5. Why Do Planes Prefer Millions of Rivets Over Welding?

6. ABB Robotics Applications in Automotive Welding

7. Summary of Common Terms in Machine Learning, with Chinese and English Comparisons!

8. Eight Robot Terms to Know in the Smart Manufacturing Industry!

9. AGV Robots: Theoretical Basis for Vision-Based Obstacle Avoidance

10. OpenAI Valuation Reaches $29 Billion: New Round of Financing Announced Completed

How to Use CNN for Image Recognition Tasks

Leave a Comment Cancel reply