
First, let’s visually appreciate the position of “Deep Learning”. The following diagram illustrates the relationship between AI, Machine Learning, and Deep Learning.
The field of AI is relatively broad, with Machine Learning being a subfield of AI, and Deep Learning being a subset within the Machine Learning domain.
Deep learning algorithms have recently become increasingly popular and useful; however, the success of deep learning or deep neural networks is attributed to the continuous emergence of various neural network model architectures. In this article, the author reviews the development of deep neural network architectures over the past 18 years, starting from 1998.
From the axes in the diagram, we can see that the horizontal axis represents operational complexity and the vertical axis represents accuracy. Initially, as the model’s weights increased, the model size grew, and its accuracy improved. However, with the emergence of network architectures like ResNet, GoogleNet, and Inception, the weights of parameters have continuously decreased while achieving the same or even higher accuracy. It is worth noting that moving right on the horizontal axis does not necessarily mean increased computation time. Here, the focus is not on time statistics, but on a comparison of model parameters and network accuracy.
Among the various networks, the author identifies several that are essential and worth learning: AlexNet, LeNet, GoogLeNet, VGG-16, and NiN.
The First Generation of Artificial Neural Networks
In 1943, psychologist Warren McCulloch and mathematician Walter Pitts proposed the concept of artificial neural networks and the mathematical model of artificial neurons in their collaborative paper “A logical calculus of the ideas immanent in nervous activity”, thus initiating the era of artificial neural network research. In 1949, psychologist Donald Hebb described the neuron learning rule in his paper “The Organization of Behavior”.
Further, American neuroscientist Frank Rosenblatt proposed a machine that could simulate human perception capabilities, calling it the “Perceptron”. In 1957, he successfully completed the simulation of the Perceptron on the IBM 704 machine at Cornell Aeronautical Laboratory, and in 1960, he developed the Perceptron-based neural computer Mark1, which could recognize some English letters.
The first generation of neural networks could classify simple shapes (like triangles and quadrilaterals), and people gradually recognized that this method was a trend of using machines to achieve similar capabilities to human sensation, learning, memory, and recognition.
However, the structural flaws of the first generation neural networks limited their development. In the Perceptron, the parameters of the feature extraction layer were manually adjusted, which contradicted the requirement for “intelligence”. On the other hand, the single-layer structure limited its learning capabilities, as many functions exceeded its learning scope.
The Second Generation of Neural Networks
In 1985, Geoffrey Hinton used multiple hidden layers to replace the single feature layer in the Perceptron, employing the Back-propagation (BP) algorithm (proposed in 1969, practical in 1974) to calculate network parameters.
In 1989, Yann LeCun and others used deep neural networks to recognize handwritten characters in mail. Later, LeCun further applied Convolutional Neural Networks (CNN) to recognize handwritten characters on bank checks, achieving commercial-level recognition accuracy. Despite the algorithm’s great success, it trained on the dataset for about three days.
The network structure is divided into an input layer, multiple hidden layers, and an output layer. Before training the network, weights are randomly initialized, and network parameters are adjusted using the BP algorithm.
The BP algorithm does not always run well. Even with stochastic gradient descent, the BP algorithm can easily get stuck in local optima. Moreover, as the number of layers in the network increases, the training difficulty grows.
The second generation neural networks have the following main drawbacks:
1. They must be trained on labeled data and cannot train on unlabeled data.
2. As the number of layers increases, the signals returned by BP become weaker, limiting the number of layers in the network.
3. Propagating back and forth between multiple hidden layers leads to slow training.
4. It may cause the network to get stuck in local optima.
5. Many parameters need to be manually set by human experience and skill, such as the number of layers and node units, which restricts the development of neural networks.
Subsequently, researchers attempted to increase dataset size and estimate initialization weights to overcome the shortcomings of artificial neural networks. However, the emergence of Support Vector Machines (SVM) caused research on artificial neural networks to enter a winter period.
SVM, with its simple structure, allows for fast training and is relatively easy to implement. However, due to its simple structure, SVM is good at handling simple features but not complex ones. Learning with SVM requires prior knowledge of specific problems, yet it is difficult to find some universal prior knowledge. Furthermore, the features in SVM are not selected by itself but are manually extracted.
Although SVM performs well in certain areas, its fatal flaw of a shallow structure makes it not a good developmental trend in the field of artificial intelligence.
The Principle of Human Visual Perception
In 1958, David Hubel and Torsten Wiesel conducted research on the correspondence between the pupil area and the cortical neurons in the brain, discovering direction-selective cells in the posterior cortex. The brain cortex performs low-level abstractions of raw signals, gradually iterating towards higher-level abstractions.
Further scientific research indicates that the brain cortex, which is related to many cognitive abilities in humans, does not explicitly preprocess perceptual signals but allows them to pass through a complex module hierarchical structure, which over time can express them according to the patterns presented by the observations.
In summary, the information processing of the human visual system is hierarchical, extracting edge features from the low-level V1 area, then to shapes or parts of targets in the V2 area, and finally to higher levels, encompassing the entire target and its behaviors. This means that higher-level features are combinations of lower-level features, with feature representation becoming increasingly abstract from lower to higher levels. This physiological discovery has contributed to breakthroughs in computer artificial intelligence forty years later.
Around 1995, Bruno Olshausen and David Field simultaneously used physiological and computational methods to study visual problems. They proposed a sparse coding algorithm, using 400 image fragments for iteration to select the best fragment weight coefficients. Amazingly, the selected weights were mostly edge lines of different objects in photos, with these line segments having similar shapes but differing in direction.
The findings of Bruno Olshausen and David Field align with the physiological discoveries of David and Torsten Wiesel forty years earlier. Further research indicates that the information processing in deep neural networks is hierarchical, similar to humans, moving from low-level edge features to high-level abstract representations through a complex hierarchical structure.
Research has found that this pattern exists not only in images but also in sounds. Scientists discovered 20 basic sound structures from unlabeled sounds, with the remaining sounds composed of these 20 basic structures. In 1997, LSTM (a special type of RNN) was proposed and showed good results in natural language understanding.
4. The Rise and Development of Deep Neural Networks
In 2006, Hinton proposed the Deep Belief Network (DBN), a deep network model. It employs a greedy unsupervised training method to solve problems and achieve good results. The training method of DBN (Deep Belief Networks) reduces the difficulty of learning hidden layer parameters, and the training time of the algorithm is nearly linearly related to the size and depth of the network.
Unlike traditional shallow learning, deep learning emphasizes the depth of model structure, clarifying the importance of feature learning. By performing layer-by-layer feature transformations, it transforms sample meta-space feature representations into a new feature space, making classification or prediction easier. Compared to methods that construct features using artificial rules, using big data to learn features can better characterize the rich intrinsic information of the data.
Compared to shallow models, deep models have immense potential. With massive amounts of data, it is easy to achieve higher accuracy by increasing the model size. Deep models can perform unsupervised feature extraction, directly handle unlabeled data, and learn structured features; therefore, deep learning is also referred to as Unsupervised Feature Learning. With the use of high-performance computing devices like GPUs and FPGAs, the emergence of neural network hardware, and distributed deep learning systems, the training time for deep learning has been significantly reduced, allowing people to enhance learning speed simply by increasing the number of devices used. The emergence of deep network models has enabled countless difficult problems to be solved, making deep learning the hottest research direction in the field of artificial intelligence.
In 2010, the U.S. Department of Defense DARPA program first funded deep learning projects.
In 2011, researchers from Microsoft Research and Google’s speech recognition team adopted DNN technology to reduce speech recognition error rates by 20%-30%, marking the biggest breakthrough in the field in a decade.
In 2012, Hinton reduced the Top-5 error rate of the ImageNet image classification problem from 26% to 15%. That same year, Andrew Ng and Jeff Dean established the Google Brain project, training a deep network with over 1 billion neurons on a parallel computing platform with 16,000 CPU cores, achieving breakthrough progress in speech and image recognition.
In 2013, Hinton’s DNN Research company was acquired by Google, and Yann LeCun joined Facebook’s artificial intelligence lab.
In 2014, Google improved the accuracy of language recognition from 84% in 2012 to 98%, with a 25% increase in the accuracy of language recognition in the mobile Android system. In facial recognition, Google’s FaceNet system achieved 99.63% accuracy on LFW.
In 2015, Microsoft used deep neural network residual learning methods to reduce the ImageNet classification error rate to 3.57%, lower than the human error rate of 5.1% in similar experiments, with the neural network reaching 152 layers.
In 2016, DeepMind used a deep learning Go software, AlphaGo, with 1920 CPU clusters and 280 GPUs to defeat human Go champion Lee Sedol.
Research on deep learning in China is also accelerating:
In 2012, Huawei established the “Noah’s Ark Lab” in Hong Kong, focusing on research in natural language processing, data mining, machine learning, media social interaction, and interpersonal interaction.
In 2013, Baidu established the “Deep Learning Research Institute” (IDL), applying deep learning to language recognition and image recognition and retrieval. In 2014, Andrew Ng joined Baidu.
In 2013, Tencent began building the deep learning platform Mariana, which provides parallel implementations of default algorithms for various application areas, including recognition and advertising recommendations.
In 2015, Alibaba released the DTPAI artificial intelligence platform, which includes open modules for deep learning.
Research on deep learning has permeated various fields of life and has become a primary development direction of artificial intelligence technology. The ultimate goal of artificial intelligence is to enable machines to possess inductive capabilities, learning abilities, analytical capabilities, and logical thinking abilities comparable to humans. Although current technology is still far from this goal, deep learning undoubtedly provides a possible pathway for machines to surpass human capabilities in specific domains.
This article is sourced from: Mathematics China

Follow our official account for more information
Membership application: Please reply “Individual Member” or “Unit Member” within the official account
Welcome to follow the media matrix of the China Command and Control Society

CICC Official Website

CICC Official WeChat Account

Official Website of the Journal of Command and Control

Official Website of the International Unmanned Systems Conference

Official Website of the China Command and Control Conference

National Wargame Competition

National Aerial Intelligent Game Competition

Sohu Account

Yidian Account