Machine Learning
A magnificent history of artificial intelligence development
The victory of AlphaGo, the success of autonomous driving, and breakthroughs in pattern recognition have repeatedly stimulated our nerves with the rapid development of artificial intelligence. As the core of artificial intelligence, machine learning has also attracted much attention in this rapid advancement, shining brightly.
Today, the applications of machine learning span across various branches of artificial intelligence, such as expert systems, automated reasoning, natural language understanding, pattern recognition, computer vision, and intelligent robotics.
However, perhaps we never thought that the origins of machine learning and even artificial intelligence lie in exploring philosophical questions about human consciousness, self, and mind. In the process of development, it has also integrated knowledge from disciplines such as statistics, neuroscience, information theory, control theory, and computational complexity theory.
Overall, the development of machine learning is a significant branch in the history of artificial intelligence development. The stories within are full of twists and turns, surprising and captivating.
Countless stories of brilliant individuals are interspersed throughout, and in the following introduction, you will see the appearances of the following legendary figures as we narrate along the timeline of ML’s progress:
The Enthusiastic Foundational Period
From the early 1950s to the mid-1960s
Hebb initiated the first step in machine learning based on neuropsychological learning mechanisms in 1949. This is later known as the Hebb Learning Rule. The Hebb Learning Rule is an unsupervised learning rule, and the result of this learning is that the network can extract the statistical properties of the training set, thereby classifying input information into several categories based on their similarity. This aligns closely with the human process of observing and understanding the world, which largely involves classifying things based on their statistical features.
From the above formula, it can be seen that the weight adjustment amount is proportional to the product of input and output, and evidently, frequently occurring patterns will have a greater impact on the weight vector. In this case, the Hebb learning rule needs to preset the weight saturation value to prevent unbounded growth of weights when the input and output are consistently positive or negative.
The Hebb learning rule is consistent with the mechanism of “conditioned reflex” and has been confirmed by the theory of neural cells. For example, Pavlov’s experiment on conditioned reflex: every time he fed the dog, he rang a bell, and over time, the dog would associate the sound of the bell with food. Later, if the bell rang but no food was given, the dog would still salivate.
In 1950, Alan· Turing created the Turing Test to determine whether a computer is intelligent. The Turing Test states that if a machine can engage in a conversation with a human (via electronic devices) without being distinguishable from its machine identity, then the machine is considered intelligent. This simplification allowed Turing to convincingly demonstrate that “thinking machines” are possible.
On June 8, 2014, a computer (Eugene Goostman, a chatbot program) successfully convinced humans that it was a 13-year-old boy, becoming the first computer in history to pass the Turing Test. This is considered a milestone event in the development of artificial intelligence.
In 1952, IBM scientist Arthur· Samuel developed a checkers program. This program could observe the current position and learn an implicit model to provide better guidance for subsequent moves. Samuel discovered that as the program ran longer, it could provide increasingly better guidance for subsequent moves.
Through this program, Samuel refuted the notion proposed by Providence that machines could not surpass humans and learn to write code like humans. He created the term “machine learning” and defined it as “the study of how to provide computers with the ability to learn without explicit programming.”
In 1957, Rosen· Blattt proposed a second model based on the background of neural perception science, which is very similar to today’s machine learning models. This was a very exciting discovery at the time, more applicable than Hebb’s idea. Based on this model, Rosen· Blattt designed the first computer neural network – the Perceptron, which simulates the functioning of the human brain.
Three years later, Widrow first used the Delta learning rule for the training steps of the perceptron. This method later became known as the Least Squares Method. The combination of these two created a good linear classifier.
In 1967, the Nearest Neighbor Algorithm emerged, allowing computers to perform simple pattern recognition. The core idea of the kNN algorithm is that if a sample belongs to the majority category among its k nearest neighbors in feature space, then the sample also belongs to this category and exhibits the characteristics of samples in this category. This method determines the classification decision based solely on the categories of the nearest one or several samples.
The advantage of kNN lies in its ease of understanding and implementation, requiring no parameter estimation or training, making it suitable for classifying rare events, especially for multi-class problems (multi-modal, objects with multiple category labels), and it can even outperform SVM in some cases.
Han et al. in 2002 tried to use a greedy approach for document classification with a weighted adjusted k nearest neighbor method (WAkNN) to improve classification results; Li et al. in 2004 proposed that due to the differences in the quantity of files among different categories, different numbers of nearest neighbors should be selected based on the number of files in each category in the training set to participate in classification.
In 1969, Marvin Minsky pushed the perceptron to its peak. He proposed the famous XOR problem and the situation where perceptron data is linearly inseparable.
Minsky also combined artificial intelligence technology with robotics, developing the world’s earliest robot, Robot C, which could simulate human activities, elevating robotics to a new level. Another significant move by Minsky was the creation of the famous “Thinking Machines, Inc.”, which developed intelligent computers.
Subsequently, research on neural networks went into dormancy until the 1980s. Although the idea of BP neural networks was proposed by Rumelhart in 1970, it did not attract enough attention.
The Stagnant Period of Calm
From the mid-1960s to the late 1970s
From the mid-60s to the late 70s, the development of machine learning was almost at a standstill. Although during this period, Winston‘s Structural Learning System and Hayes Roth‘s Logic-Based Inductive Learning System made significant progress, they could only learn single concepts and failed to be applied practically. Additionally, neural network learning machines did not meet expectations due to theoretical flaws and entered a low tide.
The research goal during this period was to simulate the human concept learning process, using logical structures or graphical structures as internal representations for machines. Machines could use symbols to describe concepts (symbolic concept acquisition) and propose various hypotheses about learning concepts.
In fact, the entire AI field encountered a bottleneck during this period. The limited memory and processing speed of computers at that time were insufficient to solve any practical AI problems. Researchers quickly found that the requirement for programs to have a child-level understanding of the world was too high: in 1970, no one could create such a vast database, nor did anyone know how a program could learn such rich information.
The Revival Period of Renewed Hope
From the late 1970s to the mid-1980s
Starting from the late 70s, people expanded from learning single concepts to learning multiple concepts, exploring different learning strategies and various learning methods. During this period, machine learning gradually returned to public attention and slowly revived.
In 1980, the First International Conference on Machine Learning was held at Carnegie Mellon University (CMU) in the United States, marking the rise of machine learning research worldwide. Subsequently, machine inductive learning entered practical applications.
After some setbacks, the Multilayer Perceptron (MLP) was specifically proposed by Werbos in the 1981 neural network backpropagation (BP) algorithm. Of course, BP remains a key factor in today’s neural network architecture. With these new ideas, research on neural networks accelerated again.
From 1985 to 1986, neural network researchers (Rumelhart, Hinton, Williams-Her, and Nelson) successively proposed the concept of combining MLP with BP training.
A very famous ML algorithm was proposed by Quinlan in 1986, known as the Decision Tree Algorithm, more accurately referred to as the ID3 algorithm. This was another spark in mainstream machine learning. In contrast to the black-box neural network model, the decision tree ID3 algorithm can also be used as software, allowing for more real-life applications through simple rules and clear references.
Decision tree classification for playing tennis based on weather conditions
A decision tree is a predictive model that represents a mapping relationship between object attributes and object values. Each node in the tree represents an object, each branch path represents a possible attribute value, and each leaf node corresponds to the value of the object represented by the path from the root node to that leaf node. A decision tree has a single output; if multiple outputs are desired, independent decision trees can be constructed for different outputs.Decision trees are frequently used techniques in data mining for data analysis and prediction.
The Formative Period of Modern Machine Learning
From the early 1990s to the early 21st century
In 1990, Schapire first constructed a polynomial-level algorithm that provided a positive proof for the problem, which is the initial Boosting algorithm. A year later, Freund proposed a more efficient Boosting algorithm. However, both algorithms share a common practical flaw, which is that they require prior knowledge of the lower bound of the weak learning algorithm.
In 1995, Freund and Schapire improved the Boosting algorithm, proposing the AdaBoost (Adaptive Boosting) algorithm, which is almost identical in efficiency to the Boosting algorithm proposed by Freund in 1991 but does not require any prior knowledge about weak learners, making it easier to apply to practical problems.
The Boosting method is a technique used to improve the accuracy of weak classification algorithms by constructing a series of predictive functions and then combining them in a certain way into a predictive function. It is a framework algorithm that primarily operates on the sample set to obtain sub-samples, and then trains a series of base classifiers on these sub-samples using weak classification algorithms.
In the same year, a significant breakthrough in the field of machine learning occurred with the introduction of Support Vectors (support vector machines, SVM) by Vapnik and Cortes, under a large number of theoretical and empirical conditions. This divided the machine learning community into the neural network community and the support vector machine community.
However, the competition between the two communities was not easy, as neural networks lagged behind the SVM versions after kernelization until the 2000s. Support vector machines achieved good results in tasks that many previous neural network models could not solve. Additionally, SVMs can utilize all prior knowledge for convex optimization choices, generating accurate theories and kernel models. Therefore, they can significantly advance various disciplines, leading to very efficient theoretical and practical improvements.
Support vector machines, Boosting, maximum entropy methods (such as logistic regression, LR), etc. The structures of these models can generally be seen as having one layer of hidden nodes (like SVM, Boosting) or no hidden nodes (like LR). These models have achieved great success in both theoretical analysis and application.
Another ensemble decision tree model was proposed by Breiman in 2001, which consists of a random subset of instances, with each node selected from a series of random subsets. Due to this property, it is called Random Forest (RF), which has also been theoretically and empirically proven to resist overfitting.
Even the AdaBoost algorithm has shown weaknesses in data overfitting and outlier instances, while Random Forest is a more robust model in response to these warnings. Random Forest has shown success in many different tasks, such as competitions like DataCastle and Kaggle.
The Flourishing Development Period
From the early 21st century to the present
Machine learning development can be divided into two parts: Shallow Learning and Deep Learning. Shallow learning originated from the invention of the backpropagation algorithm for artificial neural networks in the 1920s, which made statistical machine learning algorithms flourish. Although the neural network algorithms of that time were also known as multilayer perceptrons (Multiple layer Perception), due to the difficulties in training multilayer networks, they were usually only shallow models with one hidden layer.
The leader in the neural network research field, Hinton, proposed the Deep Learning algorithm in 2006, greatly enhancing the capabilities of neural networks and challenging support vector machines. In 2006, the guru of machine learning, Hinton, and his student Salakhutdinov published an article in the prestigious journal Science, initiating a wave of Deep Learning in academia and industry.
This article had two main messages: 1) Artificial neural networks with many hidden layers possess excellent feature learning abilities, and the features learned provide a more essential characterization of the data, thus facilitating visualization or classification; 2) The difficulty of training deep neural networks can be effectively overcome through “layer-wise pre-training”, which is achieved through unsupervised learning in this article.
Hinton’s student Yann LeCun‘s LeNets deep learning network can be widely applied in ATMs and banks worldwide. At the same time, Yann LeCun and Andrew Ng believe that convolutional neural networks allow artificial neural networks to be trained quickly, as they occupy very little memory and do not require individual storage of filters at every position in the image, making them very suitable for building scalable deep networks, thus being ideal for recognition models.
In 2015, to commemorate the 60th anniversary of the concept of artificial intelligence, LeCun, Bengio, and Hinton released a joint review on deep learning.
Deep learning enables computational models with multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have brought significant improvements in many areas, including state-of-the-art speech recognition, visual object recognition, object detection, and many other fields, such as drug discovery and genomics. Deep learning can discover complex structures in big data. It utilizes the BP algorithm to accomplish this discovery process. The BP algorithm guides the machine on how to adjust internal parameters based on errors obtained from the previous layer, which can be used to compute representations. Deep convolutional networks have made breakthroughs in processing images, videos, speech, and audio, while recurrent networks have shown remarkable performance in handling sequential data like text and speech.
The most popular methods in the current statistical learning field are mainly deep learning and SVM (support vector machine), which are representative methods of statistical learning. It can be considered that neural networks and support vector machines both originated from the perceptron.
Neural networks and support vector machines have been in a “competitive” relationship. SVM applies the expansion theorem of kernel functions, requiring no explicit expression of nonlinear mappings; since it establishes linear learning machines in high-dimensional feature spaces, it not only does not significantly increase computational complexity compared to linear models but also avoids the “curse of dimensionality” to some extent. In contrast, earlier neural network algorithms were prone to overfitting, requiring many experience parameters to be set; training speed was relatively slow, and they did not outperform other methods when the number of layers was low (less than or equal to 3).
Neural network models seem to be able to accomplish more challenging tasks, such as object recognition, speech recognition, and natural language processing. However, it should be noted that this does not mean the end of other machine learning methods. Despite the rapid growth of successful cases in deep learning, the cost of training these models is quite high, and adjusting external parameters is also quite cumbersome. Meanwhile, the simplicity of SVM ensures it remains the most widely used machine learning approach.
Artificial intelligence machine learning is a young discipline that emerged in the mid-20th century, significantly impacting human production and lifestyle, sparking intense philosophical debates. However, overall, the development of machine learning is not very different from the development of other general matters, and it can also be viewed from the perspective of philosophical development.
The development of machine learning has not been smooth sailing; it has experienced a spiral ascent process, coexisting achievements and setbacks. The contributions of numerous researchers have led to the unprecedented prosperity of today’s artificial intelligence, representing a process of quantitative change leading to qualitative change, a result of both internal and external factors.
Looking back, we are all likely to be captivated by this grand history.
Source: DataCastle
Recent Popular Articles Top10
↓ Click the title to view ↓
1. 125 Scientific Frontier Questions Published by Science
2. The 44 Images That Are Most Addictive!
3. Where Are All the Alien Civilizations? The Fermi Paradox Gives You the Answer!
4. The 10 Most Beautiful Formulas in the World
5. How Far Can the “Sky Eye” FAST See? | This Is a Groundbreaking Topic
6. Why Is Mathematics Lovely?
7. What Exactly Is the Principle of Friction?
8. A Review of Future Super Materials
9. A Debate About the Essence of Physics
10. Finally! A Special Topic on Special Relativity
Click the “Top10” menu in the public account to view past monthly popular articles Top10