The 80-Year Journey of Artificial Neural Networks

5.17

The Intellectual

The Intellectual

Image Source: Pixabay

Written by | Sun Ruichen

Edited by | Li Shanshan

● 　● 　●

Today, large language pre-trained neural network models such as ChatGPT have become widely known names. However, the algorithmic core behind GPT—the artificial neural network algorithm—has undergone an 80-year rollercoaster of ups and downs. During these 80 years, apart from a few explosive moments, most of the time, this theory remained silent, neglected, and even considered a “poison” for funding.

The birth of artificial neural networks came from the golden combination of the unruly genius McCulloch and the well-established neurophysiologist Pitts. However, their theory surpassed the technological level of their time and thus failed to gain widespread attention and empirical validation.

Fortunately, in the first two decades after its inception, researchers continuously contributed to this field. The field of artificial neural networks evolved from the simplest mathematical models of neurons and learning algorithms to perceptron models with learning capabilities. However, challenges from other researchers and the tragic demise of one of the founders of the “perceptron,” Rosenblatt, led to a more than twenty-year winter in this field until the backpropagation algorithm was introduced into the training process of artificial neural networks.

After that, after two decades of silence, research in artificial neural networks finally restarted, and in the nearly 20 years of buildup, convolutional neural networks and recurrent neural networks made their appearances.

However, the rapid development of this field in academia and industry had to wait until 17 years ago, when breakthroughs in hardware—the emergence of general-purpose GPU chips—occurred. This led to the widespread recognition of large language pre-trained neural network models like ChatGPT.

In a sense, the success of artificial neural networks is a kind of luck because not all research can wait for a core breakthrough when everything is in place. In many fields, technological breakthroughs come too early or too late, leading to a slow demise. However, amidst this luck, the steadfastness and perseverance of the researchers involved cannot be overlooked. Thanks to the idealism of these researchers, artificial neural networks have traversed their rollercoaster 80 years and finally reaped the rewards.

McCulloch-Pitts Neuron

In 1941, Warren Sturgis McCulloch moved to the University of Chicago Medical School to serve as a professor of neurophysiology. Shortly after moving to Chicago, a friend introduced him to Walter Pitts. Pitts, who was pursuing his PhD at the University of Chicago, shared a common interest in neuroscience and logic with McCulloch, and the two quickly became like-minded friends and partners in research. Pitts was naturally curious; at the age of 12, he had already read Russell and Whitehead’s “Principia Mathematica” in the library and wrote to Russell pointing out several errors in the book. Russell appreciated the young reader’s letter and replied inviting him to study at Cambridge University (even though Pitts was only 12 years old). However, Pitts’s family had a low level of education and could not understand his thirst for knowledge, often responding with harsh words. The relationship with his family gradually deteriorated, and he ran away from home at the age of 15. From then on, Pitts became a vagrant on the University of Chicago campus, auditing his favorite university courses during the day and sleeping in random classrooms at night. When Pitts met McCulloch, he was already an enrolled PhD student but still had no fixed residence. Upon learning of this situation, McCulloch invited Pitts to live with him.

By the time they met, McCulloch had already published several papers on the nervous system and was a well-known expert in the field. Although Pitts was still a PhD student, he had already made significant contributions to mathematical logic and gained recognition from prominent figures in the field, including von Neumann. Despite their very different professional fields, both were deeply interested in the workings of the human brain and firmly believed that mathematical models could describe and simulate brain functions. Driven by this common belief, they collaborated on multiple papers and established the first artificial neural network model. Their work laid the foundation for modern artificial intelligence and machine learning, and they are recognized as pioneers in the fields of neuroscience and artificial intelligence.

In 1943, McCulloch and Pitts proposed the earliest artificial neural network model: the McCulloch-Pitts Neuron model[1]. This model aimed to simulate the workings of neurons using the binary switch mechanism of “on” and “off.” The main components of the model include: input nodes that receive signals, intermediate nodes that process input signals through preset thresholds, and output nodes that generate output signals. In their paper, McCulloch and Pitts proved that this simplified model could be used to implement basic logic operations (such as “AND,” “OR,” and “NOT”). In addition, the model could also be used to solve simple problems, such as pattern recognition and image processing.

McCulloch–Pitts Neuron

Image Source: www.cs.cmu.edu/~./epxing/Class/10715/reading/McCulloch.and.Pitts.pdf

Hebbian Learning

(Hebbian Learning)

In 1949, Canadian psychologist Donald Hebb published a book titled “The Organization of Behavior” and proposed the famous Hebbian Learning theory[2]. The theory states that “neurons that fire together wire together,” meaning that neurons exhibit synaptic plasticity, which is the key area where neurons connect and transmit information, and that synaptic plasticity is the basis for the brain’s learning and memory functions.

A key step in machine learning theory is how to use different update algorithms to update the model. When using a neural network model for machine learning, the architecture and parameters of the initial model must be set. During the model training process, each input data from the training dataset leads to updates of various parameters in the model. This process requires the use of an update algorithm. The Hebbian Learning theory provided the initial update algorithm for machine learning: Δw = η x xpre x xpost. Δw represents the change in the parameters of the synaptic model, η is the learning rate, xpre is the activity value of the presynaptic neuron, and xpost is the activity value of the postsynaptic neuron.

The Hebbian update algorithm provides a theoretical basis for using artificial neural networks to mimic the behavior of brain neural networks. The Hebbian learning model is a type of unsupervised learning model—it achieves learning objectives by adjusting the strength of the connections perceived between the input data. For this reason, the Hebbian learning model is particularly adept at clustering analysis of subcategories in input data. As research into neural networks deepened, the Hebbian learning model was later found to be applicable to reinforcement learning and other subfields.

Perceptron

(Perceptron)

In 1957, American psychologist Frank Rosenblatt first proposed the Perceptron model and also introduced the Perceptron update algorithm[3]. The Perceptron update algorithm extended the basis of the Hebbian update algorithm by utilizing an iterative trial-and-error process for model training. During model training, for each new data point, the Perceptron model calculates the difference between the predicted output value and the actual measured output value of that data point, and then uses this difference to update the coefficients in the model. The specific equation is as follows: Δw = η x (t – y) x x. After proposing the initial Perceptron model, Rosenblatt continued to explore and develop theories related to the Perceptron. In 1959, Rosenblatt successfully developed a neural computer named Mark1 that used the Perceptron model to recognize English letters.

The Perceptron model, similar to the McCulloch-Pitts neuron, is also based on the biological model of neurons, receiving input signals, processing input signals, and generating output signals as its basic operating mechanism. The difference between the Perceptron model and the McCulloch-Pitts neuron model is that the latter’s output signal can only be 0 or 1—exceeding the preset threshold gives 1, otherwise, it is zero—while the Perceptron model uses a linear activation function, allowing the model’s output value to vary continuously like the input signal. Additionally, the Perceptron assigns coefficients to each input signal, which can affect the influence of each input signal on the output signal. Finally, the Perceptron is a learning algorithm because the coefficients of its input signals can be adjusted based on the data it sees; whereas the McCulloch-Pitts neuron model lacks set coefficients, preventing its behavior from dynamically updating based on data feedback.

In 1962, Rosenblatt compiled years of research on the Perceptron model into a book titled “Principles of Neurodynamics: Perceptrons and the theory of brain mechanisms.” The Perceptron model represented a significant advancement in artificial intelligence, as it was the first algorithmic model with learning capabilities, able to autonomously learn patterns and characteristics from the incoming data. Furthermore, it had the capability of pattern classification, automatically categorizing data based on its features. Additionally, the Perceptron model is relatively simple and requires fewer computational resources.

Despite the various advantages and potential of the Perceptron, it is still a relatively simplified model with many limitations. In 1969, computer scientists Marvin Minsky and Seymour Papert published a book titled “Perceptrons”[5]. In this book, the authors conducted an in-depth critique of the Perceptron model, analyzing the limitations of single-layer neural networks represented by the Perceptron, including but not limited to the implementation of “XOR” logic and the linear separability problem. However, both authors and Rosenblatt had already realized that multi-layer neural networks could solve the problems that single-layer neural networks could not. Unfortunately, the negative evaluation of the Perceptron model in the book had a significant impact, causing public and governmental interest in Perceptron research to plummet. In 1971, the theorist and chief supporter of the Perceptron, Rosenblatt, tragically died in a maritime accident at the age of 43. Under the double blow of the book “Perceptrons” and Rosenblatt’s death, the number of papers related to the Perceptron rapidly decreased year by year, and the development of artificial neural networks entered a “winter.”

Perceptron Model

Image Source: towardsdatascience.com

Backpropagation Algorithm

Multi-layer neural networks can solve problems that single-layer neural networks cannot, but they bring new issues: updating the weights of each neuron in a multi-layer neural network involves a lot of precise calculations, and conventional methods are time-consuming and labor-intensive, making the learning process of neural networks very slow and impractical.

To solve this problem, American sociologist and machine learning engineer Paul Werbos proposed the backpropagation algorithm in his 1974 Harvard doctoral dissertation titled “Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences”[6]. The basic idea of this algorithm is to adjust the weights of each neuron in the neural network by backpropagating the error between the predicted output value and the actual output value from the output layer. The essence of this algorithm is to use the chain rule commonly used in calculus to backpropagate from the output layer to the input layer (along the negative gradient direction) for training the neural network composed of multi-layer perceptrons.

Regrettably, Werbos’s dissertation did not receive sufficient attention for a long time after its publication. It wasn’t until 1985 that a paper was published by David Rumelhart, a psychologist at the University of California, San Diego, cognitive psychologist and computer scientist Geoffrey Hinton, and computer scientist Ronald Williams, regarding the application of the backpropagation algorithm in neural networks[7]. This paper gained significant attention in the field of artificial intelligence. The ideas of Rumelhart and others were essentially similar to those of Werbos, but they did not cite Werbos’s dissertation, which has been criticized in recent times.

The backpropagation algorithm plays a crucial role in the development of artificial neural networks and makes the training of deep learning models possible. Since the backpropagation algorithm was re-emphasized in the 1980s, it has been widely applied to train various neural networks. In addition to the initially proposed multi-layer perceptron neural networks, the backpropagation algorithm is also applicable to convolutional neural networks, recurrent neural networks, and more. Due to the significance of the backpropagation algorithm, Werbos and Rumelhart et al. are considered pioneers in the field of neural networks.

In fact, the backpropagation algorithm is an important achievement of the “Renaissance” era in artificial intelligence (during the 1980s and 1990s). Parallel Distributed Processing was the main methodology during this period, focusing on multi-layer neural networks and advocating for accelerated neural network training processes and applications through parallel processing calculations. This was a significant departure from the mainstream ideas in the field of artificial intelligence at the time, thus holding epoch-making significance. Additionally, this methodology was welcomed by scholars from various fields beyond computer science, including psychology, cognitive science, and neuroscience. Therefore, this period in history is often regarded as the Renaissance of the field of artificial intelligence.

Principle of Backpropagation Algorithm

Image Source: www.i2tutorials.com

Convolutional Neural Networks

(Convolutional Neural Network, CNN)

If we regard the McCulloch-Pitts neuron as the birthmark of artificial intelligence, then the United States can be said to be the birthplace of artificial neural networks. In the thirty years following the birth of artificial neural networks, the United States has played a leading role in artificial intelligence, nurturing key technologies such as the Perceptron and backpropagation algorithm. However, during the first “winter” of artificial intelligence, various parties in the U.S., including the government and academia, lost confidence in the potential of artificial neural networks, significantly slowing down support and investment in the iteration of neural network technology. Consequently, during this “winter” that swept across the U.S., research on artificial neural networks in other countries came to the historical spotlight. Convolutional neural networks and recurrent neural networks emerged in this context.

Convolutional neural networks are a multi-layer neural network model that includes various unique structures such as convolutional layers, pooling layers, and fully connected layers. This model uses convolutional layers to extract local features from input signals, then reduces the dimensionality and complexity of the data through pooling layers, and finally converts the data into a one-dimensional feature vector and generates output signals (usually predictions or classification results) through fully connected layers. The unique structure of convolutional neural networks makes them particularly advantageous when processing data with grid structure attributes (such as images and time series).

Convolutional Neural Network

Image Source: https://www.analyticsvidhya.com/blog/2022/01/convolutional-neural-network-an-overview/

The earliest convolutional neural network was proposed in 1980 by Japanese computer scientist Kunihiko Fukushima[8]. The model proposed by Fukushima included convolutional layers and downsampling layers, a structure that is still used in today’s mainstream convolutional neural networks. The only difference between Fukushima’s model and today’s convolutional neural networks is that the former did not use the backpropagation algorithm— as mentioned earlier, the backpropagation algorithm did not gain attention until 1986. Due to the lack of this algorithm’s support, Fukushima’s convolutional neural network model, like other multi-layer neural networks at the time, faced challenges of long training times and computational complexity.

In 1989, Yann LeCun, a French computer scientist working at Bell Labs in the U.S., and his team proposed a convolutional neural network model named LeNet-5 and used the backpropagation algorithm for training[9]. LeCun demonstrated that this neural network could be used to recognize handwritten digits and characters. This marked the beginning of the widespread application of convolutional neural networks in image recognition.

Recurrent Neural Networks

(Recursive Neural Network, RNN)

Like convolutional neural networks, recurrent neural networks also possess unique structural features. The main structural characteristic of these neural networks is that there is a recursive relationship between the layers, rather than a sequential one. Due to these special structural features, recurrent neural networks are particularly suitable for processing natural language and other text-based data.

In 1990, American cognitive scientist and psycholinguist Jeffrey Elman proposed the Elman network model (also known as the simplified recurrent network)[10]. The Elman network model was the first recurrent neural network. Elman demonstrated that recurrent neural networks could maintain the sequential nature of the data during training, laying the foundation for the future application of such models in natural language processing.

Recurrent neural networks face the issue of gradient vanishing. When using the backpropagation algorithm to train neural networks, the weight update gradients for layers close to the input gradually approach zero, causing these weights to change slowly and leading to poor training outcomes. To address this issue, in 1997, German computer scientist Sepp Hochreiter and his doctoral advisor Jürgen Schmidhuber proposed the Long Short-Term Memory (LSTM) network[11]. This model is a special type of recurrent neural network that introduces memory cells, allowing the model to have better long-term memory retention capabilities, thus alleviating the gradient vanishing phenomenon. This model remains one of the most widely used recurrent neural network models today.

General-Purpose GPU Chips

In 2006, American company NVIDIA launched the first general-purpose GPU (Graphics Processing Unit) chip, named CUDA (Compute Unified Device Architecture). Prior to this, GPUs were specifically designed for graphics rendering and computation, commonly used in applications related to computer graphics (such as image processing, real-time rendering of game scenes, video playback, and processing). CUDA allows for general-purpose parallel computing, enabling tasks that could only call upon the CPU (Central Processing Unit) to be computed using the GPU. The powerful parallel computing capabilities of GPUs allow them to execute multiple computing tasks simultaneously, and their computing speed is faster than that of CPUs, making them suitable for matrix operations. The training of neural networks often requires large-scale matrix and tensor operations. Before the emergence of general-purpose GPUs, the development of artificial neural networks was long constrained by the limited computational capabilities of traditional CPUs. These limitations affected both innovative theoretical research and the application of existing models in productization and industrialization. The emergence of GPUs significantly alleviated these constraints in both areas.

In 2010, Dan Ciresan, a postdoctoral researcher in Schmidhuber’s team, achieved significant acceleration in the training of convolutional neural networks using GPUs[12]. However, GPUs truly gained notoriety in the field of artificial neural networks in 2012. That year, Canadian computer scientists Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton proposed the AlexNet model[13]. The AlexNet model is essentially a type of convolutional network model. Krizhevsky and others used GPUs during model training and participated in an internationally renowned image classification and tagging competition (ImageNet ILSVRC). Surprisingly, this model won the championship with a significant margin. The success of the AlexNet model greatly stimulated interest and attention from all sectors regarding the application of artificial neural networks in computer vision.

Generative Neural Networks and Large Language Models

Recurrent neural networks can generate text sequences word by word continuously, thus often being regarded as early generative neural network models. However, despite their proficiency in handling and generating natural language data, they have struggled to effectively capture global information in long sequence data (unable to establish effective connections for distant information).

Transformer Model Image Source: [14]

In 2017, researchers at Google, led by Ashish Vaswani, proposed the Transformer model[14]. This large neural network is divided into two main parts: the encoder and the decoder. The encoder processes the input sequence through encoding, using self-attention layers and other mechanisms to further process the encoded information. Subsequently, the information is passed to the decoder, which generates the output sequence through the decoder’s self-attention layers and other network structures. The key innovation of this model lies in the self-attention layer. The self-attention layer allows the neural network model to break free from the limitations of sequential text processing, directly capturing information from different positions in the text and identifying the dependencies between various pieces of information, while also parallelizing the computation of semantic relevance across different positions. The emergence of the Transformer model has had a tremendous impact on the field of natural language processing and the entire artificial intelligence domain. Within just a few years, the Transformer model has been widely used in various large AI models.

Among the numerous large language models based on the transformer architecture, OpenAI’s chatbot ChatGPT is the most well-known. The language model behind ChatGPT is GPT-3.5 (Generative Pre-trained Transformer 3.5). OpenAI used a large corpus of data to train this model, enabling it to possess extensive language understanding and generation capabilities, including providing information, communication, text creation, software code writing, and easily handling various language comprehension-related examinations.

Conclusion

A few weeks ago, I participated in a volunteer activity where middle school students had lunch with researchers. During the event, I chatted with several fifteen- and sixteen-year-old students. Naturally, we talked about ChatGPT. I asked them, “Do you use ChatGPT? You can be honest with me; I won’t tell your teachers.” One boy shyly smiled and said he couldn’t live without ChatGPT now.

80 years ago, the wandering Pitts could only imagine a mathematical model that could simulate brain functions. Today, in the world of young people, neural networks have become not just abstract mathematical formulas but are omnipresent. What will happen in the next 80 years? Will artificial neural networks develop consciousness like human neural networks? Will carbon-based brains continue to dominate silicon-based brains, or will they be dominated by silicon-based brains?

References:

1.Warren S. McCulloch and Walter Pitts. “A Logical Calculus of Ideas Immanent in Nervous Activity.” The Bulletin of Mathematical Biophysics, vol. 5, no. 4, 1943, pp. 115-133.

2.Donald O. Hebb. “The Organization of Behavior: A Neuropsychological Theory.” Wiley, 1949.

3.Frank Rosenblatt. “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain.” Psychological Review, vol. 65, no. 6, 1958, pp. 386-408.

4.Frank Rosenblatt. “Principles of Neurodynamics: Perceptrons and the theory of brain mechanisms.” MIT Press, 1962.

5.Marvin Minsky and Seymour Papert. “Perceptrons: An Introduction to Computational Geometry.” MIT Press, 1969.

6.Paul Werbos. “Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences.” Harvard University, 1974.

7.David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. “Learning representations by back-propagating errors.” Nature, vol. 323, no. 6088, 1986, pp. 533-536.

8.Kunihiko Fukushima. “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position.” Biological Cybernetics, vol. 36, no. 4, 1980, pp. 193-202.

9.Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. “Gradient-based learning applied to document recognition.” Proceedings of the IEEE, vol. 86, no. 11, 1998, pp. 2278-2324.

10.Jeffrey L. Elman. “Finding Structure in Time.” Cognitive Science, vol. 14 1990, pp. 179-211.

11.Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-Term Memory.” Neural Computation, vol. 9, no. 8, 1997, pp. 1735-1780.

12.Dan C. Ciresan, Ueli Meier, Luca Maria Gambardella, and Jürgen Schmidhuber. “Deep Big Simple Neural Nets Excel on Handwritten Digit Recognition.” Neural Computation, vol. 22, no. 12, 2010, pp. 3207-3220.

13.Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. “ImageNet Classification with Deep Convolutional Neural Networks.” Advances in Neural Information Processing Systems, 2012, pp. 1097-1105.

14.Vaswani, Ashish, et al. “Attention is All You Need.” Advances in Neural Information Processing Systems, 2017, pp. 5998-6008.

FollowThe Intellectual Video Account

Get more interesting and informative popular science content

The 80-Year Journey of Artificial Neural Networks

THE END

Leave a Comment Cancel reply