An Overview of Natural Language Processing by Sun Maosong

This article is reprinted from: Language Monitoring and Intelligent Learning

Written by / Sun Maosong

The importance of human language (i.e., natural language) cannot be overstated. Edward O. Wilson, the father of sociobiology, once said, “Language is the greatest evolutionary achievement after eukaryotic cells.” James Gleick, the author of the popular science bestseller “The Information: A History, a Theory, a Flood,” also profoundly pointed out that “language itself is the greatest technological invention in human history.” These assertions carry the philosophical meaning of science and reflect the continuous deepening of modern humanity’s understanding of the essence of language.

It is well known that language is unique to humans; it is the carrier of thought and the most natural, profound, and convenient tool for humans to communicate ideas and express emotions. The use of the word “most” here is significant. Language to humans is like air to living beings, it permeates our world silently and constantly; it is so natural that we often do not realize its existence, but once it is absent, humanity will struggle to survive. Unfortunately, human language capability is precisely what modern computer systems lack, resulting in a deficiency in holistic comprehension. An obvious logic is that machines without language capability cannot possess true intelligence.

Natural language has infinite semantic combinability, high ambiguity, and continuous evolution, making it incredibly challenging for machines to achieve a complete understanding of natural language. Natural language understanding (a fallback term—natural language processing), due to its unparalleled scientific significance and academic challenge, has attracted generations of scholars to exhaust their thoughts and efforts.

Introduction

An Overview of Natural Language Processing by Sun Maosong

Sun Maosong

Foreign academician of the European Academy, Executive Vice Dean and Professor of the Institute of Artificial Intelligence at Tsinghua University. His main research directions are natural language processing, artificial intelligence, and social, human, and artistic computing. He has hosted the completion of two international standards for word segmentation in information processing. CAAI Fellow.

The Three Milestone Contributions of Natural Language Processing to the Development of Artificial Intelligence in the World

“Yet looking back on the path taken, the lush greenery stretches wide.” The author believes that natural language processing research (including text processing and speech processing, which complement each other) has made three milestone contributions to the history of artificial intelligence development in the world. This is merely a humble opinion, not necessarily correct, but intended to spark discussion.

The First Milestone Contribution

The modern significance of artificial intelligence technology research originated from natural language processing. The obsession with and exploration of machine intelligence has a long history. The advent of the first general-purpose computer, ENIAC, in 1946 was undoubtedly a historical watershed. As early as 1947, Warren Weaver, then head of the Natural Sciences Division of the Rockefeller Foundation, discussed the possibility of using digital computers to translate human languages in a letter to Norbert Wiener, the father of cybernetics. In 1949, he published the famous “Translation” memorandum, formally proposing the task of machine translation and designing a scientifically reasonable development path (its content actually covers the two major research paradigms of rationalism and empiricism). In 1951, Israeli philosopher, linguist, and mathematician Yehoshua Bar-Hillel began machine translation research at the Massachusetts Institute of Technology. In 1954, the machine translation experimental system developed in collaboration between Georgetown University and IBM was publicly demonstrated. Machine translation is a typical cognitive task and clearly belongs to the field of artificial intelligence.

The Second Milestone Contribution

Natural language processing was one of the first to propose and systematically practice the concept of unstructured “big data” in the field of artificial intelligence as well as in the entire field of computer science and technology, achieving a transformation from the rationalist research paradigm to the empiricist research paradigm. Here are two typical works.

First, continuous speech recognition. Since the mid-1970s, the renowned scholar Frederick Jelinek led an IBM research team that proposed a large-vocabulary continuous speech recognition method based on the corpus n-gram language model (which is essentially an n-th order Markov model), significantly improving the performance of speech recognition. This idea had a profound impact on the field of speech recognition for about 20 years, even including the IBM statistical machine translation model launched in the 1990s, which created a new pattern for machine translation (this model returned machine translation research to the empirical research paradigm suggested by Warren Weaver in 1949, fully demonstrating his foresight).

Second, automatic part-of-speech tagging. In 1971, a scholar meticulously designed a TAGGIT English part-of-speech tagging system, using 3,300 manually compiled context-sensitive rules, achieving a tagging accuracy of 77% on the one million word Brown corpus. Between 1983 and 1987, a research group at Lancaster University in the UK took a different approach, proposing a data-driven new method that did not require manual rules, using the already tagged Brown corpus to construct the CLAWS English part-of-speech tagging system based on hidden Markov models, and automatically tagging the LOB corpus of one million words, with accuracy jumping to 96%.

The Third Milestone Contribution

The current wave of artificial intelligence sweeping the globe originated from natural language processing. Between 2009 and 2010, renowned scholar Geoffrey Hinton collaborated with Dr. Li Deng from Microsoft to propose a speech recognition method based on deep neural networks, breaking through nearly a decade of performance bottlenecks in speech recognition, allowing the academic community to initially experience the power of deep learning, significantly boosting confidence and dispelling doubts about deep learning frameworks. Subsequently, research fields rushed to follow suit. In 2016, Google launched the deep neural network machine translation system GNMT, completely ending the IBM statistical machine translation model and opening a new chapter.

Deep Learning-Based Natural Language Processing: The Current Basic Situation

Since 2010, deep learning has emerged rapidly, driving the comprehensive development of artificial intelligence. The result of ten years of development is that, on one hand, deep learning has transformed artificial intelligence technology from being almost completely “unusable” to “usable,” achieving historically extraordinary progress; on the other hand, while it has significantly improved the performance of artificial intelligence systems across almost all classic tasks, deep learning methods have profound shortcomings that prevent many application scenarios from achieving “usable, manageable, and effective” results. The field of natural language processing is fundamentally the same; this article will not elaborate further.

From a macro perspective, the development of artificial intelligence has undoubtedly benefited from two major types of methodological tools: convolutional neural networks (CNN) for images and recurrent neural networks (RNN) for natural language text. Initially, the former was more prominent, but in recent years, the latter has made more significant contributions. Several major ideas influencing the global landscape of deep learning, such as attention mechanisms, self-attention mechanisms, and the Transformer architecture, have originated from the latter.

Deep learning-based natural language processing has completed three splendid iterations in model frameworks in just ten years, “traveling along the mountain path, the mountains reflect each other, leaving one in awe,” achieving three levels of realms (which are also the three realms of deep learning).

The First Realm

For each different natural language processing task, an independent set of manually annotated datasets is prepared, starting almost from scratch (often supplemented by word2vec word vectors), training a neural network model exclusive to that task. Its characteristic can be described as “starting from scratch + each family sweeps the snow in front of their door.”

The Second Realm

First, based on a large-scale corpus, a large-scale pre-trained language model (PLM) is trained through self-learning and unsupervised methods. Then, for each different natural language processing task (also referred to as downstream tasks), an independent set of manually annotated datasets is prepared. Using the PLM as a common support, a lightweight fully connected feedforward neural network is trained for that downstream task. During this process, the parameters of the PLM are adaptively adjusted. Its characteristic can be described as “pre-trained large model + fine-tuning.”

The Third Realm

First, based on an extremely large-scale corpus, a massive PLM is trained through self-learning and unsupervised methods. Then, for each different natural language processing downstream task, using the PLM as a common support, the task is completed through few-shot learning or prompt learning techniques. During this process, the parameters of the PLM are not adjusted (in fact, due to the model’s enormous scale, the downstream tasks cannot adjust it). Its characteristic can be described as “pre-trained giant model + a giant supporting many small ones.”

This three-tier realm, each deeper than the last, carries more of a “metaphysical” feeling. The performance on the GLUE and SuperGLUE public evaluation sets also improves significantly with each tier (currently, we are at the third tier).

In recent years, a plethora of talents in the global artificial intelligence community have engaged in fierce competition around pre-trained language models, with model sizes rapidly expanding (for example, OpenAI’s GPT-3 model launched in June 2020 reached 175 billion parameters, and the MT-NLG model jointly launched by Microsoft and NVIDIA in October 2021 soared to 530 billion parameters), creating a lively atmosphere of competition. In August 2021, Stanford University held a two-day academic seminar, naming the “pre-trained giant model” in the third realm as the “foundation model” and subsequently published a lengthy document of hundreds of pages to comprehensively elaborate on its viewpoints. The document included a diagram (see Figure 1) revealing the central role of the “foundation model” in intelligent information processing (its scope of influence has expanded to all data types and multimodalities).

An Overview of Natural Language Processing by Sun Maosong

Figure 1 The Central Role of the “Foundation Model” in Intelligent Information Processing

There have been many voices questioning the “foundation model”. For instance, Turing Award winner Judea Pearl questioned on Twitter: “What is the scientific principle by which ‘Foundation models’ can circumvent the theoretical limitations of data-centric methods as we know them?” However, regardless of the criticism, the important status of the “foundation model” as a public infrastructure for intelligent information processing seems to have been established. The first tier is like each family building houses on a beach with nearly zero elevation, only able to lay their own foundations, which is quite labor-intensive, and thus the efficiency is relatively low; the second tier establishes a common platform at an elevation of twenty or thirty meters, allowing everyone to build houses on this platform, significantly enhancing efficiency since the foundational work can largely be omitted, but the platform’s height is still insufficient for reliability, requiring adjustments during construction; the third tier raises the public platform to two or three hundred meters, with a solid foundation, allowing even a simple pavilion to offer a beautiful view. In this case, the factors that families need to worry about significantly decrease, further enhancing efficiency. This is a metaphorical explanation of my positioning of the “foundation model.”

Let us take the Baidu Wenxin NLP model with a parameter scale of billions as an example (https://wenxin.baidu.com/wenxin/ernie) to briefly experience the large model. First, let’s examine its “free creation” function. Inputting the following sentence, we ask the machine to continue writing:

Walking along the winding path, I did not see

The machine’s continuation is quite impressive:

a person, only a few tall ginkgo trees and some unknown flowers blooming brilliantly in the sunlight, occasionally two small birds can also be seen.

Using the same input again, we ask the machine to continue writing:

Any human presence, yet a small village stands quietly there. This is Qian Shiwucun, located in Yufang Town, Daxing District, with about 300 households and more than 800 people in total. Nowadays, rural areas have become one of the main sources of livelihood for many families. Now, we not only have good food and clothing but also can enjoy many beautiful and wonderful programs!

This passage, although fluent, has become somewhat strange in its narrative flow.

Let’s examine its “free question and answer” function. Question:

Which mountain is the highest in Shandong?

Machine answer:

Mount Tai.

Rephrasing the question:

What is the highest mountain in Shandong?

Machine answer:

Mount Tai!

Considering this is a generative response, the machine’s ability to achieve this is quite impressive. However, if the same question is asked multiple times, the machine can become confused and respond randomly:

Laoshan in Qingdao. This is straightforward.

Wulian Mountain, with an elevation of 1095 meters.

Jinan

This shows that while the large model has a broad knowledge capability, it lacks precision and self-awareness, making it easy to be “attacked” or “troubled” logically. Turning back to examine its “free creation” function again. Input:

Walking along the bustling winding path, I did not see

The machine continues:

a passerby, only a few lanterns illuminating the shadows on the ground.

Input:

Walking along the crowded winding path, I did not see

The machine continues:

anyone, yet there is a small mountain village standing quietly there. This is Qian Shiwucun, located in Yufang Town, Daxing District, with about 300 households and more than 800 people in total. Nowadays, rural areas have become one of the main sources of livelihood for many families. Now, we not only have good food and clothing but also can enjoy many beautiful and wonderful programs!

From the above examples, it is evident that the large model has distinct strengths but also significant weaknesses. It serves as a platform at a certain elevation, where houses can be conveniently built, but this does not imply that the houses are already well-built. In fact, to construct a truly “usable, manageable, and effective” house on it, considerable effort is still required.

The Confusions and Thoughts Brought by the Third Realm: The Main Challenges for Natural Language Processing in the Coming Years

I believe that natural language processing will face two main challenges in the coming years.

The First Challenge: Is Size King?

This follows the current basic trend, which is the natural extension of the “foundation model”. Although this path continues to advance, the “foundation model” itself clearly cannot solve deeper semantic problems; in other words, it is destined not to directly lead to general artificial intelligence (although it will certainly be a foundational component of the envisioned general artificial intelligence). Thus, an obvious question arises: how far can this strategy of pursuing extreme scale (data, models, computational power) go? Another related question is: what should we do?

I believe we can answer this from the perspectives of the “foundation model”‘s exploitative (exploitation) and exploratory (exploration) nature.

Exploitative (exploitation) focuses more on the engineering aspects of the “foundation model,” with several points to note.

Currently, the algorithms used to construct and utilize the “foundation model” are still quite rough. The issues observed in the performance of the Baidu Wenxin NLP model mentioned earlier can hopefully be partially resolved through active algorithm improvements.

Research and development efforts should be strengthened for new techniques that complement the “foundation model,” such as few-shot learning, prompt learning, and adapter-based learning.

Is having a comprehensive training dataset necessarily a good thing? Should we filter out the significant noise present in large datasets?

Leaderboards are undoubtedly very important for model development. However, leaderboards are not the only gold standard; application is the ultimate gold standard.

Companies developing the “foundation model” should not engage in self-promotion; they must open their models to academic testing. A “foundation model” that is not open to academic testing raises doubts about its performance. The academic community should not blindly trust or follow.

“Foundation models” urgently need to find killer applications to convincingly demonstrate their capabilities.

Exploratory (exploration) focuses more on the scientific aspects of the “foundation model.” Given that the “foundation model” has indeed exhibited some surprising (or “strange”) phenomena, a scientific explanation has yet to be provided. Typical examples include:

Why do large-scale pre-trained language models exhibit the deep double descent phenomenon (which seems to transcend the golden rule of “data complexity and model complexity should generally match” in machine learning)?

Why do “foundation models” possess few-shot and even zero-shot learning capabilities? How are these capabilities acquired? Does this involve the emergence of complex giant systems?

Why does prompt learning work? Does this imply that the “foundation model” may spontaneously generate several functional partitions, with each prompt providing the key to activate one of these functional partitions?

If so, what might the distribution of these functional partitions look like? Given that the core training algorithm of the “foundation model” is extremely simple (language model or cloze model), what deeper implications does this have?

I personally believe that exploring the scientific significance of the “foundation model” may be more important than its engineering significance. If it indeed contains the aforementioned mysteries, it will profoundly enlighten the new development of artificial intelligence models, and the “foundation model” may present a new landscape of “the mountains are heavy and the waters are deep, yet there is a bright valley ahead.” Furthermore, it may also provide inspiration for research in brain science and cognitive neuroscience.

The Second Challenge: Is Intelligence Supreme?

This is the “original intention” and eternal dream of artificial intelligence, which is quite different from the first challenge, but its necessity is undeniable. Here’s an example.

The aforementioned pioneer of machine translation, Yehoshua Bar-Hillel, published a lengthy article in 1960 titled “The Current State of Automatic Language Translation,” forecasting the prospects of machine translation. In it, he presented a sentence that is easy for humans but extraordinarily challenging for machine translation (note the word “pen” has two meanings: “fountain pen” and “enclosure”):

Little John was looking for his toy box. Finally, he found it. The box was in the pen. John was very happy.

In this sentence, for the machine to correctly translate it to mean “enclosure,” it needs to understand the meaning of the preposition “in” and possess relevant world knowledge. Let’s input this simple English sentence into a machine translation system armed with deep neural networks and big data.

Google Translate result: The box is in the pen.

Baidu Translate result: The box is in the fountain pen.

Sixty years later, it still hasn’t been resolved.

Fortunately, amidst the grand and unstoppable trend of “size is king,” a group of scholars still insist on and actively advocate for the next generation of artificial intelligence development concepts that emphasize small data, rich knowledge, and causal reasoning, i.e., “intelligence is supreme.” However, there has been little progress in this area. There are two insurmountable “roadblocks” on this path.

First, there is still a serious lack of formal knowledge bases and world knowledge bases. Knowledge graphs like Wikidata may seem vast, but if examined closely, it becomes evident that their coverage of knowledge is still quite limited. In fact, Wikidata exhibits significant compositional deficiencies, primarily containing static attribute knowledge about entities, with almost no formal descriptions of actions, behaviors, states, and event logical relationships. This severely restricts its scope and significantly diminishes its practical effectiveness.

Second, the ability to systematically acquire formal knowledge such as “actions, behaviors, states, and event logical relationships” is still severely lacking. Conducting large-scale syntactic and semantic analyses of open-text (such as Wikipedia text) is a necessary path. However, unfortunately, this syntactic and semantic capability is still not well-developed (though considerable progress has been made in recent years through deep learning methods).

These two “roadblocks” must be addressed. Otherwise, a skilled person cannot cook without rice, and this path will be difficult to navigate.

These two major challenges are, in fact, what the entire field of artificial intelligence must confront.

Conclusion

Natural language processing has come a long way to today, forming two paths: “size is king” and “intelligence is supreme.” The former is a broad road, sailing smoothly, but seems to be nearing its end; the latter is a narrow road, facing headwinds, but should be long-lasting and profound. Looking ahead, the two can coexist harmoniously, learning from and supporting each other, such as the “foundation model” expected to effectively enhance the capabilities of large-scale syntactic and semantic automatic analysis, thus providing prerequisites for large-scale knowledge acquisition. The “foundation model” may harbor certain profound computational principles or mysteries, potentially leading to major breakthroughs, which is worthy of close attention. In the next decade, it is anticipated that natural language processing will create a grand pattern in research and application, making key contributions to the development of artificial intelligence.

(References omitted)

Leave a Comment