This article was first published on WeChat public account “Intelligent Cube“
Author Introduction: Liu Zhiyuan, Assistant Researcher at Tsinghua University, Senior Member of the China Computer Federation. He obtained his PhD from Tsinghua University in 2011 and won the Excellent Doctoral Dissertation Award from the Chinese Association for Artificial Intelligence. His main research areas are natural language processing and social computing, with over 20 papers published in top conferences and journals in related fields.
A few years ago, at the recommendation of Professor Ma Shaoping, I wrote a short article introducing natural language processing for a popular science book, which covered the basic concepts, tasks, and challenges of NLP and can serve as a beginner’s reference.
1. What is Natural Language Processing
In simple terms, Natural Language Processing (NLP) is the use of computers to process, understand, and utilize human languages (such as Chinese, English, etc.). It is a branch of artificial intelligence and an interdisciplinary field between computer science and linguistics, often referred to as computational linguistics. Natural language is a fundamental characteristic that distinguishes humans from other animals. Without language, human thought cannot be discussed, so natural language processing embodies the highest tasks and realms of artificial intelligence. It means that only when computers have the ability to process natural language can machines be said to have achieved true intelligence.
From a research perspective, natural language processing includes syntax analysis, semantic analysis, and discourse understanding. From an application perspective, NLP has a wide range of application prospects. Especially in the information age, the applications of NLP are vast, such as: machine translation, handwriting and printed character recognition, speech recognition and language conversion, information retrieval, information extraction and filtering, text classification and clustering, public opinion analysis and sentiment mining, etc. It involves data mining, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research, and linguistic research related to language processing.
It is worth mentioning that the rise of natural language processing is closely related to the specific task of machine translation. Machine translation refers to the automatic translation of one natural language into another using computers. For example, automatically translating the English “I like Beijing Tiananmen Square” into “我爱北京天安门”, or conversely translating “我爱北京天安门” into “I like Beijing Tiananmen Square”. Since manual translation requires well-trained bilingual experts, the translation work is very time-consuming and labor-intensive. Not to mention when translating literature in specialized fields, the translator must also understand the basic knowledge of that field. There are over thousands of languages in the world, and the United Nations has six working languages alone. If accurate machine translation can be achieved between languages, it will greatly improve the efficiency of human communication and understanding.
There is a story in the Bible about the Babylonians wanting to build a tower that reaches heaven. The builders all spoke the same language, sharing the same intentions and working together. God saw that humans dared to do such a thing and made their languages different. Because people could not understand what others were saying, they argued all day long and could not continue building the tower. Later, this tower was called the Tower of Babel, and “Babel” means “confusion”. Although the Tower of Babel was halted, a dream has always lingered in people’s hearts: when will humanity have a common language to rebuild the Tower of Babel? Machine translation is seen as a great initiative to “rebuild the Tower of Babel”. If machine translation between different languages can be realized, we will be able to understand what anyone in the world says and communicate with them, no longer troubled by mutual incomprehension.
In fact, when “artificial intelligence” was formally proposed as a research issue, the founders identified computer chess and machine translation as two iconic tasks, believing that as long as the chess system could defeat the human world champion and the machine translation system reached human translation levels, it could be declared a victory for artificial intelligence. Forty years later, in 1997, IBM’s Deep Blue supercomputer was able to defeat the world chess champion Garry Kasparov. However, machine translation still cannot compare to human translation levels, which shows how difficult natural language processing is!
Natural language processing originated in the United States. After World War II, in the 1950s, when electronic computers were still in their infancy, the idea of using computers to process human language had already emerged. At that time, the U.S. hoped to use computers to automatically translate a large amount of Russian materials into English to gain insight into the latest developments in Soviet technology. Researchers were inspired by deciphering military codes, believing that different languages are merely different encodings of the “same semantics”, and thus naively thought that decoding techniques could be used to “decipher” these languages like breaking codes.
On January 7, 1954, Georgetown University and IBM successfully collaborated to automatically translate over 60 sentences from Russian into English. Although this machine translation system was very simple at the time, containing only 6 grammar rules and 250 words, due to extensive media coverage, it was considered a huge breakthrough, leading to increased investment from the U.S. government in natural language processing research. The experiment’s participants confidently stated that within three to five years, the automatic translation issue from one language to another could be completely solved. They believed that as long as various translation rules were established, perfect automatic translation could be achieved through a large number of rules piled together.
However, the reality is that understanding human language is far more complex than deciphering codes, so research progress was very slow. A research report in 1966 concluded that after ten years of research, the results fell far short of expectations, leading to a sharp decline in funding support, which plunged natural language processing (especially machine translation) research into a two-decade low tide. Until the 1980s, with the rapid increase in computing power and the significant decrease in manufacturing costs of electronic computers, researchers began to refocus on the challenging research field of natural language processing. Thirty years of changes have shown that researchers have recognized that simply piling up language rules cannot achieve true understanding of human language. It has been found that automatic learning and statistics from a large amount of text data can better solve natural language processing problems, such as automatic translation of languages. This idea is known as the statistical learning model of natural language processing, which is still in its infancy today.
So, what are the main difficulties or challenges in natural language processing that attract so many researchers to tirelessly explore solutions for decades?
2. Main Difficulties in Natural Language Processing
The difficulties of natural language processing can be listed in many ways, but the key lies in eliminating ambiguity issues, such as those existing in lexical analysis, syntactic analysis, semantic analysis, and other processes, collectively referred to as disambiguation. Correct disambiguation requires a large amount of knowledge, including linguistic knowledge (such as morphology, syntax, semantics, context, etc.) and world knowledge (which is language-independent). This presents two main difficulties in natural language processing.
Firstly, language is full of ambiguities, which mainly manifests at three levels: lexical, syntactic, and semantic. The emergence of ambiguity is due to the complexity of the objects described by natural language—human activities—and the limited vocabulary and syntactic rules of the language, which creates multiple meanings for the same linguistic form.
For example, the word boundary problem is a disambiguation task at the lexical level. In spoken language, words are usually spoken continuously. In written language, languages like Chinese lack boundaries between words. Since words are the smallest units carrying semantics, solving natural language processing starts with the boundary definition of words. Especially in Chinese text, which typically consists of continuous character sequences, there is a lack of natural separators between words, making Chinese information processing more complex than Western languages like English, requiring an additional step to determine word boundaries, referred to as the “Chinese automatic word segmentation” task. Simply put, it requires computers to automatically insert delimiters between words, thus segmenting Chinese text into independent words. For example, the sentence “今天天气晴朗” with delimiters becomes “今天|天气|晴朗”. Chinese automatic word segmentation is the foundational task of Chinese natural language processing and plays an important role, mainly facing issues such as new word discovery and ambiguous segmentation. We note that correct word segmentation depends on the correct understanding of the text’s semantics, and word segmentation is the first step in understanding language. Such a “chicken and egg” problem naturally becomes the first hurdle in (Chinese) natural language processing.
Other levels of linguistic units also face various ambiguity issues. For example, at the phrase level, “进口彩电” can be understood as a verb-object relationship (importing a batch of color TVs from abroad) or a modifier-head relationship (color TVs imported from abroad). Similarly, at the sentence level, “做手术的是她的父亲” can be understood as her father being sick and needing surgery or her father being a doctor, helping others perform surgery. In summary, the same word, phrase, or sentence can have multiple possible interpretations, representing various possible meanings. If we cannot properly resolve the ambiguity issues at all levels of linguistic units, we cannot correctly understand the meaning that language intends to express.
On another front, the knowledge required for disambiguation presents challenges in acquisition, expression, and application. Due to the complexity of language processing, it is difficult to design suitable language processing methods and models.
For example, the issue of context knowledge acquisition. When trying to understand a sentence, even if there is no ambiguity problem, we often need to consider the influence of context. The so-called “context” refers to the linguistic environment in which the current sentence is situated, such as the speaker’s environment or the preceding or following sentences. If there are pronouns in the current sentence, we need to refer to the preceding sentences to infer what the pronoun refers to. Take the example “小明欺负小亮,因此我批评了他”. In the second sentence, does “他” refer to “小明” or “小亮”? To correctly understand this sentence, we need to understand that the previous sentence “小明欺负小亮” implies that “小明” did something wrong, thus the “他” in the second sentence should refer to “小明”. Since the implications of context for the current sentence are diverse, how to consider the impact of context is one of the main difficulties in natural language processing.
Another example is the background knowledge issue. Correctly understanding human language also requires sufficient background knowledge. A simple example often cited in the early stages of machine translation research illustrates the daunting task of machine translation. In English, “The spirit is willing but the flesh is weak.” means “心有余而力不足”. However, a certain machine translation system at that time translated this English sentence into Russian and then back into English as “The Voltka is strong but the meat is rotten.” The literal meanings of “spirit” (strong liquor) and “Voltka” (vodka) seem unproblematic, and “flesh” and “meat” also both mean meat. So why did these two sentences diverge in meaning? The key issue is that the machine translation system had no understanding of the English idiom and merely translated it literally, resulting in a significant misunderstanding.
From the two main aspects of difficulties outlined above, we see that the root of the problem of natural language processing lies in the complexity of human language and the complexity of the external world described by language. Human language serves important functions in expressing emotions, communicating thoughts, and disseminating knowledge, thus requiring strong flexibility and expressive ability, while the knowledge required to understand language is endless. So how are people currently attempting to perform natural language processing?
3. Development Trends of Natural Language Processing
Currently, people mainly approach natural language processing through two ideas: one is rule-based rationalism, and the other is statistics-based empiricism. The rationalist approach believes that human language is primarily generated and described by language rules, so as long as the appropriate form can be used to express human language rules, understanding human language and performing various natural language processing tasks like translation can be achieved. The empirical approach, on the other hand, believes that language statistical knowledge can be obtained from language data, effectively establishing statistical models of language. Therefore, as long as there is enough statistical language data, human language can be understood. However, when faced with a real world full of ambiguity and uncertainty, both methods face their own unsolvable problems. For example, although human language has certain rules, it is often accompanied by a lot of noise and irregularity in real use. A major weakness of the rationalist approach is poor robustness; it cannot process any deviation from the rules. For the empirical approach, it cannot endlessly acquire language data for statistical learning, so it cannot perfectly understand human language. Since the 1980s, the trend has been that rule-based rationalism has been increasingly questioned, and large-scale language data processing has become the main research goal in natural language processing for the present and the foreseeable future. Statistical learning methods have gained increasing attention, and machine learning methods are being used more and more in natural language processing to acquire linguistic knowledge.
Entering the 21st century, we have entered an era marked by massive information represented in natural language. On one hand, massive information provides more “material” for computers to learn human language, while on the other hand, it offers a broader application stage for natural language processing. For example, as an important application of natural language processing, search engines have gradually become important tools for people to obtain information, with giants like Baidu and Google emerging; machine translation has also moved from laboratories into ordinary households, with companies like Google and Baidu providing machine translation and auxiliary translation tools based on massive online data; natural language processing-based Chinese input methods (such as Sogou, Microsoft, Google, etc.) have become essential tools for computer users; computers and mobile phones equipped with speech recognition are increasingly prevalent, assisting users in working and learning more effectively. In summary, with the popularization of the Internet and the emergence of massive information, natural language processing is playing an increasingly important role in people’s daily lives.
However, we also face a grim reality: how to effectively utilize massive information has become a global bottleneck issue restricting the development of information technology. Natural language processing inevitably becomes a new strategic high ground for the long-term development of information science and technology. At the same time, people are gradually realizing that simply relying on statistical methods can no longer quickly and effectively learn linguistic knowledge from massive data; only by fully leveraging the respective advantages of both rule-based rationalism and statistics-based empiricism, allowing them to complement each other, can we better and faster perform natural language processing.
As a relatively new discipline with less than a century of history, natural language processing is undergoing rapid development. Looking back at the development history of natural language processing, it has not been a smooth journey, with both lows and highs. Now we face new challenges and opportunities. For instance, current web search engines are basically still at the level of keyword matching, lacking deep natural language processing and understanding. Speech recognition, text recognition, question-answering systems, and machine translation are currently only able to achieve very basic levels. The road ahead is long and arduous; as a highly interdisciplinary emerging discipline, whether exploring the essence of nature or applying it practically, natural language processing is bound to bring exciting surprises and rapid development in the future.
References
[1] Zhang Bo. The Computational Models of Natural Language Processing. Journal of Chinese Information Processing, 2007, 21(3):3-7.
[2] Feng Zhiwei. Preface to “Statistical Natural Language Processing”. 1st ed. Beijing: Tsinghua University Press, 2008.
[3] Sun Maosong. Language Computing: A Strategic High Ground for Long-term Development of Information Science and Technology. Language and Character Applications, 2005, 3:38-40.
This article was first published on WeChat public account “Intelligent Cube“
For more language information, please follow “Han Yu Tang”!
Abstracts, Lectures, Opinions
Focusing on the field of language and charactersBig Data!