Liu Haitao: Reflections on Natural Language Processing

This article is adapted from the public account: Quantitative Linguistics

This short article was published in “Terminology Standardization and Information Technology” in 2001, and it has been 20 years since then. This is the last article I published during my amateur study of linguistics. This article not only marks the end of my identity as a language research enthusiast but also opens the door to my linguistic career.In the summer of 20 years ago, I met President Liu Jinan in the conference room on the third floor of the main building of Beijing Broadcasting Institute, along with Professor Li Xiaohua, the dean of the Broadcasting Academy, and the director of the Personnel Department and Minister of the Organization Department at that time.In 2002, I joined the Department of Applied Linguistics at the Broadcasting Academy. This short article also initiated my transformation in language research methods. Due to limited conditions in my spare time, my so-called research was mostly speculative, or I would take a few language examples, process them with my brain, and write down some thoughts. This short article’s reflection on certain fundamental issues of language made me feel that as a professional linguist, if I continued like this, what was the point of my efforts?

I remember a few years ago, during a performance appraisal meeting, an expert suggested that as a senior scholar, I should not always engage in data-based empirical research and could appropriately do something from philosophical perspectives; I said that was what I did during my amateur period. Looking back at the content of this article, many issues still have practical significance today; of course, some ideas have become mainstream in today’s natural language processing, such as usage-based NLP and human-machine intelligence integration. Over the past twenty years, we have also discovered many interesting patterns in human language operations through real corpora, which better reflect human realities and have found certain applications in fields like artificial intelligence, second language acquisition, and language for special populations. Our recent reflections on language research in the era of artificial intelligence can be seen in the extended reading section at the end. In the intelligent era, language intelligence, without language, it is hard to say intelligence, but this does not mean that linguists are necessarily useful, as human intelligence does not equal artificial intelligence.

1, Language and Intelligence

The idea of using artificial objects to mimic certain intelligent behaviors of humans can be traced back to the 16th and 17th centuries. Language, as the most important characteristic reflecting human intelligent behavior and the externalization of intelligence, has long been regarded as a key to understanding and unraveling the mysteries of human intelligence. Even before the advent of computers, people attempted to solve certain language processing functions that only humans could accomplish through computational or mechanical means. After the advent of computers in the 1940s as tools to extend human intelligence, the first project applied in the humanities—machine translation—was also related to language, which is no coincidence but rather an inevitable result of human exploration in this field. The efforts to mimic human language processing abilities through computers have generated specialized branches in both computer science and linguistics: natural language processing and computational linguistics. The essence of the two is fundamentally the same; the difference may only lie in that natural language processing focuses more on practice, while computational linguistics emphasizes theory. It can also be said that computational linguistics serves as the theoretical foundation for constructing natural language processing systems; for convenience, this article does not distinguish between these terms. It should be noted that in the decades of utilizing computers to process natural language, certain achievements have been made, which in turn greatly aid humanity in understanding its own language. However, the overall progress is not optimistic; many issues remain to be solved, and much understanding needs clarification. To better address this issue, it is necessary to broaden our horizons and study and think about it from multiple perspectives and disciplines.

Artificial intelligence is a discipline that studies intelligence using computational ideas and methods; in other words, artificial intelligence is the simulation study of human intelligent behavior using artificial objects like computers.Currently, we do not have a clear understanding of the mechanisms of human intelligence, which brings difficulties to the simulation of such behavior. Humans are language animals; language is one of the characteristics that distinguish humans from other animals and is the most important tool for humans to express, transmit knowledge, and communicate thoughts. In other words, language is the most evident form of intelligent behavior in humans. Therefore, some scholars propose that “the process of analyzing language is essentially the process of dissecting humanity itself, an analysis and understanding of the mechanisms of human intelligence.” Natural language processing, as one of the most significant yet challenging branches of artificial intelligence, has attracted numerous researchers over the years. To simulate human language processing capabilities, we need to fundamentally understand language phenomena. According to our understanding, a natural language processing system without a linguistic theory as its foundational basis is unlikely to become a truly meaningful simulation system of human language processing mechanisms. However, after carefully reading and analyzing a large number of literature from linguistics and other related disciplines, we find that the issue is not so simple; it involves a multitude of disciplines including philosophy, logic, psychology, etc. The contemplation of this issue can only be placed within the historical context of humanity’s understanding of itself.

Language and human thought are closely related, a fact supported by extensive research in related fields. In the view of some scholars, language is not only a tool for humans to communicate knowledge but also a primary carrier of knowledge and even a defining tool for human knowledge. Here, we cannot discuss whether this assertion overemphasizes the importance of language to humanity; however, it is undoubtedly true that research on language aids in deciphering the mysteries of human intelligence.

2, Is Language Computable?

Why do we believe that computers can simulate human language processing mechanisms and even the entire intelligent behavior of humans? Treating computers as qualitative, discrete machines for processing language materials requires first understanding the structure of language and other characteristics. This demands that people can accurately rewrite the structure of language and other necessary materials into programs and data structures that computers can understand. Clearly, the theoretical basis for this idea may lie in the notion that “the world is composed of numerous discrete facts”; in other words, knowledge about all things in the world can be described using so-called “knowledge factors”. The philosophical assertions supporting this claim can be traced back to Plato’s theories, later championed by notable figures such as Leibniz, Hume, Russell, and early Wittgenstein. With the ideal processing device for discrete facts (computers) and these philosophical theories as support, people believe we can construct the world that Plato explored: a world where clarity, certainty, and control are guaranteed. In the eyes of artificial intelligence researchers, it is a world composed of data structures, decision theories, and automation. However, philosophers themselves have begun to doubt these assertions before achieving complete clarity, with Wittgenstein’s later work “Philosophical Investigations” being a notable critique of the arguments he proposed in his earlier “Tractatus Logico-Philosophicus.” Wittgenstein’s transformation is an important event in the contemporary philosophical study’s “linguistic turn.” If the philosophical community can begin to detach from a research orientation based on decomposition and discreteness, should the artificial intelligence (natural language processing) field that developed based on this idea also reflect on itself?

For computer processing of language, we must first ask, “Is language computable?” This is a fundamental question in natural language processing and computational linguistics. Computability first requires that language is decomposable. The earliest observation of this characteristic of language was made by the German scholar Humboldt, who said, “Language faces an infinite, boundless field, which is the totality of all thinkable objects; thus, language must utilize limited means infinitely, and the unity of the power of thought and the creative power of language ensures that language can achieve this.” In fact, our current understanding of Humboldt’s statement is mainly due to Chomsky’s promotion; Chomsky built the world-renowned generative grammar theory on this basis. Since the 1950s, Chomsky and his followers have introduced many variants of generative grammar theories, one of the main purposes being to limit the overly strong generative capacity of generative grammar. Thus, we can say that language can generate infinite sentences through limited rules, but our understanding of this generative mechanism is still unclear. Our current insufficient understanding of language generation leads to various problems in the natural language processing systems we construct. Borrowing from automation theory, due to an incomplete understanding of the controlled object, the established mathematical model cannot fully reflect the real situation, leading to a decline in system accuracy. As the title of an article by our machine translation expert Liu Yongquan states, “Machine translation is ultimately a linguistic problem”; this statement also applies to other areas of natural language processing because machine translation is a field that integrates numerous natural language processing technologies.

Although Chomsky’s theory has only partially proven the computability of language, it is interesting that the history of planned languages completely demonstrates that infinite linguistic texts can be generated through limited grammatical rules and limited vocabulary. Although planned languages can prove this, they are used by the human brain rather than computers. Thus, we have reason to say that language is computable, but how to simulate its operational mechanisms with artificial objects remains to be further researched.

3, Semantics and “Decomposition”

If we say that the idea of “decomposition” gives us the “computability” of language, it also somewhat hinders our further understanding and practical application of the “computability” of language. Generating infinite texts through limited rules generally refers to the formal aspect of language as a symbolic system, i.e., the syntactic aspect of language. Naturally, people also apply the effective idea of “decomposition” in the syntactic domain to the content aspect of language symbols, i.e., the semantic domain. In linguistics and computational linguistics, the idea of “decomposition” has produced the largest semantic processing method to date, namely the “semantic feature” method. The essence of theories and methods focused on semantic decomposition is to use certain arbitrary “semantic features” or “semantic markers” to describe the deep structure of meaning. Theoretically, if there are enough “semantic features,” all meanings of words can be described. However, in practice, determining how many components a word contains and which components are included is very difficult. This is due to the fundamental attributes of “semantics”—ambiguity and indefiniteness; in addition, individuals’ different understandings of the same word make it difficult to establish a unified semantic marker and semantic feature. A different interpretation of a word form should be seen as several points on a continuous set—meaning is like a seamless fabric without obvious boundaries. The essence of meaning is indivisible or unquantifiable. Applying a decomposition approach to something that cannot be divided leads to predictable results. The philosopher Putnam said, “Words in natural language generally cannot be bounded by yes or no: some things can clearly be called trees, some cannot, while others belong to marginal cases. Worse still, the boundaries between the clear and the marginal are themselves unclear.” This may explain why using semantic decomposition techniques is unlikely to yield results, at least not fully addressing semantic issues.

Due to the problems and defects exposed by the “semantic feature” analysis method, the focus of semantic research has shifted from the past “semantic features” or “component analysis” to “semantic fields.” The essence of “semantic field” research is the division and representation of human knowledge, and this method of knowing and representing the world is often seen in planned languages. In planned languages, this scheme is referred to as a priori and ideographic systems. Among the hundreds of ideographic schemes, the most detailed processing is undoubtedly the scheme published by the Englishman John Wilkins in 1668, who divided the entire world into 40 major categories. Under these 40 major categories, he further subdivided them into subcategories and species. To represent these divided concepts, he invented a symbolic system called “real character.” After Wilkins, many planned language schemes based on the classification of human knowledge emerged, using methods such as numbers, images, and specially designed symbols, etc. Wilkins hoped that his scheme could become a universal tool for knowledge expression and information exchange among humans, but unfortunately, like many other authors based on knowledge classification, he failed. Incidentally, Wilkins’ above scheme is generally regarded as a representative of humanity’s attempts in the 17th century to address language issues mechanically.

Human understanding of the world is constantly changing, and this change arises from human progress and social development. Over time, the classification of knowledge by humans will also change; we believe that the “semantic field” theory may solve some semantic issues, but it will certainly be limited. This is because, like the semantic feature analysis method, it is also based on the premise that knowledge is decomposable and discrete. At the same time, the practice of planned languages has proven the limitations of this method.

4, Ambiguity and Knowledge

If we cannot satisfactorily address semantic issues using the “decomposition” method, this does not mean that semantics is entirely non-computable. In fact, we say that the issues in natural language processing are language problems because ambiguity exists at all levels of natural language. To put it exaggeratively, the development history of the natural language processing field over the past few decades is a history of battling ambiguity. Why do these ambiguity issues not pose serious problems for humans but halt the progress of computer understanding of language research?

Ambiguity, being the greatest obstacle to correct understanding of language, naturally becomes the focus of semantic research in computational linguistics. The study of semantics has led to the emergence of computational semantics, which aims to study the theories and methods of formalizing natural language semantics. Narrowly speaking, computational semantics regards semantic analysis as a calculation process, handling semantic issues through logical methods; broadly speaking, computational semantics research utilizes computers to process and simulate human semantic processing mechanisms, particularly addressing and resolving ambiguity issues.

Chinese computational linguist Feng Zhiwei proposed a theory called “Potential Ambiguity Theory (PA)” based on the characteristics of ambiguous structures. This theory can objectively explain the structure of ambiguity and the process of its resolution, and PA goes deeper than previous studies on ambiguity issues. Natural language itself is filled with ambiguity, but it also provides certain means for resolving ambiguity; otherwise, language would struggle to serve as an important tool for humans to transmit and preserve knowledge. The PA theory develops the original format of ambiguity described solely based on syntactic structure to the semantic level by further refining word classes, or introducing semantic information into word classes, which undoubtedly represents a significant advancement since ambiguity phenomena are inherently semantic-level phenomena. Once semantics is involved, it inevitably leads to some aspects we are currently not very clear about. The PA theory emphasizes the semantic relationships among various syntactic components, which can be said to support the new interpretation of ambiguity issues based on the existence of these semantic relationships. However, how computers understand and what methods they use to process these semantic relationships still require further research.

If we accept that the study of computational semantics simulates human semantic processing mechanisms, then analyzing how humans handle and resolve semantics may be beneficial. The key to how humans deal with ambiguity lies in the vast amount of knowledge stored in their brains, including syntactic, semantic, and other types of common knowledge. Utilizing this knowledge, people can easily understand sentences that are ambiguous for computers. It is precisely this knowledge that supports the correctness and operability of the PA theory to some extent. Like humans, to adequately solve this dilemma, computers necessarily require a large amount of diverse knowledge. Due to the significant differences between computers and humans, knowledge must be explicitly represented; however, much knowledge is vague and difficult to quantify. In other words, seeking suitable and effective knowledge representation methods is the only way to realize natural language processing systems using existing computational resources. Theoretically, it is not difficult to endow computers with certain knowledge about the external world; the challenge lies in the infinite nature of knowledge in the world, and we are still unclear about what knowledge systems actually need to eliminate ambiguity.

The indivisibility and implicit nature of semantics, the complexity of ambiguity issues, the infinity of language understanding, the relational nature of semantics, and the urgency of processing large-scale real texts—these factors intertwine, compelling us to seek new methods and mechanisms for semantic processing. We believe that for ambiguous statements, the task of understanding is to select the most suitable and probable structure from multiple structures. Note that we use the non-absolute terms “suitable” and “probable” here to indicate that there is no absolute correctness in the field of language understanding, only relative “possibilities.” Guided by this idea, we once proposed the semantic concept that “meaning equals the sum of its contextual relationships” and the semantic processing mechanism based on the “analogy principle.” The currently valued “corpus-based” language processing method in the international computational linguistics community is essentially a shift from qualitative to non-qualitative. Whether this shift aligns with the philosophical shift we mentioned above remains to be seen.

Human language processing ability is a highly intelligent behavior. If intelligence is understood as the ability to solve problems using knowledge, then the process of constructing any knowledge-based artificial system involves collecting knowledge, organizing knowledge, and planning the application of knowledge strategies. Given the current theoretical and technical levels of humanity, constructing a machine that completely replaces human intelligence is impossible. We believe that a more realistic research goal at this stage is to build a “human-machine intelligence integration” to address some issues that require human knowledge but are difficult to realize for certain reasons. In the “human-machine intelligence integration,” humans and machines (generally computers) can fully leverage their strengths and work together to achieve optimal or feasible solutions. This illustrates the necessity and feasibility of constructing a language automatic processing system based on “human-machine cooperation and mutual assistance.” Thus, we endow the saying “humans are machines” with new meaning: both humans and machines are components of the intelligent processing system to be constructed.

Language and knowledge have a strong integrity and relational characteristic, which forces us to consider this when researching natural language processing systems; otherwise, the systems developed will have inherent deficiencies and struggle to handle many complex language phenomena. Language and its products can be seen as the results of processing and handling human intelligent behavior, representing the largest original resource for studying human intelligent behavior. In fact, throughout the long history of human development, language (and its products) is the only visible carrier of knowledge and the most important means of continuing human intelligence. Our ignorance of the specific mechanisms of human intelligent processing and the unknowability of this mechanism itself compel us to start from the products and external characteristics of intelligent behavior to simulate the processing mechanisms of human intelligent systems. This can be seen as a gray simulation system situated between a white box and a black box. Research results from various fields of linguistics and cognitive psychology can be regarded as its main theoretical foundation, and many theories and methods in computational linguistics serve as means for system realization.

5, Conclusion

Looking at humanity’s study of language, we can find that our understanding and depth of research into language are closely related to social development and humanity’s understanding of the world as a whole. That is, language research has its era characteristics. Numerous facts indicate that we are currently in an era of information and knowledge tending towards “explosion”; the characteristic of language research during this period is that it not only considers human needs but also should take machines into account. Researching “human-machine shared” dictionaries, grammars, etc., becomes a focus of language research during this period. The proliferation of computers and the emergence of the Internet have ushered humanity into a new phase. As the virtual distance between people has become very close, humans can no longer be satisfied with traditional modes of language communication. Therefore, how to utilize computers to address the increasingly serious human language communication issues brought about by their emergence has become an important task facing many scholars. Unfortunately, due to the essential differences between the structure and problem-solving methods of computers and humans, coupled with the fact that we still have many unclear aspects regarding our own mechanisms for processing language, our efforts to simulate human language behavior through computers have made little progress. If we do not view the issue of automatic language processing in isolation but rather see it as a link in humanity’s exploration of itself, we will gain new insights into this issue. This article is my reflections on the issue of computer processing of language, combining insights from other fields.

References (Omitted)

For citations, please click the “Read Original” link. Citation format: Liu Haitao. Reflections on Natural Language Processing [J]. Terminology Standardization and Information Technology, 2001(01):23-27.

Extended Reading: Interested friends can also refer to some articles published by the author in recent years, such as “Methods and Trends in Language Research in the Era of Big Data,” “Two Major Tasks in the Construction of Chinese Linguistics: Internationalization of Achievements and Scientific Methods,” “Data-Driven Research in Applied Linguistics,” and “Paths and Significance of Theoretical Research in Linguistics in the Era of Big Data.”

Liu Haitao: Reflections on Natural Language Processing

END

Today’s editor: The little curly-haired one who likes Sanmao

Liu Haitao: Reflections on Natural Language Processing

Leave a Comment