This article is sourced from the “Han Yu Tang” WeChat official account.
Author Introduction
Yuan Yulin, male, born in 1962, from Kunshan, Jiangsu Province. Graduated from Peking University in 1990 with a Ph.D. Currently a professor in the Department of Chinese Language and Literature at the Faculty of Arts, University of Macau. His main research areas include theoretical linguistics and Chinese linguistics, particularly syntax, semantics, pragmatics, computational linguistics, and Chinese information processing. He has published over 100 papers in journals such as “Social Sciences in China”, “Chinese Language”, “Contemporary Linguistics”, and “Journal of Chinese Information Processing”, and has published more than 10 books. In 2015, he was appointed as a Distinguished Professor of the Chang Jiang Scholars Program by the Ministry of Education, and in 2017, he was selected as a leading talent in the third batch of the National ‘Ten Thousand Talents Program’ in philosophy and social sciences.
Abstract
Summary: The success of ChatGPT in language understanding and generation poses challenges to traditional concepts in linguistics. This paper focuses on three unavoidable questions for linguists in the era of artificial intelligence: 1) How should we view the language generated by large language models like ChatGPT? 2) How should we provide semantic support for AI systems like language models? 3) How can we establish human-machine collaboration with AI systems like large language models? Our tentative answers are: 1) “machine-generated language” is a form of “interlanguage”; 2) we should combine referential semantics with distributional semantics; 3) human-machine collaboration can occur in three modes: embedded, co-pilot, and agent.
Keywords: ChatGPT; large language models; artificial intelligence; semantics; human-machine collaboration
Citation: Yuan Yulin. Challenges and Solutions in Linguistics Under the ChatGPT Context [J]. Modern Foreign Languages, 2024, 47(04): 572-577.
*This article is supported by the National Social Science Fund Special Project “Research on Basic Theoretical Issues of Chinese Linguistics with Chinese Characteristics in the New Era” (19VXK06) and the Research and Development Fund for Chair Professors of the University of Macau “Research on the Visualization and Application of Syntactic-Semantic Knowledge in Dictionary Resources” (CPG2024-00005-FAH) and the Research Startup Fund “Research on Semantic Reasoning Patterns and Mechanisms for Explainable AI” (SRG2022-00011-FAH).
1 Challenges of Large Models Like ChatGPT to Linguistics
For a long time, those engaged in linguistics (including language teaching and research) have had a firm confidence: as long as humanity does not perish, language will not disappear; and as long as language does not perish, the profession of linguistics will certainly exist. Moreover, for thousands of years, linguists have accumulated rich experience regarding the structure, function, and historical evolution of language, developed systematic knowledge, and formed specialized disciplinary institutions and disciplinarity. Although there are still some debates and various schools of thought within linguistics, and competitive pressure should not be underestimated, it seems that, on the surface, the situation is relatively stable, and overall, the prosperity index should not be low.
Who would have thought that at the end of 2022, the emergence of the AI chatbot ChatGPT, with its powerful and nearly human-level language understanding and generation capabilities, along with knowledge learning and common-sense reasoning abilities, would suddenly explode across the internet; at the same time, it inadvertently pushed linguistics to a historical crossroads, challenging a series of traditional concepts in linguistics. For example:
1) Language is a phenomenon unique to humans, and only the human brain can produce natural language;
2) Human language has recursion, so it cannot be characterized by statistical probability models;
It also raises a series of tricky questions for linguists, such as:
1) Is the language generated by large models like ChatGPT human natural language? Why?
2) Language symbols have arbitrariness (i.e., there is no necessary relationship between form and meaning), how do language models bridge this semantic gap?
3) To what extent do current linguistic theories reveal the actual language mechanisms in the human brain?
4) Has the success of modern large language models affected the existence of linguistics as a discipline? Will the success of large language models in automatic translation affect the existence of foreign language majors?
Clearly, considering these questions in the context of large models like ChatGPT can broaden the academic perspective of linguistics. This paper focuses on three questions: 1) How should linguists view the language generated by large models like ChatGPT? 2) How can linguists provide semantic support for AI systems like language models? 3) How can linguists collaborate with AI systems like large language models?
2 How to View Language Generated by Large Models Like ChatGPT?
Next, we will discuss the first question: Is the language generated by large models like ChatGPT human natural language? The answer to this question depends on how we define “human natural language”. If we stipulate that only language produced by humans (brains) can be called “human natural language”, then the language generated by large models like ChatGPT cannot be considered human natural language. However, this may be problematic, because language models have learned the syntactic, semantic, and pragmatic patterns of human natural language through massive text training, and under appropriate prompts from human users (including questions or commands), they can generate texts similar to human natural language. For example, here is a Q&A between me and ChatGPT:
I: Please explain to linguists how ChatGPT works in simple terms.
ChatGPT: Of course, I would be happy to explain it to you.The working principle of ChatGPT is based on a method called “deep learning” in artificial intelligence technology. Specifically, it is based on a model called “neural networks”. (Here, a small section is omitted)
During the training process, ChatGPT learns through a large amount of text data, which contains various language information. By observing the context of the input text and the related outputs (answers, suggestions, etc.), the model tries to learn the patterns, structures, and semantics of language. In this process, ChatGPT gradually establishes its understanding of language and learns to generate natural and fluent texts. (Here, a small section is omitted)
In summary, the working principle of ChatGPT is to train deep neural networks through large-scale data, enabling it to understand and generate natural language texts. This model can be used for various tasks, including conversation, text generation, question answering, etc.
The machine’s answer is not only fluent in phrasing and complete in structure but also correct in content and logically coherent, even a bit more comprehensive than the answer I had in mind. The only drawback is that the layout of the article seems a bit loose, possibly influenced by the prompt “to explain in simple terms”. If we exclude this text, learned from human texts, which people might also say/write, from the category of “human natural language”, it could be a shortsighted and narrow-minded approach. It is somewhat like denying that test-tube babies cultivated in the laboratory by assisted reproductive technology in hospitals are “human babies”, which is unreasonable. In fact, the pragmatic AI community has already treated machine-generated language as “human natural language” to quickly and effectively expand the scale of training corpora and effectively avoid the copyright risks that may arise from using real “human natural language” texts.
However, to be fair, if we acknowledge that machine-generated language (referred to as “machine language”) is “human natural language”, then can linguists use this “machine language” to generalize the grammatical rules of “human natural language” when collecting corpora and extracting grammatical rules? Moreover, this “machine language” may vary in quality, and there may sometimes be grammatical errors. For instance, in the above example, the preposition “based on” lacks the core verb “is” before it, and the collocation of “establishing…ability” is unusual, and the preposition “through” is missing the core verb “adopt”, etc. Just as the interlanguage produced by second language learners is an important research object in linguistics, “machine language” may become an important research area in future linguistic studies and engineering applications.
3 How Can Linguists Provide Semantic Support for Artificial Intelligence?
Next, we will discuss the second question: How do large models like ChatGPT bridge the semantic gap of the arbitrariness of language symbols? As we know, human natural language is a symbolic system, and the connection between phonetics and semantics of language symbols is established by social members over the course of history. For language learners, this non-logical necessary phonetic-semantic relationship is something they gradually acquire through embodied cognition, combining certain language symbols with relevant external worlds and inner experiences in a specific cultural and linguistic environment over time. For machines, the aforementioned denotational semantics does not work because machines cannot understand the discrete semantic expressions of dictionary definitions. Therefore, how to express the meaning of language symbols in a computable continuous numerical form becomes a challenge in natural language processing (NLP). Fortunately, AI experts, guided by linguistic concepts such as “meaning is use” and “words with similar meanings have similar distributions”, represent the meaning of words as high-dimensional vectors based on the various contexts in which words appear. Roughly speaking, this represents the meaning and usage of a word using the various contexts of that word. The word vectors in modern large language models can easily reach thousands of dimensions, equivalent to using the thousands of contextual information of each word to represent its meaning and usage. Since words with similar meanings often appear in similar contexts, their vector values are also close; in other words, they are located near each other in high-dimensional semantic space, and there is a parallel derivation relationship between the vectors of related words. A famous example is that the vector of “king” minus the vector of “man” (approximately equal to the vector of “royal”) plus the vector of “woman” is approximately equal to the vector of “queen”. It is this vectorized embedding representation of word meaning based on distributional semantics that injects semantics into large models; and with the support of various pre-training techniques such as large-scale training data, the attention mechanism of transformers, and the ability to transfer knowledge learned from a small number of samples to other domains, language models can bridge the semantic gap of language symbols.
Although distributional semantics has achieved great success in AI language models, the linguistic community’s attention to it and its predecessor latent semantics is seriously insufficient. We hope that young scholars who are interested in AI semantics research can care for and study distributional semantics from a linguistic perspective. In fact, for AI applications, distributional semantics centered around word vectors only solves the problem of meaning expression and computation in the digital world; however, language symbols have not established a connection with the physical world, i.e., they are disconnected from the external environment. Language is a communication tool, and language symbols ultimately need to establish reference and binding relationships with relevant things in the physical world where human life occurs. This is known as the “symbol grounding problem”, which involves projecting the semantic interpretation of language symbols from the brain or machines into the physical world, establishing reference relationships with relevant things, events, or states in the context. For example, a household robot equipped with a large language model can understand the meaning of the command “Put the big apple on the table into the gift bag!”, but it must establish a one-to-one correspondence between symbols like “table”, “big apple”, and “gift bag” with the relevant objects in the current environment when performing the action. This involves a series of complex multimodal perception and semantic understanding issues, including the robot’s recognition of objects and the disambiguation of language symbols. Unless this symbol grounding problem is solved, artificial intelligence cannot achieve “embodied intelligence” and ultimately cannot enter people’s daily lives. It is precisely in the area of disambiguation and semantic understanding that linguists’ knowledge and wisdom of semantics are indispensable. Therefore, it is evident that in the era of artificial intelligence, how to organically combine referential semantics and distributional semantics to solve the meaning and reference problems of language symbols for robots is a common topic for linguists and AI experts.
4 Three Modes of Collaboration Between Linguists and Large Models Like ChatGPT
At the beginning of 2024, Huang Renxun, the president of NVIDIA, said: “This is an amazing era because we are at the beginning of a new industrial revolution, where steam engines, electricity, PCs, and the internet brought about the information revolution, and now it is artificial intelligence.” (1See more at https://mp.weixin.qq.com/s/QK1CiRvJcT3s2dZ0RaYlkQ.) OpenAI CEO Sam Altman claimed that artificial intelligence will “change the world like mobile phones” and fundamentally disrupt fields such as education, healthcare, and scientific research; it will become a powerful tool for humanity to build the future, empowering us to express our creative imagination and will. He encourages the younger generation to actively use AI tools to realize their personal visions and create unprecedented value. (2See more at https://mp.weixin.qq.com/s/_DrOy1eOkauTUzIWevh_aA.)Indeed, this round of artificial intelligence has also opened a great era of academic innovation and research paradigm shifts.
Standing at the threshold of this new era, we linguists should also actively embrace artificial intelligence. On the one hand, we need to think about how to more profoundly reveal the structure and function of human natural language to provide knowledge resources for natural language processing in artificial intelligence. On the other hand, we should consider how to better utilize large modern language models like ChatGPT to assist our language teaching and linguistic research. Here, we first need to understand and consider three possible modes of human-machine collaboration: (3For more on the three modes of human-machine collaboration, see https://mp.weixin.qq.com/s/ws3U1NBouPivqatUb_ldQA.)
1) Embedding mode, where linguists communicate with large models like ChatGPT, using prompts to set specific goals, allowing the AI system to assist us in completing certain tasks in research or teaching; then, linguists independently conclude the work. For example, prompts to ChatGPT could be:
Find modern Chinese two-character words ending in “得” like “获得, 落得”, the more the better;
Create 10 pairs of sentences like “难道要我下跪不成?→难不成要我下跪?”;
Generate 10 semantically violated English sentences similar to “Colorless green ideas sleep furiously” and translate them into Chinese;
Summarize the following paragraph into a sentence of 30 words or less, and refine and polish the abstract of the following article;
After returning the results, linguists check whether they meet the requirements; they can also point out deficiencies, provide more specific prompts, and ask the machine to rewrite. In this mode, large models like ChatGPT serve as tools for executing instructions and completing specific tasks, while linguists act as task designers and commanders.
2) Co-pilot mode, where linguists form a partnership with large models like ChatGPT, jointly participating in the process of research work, from proposing research plans, collecting relevant example sentences, summarizing relevant rules, to drafting outlines for papers and writing relevant paragraphs, all involving human-machine collaboration, allowing each to play their role. For instance, prompts to ChatGPT could be:
What are negative polarity items? Give examples of typical negative polarity items in English. What are their syntactic and semantic characteristics? What references and research conclusions are there in this area?
Explain how the concept of negative polarity items plays a role in describing and explaining grammatical phenomena.
Are there negative polarity items in modern Chinese? If so, is it appropriate to conduct specialized research and writing on the topic of “Syntactic and Semantic Research on Negative Polarity Items in Modern Chinese”?
What important references and research conclusions are there on this topic? What other areas can be expanded and explored?
From which aspects should I approach the corpus survey and example analysis of negative polarity items in modern Chinese?
At the same time, linguists search real text corpora, analyzing the corpora themselves to summarize rules; on the other hand, they let the machine analyze the corpora, summarize rules, then check its analysis to see if it can supplement or correct their conclusions, or gain new perspectives; finally, they inform the machine of their discoveries, consult it on how to outline the paper, and even ask it to help write relevant sections, followed by manual revisions, etc. In this mode, large models like ChatGPT and linguists complement each other’s strengths and cooperate, with linguists as the leaders and machines as collaborators.
3) Agent mode, where linguists provide goals and necessary resources (such as specific literature and corpora, relatively clear viewpoints, conclusions, or evaluative tendencies) for large models like ChatGPT to independently complete certain tasks (such as writing a popular science article), with linguists supervising the process and evaluating the final results. For example, depending on the nature of the article or real needs, prompts to ChatGPT could be:
Please write a public article, as a language teacher, about the importance of Chinese letter words (like “X-ray, CT scan”, etc.) for language communication (which cannot be abolished or replaced), limited to 800 words;
Please summarize the following three pairs of sentences (with the asterisk “*” indicating that the sentence is ungrammatical) into a grammatical rule, providing explanations, justifications, and write an article within 3000 words:
1) 买这种汽车的人不太多~这种汽车买的人不太多
2) 看见那个学生的人来了~*那个学生看见的人来了
3) 我见过买这种汽车的人~*这种汽车我见过买的人
For highly specialized topics like the latter, multiple rounds of human-machine Q&A interaction may be needed, requiring linguists to provide more positive (supporting their arguments) and negative (refuting opposing arguments) examples; ultimately, linguists must meticulously check and rigorously evaluate the results. In this mode, large models like ChatGPT fully exhibit the interactive, autonomous, and adaptive characteristics of agents, approaching the level of independent actors and agents; while linguists serve as supervisors of task execution and evaluators of results. Perhaps currently, large models like ChatGPT have not yet reached this level of autonomy.