Source: Han Yu Tang
Copyright Notice: The content is for academic sharing only. The copyright belongs to the original author. If there are any infringement issues, please contact us to delete it. Thank you very much!
Author Introduction
Yuan Yulin, male, born in 1962, from Kunshan City, Jiangsu Province. He graduated from Peking University in 1990 with a doctoral degree. He is currently a professor in the Department of Chinese Language and Literature at the Faculty of Humanities, University of Macau. His main research areas are theoretical linguistics and Chinese linguistics, particularly syntax, semantics, pragmatics, computational linguistics, and Chinese information processing. He has published over 100 papers in journals such as “Social Sciences in China,” “Chinese Language,” “Contemporary Linguistics,” and “Journal of Chinese Information Processing,” and has published more than 10 books. In 2015, he was appointed as a distinguished professor under the Chang Jiang Scholars Program by the Minister of Education, and in 2017, he was selected as a leading talent in philosophy and social sciences in the national “Ten Thousand Talents Program.”
Abstract
Summary: The success of ChatGPT in language understanding and generation challenges traditional concepts in linguistics. This article focuses on three unavoidable questions for linguists in the era of artificial intelligence: 1) How should we view the language generated by large models like ChatGPT? 2) How can we provide semantic support for artificial intelligence systems such as language models? 3) How can we achieve human-machine collaboration with large language models? Our tentative answers are: 1) “Machine-generated language” is a type of “interlanguage”; 2) we should combine referential semantics and distributional semantics; 3) human-machine collaboration has three modes: embedding, co-pilot, and agent.
Keywords: ChatGPT; large language models; artificial intelligence; semantics; human-machine collaboration
Citation Style: Yuan Yulin. Challenges and Solutions in Linguistics Under ChatGPT Context [J]. Modern Foreign Languages, 2024, 47(04): 572-577.
*This article is supported by the National Social Science Fund project “Research on Basic Theoretical Issues of Chinese Linguistics with Chinese Characteristics in the New Era” (19VXK06), as well as the University of Macau’s Chair Professorship Research and Development Fund “Visualization and Application Research of Syntactic and Semantic Knowledge in Dictionary Resources” (CPG2024-00005-FAH) and the Research Startup Fund “Research on Semantic Reasoning Patterns and Mechanisms for Explainable AI” (SRG2022-00011-FAH).
1 Challenges of Large Models like ChatGPT to Linguistics
For a long time, those engaged in linguistics (including language teaching and research) have had a firm belief: as long as humanity does not perish, language will not die; as long as language does not perish, the profession of linguistics will certainly exist. Moreover, for thousands of years, practitioners of linguistics have accumulated rich experience regarding the structure, functions, and historical evolution of language, as well as strategies and methods for language and foreign language teaching, constructing systematic knowledge and establishing specialized disciplinary institutions and disciplines. Although there are some disputes and various schools of thought within linguistics, and competitive pressure is not to be underestimated, it appears that the overall situation is relatively stable, and the overall prosperity index should not be low.
Who would have thought that at the end of 2022, the AI chatbot ChatGPT emerged, with its super-strong, near-human-level language understanding and generation capabilities, along with knowledge learning and common-sense reasoning abilities, quickly becoming a sensation online; at the same time, it inadvertently pushed linguistics to a historical crossroads, challenging a series of traditional concepts in linguistics. For example:
1) Language is a phenomenon unique to humans, and only the human brain can produce natural language;
2) Human language has recursion, so it cannot be described by statistical probability models;
It has also posed a series of tricky questions for linguists, such as:
1) Is the language generated by large models like ChatGPT human natural language? Why?
2) Language symbols have arbitrariness (i.e., there is no necessary relationship between form and meaning), how do language models cross this semantic gap?
3) To what extent do current linguistic theories reveal the actual language mechanisms in the human brain?
4) Has the success of modern large language models affected the existence of linguistics as a discipline? Will the success of large language models in automatic translation affect the existence of foreign language majors?
Clearly, thinking about these questions in the context of large models like ChatGPT can broaden the academic perspective of linguistics. This article focuses on three questions:1) How should linguists view the language generated by large models like ChatGPT?2) How can linguists provide semantic support for artificial intelligence systems such as language models?3) How can linguists coexist (collaborate) with large models like ChatGPT?
2 How to View the Language Generated by Large Models like ChatGPT?
Next, we will discuss the first question: Is the language generated by large models like ChatGPT human natural language? The answer to this question depends on how we define “human natural language.” If we stipulate that only language produced by humans (the brain) can be called “human natural language,” then the language generated by large models like ChatGPT cannot be considered human natural language. However, this may pose a problem, as language models have essentially learned the syntax, semantics, and pragmatic patterns of human natural language through training on massive amounts of text and can generate text similar to human natural language under appropriate prompts from human users. For example, here is a question-and-answer session between myself and ChatGPT:
I: Please explain to linguists how ChatGPT works in simple terms.
ChatGPT: Of course, I would be happy to explain.ChatGPT’s working principle is based on a method in artificial intelligence technology called “deep learning.” Specifically, it is based on a model called “neural networks.” (A small section omitted)
During the training process,ChatGPT learns through a large amount of text data, which contains various kinds of language information. By observing the context of the input text and the corresponding outputs (responses, suggestions, etc.), the model attempts to learn the patterns, structures, and semantics of language. In this process,ChatGPT gradually establishes an understanding of language and learns to generate natural and fluent text. (A small section omitted)
In summary,ChatGPT’s working principle is to train deep neural networks on a large scale of data, enabling it to understand and generate natural language text. This model can be used for various tasks, including dialogue, text generation, question answering, etc.
The machine’s answer is not only fluent in sentence structure and complete in structure, but also correct in content and logically coherent, even somewhat more comprehensive than what I had in mind. The only drawback is that the layout of the text is slightly loose, possibly influenced by the prompt’s request to “explain simply.” If we exclude this text, which the machine has learned from human texts and which people might also say/write, from “human natural language,” it may be a short-sighted and narrow-minded approach. This is somewhat akin to not recognizing that test-tube babies cultivated in the laboratory by assisted reproductive technology in hospitals are “human babies,” which is unreasonable. In fact, the pragmatic AI community has already treated machine-generated language as “human natural language” to quickly and effectively expand the scale of training corpora and effectively avoid copyright risks that may arise from using genuine “human natural language” texts.
However, to be fair, if we acknowledge that machine-generated language (referred to as “machine language”) is “human natural language,” then can linguists use this “machine language” to induce the grammatical rules of “human natural language” when collecting corpora and extracting grammatical rules? Moreover, this “machine language” may vary in quality, and sometimes may contain grammatical errors. For example, in the previous example, the preposition “based on” lacks the core verb “is,” and the collocation “establishing… ability” is unusual, and the core verb “adopt” is missing after the preposition “through,” etc. Just as the interlanguage produced by second language learners is an important research subject in linguistics, “machine language” may become an important area of research in future linguistic studies and engineering applications.
3 How Can Linguists Provide Semantic Support for Artificial Intelligence?
Next, we will discuss the second question: How do large models like ChatGPT cross the semantic gap of the arbitrariness of language symbols? As we know, human natural language is a symbolic system, and the connection between sound and meaning of language symbols is conventionally established by social members over a long historical period. For language learners, this non-logical necessary relationship of sound and meaning is gradually acquired through embodied cognition, combining certain language symbols with relevant external worlds and inner experiences in a specific cultural and linguistic environment over time. For machines, the aforementioned denotational semantics does not work because machines cannot comprehend discrete semantic expressions in dictionary definitions. Therefore, how to express the meaning of language symbols in a computable continuous numerical form becomes a challenge in natural language processing (NLP). Fortunately, under the guidance of linguistic ideas such as “meaning is use” and “words with similar meanings have similar distributions,” AI experts represent the meaning of words as high-dimensional vectors based on the various contexts in which they appear. Roughly speaking, this means using the various contexts of a word to represent its meaning and usage. Modern large language models often have thousands of dimensions in their word vectors, representing the meaning and usage of each word using thousands of contextual information. Since words with similar meanings often appear in similar contexts, their vector values are also close; or, in other words, they are closely positioned in high-dimensional semantic space, and there is a parallel inferential relationship between the vectors of related words. A famous example is that the vector of “king” minus the vector of “man” (approximately equal to the vector of “royal”) plus the vector of “woman” is approximately equal to the vector of “queen.” It is precisely this vectorized embedding expression of word meaning based on distributional semantics that injects semantics into large models; and with the support of various pre-training techniques such as large-scale training data, the attention mechanism of transformers, the ability to transfer knowledge learned from a small number of samples to other fields, and learning in context, language models can cross the semantic gap of language symbols.
Although distributional semantics has achieved great success in AI language models, the linguistic community’s attention to it and its predecessor latent semantics is seriously insufficient. We hope that young scholars interested in AI semantics research can care for and study distributional semantics from a linguistic perspective. In fact, for AI applications, distributional semantics centered on word vectors only solves the expression and computation of meaning in the digital world; however, language symbols have not established a connection with the physical world, which means they are disconnected from the external environment. Language is a tool for communication, and language symbols ultimately need to establish referential and binding relationships with relevant entities in the physical world where human life occurs. This is known as the “symbol grounding problem”—to project the semantic interpretation of language symbols from the human brain or machine to the physical world, establishing referential relationships with relevant entities, events, or states in the context. For example, a household robot loaded with a large language model can understand the meaning of the instruction “Put the big apple on the table into the gift bag!” but must establish a one-to-one correspondence between symbols like “table,” “big apple,” and “gift bag” and the relevant objects in the current environment when performing the operation. This involves a series of complex multimodal perception and semantic understanding issues such as the robot’s recognition of objects and disambiguation of referential meanings of language symbols.
If the symbol grounding problem is not solved, artificial intelligence cannot achieve “embodied intelligence” and ultimately cannot enter people’s daily lives. It is precisely in the aspects of disambiguation and semantic understanding that the semantic knowledge and wisdom of linguists are indispensable. It is evident that in the era of artificial intelligence, how to organically combine denotational semantics and distributional semantics to solve the meaning and reference problems of language symbols for robots is a common topic for linguists and AI experts.
4 Three Modes of Collaboration Between Linguists and Large Models like ChatGPT
At the beginning of 2024, Huang Renxun, president of the AI chip company NVIDIA, said: “This is an amazing era because we are at the beginning of a new industrial revolution. In the past, steam engines, electricity, PCs, and the internet brought about the information revolution; now it is artificial intelligence.” (See 1 for details: https: //mp.weixin.qq.com/s/QK1CiRvJcT3s2dZ0RaYlkQ.) OpenAI CEO Sam Altman stated that artificial intelligence will change the world “like mobile phones,” completely transforming education, healthcare, and scientific research; it will become a powerful tool for humanity to build the future, empowering us to express our creative imagination and will. He encourages the younger generation to actively utilize AI tools to realize personal visions and create unprecedented value. (See 2 for details: https: //mp.weixin.qq.com/s/_DrOy1eOkauTUzIWevh_aA.) Yes, this round of artificial intelligence has also opened a great era of academic innovation and transformation of research paradigms.
Standing at the threshold of this new era, we linguists should also actively embrace artificial intelligence. On the one hand, we need to think about how to reveal the structure and functions of human natural language more profoundly to provide knowledge resources for natural language processing in artificial intelligence. On the other hand, we should consider how to better utilize modern large language models like ChatGPT to assist our language teaching and linguistic research. Here, we first need to understand and consider three possible modes of human-machine collaboration:
1) Embedding mode, where linguists interact with large models like ChatGPT, using prompt words to set specific goals, allowing the AI system to assist us in completing certain tasks in research or teaching; then, linguists autonomously conclude the work. For example, prompts to ChatGPT could include:
Find two-character words in modern Chinese that end with “得” like “获得, 落得”; the more, the better;
Create 10 sentence pairs similar to “难道要我下跪不成?→难不成要我下跪?”;
Create 10 semantically violated English sentences similar to “Colorless green ideas sleep furiously” and translate them into Chinese;
Summarize the following paragraph into a sentence of no more than 30 words; polish and refine the abstract of the following article;
After returning the results, linguists check whether they meet the requirements; they can also point out deficiencies, provide clearer prompts, and ask the machine to rewrite. In this mode, large models like ChatGPT serve as tools for executing instructions and completing specific tasks, while linguists act as task designers and directors.
2) Co-pilot mode, where linguists form a partnership with large models like ChatGPT, jointly participating in the research process, from proposing research plans, collecting relevant example sentences, summarizing related patterns, to drafting outlines and writing relevant sections of papers, all in human-machine collaboration, allowing both to play their respective roles. For example, prompts to ChatGPT could include:
What are negative polarity items? Please illustrate with typical examples of negative polarity items in English. What are their syntactic and semantic characteristics? What references and research conclusions are available in this area?
What role does the concept of negative polarity items play in describing and explaining grammatical phenomena?
Are there negative polarity items in modern Chinese? If so, is it appropriate to conduct a special study and write a paper on “Syntactic and Semantic Research on Negative Polarity Items in Modern Chinese”?
What important references and research conclusions are available on this topic? What further explorations and excavations can be conducted?
From which aspects is it appropriate to approach the corpus investigation and example analysis of negative polarity items in modern Chinese?
At the same time, linguists search for real text corpora, analyzing the corpus and summarizing patterns on their own; on the other hand, they provide the corpus to the machine for analysis, summarize patterns, and then check its analysis to see if it can supplement or correct their conclusions or provide new perspectives; finally, they inform the machine of their findings, consult it on how to outline the paper, and even ask it to help write relevant sections, followed by their own modifications, etc. In this mode, large models like ChatGPT and linguists complement each other’s advantages and cooperate, with linguists as the leaders and machines as collaborators.
3) Agent mode, where linguists provide goals and necessary resources (such as specific literature and corpora, relatively clear viewpoints, conclusions, or positions) for large models like ChatGPT to independently complete a task (such as writing a popular science article), while linguists supervise the process and evaluate the final results. For example, prompts to ChatGPT could include:
Please write an article for the public, as a language teacher, about the importance of Chinese letter words (such as “X-ray, CT scan,” etc.) for language communication (cannot be abolished, irreplaceable), limited to 800 words;
Please summarize the grammatical rules from the following three pairs of example sentences (with the asterisk “*” indicating that the sentence is ungrammatical) and provide explanations, justifications, and arguments in an article of no more than 3000 words:
1) Not many people buy this car~*Not many people this car buy.
2) The person who saw that student has come~*The person that student saw has come.
3) I have seen people who buy this car~*This car I have seen people who buy.
For highly specialized topics like the latter, multiple rounds of human-machine Q&A interaction may be needed, and linguists must provide more positive (supporting their arguments) and negative (refuting opposing arguments) examples, ultimately requiring linguists to meticulously check and strictly verify the results. In this mode, large models like ChatGPT fully reflect the interactivity, autonomy, and adaptability of agents, approaching independent actors and agents, while linguists act as supervisors and evaluators of task execution.