Challenges and Solutions in Linguistics Under ChatGPT Context

Author Introduction

Yuan Yulin, male, born in 1962, from Kunshan, Jiangsu Province. Graduated from Peking University in 1990 with a doctorate. Currently a professor in the Department of Chinese Language and Literature at the Faculty of Humanities, University of Macau. His main research areas include theoretical linguistics and Chinese linguistics, particularly syntax, semantics, pragmatics, computational linguistics, and Chinese information processing. He has published over 100 papers in journals such as “Social Sciences in China,” “Chinese Language,” “Contemporary Linguistics,” and “Journal of Chinese Information Science,” and has published more than 10 books. In 2015, he was appointed as a Distinguished Professor of the Chang Jiang Scholars Program by the Ministry of Education, and in 2017, he was selected as a leading talent in philosophy and social sciences under the National “Ten Thousand Talents Program.”

Challenges and Solutions in Linguistics Under ChatGPT Context

Abstract

Summary: The success of ChatGPT in language understanding and generation challenges traditional concepts in linguistics. This article focuses on three unavoidable questions for linguists in the era of artificial intelligence: 1) How should we view the language generated by large language models like ChatGPT? 2) How can we provide semantic support for AI systems like language models? 3) How can we form human-machine collaboration with AI systems like large language models? Our tentative answers are: 1) “machine-generated language” is a form of “interlanguage”; 2) we should combine referential semantics and distributional semantics; 3) there are three modes of human-machine collaboration: embedding, co-pilot, and agent.

Keywords: ChatGPT; large language models; artificial intelligence; semantics; human-machine collaboration

Citation: Yuan Yulin. Challenges and Solutions in Linguistics Under ChatGPT Context [J]. Modern Foreign Languages, 2024, 47(04): 572-577.

*This article is supported by the National Social Science Fund Special Project “Research on Basic Theoretical Issues of Chinese Linguistics with Chinese Characteristics in the New Era” (19VXK06) and the University of Macau Chair Professorship Research and Development Fund “Visualization of Syntactic and Semantic Knowledge in Dictionary Resources and Its Application Research” (CPG2024-00005-FAH) and the Research Start-up Fund “Research on Semantic Reasoning Patterns and Mechanisms for Explainable AI” (SRG2022-00011-FAH).

1 Challenges of Large Models like ChatGPT to Linguistics

For a long time, those engaged in linguistics (including language teaching and research) have had a firm confidence: as long as humanity does not perish, language will not disappear; as long as language does not perish, the profession of linguistics will certainly exist. Moreover, for thousands of years, linguists have accumulated rich experience regarding the structure, function, and historical evolution of language, as well as strategies and methods for language and foreign language teaching, constructing systematic knowledge, and forming specialized disciplinary institutions and regulations. Although there are some internal tensions within linguistics, with ongoing debates and various schools of thought regarding teaching and research concepts and methods, and the pressure of competition cannot be underestimated, it still appears that the field is relatively stable, and overall, the prosperity index should not be low.

Who would have thought that at the end of 2022, the emergence of the AI chatbot ChatGPT, with its superhuman-level language understanding and generation capabilities, as well as knowledge learning and common sense reasoning abilities, would suddenly become a sensation online; at the same time, it inadvertently pushed linguistics to a historical crossroads, challenging a series of traditional concepts in linguistics. For example:

1) Language is a phenomenon unique to humans, and only the human brain can produce natural language;

2) Human language has recursion, so it cannot be characterized by statistical probability models;

It also raises a series of thorny questions for linguists, such as:

1) Is the language generated by large models like ChatGPT human natural language? Why?

2) Language symbols have arbitrariness (i.e., there is no necessary relationship between form and meaning), how do language models cross this semantic gap?

3) To what extent do current linguistic theories reveal the actual language mechanisms in the human brain?

4) Has the success of modern large language models affected the existence of linguistics as a discipline? Will the success of large language models in automatic translation affect the existence of foreign language majors?

Clearly, thinking about these questions in the context of large models like ChatGPT can broaden the academic perspective of linguistics. This article focuses on three questions:1) How should linguists view the language generated by large models like ChatGPT? 2) How can linguists provide semantic support for AI systems like language models? 3) How can linguists coexist (collaborate) with AI systems like large language models?

2 How to View Language Generated by Large Models like ChatGPT?

Next, let’s discuss the first question: Is the language generated by large models like ChatGPT human natural language? The answer to this question depends on how we define “human natural language.” If we stipulate that only language produced by humans (the brain) can be called “human natural language,” then the language generated by large models like ChatGPT cannot be considered human natural language. However, this may be problematic because large language models have essentially learned the syntactic, semantic, and pragmatic patterns of human natural language through training on vast amounts of text, and under suitable prompts from human users (including questions or commands), they can generate texts similar to human natural language. For example, here is a Q&A between me and ChatGPT:

I: Please explain to linguists how ChatGPT works in simple terms.

ChatGPT: Of course, I’m happy to explain it to you.The working principle of ChatGPT is based on a method in artificial intelligence technology called “deep learning.” Specifically, it is based on a model called “neural networks.” (A small section is omitted here.)

During the training process,ChatGPT learns through large amounts of text data, which contain various kinds of linguistic information. By observing the context of input text and the related outputs (responses, suggestions, etc.), the model attempts to learn the patterns, structures, and semantics of language. In this process,ChatGPT gradually builds its understanding of language and learns to generate naturally fluent text.(A small section is omitted here.)

In general,the working principle of ChatGPT is to train deep neural networks through massive data, enabling it to understand and generate natural language text. This model can be used for various tasks, including dialogue, text generation, question answering, etc.

The machine’s response is not only fluent in structure but also correct in content and logical coherence, even slightly more comprehensive than my ideal answer. The only shortcoming is that the layout of the text is a bit loose, possibly influenced by the prompt to “introduce in simple terms.” If we exclude this text, which the machine learned from human texts and which people might also say/write, from “human natural language,” it might be a shortsighted and narrow approach. This is somewhat akin to not recognizing test-tube babies cultivated in assisted reproductive technology as “human babies”—it lacks common sense. In fact, the pragmatic AI community has already regarded machine-generated language as “human natural language” to quickly and effectively expand the scale of training corpora and effectively avoid copyright risks associated with using actual “human natural language” texts.

However, let’s return to the point. If we acknowledge that machine-generated language (referred to as “machine language”) is “human natural language,” can linguists use this “machine language” to generalize the grammatical rules of “human natural language” when collecting corpora and extracting grammatical rules? Moreover, this “machine language” may be of mixed quality, sometimes containing grammatical errors. For example, in the previous example, the preposition “based on” lacks the core verb “is,” and the phrase “establishing… ability” is an unusual verb-object pairing, with the preposition “through” missing the core verb “adopt,” etc. Just as interlanguage produced by second language learners is an important research object in linguistics, “machine language” may become an important area of research in future linguistic studies and engineering applications.

3 How Can Linguists Provide Semantic Support for AI?

Next, let’s discuss the second question: How do large models like ChatGPT cross the semantic gap of the arbitrariness of language symbols? As we know, human natural language is a symbolic system, and the connection between language symbols and meanings is conventionally established by social members over a long historical period. For language learners, this non-logical relationship between sound and meaning is gradually acquired through embodied cognition, combining certain language symbols with relevant external worlds and internal experiences in a specific cultural and linguistic environment. For machines, the aforementioned referential semantics does not work because machines cannot comprehend the discrete semantic expressions typical of dictionary definitions. Therefore, how to express the meaning of language symbols in a computable continuous numerical form has become a challenge in natural language processing (NLP). Fortunately, under the guidance of linguistic ideas such as “meaning is use” and “words with similar meanings have similar distributions,” AI experts represent the meanings of words as high-dimensional vectors based on the various contexts in which words appear. Roughly speaking, this expresses the meaning and usage of a word using various contexts in which that word appears. Modern large language models’ word vectors often have thousands of dimensions, which amount to using thousands of contextual information to represent the meaning and usage of each word. Since words with similar meanings often appear in similar contexts, their vector values are also close; in other words, they are located closely in high-dimensional semantic space, and there is a parallel derivation relationship between the vectors of related words. A famous example is that the vector of “king” minus the vector of “man” (approximately equal to the vector of “royal”) plus the vector of “woman” is approximately equal to the vector of “queen.” It is this vectorized embedding expression of word meanings based on distributional semantics that injects semantics into large models; moreover, with the support of large-scale training data, transformer attention mechanisms, and various pre-training techniques that enable knowledge learned from a small number of samples to be transferred to other domains, as well as learning in context, language models can cross the semantic gap of language symbols.

Although distributional semantics has achieved great success in AI language models, the linguistic community’s attention to it and its predecessor latent semantics is severely lacking. We hope that young scholars interested in AI semantics research will pay attention to and study distributional semantics from a linguistic perspective. In fact, for AI applications, distributional semantics centered on word vectors only solves the expression and computation of meaning in the digital world; however, language symbols have not established a connection with the physical world, i.e., they are disconnected from the external environment. Language is a tool for communication, and language symbols ultimately need to establish referential and binding relationships with relevant things in the physical world where humans live. This is known as the “symbol grounding problem”—that is, making the semantic interpretation of language symbols project from the human brain or machine into the physical world, establishing referential relationships with relevant things, events, or states in context. For example, a household robot equipped with a large language model may understand the meaning of the owner’s instruction “Put the big apple on the table into the gift bag!” but must establish a one-to-one correspondence between symbols like “table,” “big apple,” and “gift bag” and the relevant objects in the current environment when performing the task. This involves the robot’s recognition of objects and the resolution of ambiguities in the referential meanings of language symbols, which are a series of complex multimodal perception and semantic understanding issues.

This symbol grounding problem, if unresolved, means that artificial intelligence cannot achieve “embodied intelligence” and will ultimately be unable to enter people’s daily lives. And it is precisely in this aspect of disambiguation and semantic understanding that linguists’ knowledge and wisdom in semantics are indispensable. Therefore, it is evident that in the era of artificial intelligence, how to organically combine referential semantics and distributional semantics to solve the meaning and reference problems of language symbols in robots is a common challenge for linguists and AI experts.

4 Three Modes of Collaboration Between Linguists and Large Models like ChatGPT

At the beginning of 2024, Huang Renxun, president of AI chip company NVIDIA, stated: “This is an amazing time, as we are at the beginning of a new industrial revolution. In the past, steam engines, electricity, PCs, and the internet brought about the information revolution; now it is artificial intelligence.” (See 1 for details: https://mp.weixin.qq.com/s/QK1CiRvJcT3s2dZ0RaYlkQ). OpenAI CEO Sam Altman stated that AI will “change the world like mobile phones,” completely transforming education, healthcare, and scientific research; it will become a powerful tool for humanity to build the future, empowering us to express our creative imagination and will. He encourages the younger generation to actively utilize AI tools to realize personal visions and create unprecedented value. (See 2 for details: https://mp.weixin.qq.com/s/_DrOy1eOkauTUzIWevh_aA). Indeed, this round of artificial intelligence has also opened a great era of academic innovation and research paradigm shift.

Standing at the threshold of this new era, we linguists should also actively embrace artificial intelligence. On the one hand, we need to think about how to more profoundly reveal the structure and function of human natural language to provide knowledge resources for natural language processing in AI. On the other hand, we need to consider how to better utilize modern large language models like ChatGPT to assist our language teaching and linguistic research. Here, we first need to understand and consider three possible modes of human-machine collaboration: (See 3 for details on the three modes of human-machine collaboration: https://mp.weixin.qq.com/s/ws3U1NBouPivqatUb_ldQA).

1) Embedding mode, where linguists interact with large models like ChatGPT, using prompts to set specific goals, allowing AI systems to assist us in completing one or several tasks in research or teaching; then, linguists independently conclude the work. For example, prompting ChatGPT:

Find two-character words in modern Chinese that end with “得” like “获得, 落得”; the more, the better;

Create 10 pairs of sentences like “难道要我下跪不成?→难不成要我下跪?”;

Generate 10 semantically violated English sentences similar to “Colorless green ideas sleep furiously” and translate them into Chinese;

Summarize the following paragraph into a sentence within 30 words, and refine and polish the abstract of the following article;

After the results are returned, linguists check whether they meet the requirements; they can also point out shortcomings, provide clearer prompts, and request the machine to rewrite. In this mode, large models like ChatGPT serve as tools for executing instructions and completing specific tasks, while linguists act as task designers and commanders.

2) Co-pilot mode, where linguists form a partnership with large models like ChatGPT, jointly participating in the research workflow, from proposing research plans, collecting relevant examples, summarizing relevant rules, to drafting outlines of papers and writing related paragraphs, all done through human-machine collaboration, allowing both to play their respective roles. For example, prompting ChatGPT in stages:

What are negative polarity items? Provide examples of typical negative polarity items in English. What are their syntactic and semantic characteristics? What references and research conclusions are available on this topic?

Explain how the concept of negative polarity items contributes to describing and explaining grammatical phenomena.

Are there negative polarity items in modern Chinese? If so, would it be appropriate to conduct a specialized study and write a paper on the syntactic and semantic research of negative polarity items in modern Chinese?

What are the important references and research conclusions on this topic? What further areas can be explored and excavated?

From which aspects should I approach the corpus investigation and example analysis of negative polarity items in modern Chinese?

At the same time, linguists search real text corpora, analyzing the data themselves to summarize rules; on the other hand, they also give the data to the machine for analysis, summarizing rules, and then check its analysis to see if it can supplement or correct their conclusions, or offer new perspectives. They then inform the machine of their findings, consult it on how to outline a paper, and even request assistance in writing relevant sections, followed by their own revisions, etc. In this mode, large models like ChatGPT and linguists complement each other’s strengths and cooperate, with linguists as the leaders and machines as collaborators.

3) Agent mode, where linguists provide goals and necessary resources (such as specific literature and corpora, relatively clear viewpoints, conclusions, or tendencies) and let large models like ChatGPT independently complete certain tasks (such as writing a popular science article), with linguists supervising the process and evaluating the final results. For example, based on the nature of the article or real needs, prompting ChatGPT:

Please write an article for the general public on the importance of Chinese letter words (like “X-ray, CT scan,” etc.) for language communication (cannot be abolished, irreplaceable) from the perspective of a language teacher, limited to 800 words;

Please write an article of no more than 3000 words from the perspective of a professional linguist, summarizing a grammatical rule based on the following three pairs of sentences (with asterisk “*” indicating that the sentence is ungrammatical):

1) Not many people buy this type of car~This type of car is not bought by many people.

2) The person who saw that student has arrived~*That student is the person who was seen.

I have seen people who buy this type of car~*This type of car is what I have seen people buy.

This highly specialized topic may require multiple rounds of human-machine Q&A interaction, and linguists need to provide more supportive examples (to support their arguments) and opposing examples (to refute opposing arguments) before carefully checking and scrutinizing the results. In this mode, large models like ChatGPT fully demonstrate the interactivity, autonomy, and adaptability of agents, approaching independent actors and agents; while linguists serve as supervisors of task execution and evaluators of results. Perhaps, currently, large models like ChatGPT have not yet reached this level of autonomy.

Today’s editor: Weili Zhou

Leave a Comment