Is BERT Perfect? Do Language Models Truly Understand Language?

Machine Heart Release

Author: Tony, Researcher at Zhuiyi Technology AI Lab

Everyone knows that language models like BERT have been widely used in natural language processing. However, a question sometimes arises: do these language models truly understand language? Experts and scholars have different opinions on this. The author of this article elaborates on this topic from several aspects, including the essence of language, the relationship between language symbols and meaning, and what language models actually learn from language.

Is BERT Perfect? Do Language Models Truly Understand Language?

After BERT exploded onto the GLUE leaderboard in 2018, pre-trained language models became the standard for natural language processing (NLP) tasks. Intuitively, the model must at least understand language to some extent to be so effective — as Sebastian Ruder noted in an article [1], to solve the task of language modeling, a language model needs to learn both syntax and semantics.

Do language models represented by BERT really “understand language”? If so, with the accumulation of computational power and data, natural language understanding (NLU) will be thoroughly solved in the foreseeable future. However, Emily Bender denied this expectation in a talk in September [2] — one cannot learn meaning solely from form. In other words, Bender believes that language models cannot learn semantics.

Understanding this conclusion involves a fundamental question: what does it mean to understand language? In the rapid iteration of deep learning, answering this question — at least partially — has methodological implications for natural language processing.

From the stumbles of a thousand years ago to the intense debates of the past two centuries, scholars in the philosophy of language, linguistics, and psycholinguistics have gradually outlined answers to the above questions. While they cannot definitively answer the question, they can at least provide many intuitions — which is precisely what engineering research needs (everything).

Language is a symbolic system used for communication [1]

Language is primarily a symbolic system. It consists of discrete symbols: we identify phonemes from continuous speech signals; we identify morphemes from continuous visual signals. Symbols can combine to form new symbols, and the rules for their combination are specified by grammar. Morphology specifies how morphemes combine to form words, such as “un-deni-able”, “go-ing”. Syntax specifies how words combine to form sentences.

Language is used for communication, so symbols and their combination rules represent certain meanings. The connection between symbols, rules, and meanings is arbitrary: Chinese uses “我” to represent the first person, English uses “I”, and Japanese can use “わたし”. However, this arbitrariness is constrained by the conventions of the community. I can certainly use “你大爷” to express affection, but only if I am prepared for a life of loneliness.

“Understanding language” means understanding the symbols and their combination rules within the language, as well as their corresponding meanings. Both of these are extremely difficult issues.

A Symbolic System with Order in Chaos [1]

The language symbolic system is conventional, and it has a certain degree of arbitrariness. This arbitrariness brings ambiguity and exceptions to the language system, making analysis and learning of the language quite challenging.

Ambiguity manifests in various forms. The same word may come from different combinations of morphemes:

unlockable = un+lockable / unlock + able

The same sentence can have different grammatical structures and different meanings:

Welcome new teachers to dine = Welcome + new teachers + to dine / Welcome + new + teachers + to dine

The same word can also have different meanings: “book” can mean “a book” or “to reserve”.

The above is a case where a word/sentence corresponds to multiple meanings, that is, ambiguity.Conversely, one meaning can be expressed by different sentences:

Shinomiya Kaguya is my wife = I am Shinomiya Kaguya’s husband

There are also many forms of exceptions in language rules. In languages with inflections, there are usually irregular forms, such as the past tense of “go” in English is not “goed”, but “went”. Idioms and phrases break many grammatical rules: “卧薪尝胆” has a completely different meaning from its literal meaning, while “long time no see” does not conform to English grammar at all. The inclusion of loanwords also disrupts morphological rules: for example, “鲁棒” is not a stick from Shandong, but a transliteration of the English word “robust”.

Understanding the symbols and their combination rules within the language means we need to clarify and eliminate existing ambiguities, distinguishing rules from exceptions.

Meaningless Meaning [2] [3]

Symbols and rules have meaning. But what is meaning? This is a hardcore issue, and the theory of meaning in the philosophy of language has extensive discussions.

Language is used for communication, and it is often used to describe a certain state of reality. Therefore, the meaning of language must correspond to components that exist in reality, which we call reference (Refer). “Obama” corresponds to the real person Obama, and “Beijing” corresponds to the place Beijing. Thus, meaning must have reference; some even say that meaning is reference.

But is meaning merely reference? This naive theory of meaning can be easily refuted. Clearly, there are many words that cannot find referents in reality — “if”, “possible” — but they express certain meanings. However, the meanings of proper nouns like “Obama” and “Beijing” seem to only refer, right? However, Frege’s Puzzle undermines this viewpoint. For instance, the sentence “鲁迅是周树人” (鲁迅: What does it have to do with me) suggests that if the meaning of “鲁迅” and “周树人” is reference, this sentence would be equivalent to “鲁迅是鲁迅” — a tautology. But “鲁迅是周树人” indeed tells us something — therefore, the meaning of proper nouns needs to include something beyond reference. Frege believes that the meaning of language is its sense, which exists between language and the referent; thus, the meaning of the word “鲁迅” might encompass being a literary master, having a background of abandoning medicine for literature, while the meaning of “周树人” might include being born at a certain time and place.

But what is the sense of a word? What is the sense of a sentence? Possible Worlds Semantics [2] suggests that the sense of the components of a sentence is a function that selects its referent from possible worlds; and the sense of the sentence itself is a function that can pick out the possible worlds where the sentence is true (possible worlds mean literally. For example, in one possible world, 周树人 might not have abandoned medicine for literature and became a famous doctor; in another possible world, 鲁迅 is a famous person named 马树人). But how do we determine these functions?

For words, we can say that this function is the concept corresponding to the word. But what is a concept? We can say that each concept corresponds to its equivalent properties. For instance, the properties of birds are “having feathers”, “being able to fly” — but some birds, like penguins, cannot fly: determining the equivalent properties of a concept has operational difficulties. We can also say that each concept corresponds to its typical individuals, such as the typical individuals of birds being eagles and cuckoos, which can fly, while penguins are not so typical, being on the fringe of the concept of birds — but what are the typical individuals of abstract nouns like “love” and “ideal”?

Assuming we have completed the characterization of meaning — all these are our linguistic knowledge. However, linguistic knowledge seems to be entangled with encyclopedic knowledge. We know that “bald means no hair” is linguistic knowledge, describing our convention about the meaning of the symbol “bald”; “programmers are prone to baldness” is encyclopedic knowledge, describing the laws of the world. But is “light things fall down” one of the linguistic knowledge, that is, one of the meanings of the symbol “light”, or is it encyclopedic knowledge, which includes the law of gravity?

So far, we have not considered another hardcore element — context. Without specific context, who is “you” in “you are still too young”? Who is “the richest person in the world” in “the richest person in the world is really rich”? These deictic expressions need contextual information — such as who said the sentence and when — to complete the meaning of the entire sentence.

In a specific context, the speaker’s intended meaning differs from the literal meaning. When someone who needs something calls you “handsome”, it is not really because you are handsome; saying “he is really interesting” when everyone is sleepy expresses a different meaning than when everyone is applauding; saying “Dry Martini” to the bartender with a melancholic look will get you a strong drink, even if that is not a complete sentence.

While debates about hardcore theories of meaning will continue, we have at least reached a vague conclusion that reference is part of meaning, and language needs to have a correspondence with the real world. At the same time, meaning is the sense that lies between reality and language, a mental representation, but its form is still undecided. Ultimately, the meaning of language depends on context.

Understanding the meaning of language means understanding the form of sense and its reference, as well as how both change with context.

Language Models Cannot Fully Understand Language

To understand language, we need:

To understand the symbols and their combination rules within the language;
To understand the meaning of language, that is, to understand the form of sense and its reference, as well as how both change with context. However, existing language models are trained based on textual corpora. The data they see iscompliant with language rules, in a certaincontext (a specific time and place, a certain mental state, based on certain knowledge, having made certain inferences, etc.) produced byhuman outputs of a pile of language symbols. If we consider the meaning and context that humans intend to express as input, and the text as output, then language is a function that maps meaning and context to texts that comply with language rules. And it is impossible for a language model to infer the form of the function, that is, the correspondence between input and output, solely based on the possible outputs of the function. Therefore, according to the definition of this article, language models cannotfully understand language.

What Language Models Have Learned

Language models must have learned a considerable degree of linguistic knowledge; otherwise, they could not be so effective.

Unlike the corpus generated by monkeys randomly pressing keys, the corpus used to train language models is not random: the distribution of language symbols is constrained by the symbols and combination rules within the language, as well as the possible meanings and contexts they can express.

Context and expressed meaning are related, which is Wittgenstein’sUse Theory of Meaning insight. Since context is part of the situation, “words that often appear in similar contexts have similar meanings”, which is Firth’sDistributional Hypothesis. Therefore, although we cannot restore the function of language, we can infer the semantic similarity between language symbols through the insights mentioned above. Besides reference, there are other properties reflected in the meanings of language symbols and rules, and these propertiespartially represent semantic relationships (Semantic Relationship), while semantic similarity is adescription of semantic relationships.

Context consists of language symbols that comply with combination rules. By observing the symbols thatmay appear in the context and theircombinations, the model seems to be able toguess the symbols and theircombination rules in the language. With the characterization of semantic similarity based on the distribution of context, the model also seems to be able toguess the semantic similarity of the combination rules.

The semantic similarity learned by the model seems sufficient to meet engineering needs — the model serves as an abstract interface, unifying various expressions of similar meanings into a single output. Our supervised learning paradigm, taking sentiment analysis as an example, can be understood as using additional knowledge (i.e., annotations) to tell the model that “happy” and “joyful” are similar in the dimension of sentiment. Thus, language models trained onlarge amounts of data can assist with downstream tasks like sentiment analysis.

However, language models cannotcompletely solve the modeling of semantic relationships.

The distribution of context only indicates similarity and dissimilarity, hence the semantic relationships that language models can learn are only vague similarities and dissimilarities. This vagueness has been criticized since the era of word vectors [4]. However, semantic relationships are very rich: for words, we have synonyms (Synonymy), antonyms (Antonymy), hypernyms (Hypernym), hyponyms (Hyponym), derivation (Derivation), and other rich relationships; for sentences, we have synonymy (Synonymy), contradiction (Contradiction), entailment (Entailment), presupposition (Presupposition), etc. — all of these cannot be distinguished by simple “similarity”.

The distribution of context is merely a real input, that is, a kind ofapproximation of meaning and context. The sources of approximation errors are threefold: first, context is just part of the situation; nearly infinite information in the situation is only a small part recognized and expressed in language; second, meaning and context are only related; the situation does not determine the meaning a person wants to express — seeing magnificent mountains and rivers, a person might just take a picture and not speak; they might exclaim “the beauty of nature” or say “let’s eat boiled meat tonight”; third, the bias in the sampling of the corpus affects the distribution of context — the “philosophy” in B station corpus appears in completely different contexts than in academic literature corpus, and completely unbiased sampling is practically impossible.

Outlook

No one is perfect, and neither is BERT. From the analysis above of the limitations of language models, we can at least see two directions for improvement:

The first direction is to enhance the extraction of information within the corpus. From GPT to BERT, and then to XLNET, T5, and ELECTRA, this series of advancements represents a more refined modeling of the binary tuple of “context” – “words in context”.

The second direction is to fill in the gaps. The absence of meaning and context as inputs to the language function leads to significant biases in inferring semantic relationships. There are two direct approaches: one is to use knowledge graphs to directly provide rich semantic relationships between language symbols; the other is multimodal learning, enriching meanings and contexts that are difficult to approximate. These two points are also directions that many NLP researchers (including Zhuiyi Technology) have started or are already laying out.

Based on relatively limited training information, pre-trained language models have achieved such impressive results — in specific task research, researchers have almost reached a state where they become wary of mentioning Sesame Street (ELMo, BERT, and ERNIE are all characters from Sesame Street). We have reason to believe that with the further expansion of pre-training tasks and the exploration of the model’s symbolic reasoning capabilities, the gem of language intelligence on the crown of AI will become increasingly closer to us.

[1] https://thegradient.pub/nlp-imagenet/

[2] http://faculty.washington.edu/ebender/papers/GeneralPurposeNLU.pdf

[3] Fromkin, V, R Rodman, and N Hyams. “An Introduction to Language”. Cengage Learning, (2013).

[4] Lycan, William G..”Philosophy of Language: A Contemporary Introduction.” (2018).

[5] Saeed, J I. “Semantics”. Wiley, (2011).

[6] Lenci, Alessandro.”Distributional Models of Word Meaning.” doi.org 4.1 (2018): 151–171.

This article is published by Machine Heart, please contact this public account for authorization to reprint

✄————————————————

Join Machine Heart (Full-time Reporter / Intern):[email protected]

Submissions or seeking coverage: content@jiqizhixin.com

Advertising & Business Cooperation:[email protected]

Leave a Comment Cancel reply