Originally published by Xi Xiaoyao Technology

Original author | Xie Nian Nian
In interpersonal communication, especially when using a profound language like Chinese, people often do not answer questions directly but instead use implicit, obscure, or indirect expressions.
Humans can accurately interpret some implied meanings based on past experiences or understanding of the speaker, as illustrated by common childhood dialogues:
“Mom, where’s my book?”
“It’s in my hand, come get it!”
Or:
“Mom, can I have braised pork today?”
“Do I look like braised pork?”
Faced with a response from mom that seems to answer yet says nothing, we quickly grasp her mood of not wanting to engage. Can LLMs understand the speaker’s true meaning when confronted with similar conversational implicatures?
Recently, Shanghai Jiao Tong University extracted the first Chinese multi-turn dialogue dataset targeting conversational implicature from the classic sitcom “The Legend of the Condor Heroes,” selecting 200 meticulously designed questions that conform to conversational implicature. They tested eight LLMs with multiple-choice tasks and metaphor explanation tasks. The results show that the task of conversational implicature remains a challenge for LLMs.
Paper Title: Do Large Language Models Understand Conversational Implicature – A case study with a Chinese sitcom
Paper Link: https://arxiv.org/pdf/2404.19509
Dataset Construction
This paper uses the widely popular Chinese sitcom “The Legend of the Condor Heroes” as the data source. The show not only contains a wealth of meaningful dialogues but also features beautifully written exchanges based on naturally occurring scenarios, making it highly suitable for evaluating language models’ abilities to understand and infer the deep meanings of Chinese dialogues.
Dataset Construction Principles
The Cooperative Principle is an important theory in linguistics proposed by American philosopher Grice during his 1967 talk on “Logic and Conversation” at Oxford University. The Cooperative Principle includes four categories, each with a maxim and several sub-maxims:
-
Maxim of Quality: a) Do not say what you believe to be false; b) Do not say that for which you lack adequate evidence; -
Maxim of Quantity: a) Make your contribution as informative as is required; b) Do not make your contribution more informative than is required; -
Maxim of Relevance: Be relevant. For example, when asked, “Is John in the office?” Sam replies, “It’s Saturday, you know.” This violates the maxim of relevance because the response is not directly related to the question, implying, “John never works on weekends, so he is not in the office.” -
Maxim of Manner: Be perspicuous. a) Avoid obscurity; b) Avoid ambiguity; c) Be brief (avoid unnecessary prolixity); d) Be orderly.
However, in actual verbal communication, people do not always adhere to the Cooperative Principle. Out of necessity, people may deliberately violate it. Grice referred to the implied meaning produced by seemingly violating the Cooperative Principle as “conversational implicature.” This explains how listeners understand the implied meanings through the surface meanings of the speaker’s words, often leading to humor.
This paper selects dialogues targeting conversational implicature based on the above principles to create a multi-turn dialogue dataset in Chinese.
Identification and Classification of Implicature
The three authors selected dialogues containing conversational implicature from the scripts of “The Legend of the Condor Heroes” by judging violations of the conversational maxims. To classify them more finely, they used sub-maxims as criteria to assess whether the target sentences met each requirement. If a sentence violated a sub-maxim, it was considered a violation of that maxim. Dialogues may belong to multiple categories based on the violated sub-maxims. A sample data entry includes the dialogue, four explanations, and categories, as shown below:

Next, four types of explanations for the dialogue were constructed:
-
Pragmatic interpretation, which is the correct answer; -
Literal interpretation; -
Two context-related distractors;
Based on the above explanations, multiple-choice questions were created, and a renowned linguistic PhD was invited to answer them, discussing incorrect answers and reasoning processes. This validation process ensures that the provided pragmatic understanding is closely aligned with common sense intuition and can be inferred from limited context. Necessary information, such as character relationships, personalities, social backgrounds, and multimodal information, was added at the beginning of the dialogues.
Human Scoring
To compare with human performance, ten native speakers were randomly invited to answer 32 questions extracted from the dataset, achieving an average accuracy of 93.1%. The questionnaire included an equal number of questions for each type of violation of Grice’s maxims.
The final SwordsmanImp corpus contains 200 carefully selected questions, categorized into four types according to the Cooperative Principle, as shown in the table below. Each entry contains multi-turn dialogues and four explanations for the target sentences as options.

Experiment 1: LLM Multiple-Choice Questions
Experiment Setup
In this experiment, the model will see the dialogue and four manually created explanations. The task is to choose the correct explanation for statements containing implied meanings.
The authors tested eight models, including both open-source and closed-source models, using zero-shot prompting to simulate real-life scenarios where humans encounter these implied meanings.
For open-source models, the established practice of LLM evaluation was followed, calculating the logits of the four tokens “A”, “B”, “C”, “D” generated, and selecting the one with the highest logit value as the model’s prediction; for closed-source models, answers were generated, and the generated text was manually checked to determine which explanation was chosen.
Experiment Results
The experiment results are shown in the table below, with GPT-4 achieving an accuracy of up to 94%, comparable to human performance, demonstrating strong capabilities. Following closely is CausalLM (14B), with an accuracy of 78.5%, also showing good performance.
However, other models faced significant difficulties, with their accuracy generally ranging from 20% to 60%. Notably, Textdavinci-002’s accuracy even failed to reach random level (25%), indicating that the tested models still have considerable room for improvement in understanding implied meanings.

The table below details the performance of different models in violating different conversational maxims:
Overall, the models exhibited varying strengths and weaknesses across different maxims, with no single model showing consistent strengths or weaknesses across all maxims. Human responses also displayed this diversity.
Among open-source models, CausalLM (14B) achieved accuracy close to human levels, performing best among all open-source models, demonstrating its strong dialogue understanding capability.
GPT-4 stood out among all models, with an accuracy exceeding 90% across all categories of questions, reaffirming its leading position in the NLP field.
The figure below illustrates the distribution of model choices for explanations. Red represents the model selecting the correct answer, i.e., the pragmatic explanation; yellow represents the selection of the literal meaning; and green indicates the selection of two distractors.

It can be seen that the two 13B models frequently selected distractors, which may suggest they are more easily influenced by irrelevant information in the context.
Moreover, as GPT models evolve, they are gradually better at distinguishing between literal meanings and implied meanings. Notably, GPT-4 significantly reduced the proportion of literal understanding in explanation choices, further validating the model’s progress in understanding complex linguistic phenomena.
Experiment 2: Evaluating the Quality of LLM-Generated Explanations
The authors designed open-ended questions requiring the model to generate explanations for implied meanings, which were then manually evaluated by native Chinese speakers based on the reasonability, logic, and fluency of the generated explanations. The results are shown in the table below:

GPT-4 ranked first in all three dimensions, with the smallest variance in scores, demonstrating stable and excellent performance. Although GPT-3.5-Turbo also scored high, the standard deviation was larger, reflecting some instability in its performance. The scores of the other three models were relatively close, with statistical tests showing no significant differences among them.
However, it is noteworthy that CausalLM (14B) scored lower than GPT-3.5-Turbo, which is inconsistent with the observations in Experiment 1. This finding reveals that the model’s excellent performance on specific tasks (such as selecting answers from four options) does not necessarily guarantee similar excellence in other tasks (such as providing coherent explanations of implied meanings). This further illustrates the performance differences that models may exhibit when handling different tasks.
The figure below presents a typical dialogue example generated by a model.

By analyzing the implied meanings in Xiangyu’s discourse, we can understand that she is actually warning Shitou not to drink anymore, while her words also reveal sarcasm and dissatisfaction towards Shitou.
In this example, although GPT-4 provided a concise explanation similar to the reference explanation, it mistakenly interpreted the sarcastic tone as questioning Shitou’s drinking capacity.
CausalLM (14B) provided a correct explanation overall, but the quality of the answer was affected by poor fluency, including English words and meaningless character sequences like “NST.” It is worth noting that the expression “forgot his place.” actually contains the correct meaning and can be viewed as a language code switch rather than nonsensical output.
Openbuddy-Llama2 (13B) produced a response that was lengthy and irrelevant to the question.
Analysis: How Well Do LLMs Understand Chinese Implicature?
Results from Experiment 1 indicate that GPT-4 demonstrated performance comparable to humans in the benchmark tests set in this paper, while other models lagged at least 15 points behind, including GPT-3.5-turbo.
This suggests that while theoretically advanced LLMs can learn and understand Chinese implicature, for most LLMs, this remains a challenging task.
Results from Experiment 2 reveal that a model that performs excellently in multiple-choice questions (like CausalLM-14B) may fail in free-text generation tasks, where it needs to explain implied meanings independently. This finding highlights that relying solely on multiple-choice questions is insufficient to comprehensively assess a language model’s linguistic capabilities. Future research could design more complex methods to better quantify models’ free-form explanations of conversational implicature.
Conclusion
This paper constructs SwordsmanImp, the first fine-grained Chinese dataset for assessing LLMs’ understanding of conversational implicature, and conducts evaluations of LLMs’ understanding capabilities through multiple-choice and free generation explanation tasks. GPT-4 remains the strongest among all comparative models, even achieving human-level performance in multiple-choice question answering.