Reprinted from | RUC AI Box

Author｜Gong Zheng

Institution｜Renmin University of China

Research Direction｜Natural Language Processing

0

Introduction

In recent years, with the development of pre-training technology, models have achieved excellent performance in most natural language processing tasks by pre-training on massive text corpora. However, it seems that models with relatively small parameter counts still struggle to fully store the knowledge from the pre-training corpus within their parameters. Therefore, aside from the increasingly larger pre-trained models, another area of research focuses on how to enhance models using non-parametric external corpora/knowledge bases, such as allowing models to retrieve documents from a corpus that are beneficial for completing the current task.

The author believes that the methods of enhancing models with external knowledge can generally be divided into two categories based on the type of knowledge: The first category can provide models with a broad coverage of knowledge, which usually directly or indirectly contains the answers to the current tasks, such as documents in question-answering tasks or knowledge graphs. Therefore, the performance of this category largely depends on the accuracy of the retrieval module, that is, whether the model can accurately locate the knowledge it needs. The second category provides examples that the model can refer to in certain aspects, such as question-answer pairs in question-answering tasks. This article refers to this type of method as example augmented methods and selects eight articles for brief introduction, divided into four small categories, along with some personal insights. Any errors or shortcomings are welcome to be pointed out by everyone.

1

Retrieve and Edit

A relatively obvious application scenario for using examples to enhance models is in certain generation tasks where answers from examples can be directly copied. In this type of work, the model mainly refers to the text content of the retrieved examples. Below are two related works in this area:

Prototype-to-Style: Dialogue Generation with Style-Aware Editing on Retrieval Memory, IEEE/ACM Transactions on Audio, Speech, and Language Processing 2021

This article comes from the University of Cambridge, Tencent AI LAB, and the Chinese University of Hong Kong. The paper mainly focuses on how to generate dialogues in a specific style, with the general idea being to use the retrieved dialogue content as a template, and then have the model edit the content according to the specified style, as shown in the figure below:

Specifically, the method introduced in this article mainly includes three modules: Prototype Extraction, Stylistic Response Generator, and Learning.

Prototype Extraction:This article obtains a response template by masking style-related words in the retrieved responses. To define which words are style-related, the authors calculate pointwise mutual information (PMI) for all words appearing in the training set and their corresponding styles.

, p(x,s) represents the frequency of vocabulary x appearing in responses of style s. When PMI(x,s) is greater than a certain threshold, x is considered a style-related word.

Stylistic Response Generator:This article uses GPT2 as the generation model, concatenating the query, response prototype, and reference response with delimiters as input, and adding corresponding segment embeddings for distinction. To learn to generate language in a specific style, the authors learn a style embedding for each style to be added to the generated response part. The overall framework is shown in the figure below:

Example Augmented Natural Language Processing

Learning:The training part of this article does not involve the retrieval module, but directly applies stylist word mask, random mask, and random replace denoise operations to the reference responses in the training samples to obtain response prototypes. The advantage of this approach is that it ensures the relevance between the templates and references during training while avoiding the model from performing verbatim copying. Finally, in the training loss, the authors increase the loss weight of stylist words and introduce the LM Loss of the query part as auxiliary loss.

Neural Machine Translation with Monolingual Translation Memory, ACL 2021

This article comes from the Chinese University of Hong Kong and Tencent AI LAB and is an outstanding paper at ACL 2021.

Another application scenario where examples can be used as templates is machine translation. By retrieving source sentences similar to the current input, the model can completely reference the target sentences of the retrieved examples to generate translations for the current input. However, this method typically requires a large amount of aligned corpora as a retrieval library. This article attempts to address this pain point by using a monolingual corpus as the retrieval library to enhance the model. The model framework diagram of this article is as follows:

Retrieval Model:This article sets the retrieval corpus as a monolingual corpus in the target language. The authors first use a Dual-Encoder approach to encode the input sentence and the sentences in the retrieval library separately, then obtain similarity through dot product, and finally use FAISS to retrieve the M most relevant examples to the input sentence. To unify the training of the Retrieval Model and the subsequent Translation Model, the similarity of these M retrieved examples will be further used in the translation process. Here, the retrieval model has a cold start problem, meaning that the randomly initialized retrieval module may retrieve examples that are completely unrelated to the input, which could lead the entire model to learn to ignore the retrieval examples for completing the translation task. Therefore, this article also designs two sentence-level and token-level alignment tasks to assist in initializing the retrieval module.

Translation Model:The translation module utilizes both the input sentence and the retrieved examples during generation. Specifically, the model first encodes the input sentence through the Encoder, similar to a standard Seq2Seq model, and then generates the representation of the token to be predicted in a self-regressive manner on the Decoder side.

On this basis, this article continues to encode each retrieved example with the Encoder, concatenating the obtained representations and feeding them to the Decoder for cross-attention, obtaining attention scores for each token of each example. These attention scores will be further weighted by the similarity obtained during retrieval. Finally, the predicted token’s probability distribution consists of two weighted components, one is the probability obtained by the model using the self-regressive method, and the other is the probability of directly copying tokens from the retrieved examples, dominated by the attention scores.

2

Retrieve Rather Than Memorize

An important feature of deep learning models is their good generalization ability, meaning that the knowledge learned from the training set can generalize well to the test set. The following two papers indicate that it may be more efficient to directly retrieve helpful samples rather than relying entirely on training to store all knowledge from the training set within the model parameters.

PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them, TACL 2021

This article comes from Facebook and University College London, published in TACL 2021.

Current Open Domain QA tasks typically first retrieve from a large-scale document corpus and then perform reading comprehension based on the retrieved document content to obtain the final answer. This approach requires a large amount of space and time to store and retrieve document corpora. This article first designs a data augmentation method, proposing a dataset containing 65 million automatically generated QA pairs, Probably Asked Questions (PAQ), and then explores two aspects of the above-mentioned solutions to the ODQA task’s shortcomings based on this dataset:

1. Closed-Book QA (CBQA)By fine-tuning the model on a large number of question-answer pairs, the model learns the mapping from questions to answers directly, allowing it to answer questions without relying on external knowledge/corpora.

2. QA-Pair Retriever (RePAQ)By retrieving question-answer pairs instead of documents to help the model answer current questions, this retrieval approach has advantages in memory, speed, and accuracy compared to previous retrieval paradigms.

The model structure diagram involved in this article and some experimental results are as follows:

The author does not provide a detailed introduction to the data generation and utilization methods in this paper, but mainly focuses on some experimental results in the paper. In the comparative experimental results, lines 2, 7 and lines 3, 8 show that the expansion of the corpus has benefits for both fine-tuning and retrieval, which is very intuitive. Additionally, comparing lines 3 and 4 reveals that using QA pairs as external knowledge for enhancement is far less efficient than using traditional documents as external knowledge. Finally, comparing lines 7 and 8 shows that enhancing the model by retrieving QA pairs significantly outperforms fine-tuning the model on the PAQ dataset, which seems to indicate that even though QA pairs contain less information than documents, storing the knowledge contained in a large number of QA pairs in an external non-parametric module is superior to storing it in the model parameters.

Training Data is More Valuable than You Think: A Simple and Effective Method by Retrieving from Training Data, ACL 2022

This article comes from Microsoft and has been accepted by ACL 2022. The idea of this paper is more straightforward than the previous one, directly retrieving similar examples from the training set to help the model complete the current task, and the method used in this paper is also relatively simple: (1) Use BM25 as the retriever, filtering out training samples from the retrieval corpus during training. (2) Use text concatenation to utilize the retrieved examples.

As shown in the figure above, this article evaluated four tasks: summarization, language modeling, machine translation, and QA, achieving certain gains over the original model in all tasks. It is worth noting that the datasets used for the experiments are not too small, but this also indicates that even during fine-tuning in downstream tasks, models still find it difficult to store all content from the training set in their parameters and utilize it. Therefore, enhancing the model directly by retrieving specific examples from the training set is a viable approach.

3

Case-Based Reasoning

The following two works mainly focus on Knowledge Base Question Answering (KBQA) tasks, aiming to learn how similar examples solve problems, which may represent a more sensible way to utilize examples other than simply copying and pasting.

Case-Based Reasoning for Natural Language Queries over Knowledge Bases, EMNLP 2021

This article comes from the University of Massachusetts Amherst and Google, published in EMNLP 2021. This paper addresses supervised KBQA scenarios, where each query is labeled with a corresponding logic form (which can be seen as a reasoning path), and the answer to the question can be obtained by executing the logic form labeled on the Knowledge Base. This article aims to retrieve similar cases for the current query and use the reasoning paths of these similar cases to generate the reasoning path for the current query to obtain the final answer. The overall framework of the model in this paper is illustrated in the figure below:

The proposed model specifically includes three parts: Retrieve, Reuse, and Revise:

Retrieve:This paper uses the DPR (Dense Passage Retriever) method to encode, retrieve, and train the text of the query. The authors hope that the retriever will focus more on the form of the question rather than the specific entities it contains. For example, for the question “Who is Xiao Ming’s brother?”, they would prefer to retrieve “Who is Xiao Hong’s brother?” rather than “Who is Xiao Ming’s father?” Therefore, the retriever masks the entity parts when encoding the text.

Reuse:This paper adopts a generative approach to generate the logic form for the current query. Specifically, the authors concatenate all retrieved query-logic form pairs and feed them to a Seq2Seq model to generate the logic form for the current query in a self-regressive manner. Since concatenated inputs can be particularly long, the authors use the BigBird model to generate the logic form.

Revise:Since it is uncertain whether certain generated relational paths exist in the Knowledge Base, the authors further revise the generated logic form. They first encode the Knowledge Base using a pre-trained TransE model and then replace relationships that do not exist in the generated results with the most similar relationships currently present in the Knowledge Base.

Knowledge Base Question Answering by Case-based Reasoning over Subgraphs, Arxiv 2022

This article comes from the University of Massachusetts Amherst and Google, and is authored by the same first author as the previous paper.

KBQA questions often exist in a weakly supervised environment, where we can only obtain the questions and answers but cannot acquire the specific reasoning paths to obtain answers from the Knowledge Base. This paper further addresses the weakly supervised KBQA problem based on the previous paper, mainly modifying the Reuse and Revise parts to replace them with subgraph collection and reasoning on graphs, as illustrated in the figure:

Subgraph Collection:In weakly supervised scenarios, the retrieved cases can only provide the question and answer. This paper extracts all entities from the questions and answers and retrieves all paths linking the question entities and answer entities from the Knowledge Base to obtain a sub-graph, which can be seen as a less precise reasoning path. Then, it extends the entities in the input query on the Knowledge Base using the relations from the subgraph obtained from the retrieved examples, resulting in a subgraph of the input query in the KB.

Reasoning on Graphs:This paper uses GNN to encode the subgraphs, positing that the representations of answer entities in different subgraphs should be closer to each other, and reinforces this assumption during training, i.e., pulling closer the distances between answer entities and pushing away the distances between answer entities and other entities. During inference, the authors encode the subgraphs of each retrieved example and the input query using GNN, selecting the entity in the input query’s subgraph that is most similar to the answer entities of all retrieved subgraphs as the final answer to the question.

4

Chain Of Thought

Another type of example-based method is in-context learning, which involves feeding the model some question-answer pairs as prompts, allowing the model to “learn” to answer the current questions. This part of the work is characterized by high requirements for the model, generally only applicable to large-scale pre-trained language models, while having lower requirements for examples, as the model is more about generating answers based on the provided questions rather than extracting additional knowledge from the examples.

Chain of Thought Prompting Elicits Reasoning in Large Language Models, Arxiv 2022

This article comes from Google. This paper proposes adding explanations for each example to the standard prompt, aiming to have the model mimic the reasoning process presented in the examples and generate its own reasoning process for the current query, making the in-context learning generation results more interpretable and improving effectiveness. This approach is referred to as Chain Of Thought Prompt, with specific examples illustrated in the figure below:

This article conducted experiments on three types of tasks: Arithmetic, Symbolic, and Commonsense Reasoning, finding that compared to the original Standard Prompt method, Chain Of Thought Prompt can better leverage the capabilities of large-scale language models. The experimental results in the arithmetic section are shown in the figure below:

It can be seen that when testing on the left, relatively simple tasks, the performance of both methods increases exponentially with the increase of model parameters. However, when testing on the right, more difficult datasets, Chain Of Thought maintains its ability to leverage large model capabilities better than Standard Prompting.

The Unreliability of Explanations in Few-Shot In-Context Learning, Arxiv 2022

This article comes from the University of Texas at Austin. This paper conducts experiments on the Chain Of Thought method across three datasets in QA and NLI tasks, finding that on these datasets, Chain Of Thought does not always bring gains, and its performance is not necessarily better than the Standard Prompt method. The experimental results are illustrated in the figure below, where P-E and E-P represent concatenating the explanation before or after the answer in Chain Of Thought.

The paper further analyzes the explanations generated by GPT3 and evaluates them from two perspectives: (1) correctness, whether the generated explanations contradict certain conditions given in the context. (2) consistency, whether the generated explanations can lead to the answers provided by the model. The results show that the explanations generated by GPT3 have good consistency but poor correctness, and for questions that the model got wrong, it generally generated incorrect explanations.

Based on this finding, the paper proposes using rule-based methods to estimate the correctness of model-generated explanations and designs multiple calibration methods for explanations estimated as incorrect, allowing the Chain Of Thought effect to outperform standard prompts, such as generating multiple explanations and answers, selecting the first estimated correct explanation as the final prediction.

Summary: This article provides a brief introduction to eight papers on the theme of Example Augment. Except for the last part of the work that does not involve retrieval, the first three sections can be categorized under the larger category of retrieval enhancement. The first and third sections have relatively straightforward utilization methods for examples, explicitly using referable parts from similar examples to assist in answering current tasks (content and methods). The third section’s reference to reasoning paths is, in the author’s view, very enlightening. Compared to documents rich in knowledge, the advantages of examples containing less knowledge, aside from being more lightweight, also lie in providing specific, referable problem-solving patterns. This aspect is similar to the fourth section’s chain of thought. Currently, the abstraction of the reasoning process in the third section is still limited to structured reasoning paths, while the fourth section uses natural language forms of explanations, but the utilization method is not as intuitive as in the third section, relying more on the powerful performance of large-scale pre-trained language models. Future work is expected to combine the characteristics of these two parts.

0 Introduction

1 Retrieve and Edit

2 Retrieve Rather Than Memorize

3 Case-Based Reasoning

4 Chain Of Thought