Introduction
Hello everyone, I am Liu Cong from NLP.
RAG (Retrieval-Augmented Generation) finds information fragments relevant to user questions through a retrieval system, utilizing large models to synthesize an answer. This greatly addresses issues such as hallucination and outdated information in large models, and has become an important means for the practical application of large models.
However, during the retrieval process, fragments that are highly similar to the question but do not contain the answer or contain misleading answers are often retrieved. What impact do these irrelevant fragments have on the answers generated by large models?
Today, I happened to come across a related article, bringing you “How Easily Do Irrelevant Inputs Skew the Responses of Large Language Models?”
Paper: https://arxiv.org/abs/2404.03302
Github: https://github.com/Di-viner/LLM-Robustness-to-Irrelevant-Information

First, let’s present the relevant conclusions, followed by additional details.
-
Compared to common semantically irrelevant fragments, LLMs are more easily influenced by highly semantically relevant irrelevant fragments; -
As the number of irrelevant fragments increases, LLMs become more distracted, and their ability to identify correct information decreases; -
The ability of LLMs to recognize irrelevant fragments varies with the format of the question: free-form Q&A > yes/no Q&A > multiple-choice Q&A -
Adding content such as “ignore irrelevant fragments” in system prompts improves LLMs’ recognition ability, but only slightly; -
When there are highly semantically relevant irrelevant fragments, COT or ICL can cause LLMs to overthink, resulting in poorer recognition ability.
Data & Fragment Construction
Irrelevant fragments are divided into three categories:
-
Irrelevant: paragraphs that are unrelated to the question theme but have high similarity scores; -
Partially Relevant: paragraphs that score high in similarity measurement and partially overlap with the topic of the question; -
Relevant: paragraphs that score high in similarity measurement and overlap with the topic of the question but do not contain the correct answer.

Data construction:
-
Irrelevant: directly retrieve the top 10 paragraphs through the retriever; -
Partially Relevant: select a paragraph containing the subject but missing the object from the top 10 retrieved paragraphs as the first half; then find a fragment containing the incorrect answer object’ as the second half; -
Relevant: compared to “partially relevant,” “relevant” fragments are highly semantically related to the question but do not contain the correct answer, mainly involving misleading associations, common feature types, and fictional anecdote types.
Relevant examples are shown in the figure below,
By calculating the similarity scores of different fragments using the Contriever model, the similarity of relevant and partially relevant fragments to the question is even higher than that of the actual fragments, indicating the effectiveness of data construction.

Evaluation metrics:
-
Misrepresentation Ratio (MR): the proportion of correct answers changed by LLMs due to the influence of irrelevant information, used to measure the tendency of LLMs to be misled by irrelevant information; -
Uncertainty Ratio (UR): the proportion of answers in which LLMs express “uncertainty” due to the influence of irrelevant information, used to measure LLMs’ confidence in the answers generated after interference.
To facilitate assessment, multiple-choice questions are used to evaluate LLMs, providing “correct answer,” “incorrect answer,” and “uncertain” as options for LLMs to choose from.

Conclusion Experiment
Evaluated LLMs’ performance when faced with three different levels of semantic relevance of irrelevant fragments, as shown in the table below. As the relevance of the fragments increases, the performance of different models declines, and their confidence in the generated answers after interference increases. Closed-source models perform significantly better than open-source models.

PS: The open-source model only tested Llama2-7B, and I feel it should be supplemented~
As the number of fragments continues to increase, LLMs become more distracted, as shown in the table below. With the increase of irrelevant fragment data, they are more willing to choose irrelevant answers.

To facilitate evaluation, multiple-choice format is chosen for analyzing LLMs. But how do other question formats perform? As shown in the table below, free-form questions are least affected by irrelevant fragments, followed by yes/no types, while multiple-choice questions are most affected.

PS: For free-form questions, due to the lack of constraints, the answers are more scattered and difficult to evaluate. A GPT3.5 alignment operation was performed, and a manual random check of 300 entries yielded an accuracy rate of 97%, deemed reliable.
Ignoring prompts have a slight improvement on results, while COT and ignore prompts + ICL are detrimental, making results worse.

Conclusion
An interesting experimental report exploring the additional effects of retrieval fragments on RAG systems.
PS: Add a “star” ⭐️ to the public account to avoid getting lost! Your likes, shares, and follows are my biggest motivation to keep going!
Feel free to follow the public account “NLP Workstation”, join the discussion group, make friends, learn together, and improve together!
Our motto is “As long as life lasts, learning never stops!”
Previous Recommendations:
-
InternLM2 Technical Report -
Qwen1.5-MoE Model: 2.7B Activation Parameters Achieve Performance of 7B Model -
The RAG vs Long-Context Debate – No Need to Argue -
Ramblings of Role-Playing Large Models -
Self-Distillation Method – Mitigating Catastrophic Forgetting During Fine-Tuning of Large Models -
Yi Technical Report Details Sharing -
New Techniques for Incremental Pre-Training of Large Models – Addressing Catastrophic Forgetting -
How to Improve the Text Representation Ability of LLMs? -
DEITA – Data Efficient Filtering Method for Instruction Fine-Tuning of Large Models -
Fine-Tuning Techniques for Large Models | High-Quality Instruction Data Filtering Method – MoDS -
Debunking! Microsoft Retracts Claim that ChatGPT has 20B Parameters and Provides Explanation. -
How to View Microsoft’s Paper Claiming ChatGPT is a 20B Parameter Model? -
Fine-Tuning Techniques for Large Models – Adding Noise to Embeddings to Improve Instruction Fine-Tuning Effects -
How to Automatically Identify High-Quality Instruction Data from Datasets -
BaiChuan2 Technical Report Details Sharing & Personal Thoughts -
Summary of Fine-Tuning Experiences for Large Models & Project Updates -
Building a Web UI for LLMs -
Are We Training Large Models or Are Large Models Training Us? -
Llama2 Technical Details & Open Source Impact -
Large Model Era – Rethinking Industry Implementation -
Thoughts on Vertical Domain Large Models and Summary of Open Source Models -
How to Evaluate the Quality of Large Models – LLMs? -
Summary | Application of Prompts in NER Scenarios