RAG vs Fine-Tuning: A Guide for Domain-Specific AI Models

Machine Heart Report

Editor: Rome

Retrieval-Augmented Generation (RAG) and Fine-tuning are two common methods to enhance the performance of large language models. So, which method is better? Which is more efficient when building applications in specific domains? This paper from Microsoft serves as a reference for your choice.

When constructing large language model applications, there are typically two common methods to integrate proprietary and domain-specific data: Retrieval-Augmented Generation and Fine-tuning. RAG enhances prompts with external data, while Fine-tuning integrates additional knowledge into the model itself. However, the understanding of the pros and cons of these two methods is not sufficient.

In this article, researchers from Microsoft introduce a new focus: creating AI assistants for industries that require specific context and adaptive responses (agriculture). This paper proposes a comprehensive pipeline for large language models to generate high-quality, industry-specific questions and answers. The method includes a systematic process that involves identifying and collecting relevant documents covering a wide range of agricultural topics. These documents are then cleaned and structured for use with a basic GPT model to generate meaningful Q&A pairs. The generated Q&A pairs are subsequently evaluated and filtered based on their quality.

The goal of this article is to create valuable knowledge resources for specific industries, using agriculture as a case study, with the ultimate aim of contributing to the development of LLMs in the agricultural sector.

Paper link: https://arxiv.org/pdf/2401.08406.pdf
Paper title: RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture

The proposed process aims to generate domain-specific questions and answers that meet the needs of professionals and stakeholders in a particular industry, where the answers expected from AI assistants should be based on relevant industry-specific factors.

This article involves agricultural research, aiming to generate answers for this specific domain. Therefore, the starting point of the research is an agricultural dataset, which is input into three main components: Q&A Generation, Retrieval-Augmented Generation, and Fine-tuning Process. Q&A Generation creates Q&A pairs based on information from the agricultural dataset, while Retrieval-Augmented Generation uses it as a knowledge source. The generated data is refined and used to fine-tune multiple models, with quality assessed through a set of proposed metrics. Through this comprehensive approach, the power of large language models is leveraged to benefit the agricultural industry and other stakeholders.

This article makes specific contributions to the understanding of large language models in the agricultural sector, which can be summarized as follows:

1. Comprehensive Evaluation of LLMs: This article provides an extensive evaluation of large language models, including LlaMa2-13B, GPT-4, and Vicuna, to answer agriculture-related questions. Benchmark datasets from major agricultural producing countries were used for evaluation. In the analysis, GPT-4 consistently outperformed other models, but the costs associated with its fine-tuning and inference also need to be considered.

2. Impact of Retrieval Techniques and Fine-tuning on Performance: This article studies the impact of retrieval techniques and fine-tuning on the performance of LLMs. The research finds that both Retrieval-Augmented Generation and Fine-tuning are effective techniques for improving LLM performance.

3. Potential Applications of LLMs in Different Industries: For those looking to establish processes for applying RAG and fine-tuning technologies in LLMs, this paper takes a pioneering step and promotes innovation and collaboration across multiple industries.

Methodology

Section 2 of this article details the methodology adopted, including the data acquisition process, information extraction process, question and answer generation, and model fine-tuning. The methodology revolves around a process aimed at generating and evaluating Q&A pairs for building domain-specific assistants, as illustrated in Figure 1 below.

The process begins with data acquisition, which includes obtaining data from various high-quality repositories, such as government agencies, scientific knowledge databases, and proprietary data when necessary.

After completing data acquisition, the process continues with information extraction from the collected documents. This step is crucial as it involves parsing complex and unstructured PDF files to recover their content and structure. Figure 2 below shows an example of a PDF file from the dataset.

The next component of the process is question and answer generation. The goal here is to generate context-based high-quality questions that accurately reflect the content of the extracted text. The method in this article employs a framework to control the structural composition of the inputs and outputs, thereby enhancing the overall effectiveness of language model-generated responses.

Subsequently, the process generates answers to the formulated questions. The method used here leverages Retrieval-Augmented Generation, combining the abilities of retrieval and generation mechanisms to create high-quality answers.

Finally, the process fine-tunes the models through Q&A pairs. The optimization process employs methods like Low-Rank Adaptation (LoRA) to ensure a comprehensive understanding of the content and context of scientific literature, making it a valuable resource for various fields or industries.

Dataset

The study evaluates language models that have been fine-tuned and enhanced through retrieval, using context-relevant question and answer datasets sourced from three major crop-producing countries: the United States, Brazil, and India. In this case study, agriculture serves as the industrial context. The available data varies widely in format and content, covering various types such as regulatory documents, scientific reports, agronomy exams, and knowledge databases.

Information was collected from publicly available online documents, manuals, and reports from the USDA, state agriculture, and consumer services agencies.

The available documents included federal regulations and policy information regarding crop and livestock management, diseases and best practices, quality assurance and export regulations, details of assistance programs, and insurance and pricing guidelines. The collected data totaled over 23,000 PDF files, containing over 50 million tokens, covering 44 states in the U.S. Researchers downloaded and preprocessed these files, extracting text information that could be used as input for the Q&A generation process.

To benchmark and evaluate the models, this article used documents related to Washington State, which included 573 files containing over 2 million tokens. An example of the content from these files is shown in List 5 below.

Metrics

The main purpose of this section is to establish a comprehensive set of metrics aimed at guiding the quality assessment of the Q&A generation process, particularly the evaluation of fine-tuning and retrieval-augmented generation methods.

When developing the metrics, several key factors must be considered. First, the inherent subjectivity of question quality poses significant challenges.

Second, the metrics must account for the relevance of questions and their practical dependence on context.

Third, the diversity and novelty of the generated questions need to be assessed. A robust question generation system should be able to produce a wide range of questions covering various aspects of the given content. However, quantifying diversity and novelty can be challenging, as it involves assessing the uniqueness of the questions and their similarity to the content and other generated questions.

Finally, good questions should be answerable based on the provided content. Assessing whether questions can be accurately answered using existing information requires a deep understanding of the content and the ability to identify relevant information to answer the questions.

These metrics play an indispensable role in ensuring that the answers provided by the model are accurate, relevant, and effectively address the questions. However, there is a significant gap in metrics specifically designed to evaluate question quality.

Recognizing this gap, this article focuses on developing metrics aimed at assessing question quality. Given the critical role of questions in driving meaningful dialogue and generating useful answers, ensuring question quality is as important as ensuring answer quality.

The metrics developed in this article aim to fill the gaps in previous research in this area, providing a means for comprehensive assessment of question quality, which will significantly impact the progress of the Q&A generation process.

Question Evaluation

The metrics developed for evaluating questions are as follows:

Relevance
Global Relevance
Coverage
Overlap
Diversity
Detail
Fluency

Answer Evaluation

Since large language models tend to generate long, detailed, and informative conversational responses, evaluating the answers they generate is challenging.

This article employs AzureML model evaluation, using the following metrics to compare the generated answers with the actual situation:

Consistency: Comparing the consistency between the actual situation and the prediction in the given context.
Relevance: Measuring how effectively the answer addresses the question within the context.
Truthfulness: Defining whether the answer logically aligns with the information contained in the context, providing an integer score to determine the truthfulness of the answer.

Model Evaluation

To evaluate the performance of different fine-tuned models, this article uses GPT-4 as the evaluator. Approximately 270 question and answer pairs were generated from agricultural documents using GPT-4 as the actual situation dataset. For each fine-tuned model and retrieval-augmented generation model, answers to these questions were generated.

This article evaluates LLMs using several different metrics:

Guided Evaluation: For each Q&A actual situation pair, this article prompts GPT-4 to generate an evaluation guide listing the content that the correct answer should include. Then, GPT-4 is prompted to score each answer based on the criteria in the evaluation guide, with scores ranging from 0 to 1. Below is an example:
Conciseness: A scoring table describing what concise and verbose answers might contain was created. Based on this scoring table, the actual situation answers and LLM answers are prompted to GPT-4, which is then asked to score from 1 to 5.
Correctness: This article created a scoring table describing what complete, partially correct, or incorrect answers should contain. Based on this scoring table, the actual situation answers and LLM answers are prompted to GPT-4, which is then asked to score as correct, incorrect, or partially correct.

Experiments

The experiments in this article are divided into several independent experiments, each focusing on specific aspects of Q&A generation and evaluation, retrieval-augmented generation, and fine-tuning.

These experiments explore the following areas:

Q&A Quality
Context Study
Model-to-Metric Computation
Comparative Generation vs. Separate Generation
Retrieval Ablation Study
Fine-tuning

Q&A Quality

This experiment evaluates the quality of Q&A pairs generated by three large language models, namely GPT-3, GPT-3.5, and GPT-4, under different contextual settings. The quality evaluation is based on multiple metrics, including relevance, coverage, overlap, and diversity.

Context Study

This experiment studies the impact of different contextual settings on the performance of model-generated Q&A pairs. It evaluates the generated Q&A pairs under three contextual settings: no context, context, and external context. An example is provided in Table 12.

In the no context setting, GPT-4 has the highest coverage and size of prompts among the three models, indicating that it can cover more parts of the text, but the generated questions are longer. However, the three models show similar values in diversity, overlap, relevance, and fluency.

When context is included, GPT-3.5 shows a slight increase in coverage compared to GPT-3, while GPT-4 maintains the highest coverage. For Size Prompt, GPT-4 has the largest value, indicating its ability to generate longer questions and answers.

In terms of diversity and overlap, the three models perform similarly. For relevance and fluency, GPT-4 shows a slight increase compared to the other models.

In the external context setting, similar patterns are observed.

Moreover, when observing each model, the no context setting seems to provide the best balance for GPT-4 in terms of average coverage, diversity, overlap, relevance, and fluency, but the generated Q&A pairs are shorter. The context setting leads to longer Q&A pairs and slight decreases in other metrics, except for size. The external context setting generates the longest Q&A pairs while maintaining average coverage and showing slight increases in average relevance and fluency.

Overall, for GPT-4, the no context setting seems to provide the best balance for average coverage, diversity, overlap, relevance, and fluency but results in shorter answers. The context setting leads to longer prompts and slight decreases in other metrics. The external context setting generates the longest prompts while maintaining average coverage and showing slight increases in average relevance and fluency.

Thus, the choice between these three will depend on the specific requirements of the task. If prompt length is not a consideration, external context may be the best choice due to higher relevance and fluency scores.

Model-to-Metric Computation

This experiment compares the performance of GPT-3.5 and GPT-4 when used to compute metrics for evaluating the quality of Q&A pairs.

Overall, while GPT-4 generally rates the generated Q&A pairs as more fluent and contextually accurate, it has lower diversity and relevance scores compared to GPT-3.5. These insights are crucial for understanding how different models perceive and evaluate the quality of generated content.

Comparative Generation vs. Separate Generation

This experiment explores the pros and cons of generating questions and answers separately versus generating them together, focusing on the comparison in terms of token usage efficiency.

Overall, the method of generating only questions offers better coverage and lower diversity, while the combined generation method scores higher in overlap and relevance. In terms of fluency, both methods perform similarly. Therefore, the choice between these two methods will depend on the specific requirements of the task.

If the goal is to cover more information while maintaining diversity, the method of generating only questions would be preferred. However, if a higher overlap with the source material is desired, the combined generation method would be the better choice.

Retrieval Ablation Study

This experiment evaluates the retrieval capabilities of Retrieval-Augmented Generation, a method that enhances the inherent knowledge of LLMs by providing additional context during the question-answering process.

This article investigates the impact of the number of retrieved segments (i.e., top-k) on the results, with the findings presented in Table 16. By considering more segments, Retrieval-Augmented Generation can more consistently recover the original excerpts.

To ensure that the model can handle questions from various geographical backgrounds and phenomena, it is necessary to expand the corpus of supporting documents to cover a wide range of topics. As more documents are considered, the size of the index is expected to increase. This may increase the number of collisions between similar segments during retrieval, thereby hindering the ability to recover relevant information for the input question and reducing recall.

Fine-tuning

This experiment evaluates the performance differences between fine-tuned models and base instruction-tuned models, aiming to understand the potential of fine-tuning to help models learn new knowledge.

For the base models, this article evaluates the open-source models Llama2-13B-chat and Vicuna-13B-v1.5-16k. These two models are relatively small and represent an interesting trade-off between computation and performance. Both models are fine-tuned versions of Llama2-13B, using different methods.

Llama2-13B-chat underwent instruction tuning through supervised fine-tuning and reinforcement learning. Vicuna-13B-v1.5-16k is an instruction-tuned version fine-tuned on the ShareGPT dataset. Additionally, this article evaluates the base GPT-4 as a larger, more expensive, and more powerful alternative.

For the fine-tuned models, this article directly fine-tunes Llama2-13B on agricultural data to compare its performance with similar models fine-tuned for more general tasks. This article also fine-tunes GPT-4 to assess whether fine-tuning remains beneficial for very large models. The guided evaluation results are shown in Table 18.

To comprehensively assess the quality of answers, this article evaluates not only accuracy but also the conciseness of the answers.

Table 21 shows that these models do not always provide complete answers to questions. For example, some answers indicate that soil erosion is a problem but fail to mention air quality.

Overall, in terms of accurately and concisely answering reference answers, the best-performing models are Vicuna + Retrieval-Augmented Generation, GPT-4 + Retrieval-Augmented Generation, GPT-4 Fine-tuning, and GPT-4 Fine-tuning + Retrieval-Augmented Generation. These models provide a balanced mix of accuracy, conciseness, and depth of information.

Knowledge Discovery

The research objective of this article is to explore the potential of fine-tuning to help GPT-4 learn new knowledge, which is crucial for applied research.

To test this, this article selects at least three states among the 50 states in the U.S. with similar questions. The cosine similarity of the embeddings is then computed, identifying a list of 1000 such questions. These questions are removed from the training set and assessed whether GPT-4 can learn new knowledge based on similarities between different states using fine-tuning and retrieval-augmented generation.

For more experimental results, please refer to the original paper.

RAG vs Fine-Tuning: A Guide for Domain-Specific AI Models

For reprints, please contact this public account for authorization

Submissions or inquiries for coverage: [email protected]

Leave a Comment Cancel reply