Prompt, RAG, Fine-Tuning, or Training From Scratch? Choosing the Right Generative AI Approach

Source: DeepHub IMBA



This article is approximately 2600 words and suggests a 5-minute reading time.
This article will attempt to provide recommendations for choosing the correct generative AI methods based on some common quantifiable metrics.

Generative AI is rapidly evolving, and many people are trying to use this technology to solve their business problems. Generally, there are four common approaches:

Prompt Engineering
Retrieval Augmented Generation (RAG)
Fine-Tuning
Training From Scratch (FM)

This article will attempt to provide recommendations for choosing the correct generative AI methods based on some common quantifiable metrics.

This article does not include the option of using a base model, as there are hardly any business cases where a base model can be effectively utilized. Using a base model as is can work well for general searches, but for any specific use case, one of the options mentioned above is needed.

How to Execute Comparisons?

Based on the following metrics:

Accuracy (How accurate are the responses?)
Implementation Complexity (How complex can the implementation be?)
Workload Input (How much work is required for implementation?)
Total Cost (What is the total cost of owning the solution?)
Flexibility (How loosely coupled is the architecture? How easy is it to replace/upgrade components?)

We will rate each solution approach against these metrics for a simple comparison.

Accuracy

Let’s first address the most critical point of discussion: which method provides the most accurate responses?

Prompt Engineering provides as much context as possible by offering a few examples to help the base model better understand the use case. While the results may seem impressive on their own, they are the least accurate compared to other methods.

RAG produces high-quality results because it adds context specific to the use case directly from a vectorized information store. Compared to Prompt Engineering, it significantly improves the results and has a very low likelihood of generating hallucinations.

Fine-tuning also provides quite accurate results, with output quality comparable to RAG. Since we update the model weights on domain-specific data, the model generates more contextual responses. The quality may be slightly better than RAG, but it depends on specific instances. Therefore, it is essential to assess whether it is worth spending time on a trade-off analysis between the two. Generally, choosing fine-tuning may have various reasons beyond just accuracy, including the frequency of data changes, controlling model implementation regulations in your environment, compliance, and reproducibility purposes, etc.

Training from scratch produces the highest quality results (this is certain). Since the model is trained from zero on use case-specific data, the likelihood of generating hallucinations is almost zero, and the accuracy of the output is also relatively high.

Implementation Complexity

Aside from accuracy, another aspect to consider is the ease of implementing these methods.

Prompt Engineering has relatively low implementation complexity because it requires almost no programming. Good English (or other) language skills and domain expertise are needed, and a good prompt can be created using context learning and few-shot learning methods.

RAG has higher complexity than Prompt Engineering because coding and architectural skills are required to implement this solution. Depending on the tools chosen in the RAG architecture, complexity may be even higher.

Fine-tuning is more complex than the two mentioned above because the model’s weights/parameters are changed through tuning scripts, which requires data science and ML expertise.

Training from scratch definitely has the highest implementation complexity because it requires extensive data management and processing, and training a relatively large model necessitates in-depth data science and ML expertise.

Workload Input

The complexity of implementation and workload is not always proportional.

Prompt Engineering requires a significant amount of iterative effort to get it right. The base model is very sensitive to the wording of prompts, and changing a word or even a verb can sometimes yield completely different responses. Therefore, a considerable amount of iteration is needed to tailor it to the corresponding needs.

Due to the tasks involved in creating embeddings and setting up vector storage, RAG also requires a lot of workload, somewhat higher than Prompt Engineering.

Fine-tuning is more labor-intensive than the first two. Although fine-tuning can be done with very little data (in some cases even around or fewer than 30 examples), setting it up and obtaining the correct tunable parameter values takes time.

Training from scratch is the most labor-intensive of all methods. It requires extensive iterative development to obtain the optimal model with the correct technical and business outcomes. This process starts with data collection and management, designing model architecture, and experimenting with different modeling methods to achieve the best model for the specific use case. This process can be lengthy (weeks to months).

Total Cost

We are discussing not just the cost of services/components, but the total cost of implementing the solution, which includes skilled engineers (personnel), the time required to build and maintain the solution, costs of other tasks such as maintaining infrastructure, executing upgrades and downtime for updates, establishing support channels, hiring, upskilling, and other miscellaneous costs.

The cost of Prompt Engineering is relatively low because it only requires maintaining the prompt templates and keeping them updated when the base model version is updated or a new model is released. Additionally, there will be some extra costs for hosting the model or using it directly via API.

Due to the involvement of multiple components in the architecture, the cost of RAG is slightly higher than that of Prompt Engineering. This depends on the embedding model, vector storage, and model used, as three different components need to be paid for here.

The cost of fine-tuning is certainly higher than the first two because it involves tuning a model that requires substantial computational power, and in-depth ML skills and understanding of model architecture are necessary. The cost of maintaining such a solution will also be higher because tuning is required every time there is an update to the base model version or a new batch of data comes in.

Training from scratch is undoubtedly the most expensive option because the team must possess end-to-end data processing and ML training, tuning, and deployment capabilities. This requires a group of highly skilled machine learning practitioners to complete. The cost of maintaining such a solution is very high due to the frequent retraining cycles needed to keep the model synchronized with new information surrounding the use case.

Flexibility

Let’s look at how it fares in terms of simplifying updates and changes.

Prompt Engineering has very high flexibility because it only requires changing the prompt templates based on changes in the base model and use case.

When it comes to changes in the architecture, RAG also has the highest degree of flexibility. The embedding model, vector storage, and LLM can be changed independently with minimal impact on other components. It can also add more components in complex authorization processes without affecting other components.

Fine-tuning has very low flexibility for changes because any change in data and input requires another fine-tuning cycle, which can be very complex and time-consuming. Similarly, adjusting the same fine-tuned model to different use cases requires a lot of work, as the same model weights/parameters may perform worse in other domains than in the one it was tuned for.

Training from scratch has the lowest flexibility. Because the model is built from scratch, performing updates on the model triggers another complete retraining cycle. We can also fine-tune the model rather than retrain from scratch, but the accuracy will differ.

Conclusion

From all the comparisons above, it is clear that there is no obvious winner. The final choice depends on what the most important metrics are when designing the solution. Our recommendations are as follows:

When higher flexibility in changing models and prompt templates is desired, and the use case does not involve a lot of domain context, Prompt Engineering can be used.

When the highest degree of flexibility is desired in changing different components (data sources, embeddings, FM, vector engines), RAG should be used, as it is simple and can maintain high-quality outputs (provided you have data).

When better control of model artifacts and version management is desired, fine-tuning can be used. It is especially useful when domain-specific terminology and data are very specific (such as in law, biology, etc.).

When none of the above fits, training from scratch can be considered. Since the accuracy of the above solutions is deemed insufficient, a sufficient budget and time are needed to do it better.

In summary, choosing the right generative AI method requires deep thought and assessment of acceptable and unacceptable metrics. It may even involve choosing different solutions based on different periods.

Editor: Wenjing