Introduction
Hello everyone, I am Liu Cong from NLP.
As the context length supported by large models continues to increase, a debate has emerged online (many groups are discussing this topic, so I would like to share my thoughts) regarding RAG and Long-Context, which is really unnecessary… The main point is that the two are not in conflict; it is not a matter of A or B, or B or A.
My personal view is: if we compare it to a retrieval system, RAG should be considered coarse ranking, while Long-Context can be seen as fine ranking. The essence of RAG is to find relevant content fragments from a database/knowledge base based on user queries, and then use the large model to find or summarize answers. The essence of Long-Context is to feed all text content into the large model, where the user asks a question, and then the large model finds or summarizes the answer.
The fundamental difference lies in how external knowledge is provided to the large model and how much is given to the large model.
This is also the point of contention, as the longer the input length that the large model can accept, the less important retrieval becomes, and the reliance on retrieval effectiveness decreases. This is why some people hold the view that in the future, there will be no RAG, only Long-Context. However, as the input length of the large model increases, the resources occupied by the KV Cache increase, and costs will soar, which is also why some people believe that RAG will still exist in the future.
So how far into the future are we talking about? If the future is only 5 years away, I believe RAG will definitely exist; if the future is AGI, then perhaps RAG may not be needed.
Why RAG is Coarse Ranking and Long-Context is Fine Ranking
From a computational perspective, currently RAG relies on retrieval systems for content filtering, usually employing methods like ES and vector matching, which can be understood as having a smaller computational load—meaning there is less interaction between texts; whereas Long-Context involves the entire large model parameters during user query and text interaction, utilizing every layer of attention in the Transformer to locate relevant text fragments.
From the perspective of text selection, Long-Context can be likened to having predetermined relevant content for a given question, while RAG needs to utilize retrieval methods to identify the relevant content related to the question, which results in some information loss.
The Longer the Context Supported by the Large Model, the More Friendly it is to RAG
RAG is currently the fastest, most effective, and safest way to implement large models. However, RAG still faces many issues, such as: during the segmentation of documents, the original meaning of the text can be cut off, and there are accuracy issues with retrieval matching. But when the context that the large model can support is longer, RAG can avoid or reduce document segmentation, allowing retrieval to recall more content; if Top 10 is not enough, we can use Top 100.
So,
-
If you support RAG, then even if the large model supports Long-Context, it is still a large model; -
If you support Long-Context, then RAG is merely a filtering method before the large model supports infinite context.
The core issue of contention is whether the content length in the database/knowledge base will exceed the maximum length that the large model can support.
However, the better the retrieval and the longer the supported length of the large model, the more essential it is for the current implementation and sustainable development of large models, we are family!!!
Long-Context Will Impact Some RAG Scenarios, but Not All
It is undeniable that Long-Context will impact some RAG scenarios.
The main scenarios are those where the content in the database/knowledge base does not exceed the maximum length of the large model, for example: in scenarios where there are only a few documents as reference materials, there is no need for retrieval, just directly providing the brute force solution to the large model is sufficient.
But if the content exceeds the maximum length of the large model, how would you respond without retrieval filtering?
Many extreme people say that perhaps there will be no domain-specific large models, but upon careful consideration, even if Gemini supports 1M tokens and Kimi-Chat supports 2 million words, that’s equivalent to over 400 PDFs (assuming an average of 5k words per PDF), how could vertical domain (not to argue, specifically referring to certain large domains) data reach such a scale?
Additionally, in scenarios involving permission determinations, there is no need to use the large model for judgments; domain isolation can suffice, thus there is no need to provide all text to the large model (saving computational resources).
PS: As a side note, it seems that a person generates only about 0.3 billion tokens in a lifetime (this topic was previously discussed in the group). Is it possible that when the model supports a maximum length of 300 million, we could quickly replicate a person? No fine-tuning needed, just inputting into the database.
Can You Deploy a 1M Token Large Model Service?
The calculation for KV Cache is 4 * Batch Size * Number of Model Layers L * Attention Dimension H * KV Cache Length; to hard support a length of 1M is indeed daunting.
Of course, there are currently some optimization methods, such as sliding windows, cache quantization, etc. (everyone is welcome to contribute), but even so, due to the large parameter scale of the large model, the GPU memory usage is also frightening. Moreover, the sliding window seems to feel quite similar to retrieval, both resulting in some loss. So if losses can be accepted, why not accept retrieval?
Of course, the previous discussions did not consider cost issues, but returning to reality, which manufacturers have the capability to deploy a model service supporting 1M tokens, or what is the cost-effectiveness of deploying such a model?
For ToC, large model services are deployed by large model companies, and the costs are borne by them, so the consideration is how to profit (of course, one can dream without considering this).
For ToB, a very realistic issue arises: after using so many GPUs, what can be produced? Compared to RAG, does Long-Context bring accuracy improvements? How much improvement does it bring? Will it increase latency? In the case of the same GPU, does deploying a larger model with RAG offer better cost-effectiveness?
If you encounter a requirement to deploy a large model on a single T4 GPU, you will realize how cruel the reality is. This is not an argument, I am just on my journey, aiming to survive.
If You Believe in AGI, Caching All Text is Not a Dream
In an era of rapid technological development, one day, it may not be a dream to cache all text at a relatively low cost.
But is it really necessary?
Both Long-Context and RAG essentially aim to help the large model find better answers; true intelligence still relies on the model itself. Accepting longer context can indirectly reflect the intelligence of the model, but the intelligence of the model is not solely about accepting longer context.
Conclusion
Do not be too extreme; the world is not simply black and white, and there are many paths leading to AGI.
PS: Please add our public account to your favorites ⭐️ so you don’t get lost! Your likes, views, and follows are my greatest motivation to keep going!
Feel free to follow our public account “NLP Workstation”, join our discussion group, make friends, and learn together for mutual progress!
Our motto is “Life goes on, learning never stops”!
Previous recommendations:
-
Musings on Role-Playing Large Models -
Self-Distillation Method – Mitigating Catastrophic Forgetting in Fine-Tuning Large Models -
Details Sharing of Yi Technical Report -
New Techniques for Incremental Pre-Training of Large Models – Solving Catastrophic Forgetting -
How to Improve Text Representation Ability of LLMs? -
DEITA – Efficient Data Screening Method for Instruction Fine-Tuning of Large Models -
Fine-Tuning Techniques for Large Models | High-Quality Instruction Data Screening Method – MoDS -
Debunking! Microsoft Retracts Claim that ChatGPT is a 20B Parameter Model and Provides Explanation. -
How to View Microsoft’s Claim that ChatGPT is a 20B (200 Billion) Parameter Model? -
Fine-Tuning Techniques for Large Models – Adding Noise to Embeddings to Improve Instruction Fine-Tuning Effectiveness -
How to Automatically Identify High-Quality Instruction Data from Datasets -
Details Sharing of BaiChuan2 Technical Report & Personal Thoughts -
Summary of Fine-Tuning Experiences for Large Models & Project Updates -
Building a Web UI for LLMs -
Are We Training Large Models, or Are Large Models Training Us? -
Llama2 Technical Details & Open Source Impact -
Reconsidering Industry Implementation in the Era of Large Models -
Some Thoughts on Vertical Domain Large Models and Summary of Open Source Models -
How to Evaluate the Quality of Large Models – LLMs? -
Summary | Application of Prompts in NER Scenarios