As the context length supported by large models increases, there has surprisingly emerged a debate between RAG and Long-Context (this topic is being discussed in several groups, so I’d like to share some thoughts). It really isn’t necessary… mainly because the two do not conflict; it’s not an either-or relationship.
My personal opinion: if we compare them to a retrieval system, RAG should be considered as coarse ranking, while Long-Context can be seen as fine ranking. The essence of RAG is to find relevant content fragments from a database/knowledge base based on user questions, and then use the large model to find or summarize the answer. The essence of Long-Context is to input all text content into the large model, so when a user asks a question, the model retrieves or summarizes the answer.
The fundamental difference lies in how external knowledge is provided to the large model and how much is provided.
This is also where the debate arises, because as the length accepted by the large model increases, the importance of retrieval decreases, and the reliance on retrieval effectiveness diminishes. This is why some people hold the view that in the future there will be no RAG, only Long-Context. However, as the input length of the model increases, the resources occupied by the KV Cache also increase, leading to a significant rise in costs. This is also why some believe that RAG will still exist in the future.
So how far into the future are we talking about? If it’s just five years, I believe RAG will definitely exist; if it’s AGI in the future, then perhaps RAG won’t be necessary.
Why RAG is Coarse Ranking and Long-Context is Fine Ranking
From a computational perspective, RAG currently relies on retrieval systems for relevant content filtering, generally using methods like ES and vector matching, which can be understood as having a relatively low computational load, meaning less interaction between texts. Long-Context, on the other hand, interacts with the entire large model parameters when the user’s query interacts with the text, utilizing every layer of attention in the Transformer to locate relevant text fragments.
From the perspective of text selection, Long-Context can be likened to having already determined the relevant content for a specific question, while RAG requires retrieval methods to determine the relevant content related to the question, which results in some information loss.
Longer Context Support by Large Models is More Friendly to RAG
RAG is currently the fastest, most effective, and safest way to deploy large models. However, RAG still has many issues, such as the original semantics of the text being cut during document segmentation and the accuracy issues with retrieval matching. But when the context that the large model can support is longer, RAG can avoid or minimize document segmentation, allowing retrieval to recall more content; if Top 10 isn’t enough, we can go for Top 100.
So,
-
If you side with RAG, then even if the large model supports Long-Context, it is still a large model;
-
If you side with Long-Context, then RAG is merely a filtering method before the large model supports unlimited context.
The core issue of the debate is whether the length of the content in the database/knowledge base will exceed the maximum length that the large model can support.
However, the better the retrieval and the longer the supported length of the large model, the more essential it is for the current deployment of large models and their sustainable development. We are family!!!
Long-Context Will Impact Some RAG Scenarios, But Not All
It is undeniable that Long-Context will impact some RAG scenarios.
Primarily, this concerns scenarios where the content of the database/knowledge base does not exceed the maximum length of the large model, for example: scenarios where only a few documents are available as reference materials, and there is no need for retrieval; simply providing everything to the large model is sufficient.
But if the content exceeds the maximum length of the large model, how would you handle it without retrieval filtering?
Many extreme people say that there may no longer be domain-specific large models, but if you think about it carefully, even if Gemini supports 1M tokens and Kimi-Chat supports 2 million words, that’s about 400 PDFs (assuming an average of 5k words per PDF), how could vertical field data be that extensive?
Additionally, regarding the determination of permissions in scenarios, there’s no need to use the large model for judgment; domain isolation can suffice, so there’s no need to provide all text to the large model (saving computational resources).
PS: On a side note, it seems that a person only generates about 0.3 billion tokens in a lifetime (this was discussed by group friends before). Is it possible that when the model supports a maximum length of 300 million, we could quickly replicate a person? No fine-tuning needed, just provide the database.
Can You Deploy a 1M Token Large Model Service?
The calculation method for KV Cache is 4 * Batch Size * number of model layers L * attention dimension H * KV Cache length; supporting a hard 1M length is indeed daunting.
Of course, there are currently some optimization methods, such as sliding windows, cache quantization, etc. (everyone is welcome to contribute), but even so, due to the large model’s parameter scale, the memory usage is still quite alarming. Moreover, the sliding window seems to be similar to retrieval, both having losses. So if losses are acceptable, why not accept retrieval?
Of course, the content discussed earlier does not consider cost issues, but returning to reality, which manufacturers have the capability to deploy a large model service that supports 1M tokens, or what is the cost-performance ratio of deploying such a model?
For ToC, large model services are deployed by large model companies, and the costs are borne by them, so the focus is on profitability (though one can choose to ignore this for the sake of dreams).
For ToB, there’s a very practical issue: with so many GPUs used, what do you get in return? Compared to RAG, can Long-Context bring an improvement in accuracy, how much improvement can it bring, and will it increase latency? In the same GPU scenario, is deploying a larger model + RAG more cost-effective?
If you encounter a demand for deploying a large model on a single T4, you’ll understand how harsh the reality is. This isn’t an argument, it’s just me walking on a path, aiming to survive.
If You Believe in AGI, Caching All Text Isn’t a Dream
In an era of rapid technological development, perhaps one day it won’t be a dream to cache all text at a lower cost.
But is it really necessary?
Both Long-Context and RAG essentially aim to help large models find better answers; true intelligence still relies on the model itself. Accepting longer contexts can reflect the intelligence of the model indirectly, but the model’s intelligence is not solely dependent on accepting longer contexts.
Scan the QR code to add the assistant on WeChat