Will the Larger Context Window Kill RAG?
Bill Gates, 1981
Eric Schmidt, 2010
Peter Sondergaard, 2011
Every second AI specialist, 2024.
Disclaimer:
There is no solid proof that the quotes mentioned here are accurate. The text below is purely the author’s own imagination. I assumed that a wonderful future is just around the corner, where a super-duper chip will be invented, resolving memory issues, LLMs will become cheaper, faster, and the hallucination problem will be solved. Therefore, this text should not be taken as an ultimate truth.
Lately, there’s been a lot of buzz around the arrival of LLMs with large context windows — millions of tokens. Some people are already saying that this will make RAG obsolete.
But is that really the case?
Are we so sure that larger context windows will always keep up with the exponential growth of data? According to estimates, the total amount of data in the world doubles every two to three years. At some point, even these huge context windows might start looking a bit too cramped.
Let’s say we’re talking about a million tokens right now — that’s roughly 2,000 pages of text. Think of 200 contracts, each a hundred pages long. Not that impressive if we’re talking about large-scale company archives. Even if we're talking about 10 million tokens, that's 20,000 pages of English text. What about Slavic or Eastern languages?
So, we're not talking about fitting an entire corporate database into a single context just yet. Instead, it’s more about reducing the requirement for search accuracy. You can just grab a broad set of a few hundred relevant documents, and let the model do the fact extraction on its own.
But here's what's important. We’re still in the early days of RAG. Right now, RAG handles information retrieval well but struggles with more complex analytical tasks, like the ones in the infamous FinanceBench. And if we’re talking about creative tasks that need deep integration with unique, user-specific content, RAG is still hovering at the edge of what's possible. In other words, at this stage, a million tokens feel like more of a “buffer” than a solution.
But the larger context windows might give RAG a major boost! Here’s why:
• Tackling more complex tasks. As context windows grow, RAG will be able to handle much more sophisticated analytical and creative challenges, weaving internal data together to produce insights and narratives.
• Blending internal and external data. With larger context, RAG will be able to mix internal company data with real-time info from the web, unlocking new possibilities for hybrid use cases.
• Keeping interaction context intact. Longer contexts mean keeping the entire conversation history alive, turning interactions into richer dialogues that are deeply rooted in “your” data.
So, what’s next? Once people and companies have tools to find and analyze all their stored data, they’re going to start digitizing everything. Customer calls, online and offline behavior patterns, competitor info, logs from every single meeting… You name it. Data volumes will start skyrocketing again, and no context window — no matter how big — will ever be able to capture it all.
And that’s when we’ll be heading into the next RAG evolution, which will need even more advanced techniques to keep up.
@advancedrag #blog
640 KB ought to be enough for anybody.
Bill Gates, 1981
There were 5 Exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days.
Eric Schmidt, 2010
Information is the oil of the 21st century, and analytics is the combustion engine.
Peter Sondergaard, 2011
The context window will kill RAG.
Every second AI specialist, 2024.
Disclaimer:
There is no solid proof that the quotes mentioned here are accurate. The text below is purely the author’s own imagination. I assumed that a wonderful future is just around the corner, where a super-duper chip will be invented, resolving memory issues, LLMs will become cheaper, faster, and the hallucination problem will be solved. Therefore, this text should not be taken as an ultimate truth.
Lately, there’s been a lot of buzz around the arrival of LLMs with large context windows — millions of tokens. Some people are already saying that this will make RAG obsolete.
But is that really the case?
Are we so sure that larger context windows will always keep up with the exponential growth of data? According to estimates, the total amount of data in the world doubles every two to three years. At some point, even these huge context windows might start looking a bit too cramped.
Let’s say we’re talking about a million tokens right now — that’s roughly 2,000 pages of text. Think of 200 contracts, each a hundred pages long. Not that impressive if we’re talking about large-scale company archives. Even if we're talking about 10 million tokens, that's 20,000 pages of English text. What about Slavic or Eastern languages?
So, we're not talking about fitting an entire corporate database into a single context just yet. Instead, it’s more about reducing the requirement for search accuracy. You can just grab a broad set of a few hundred relevant documents, and let the model do the fact extraction on its own.
But here's what's important. We’re still in the early days of RAG. Right now, RAG handles information retrieval well but struggles with more complex analytical tasks, like the ones in the infamous FinanceBench. And if we’re talking about creative tasks that need deep integration with unique, user-specific content, RAG is still hovering at the edge of what's possible. In other words, at this stage, a million tokens feel like more of a “buffer” than a solution.
But the larger context windows might give RAG a major boost! Here’s why:
• Tackling more complex tasks. As context windows grow, RAG will be able to handle much more sophisticated analytical and creative challenges, weaving internal data together to produce insights and narratives.
• Blending internal and external data. With larger context, RAG will be able to mix internal company data with real-time info from the web, unlocking new possibilities for hybrid use cases.
• Keeping interaction context intact. Longer contexts mean keeping the entire conversation history alive, turning interactions into richer dialogues that are deeply rooted in “your” data.
So, what’s next? Once people and companies have tools to find and analyze all their stored data, they’re going to start digitizing everything. Customer calls, online and offline behavior patterns, competitor info, logs from every single meeting… You name it. Data volumes will start skyrocketing again, and no context window — no matter how big — will ever be able to capture it all.
And that’s when we’ll be heading into the next RAG evolution, which will need even more advanced techniques to keep up.
@advancedrag #blog
👍1
https://arxiv.org/pdf/2410.02721
Domain-Specific Retrieval-Augmented Generation Using Vector Stores, Knowledge Graphs, and Tensor Factorization
@advancedrag #digest
Domain-Specific Retrieval-Augmented Generation Using Vector Stores, Knowledge Graphs, and Tensor Factorization
@advancedrag #digest
👍2
Domain-Specific Retrieval-Augmented Generation Using Vector Stores, Knowledge Graphs, and Tensor Factorization
An interesting study where a Knowledge Graph is used to refine search across a large corpus of documents.
Key features:
— The KG ontology is created using NLP libraries (without the use of LLMs);
— The ontology structure is fixed and reflects parameters of scientific articles (authors, publication year, Scopus category, affiliations, affiliation country, acronyms, publisher, topics, topic keywords, citations, references) and standard NLP tags (events, persons, locations, products, organizations, and geopolitical entities);
— The question examples provided in the article include keys from the listed ontology.
In my opinion, this study shows that when creating RAG systems, the problem of filtering relevant documents and chunks is critical. It’s not enough to rely solely on re-ranking algorithms like bi-/cross-encoders. A graph-based representation helps to organize data, and queries to the graph database allow selecting only the necessary content, which can then be re-ranked.
https://arxiv.org/pdf/2410.02721
@advancedrag #digest
In this paper, we introduce SMART-SLIC, a highly domain-specific LLM framework, that integrates RAG with KG and a vector store (VS) that store factual domain specific information. Importantly, to avoid hallucinations in the KG, we build these highly domain-specific KGs and VSs without the use of LLMs, but via NLP, data mining, and nonnegative tensor factorization with automatic model selection. Pairing our RAG with a domain- specific: (i) KG (containing structured information), and (ii) VS (containing unstructured information) enables the development of domain-specific chat-bots that attribute the source of information, mitigate hallucinations, lessen the need for fine-tuning, and excel in highly domain-specific question answering tasks.
An interesting study where a Knowledge Graph is used to refine search across a large corpus of documents.
Key features:
— The KG ontology is created using NLP libraries (without the use of LLMs);
— The ontology structure is fixed and reflects parameters of scientific articles (authors, publication year, Scopus category, affiliations, affiliation country, acronyms, publisher, topics, topic keywords, citations, references) and standard NLP tags (events, persons, locations, products, organizations, and geopolitical entities);
— The question examples provided in the article include keys from the listed ontology.
In my opinion, this study shows that when creating RAG systems, the problem of filtering relevant documents and chunks is critical. It’s not enough to rely solely on re-ranking algorithms like bi-/cross-encoders. A graph-based representation helps to organize data, and queries to the graph database allow selecting only the necessary content, which can then be re-ranked.
https://arxiv.org/pdf/2410.02721
@advancedrag #digest
👍2
LATE CHUNKING: CONTEXTUAL CHUNK EMBEDDINGS USING LONG-CONTEXT EMBEDDING MODELS
A rather interesting article about adding contextual information to all surrounding chunks. It is clearly noticeable how this method increases the inclusion of chunks with correct answers in the top 10 results. The overall improvement might seem small (~3.5%), but this is the essence of search — there is no "silver bullet" that will instantly provide perfect ranking, but there are many search factors that, when combined, help to improve the final ranking.
What's also interesting is that when increasing the size of the chunks, the metrics for hitting the top 10 do not change. This could suggest that this is the limit for simple bi-encoder solutions. That is, even when gathering all the contextual information from the surrounding chunks, there are still challenges in finding the correct answer through vector similarity.
PS: Some time ago, I started following Jina, they are one of the most active companies sharing their open-weight embedding and reranker models on Hugging Face.
https://arxiv.org/pdf/2409.04701
@advancedrag #digest
Many use cases require retrieving smaller portions of text, and dense vector-based retrieval systems often perform better with shorter text segments, as the semantics are less likely to be “over-compressed” in the embeddings. Consequently, practitioners often split text documents into smaller chunks and encode them separately. However, chunk embeddings created in this way can lose contextual information from surrounding chunks, resulting in sub-optimal representations. In this paper, we introduce a novel method called “late chunking”, which leverages long context embedding models to first embed all tokens of the long text, with chunking applied after the transformer model and just before mean pooling - hence the term “late” in its naming. The resulting chunk embeddings capture the full contextual information, leading to superior results across various retrieval tasks. The method is generic enough to be applied to a wide range of long-context embedding models and works without additional training. To further increase the effectiveness of late chunking, we propose a dedicated fine-tuning approach for embedding models.
A rather interesting article about adding contextual information to all surrounding chunks. It is clearly noticeable how this method increases the inclusion of chunks with correct answers in the top 10 results. The overall improvement might seem small (~3.5%), but this is the essence of search — there is no "silver bullet" that will instantly provide perfect ranking, but there are many search factors that, when combined, help to improve the final ranking.
What's also interesting is that when increasing the size of the chunks, the metrics for hitting the top 10 do not change. This could suggest that this is the limit for simple bi-encoder solutions. That is, even when gathering all the contextual information from the surrounding chunks, there are still challenges in finding the correct answer through vector similarity.
PS: Some time ago, I started following Jina, they are one of the most active companies sharing their open-weight embedding and reranker models on Hugging Face.
https://arxiv.org/pdf/2409.04701
@advancedrag #digest
👍2
Yuri Vorontsov: RAGs, LLM Papers & AI Search Renaissance pinned «Will the Larger Context Window Kill RAG? 640 KB ought to be enough for anybody. Bill Gates, 1981 There were 5 Exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days. Eric Schmidt…»
HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly
Great study, you can observe the degradation of metrics depending on the size of the context used.
Based on open datasets, the team built a tool for testing how different LLMs solve various tasks.
https://arxiv.org/pdf/2410.02694
@advancedrag #digest
Great study, you can observe the degradation of metrics depending on the size of the context used.
Based on open datasets, the team built a tool for testing how different LLMs solve various tasks.
Performance degradation with longer inputs is category-dependent. Most frontier models largely retain performance on recall and RAG with longer inputs; however, even the best models significantly degrade with more contexts on tasks like re-ranking and generation with citations.
https://arxiv.org/pdf/2410.02694
@advancedrag #digest
👍2
Driving with Regulation: Interpretable Decision-Making for Autonomous Vehicles with Retrieval-Augmented Reasoning via LLM
A very unexpected, yet obvious application of RAG for edge cases. It’s a great idea to use LLM to explain the reasons behind certain non-obvious actions, especially when it comes to various legal nuances.
https://arxiv.org/pdf/2410.04759
@advancedrag #digest
This work presents an interpretable decisionmaking framework for autonomous vehicles that integrates traffic regulations, norms, and safety guidelines comprehensively and enables seamless adaptation to different regions. While traditional rule-based methods struggle to incorporate the full scope of traffic rules, we develop a Traffic Regulation Retrieval (TRR) Agent based on Retrieval-Augmented Generation (RAG) to automatically retrieve relevant traffic rules and guidelines from extensive regulation documents and relevant records based on the ego vehicle’s situation. Given the semantic complexity of the retrieved rules, we also design a reasoning module powered by a Large Language Model (LLM) to interpret these rules, differentiate between mandatory rules and safety guidelines, and assess actions on legal compliance and safety.
Additionally, the reasoning is designed to be interpretable, enhancing both transparency and reliability. The framework demonstrates robust performance on both hypothesized and real-world cases across diverse scenarios, along with the ability to adapt to different regions with ease.
A very unexpected, yet obvious application of RAG for edge cases. It’s a great idea to use LLM to explain the reasons behind certain non-obvious actions, especially when it comes to various legal nuances.
https://arxiv.org/pdf/2410.04759
@advancedrag #digest
👍2
