Yuri Vorontsov: RAGs, LLM Papers & AI Search Renaissance
37 subscribers
18 photos
16 links
A perspective on the world of AI Search through the eyes of founder of Yandex Vertical Search.
DM: @ragexpert
Download Telegram
Will the Larger Context Window Kill RAG?

640 KB ought to be enough for anybody.

Bill Gates, 1981

There were 5 Exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days.

Eric Schmidt, 2010

Information is the oil of the 21st century, and analytics is the combustion engine.

Peter Sondergaard, 2011

The context window will kill RAG.

Every second AI specialist, 2024.

Disclaimer:
There is no solid proof that the quotes mentioned here are accurate. The text below is purely the author’s own imagination. I assumed that a wonderful future is just around the corner, where a super-duper chip will be invented, resolving memory issues, LLMs will become cheaper, faster, and the hallucination problem will be solved. Therefore, this text should not be taken as an ultimate truth.

Lately, there’s been a lot of buzz around the arrival of LLMs with large context windows — millions of tokens. Some people are already saying that this will make RAG obsolete.

But is that really the case?

Are we so sure that larger context windows will always keep up with the exponential growth of data? According to estimates, the total amount of data in the world doubles every two to three years. At some point, even these huge context windows might start looking a bit too cramped.

Let’s say we’re talking about a million tokens right now — that’s roughly 2,000 pages of text. Think of 200 contracts, each a hundred pages long. Not that impressive if we’re talking about large-scale company archives. Even if we're talking about 10 million tokens, that's 20,000 pages of English text. What about Slavic or Eastern languages?

So, we're not talking about fitting an entire corporate database into a single context just yet. Instead, it’s more about reducing the requirement for search accuracy. You can just grab a broad set of a few hundred relevant documents, and let the model do the fact extraction on its own.

But here's what's important. We’re still in the early days of RAG. Right now, RAG handles information retrieval well but struggles with more complex analytical tasks, like the ones in the infamous FinanceBench. And if we’re talking about creative tasks that need deep integration with unique, user-specific content, RAG is still hovering at the edge of what's possible. In other words, at this stage, a million tokens feel like more of a “buffer” than a solution.

But the larger context windows might give RAG a major boost! Here’s why:
Tackling more complex tasks. As context windows grow, RAG will be able to handle much more sophisticated analytical and creative challenges, weaving internal data together to produce insights and narratives.
Blending internal and external data. With larger context, RAG will be able to mix internal company data with real-time info from the web, unlocking new possibilities for hybrid use cases.
Keeping interaction context intact. Longer contexts mean keeping the entire conversation history alive, turning interactions into richer dialogues that are deeply rooted in “your” data.

So, what’s next? Once people and companies have tools to find and analyze all their stored data, they’re going to start digitizing everything. Customer calls, online and offline behavior patterns, competitor info, logs from every single meeting… You name it. Data volumes will start skyrocketing again, and no context window — no matter how big — will ever be able to capture it all.

And that’s when we’ll be heading into the next RAG evolution, which will need even more advanced techniques to keep up.

@advancedrag #blog
👍1
https://arxiv.org/pdf/2410.02721

Domain-Specific Retrieval-Augmented Generation Using Vector Stores, Knowledge Graphs, and Tensor Factorization

@advancedrag #digest
👍2
Domain-Specific Retrieval-Augmented Generation Using Vector Stores, Knowledge Graphs, and Tensor Factorization

In this paper, we introduce SMART-SLIC, a highly domain-specific LLM framework, that integrates RAG with KG and a vector store (VS) that store factual domain specific information. Importantly, to avoid hallucinations in the KG, we build these highly domain-specific KGs and VSs without the use of LLMs, but via NLP, data mining, and nonnegative tensor factorization with automatic model selection. Pairing our RAG with a domain- specific: (i) KG (containing structured information), and (ii) VS (containing unstructured information) enables the development of domain-specific chat-bots that attribute the source of information, mitigate hallucinations, lessen the need for fine-tuning, and excel in highly domain-specific question answering tasks.


An interesting study where a Knowledge Graph is used to refine search across a large corpus of documents.

Key features:
— The KG ontology is created using NLP libraries (without the use of LLMs);
— The ontology structure is fixed and reflects parameters of scientific articles (authors, publication year, Scopus category, affiliations, affiliation country, acronyms, publisher, topics, topic keywords, citations, references) and standard NLP tags (events, persons, locations, products, organizations, and geopolitical entities);
— The question examples provided in the article include keys from the listed ontology.

In my opinion, this study shows that when creating RAG systems, the problem of filtering relevant documents and chunks is critical. It’s not enough to rely solely on re-ranking algorithms like bi-/cross-encoders. A graph-based representation helps to organize data, and queries to the graph database allow selecting only the necessary content, which can then be re-ranked.

https://arxiv.org/pdf/2410.02721

@advancedrag #digest
👍2
LATE CHUNKING: CONTEXTUAL CHUNK EMBEDDINGS USING LONG-CONTEXT EMBEDDING MODELS

Many use cases require retrieving smaller portions of text, and dense vector-based retrieval systems often perform better with shorter text segments, as the semantics are less likely to be “over-compressed” in the embeddings. Consequently, practitioners often split text documents into smaller chunks and encode them separately. However, chunk embeddings created in this way can lose contextual information from surrounding chunks, resulting in sub-optimal representations. In this paper, we introduce a novel method called “late chunking”, which leverages long context embedding models to first embed all tokens of the long text, with chunking applied after the transformer model and just before mean pooling - hence the term “late” in its naming. The resulting chunk embeddings capture the full contextual information, leading to superior results across various retrieval tasks. The method is generic enough to be applied to a wide range of long-context embedding models and works without additional training. To further increase the effectiveness of late chunking, we propose a dedicated fine-tuning approach for embedding models.


A rather interesting article about adding contextual information to all surrounding chunks. It is clearly noticeable how this method increases the inclusion of chunks with correct answers in the top 10 results. The overall improvement might seem small (~3.5%), but this is the essence of search — there is no "silver bullet" that will instantly provide perfect ranking, but there are many search factors that, when combined, help to improve the final ranking.

What's also interesting is that when increasing the size of the chunks, the metrics for hitting the top 10 do not change. This could suggest that this is the limit for simple bi-encoder solutions. That is, even when gathering all the contextual information from the surrounding chunks, there are still challenges in finding the correct answer through vector similarity.

PS: Some time ago, I started following Jina, they are one of the most active companies sharing their open-weight embedding and reranker models on Hugging Face.

https://arxiv.org/pdf/2409.04701

@advancedrag #digest
👍2
Yuri Vorontsov: RAGs, LLM Papers & AI Search Renaissance pinned «Will the Larger Context Window Kill RAG? 640 KB ought to be enough for anybody. Bill Gates, 1981 There were 5 Exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days. Eric Schmidt…»
HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly

Great study, you can observe the degradation of metrics depending on the size of the context used.

Based on open datasets, the team built a tool for testing how different LLMs solve various tasks.

Performance degradation with longer inputs is category-dependent. Most frontier models largely retain performance on recall and RAG with longer inputs; however, even the best models significantly degrade with more contexts on tasks like re-ranking and generation with citations.


https://arxiv.org/pdf/2410.02694

@advancedrag #digest
👍2
Driving with Regulation: Interpretable Decision-Making for Autonomous Vehicles with Retrieval-Augmented Reasoning via LLM

This work presents an interpretable decisionmaking framework for autonomous vehicles that integrates traffic regulations, norms, and safety guidelines comprehensively and enables seamless adaptation to different regions. While traditional rule-based methods struggle to incorporate the full scope of traffic rules, we develop a Traffic Regulation Retrieval (TRR) Agent based on Retrieval-Augmented Generation (RAG) to automatically retrieve relevant traffic rules and guidelines from extensive regulation documents and relevant records based on the ego vehicle’s situation. Given the semantic complexity of the retrieved rules, we also design a reasoning module powered by a Large Language Model (LLM) to interpret these rules, differentiate between mandatory rules and safety guidelines, and assess actions on legal compliance and safety.
Additionally, the reasoning is designed to be interpretable, enhancing both transparency and reliability. The framework demonstrates robust performance on both hypothesized and real-world cases across diverse scenarios, along with the ability to adapt to different regions with ease.


A very unexpected, yet obvious application of RAG for edge cases. It’s a great idea to use LLM to explain the reasons behind certain non-obvious actions, especially when it comes to various legal nuances.

https://arxiv.org/pdf/2410.04759

@advancedrag #digest
👍2
GARLIC: LLM-GUIDED DYNAMIC PROGRESS CONTROL WITH HIERARCHICAL WEIGHTED GRAPH FOR LONG DOCUMENT QA

In the past, Retrieval-Augmented Generation (RAG) methods split text into chunks to enable language models to handle long documents. Recent tree-based RAG methods are able to retrieve detailed information while preserving global context. However, with the advent of more powerful LLMs, such as Llama 3.1, which offer better comprehension and support for longer inputs, we found that even recent tree-based RAG methods perform worse than directly feeding the entire document into Llama 3.1, although RAG methods still hold an advantage in reducing computational costs. In this paper, we propose a new retrieval method, called LLM-Guided Dynamic Progress Control with Hierarchical Weighted Graph (GARLIC), which outperforms previous state-of-the-art baselines, including Llama 3.1, while retaining the computational efficiency of RAG methods. Our method introduces several improvements:
(1) Rather than using a tree structure, we construct a Hierarchical Weighted Directed Acyclic Graph with many-to-many summarization, where the graph edges are derived from attention mechanisms, and each node focuses on a single event or very few events.
(2) We introduce a novel retrieval method that leverages the attention weights of LLMs rather than dense embedding similarity. Our method allows for searching the graph along multiple paths and can terminate at any depth.
(3) We use the LLM to control the retrieval process, enabling it to dynamically adjust the amount and depth of information retrieved for different queries.
Experimental results show that our method outperforms previous state-of-the-art baselines, including Llama 3.1, on two single-document and two multi-document QA datasets, while maintaining similar computational complexity to traditional RAG methods.


An interesting approach not only to data organization through graphs but also to the retrieval step, which is implemented as graph traversal, guided by attention weights to choose the node and using an LLM to determine the depth of the traversal.

This approach might be suitable for analyzing legal documents. It needs to be tested on some dataset.

https://arxiv.org/pdf/2410.04790

@advancedrag #digest
👍2
Yuri Vorontsov: RAGs, LLM Papers & AI Search Renaissance
Driving with Regulation: Interpretable Decision-Making for Autonomous Vehicles with Retrieval-Augmented Reasoning via LLM This work presents an interpretable decisionmaking framework for autonomous vehicles that integrates traffic regulations, norms, and…
LLaVA Needs More Knowledge: Retrieval Augmented Natural Language Generation with Knowledge Graph for Explaining Thoracic Pathologies

Generating Natural Language Explanations (NLEs) for model predictions on medical images, particularly those depicting thoracic pathologies, remains a critical and challenging task. Existing methodologies often struggle due to general models’ insufficient domain-specific medical knowledge and privacy concerns associated with retrievalbased augmentation techniques. To address these issues, we propose a novel Vision-Language framework augmented with a Knowledge Graph (KG)-based datastore, which enhances the model’s understanding by incorporating additional domain-specific medical knowledge essential for generating accurate and informative NLEs. Our framework employs a KG-based retrieval mechanism that not only improves the precision of the generated explanations but also preserves data privacy by avoiding direct data retrieval. The KG datastore is designed as a plug-andplay module, allowing for seamless integration with various model architectures. We introduce and evaluate three distinct frameworks within this paradigm: KG-LLaVA, which integrates the pre-trained LLaVA model with KGRAG; Med-XPT, a custom framework combining MedCLIP, a transformer-based projector, and GPT-2; and Bio-LLaVA, which adapts LLaVA by incorporating the Bio-ViT-L vision model. These frameworks are validated on the MIMIC-NLE dataset, where they achieve state-of-the-art results, underscoring the effectiveness of KG augmentation in generating high-quality NLEs for thoracic pathologies.


First, a RAG for self-driving cars; next, a RAG for analyzing medical images.

https://arxiv.org/pdf/2410.04749

@advancedrag #digest
🔥2
Retrieving, Rethinking and Revising: The Chain-of-Verification Can Improve Retrieval Augmented Generation

On one hand, this work demonstrates how it is possible to refine a question and find the correct answer in a more controlled manner using a function that depends on multiple parameters. This allows to manage multiple iterations effectively.

On the other hand, it raises the same issues as other approaches where the results are re-validated by a model trained on publicly available data, which can be time-consuming and expensive.

https://arxiv.org/pdf/2410.05801

@advancedrag #digest
🔥3
Is Semantic Chunking Worth the Computational Cost?

There is no "silver bullet" — no perfect chunking strategy. The document chunking format should be chosen based on the specific retrieval task. It's great that this research has emerged!

TL;DR

This article examines the balance between semantic chunking and fixed-size chunking in Retrieval-Augmented Generation (RAG) systems. While semantic chunking aims to improve document retrieval and answer generation by dividing documents into semantically coherent segments, the authors question if the additional computational costs are justified compared to the simpler fixed-size chunking method.

The study conducted various experiments on document retrieval, evidence retrieval, and answer generation tasks. Results show that while semantic chunking offers some benefits in synthetic datasets with high topic diversity, it doesn't consistently outperform fixed-size chunking in real-world scenarios. Fixed-size chunking is often more computationally efficient and performs better when documents don't exhibit significant topic diversity.

The authors conclude that semantic chunking's performance improvements are context-dependent and may not justify the increased computational overhead. They call for further exploration of more efficient chunking strategies for practical applications.

https://arxiv.org/pdf/2410.13070

@advancedrag #digest
👍3
ConTReGen: Context-driven Tree-structured Retrieval for Open-domain Long-form Text Generation

This method could be useful in complex tasks such as building comprehensive summaries (e.g., wikis) or handling intricate queries in business intelligence. The approach emphasizes depth and accuracy but could be slow, raising questions about how to optimize the process.

TL;DR

The ConTReGen paper introduces a novel method for open-domain long-form text generation. The problem it tackles involves generating responses to complex queries by breaking them down into smaller, manageable sub-tasks. The authors propose a tree-structured retrieval model where queries are decomposed into sub-queries in a top-down manner and then synthesized in a bottom-up process. This hierarchical method ensures that every facet of the query is explored in depth.

Key points include:
1. Tree-structured Approach: Instead of a linear sequence of sub-questions, the system uses a tree structure where each task can have further sub-tasks, ensuring a thorough exploration of all aspects.

2. Recursive Decomposition: The model recursively breaks down the original query and explores sub-facets until a stopping condition is reached, either a predefined depth or the determination that all questions have been answered.

3. Bottom-up Synthesis: After exploring the tree, results are synthesized bottom-up, where each node's results are combined to form a comprehensive answer to the original query.

The paper's results show potential, but the authors have not yet detailed how the model handles recursion depth or the decision to stop exploration. The idea is promising for handling large, complex problems, but there are questions about performance and potential optimizations.

https://arxiv.org/pdf/2410.15511

@advancedrag #digest
👍1