XRAG supports comprehensive RAG evaluation benchmarks and toolkits, covering over 50 testing metrics for thorough evaluation and optimization of failure points in RAG. It supports comparisons among four types of advanced RAG modules (query rewriting, advanced retrieval, question-answering models, post-processing) and integrates various specific implementations within the modules, supporting the OpenAI large model API. The XRAG 1.0 version also provides a simple Web UI demo, lightweight data upload with a unified standard format, and integrates methods for detecting and optimizing RAG failure points. The article and code have been open-sourced and released.


Paper Title: XRAG: eXamining the Core – Benchmarking Foundational Component Modules in Advanced Retrieval-Augmented Generation
Author Institutions: Beihang University, ZGCLAB
Paper Link: https://arxiv.org/abs/2412.15529
Project Link: https://github.com/DocAILab/XRAG
Project Highlights
1. XRAG-Ollama Localized Retrieval Inference Framework
XRAG: A Flexible and Scalable RAG Framework
XRAG is a comprehensive and customizable RAG framework that integrates core components such as query rewriting, advanced retrieval, question-answering models, and post-processing through a component-based and modular design. XRAG provides over 50 testing metrics and various advanced algorithms, as well as efficient data preprocessing scripts, simplifying the model testing and validation process in complex RAG scenarios, allowing users to easily implement and optimize their RAG models.
Ollama: An Efficient Local Inference Engine
Ollama, as a lightweight inference framework focused on local deployment, significantly enhances the inference efficiency of large language models in heterogeneous computing environments through techniques such as hardware acceleration, quantization, and attention mechanism optimization. Its modular design supports seamless integration with vector databases (such as ChromaDB), providing high-performance local inference capabilities for building RAG systems, particularly suitable for scenarios requiring quick responses and strict data privacy requirements.
XRAG-Ollama Localized Retrieval Inference Framework: A Perfect Combination of Retrieval-Augmented Generation and Efficient Inference
The XRAG-Ollama localized retrieval inference framework combines XRAG’s retrieval-augmented generation capabilities with Ollama’s efficient inference capabilities, providing a more accurate and faster user experience. Based on XRAG’s modular design and Ollama’s localized inference optimization, this framework achieves significant performance improvements compared to traditional inference methods. By combining retrieval-augmented knowledge with efficient inference techniques, the XRAG-Ollama framework offers users an agile and efficient experience, ensuring output results are more accurate and meeting users’ demands for high-performance RAG systems.
2. Ollama Framework: Supporting XRAG in Achieving Efficient Localized Retrieval Inference
In the XRAG-Ollama localized retrieval inference framework, Ollama plays a crucial role. As an open-source, user-friendly local large model execution framework, Ollama provides XRAG with powerful localized retrieval inference capabilities, allowing XRAG to fully leverage its retrieval-augmented generation advantages.
Why Localize Deployment of XRAG?
-
Reduce External Risks: Using local deployment can reduce reliance on external services, lowering potential risks from third-party service instability or data breaches. -
Offline Availability: Localized RAG systems do not rely on internet connections and can operate normally even in cases of network interruption, ensuring service continuity and stability. -
Data Self-Management: Local deployment allows users to have complete control over data storage, management, and processing methods, such as embedding private data into local vector databases, ensuring data processing complies with the enterprise’s own security standards and business requirements. -
Data Privacy and Security: Running RAG systems in a local environment can avoid the risk of sensitive data leakage through network transmission, ensuring data remains under local control. This is especially important for enterprises handling confidential information.
Why Choose Ollama?
Ollama is a lightweight, scalable framework for building and running large language models (LLM) on local machines. It provides a simple API for creating, running, and managing models, as well as a pre-built model library that can be easily used for various applications. It supports multiple models such as DeepSeek, Llama 3.3, Phi 3, Mistral, Gemma 2, and can leverage modern hardware acceleration for high-performance inference support for XRAG. Additionally, Ollama offers support for model quantization, which can significantly reduce memory requirements. For example, 4-bit quantization can compress FP16 precision weight parameters to 4-bit integer precision, greatly reducing the model weight size and memory required for inference. This makes it possible to run large models on ordinary home computers.
By combining with Ollama, XRAG can efficiently run large language models locally without relying on complex environment configurations and extensive computing resources, significantly lowering deployment and operational costs. Meanwhile, the local deployment solution grants developers full control over data processing, supporting a full-link customization from raw data cleaning, vectorization processing (such as building private knowledge bases through ChromaDB) to final application implementation. Its infrastructure-based deployment architecture inherently possesses offline operation capabilities, ensuring service continuity and meeting the stringent reliability requirements of special environments (such as confidential networks).
Below are some models available for direct download:

Here are several localization model and GPU adaptation solutions for your reference in local deployment of XRAG+Ollama:



Installation and Usage
1. Installing and Using Ollama
Before starting to use Ollama, ensure that Docker is installed on your computer or that you have an environment that can run exe files. Docker is an open-source application container engine that makes deploying applications easier. If you haven’t installed it yet, you can visit the Docker official website to download and install the appropriate version for your operating system.
Downloading Ollama
Ollama offers various installation methods, including Docker images and direct exe installation packages. For users who wish to install directly via exe, you can download the suitable exe installation package for your operating system from Ollama’s official website or GitHub repository. When you successfully run <span>ollama --version</span>
, it indicates that Ollama has been successfully installed, and you can then use the <span>pull</span>
command to download models from the online model library.
Pulling and Running Models
Whether through Docker or direct exe installation, you can use Ollama’s command-line tools to pull and run models. For example:

This will download the llama3.1 model from Ollama’s model library and run it locally, providing strong inference support for XRAG.
Installation Success Test
Through the above steps, XRAG can leverage Ollama to achieve efficient localized retrieval inference, providing users with a more accurate and faster RAG system experience.

2. Installing XRAG
-
Create a Virtual Environment Using Conda

-
Clone the Code Locally and Configure the Environment

-
Try Starting XRAG

-
The output should be:

This indicates that the environment configuration is basically complete, and XRAG has started. The first use may require entering an email.
3. Interactive Use of XRAG-Ollama Framework
Start XRAG and Access the Web Page

As shown on the page, we have collected and preprocessed three benchmark datasets for the XRAG framework. In addition, we have developed a unified dataset structure to facilitate performance testing of retrieval and generation modules and provide a standardized format. You can load your customized dataset into the system by uploading a JSON file in the specified format.
Select One of the Datasets and Click Load Dataset to Start Loading Data, Taking Drop as an Example

Loading Data Takes Some Time; Once Completed, Enter the Configuration Stage

The main content to configure in the configuration stage is the generative model and encoding model used.
-
For the Encoding Model: XRAG supports BGE series and embedding models supported by the Huggingface library. -
For the Generative Model: XRAG supports various models, including: -
OpenAI Series Models: XRAG supports OpenAI models and all models that can seamlessly use the OpenAI framework; just fill in the correct API Key and API Base to use. -
Ollama Local Large Models: XRAG also supports local deployment of large models; simply install Ollama locally to directly call models offline, fully leveraging the capabilities of the XRAG framework.
Next, Configure the Retrieval Part and Prompts
The main content to configure in this section includes:
-
Advanced Retriever Methods -
Pre-Retriever Methods -
Postprocess Methods -
Text QA Template and Refine Template
The Text QA Template and Refine Template preset the common prompts needed for question-answering tasks.
Pre-Retriever Methods: Before retrieval, we use the pre-retrieval component to optimize user queries, thereby improving the quality and relevance of the information retrieval process. Main methods include prompt expansion: broadening queries to enrich the context for answers and enhance the contextual basis for answer generation.
Hypothetical Document Embedding (HyDE Technology): Transforming the original query into a form more consistent with indexed documents, thereby improving retrieval consistency and efficiency.
Verification Chain (CoVe): Executing a verification plan to further refine the system response into an enhanced response.
Advanced Retriever Methods: We modularized the standard advanced methods implemented in LlamaIndex. For example, we rank documents in the LexicalBM25 retriever based on the frequency and rarity of query terms in the corpus.
Simple Fusion Retriever (SQFusion) enhances queries by generating relevant sub-queries and returns the top k nodes from all queries and indices. RRFusion merges the index with BM25-based retrievers, capturing both semantic relationships and keyword relevance.
SentenceWindow Retriever parses documents into individual sentences as leaf nodes, merging surrounding sentences when retrieving leaf nodes to increase context.
RecursiveChunk Retriever traverses node relationships to retrieve nodes based on references.
Postprocess Methods: Post-processors aim to transform and filter returned nodes, further improving retrieval accuracy and efficiency.
XRAG integrates a re-ranker that enhances the accuracy of relevance assessments by utilizing context understanding models instead of embedding matching models. We use Huggingface and transformers to integrate (BGE-BASE) 1reranker, which processes questions and documents through a Cross-Encoder model, directly outputting similarity scores. ColBERT reranker uses multi-vector representations for granular query-document matching. Additionally, LongContextReorder repositions high-scoring nodes to the top and bottom of the list, speeding up the identification of relevant information.

After confirming the configuration is complete, click OK to start building the overall process, which may take some time. You can then test your RAG system with individual questions.



Note:Nickname – School/Company – Direction/ Conference (e.g., ACL), enter the technical/submission group
ID: DLNLPer, remember to leave a note