Enhancing RAG: Choosing the Best Embedding and Reranker Models

Detailed steps and code are provided for how to choose the best embedding model and reranker model.

When building a Retrieval-Augmented Generation (RAG) pipeline, one of the key components is the retriever. We have various embedding models to choose from, including OpenAI, CohereAI, and open-source sentence transformers. Additionally, there are several rerankers available from CohereAI and sentence transformers.

However, among all these options, how do we determine the best combination for top retrieval performance? How do we know which embedding model is best suited for our data? Or which reranker can most enhance our results?

In this blog post, we will quickly determine the best combination of embedding and reranker models using the Retrieval Evaluation module from LlamaIndex. Let’s get started!

First, let’s understand the metrics available in Retrieval Evaluation.

Understanding Metrics in Retrieval Evaluation

To measure the efficiency of our retrieval system, we primarily rely on two widely accepted metrics: Hit Rate and Mean Reciprocal Rank (MRR). Let’s dive into these metrics, understand their significance, and how they work.

Hit Rate:

Hit Rate calculates the proportion of correct answers appearing in the top k retrieved documents for a query. Simply put, it concerns the frequency with which the correct answer appears in the first few guesses of our system.

Mean Reciprocal Rank (MRR):

For each query, MRR assesses the accuracy of the system by looking at the highest rank of the most relevant document. Specifically, it is the average of these reciprocal ranks across all queries. Thus, if the first relevant document is the top result, the reciprocal rank is 1; if it is second, the reciprocal rank is 1/2, and so on.

Now that we have established the scope and are familiar with these metrics, it’s time to dive into the experiments. For practical experience, you can also follow along using our Google Colab Notebook.

Setting Up the Environment

!pip install llama-index sentence-transformers cohere anthropic voyageai protobuf pypdf

Setting Up Keys

openai_api_key = <span>'YOUR OPENAI API KEY'</span>cohere_api_key = <span>'YOUR COHEREAI API KEY'</span>anthropic_api_key = <span>'YOUR ANTHROPIC API KEY'</span>openai.api_key = openai_api_key

Downloading Data

We are using the Llama2 paper in this experiment, and we will download this paper.

!wget --user-agent <span>"Mozilla"</span> <span>"https://arxiv.org/pdf/2307.09288.pdf"</span> -O <span>"llama2.pdf"</span>

Loading Data

Let’s load the data. We will use the content before page 36 for the experiment, which excludes the table of contents, references, and appendices.

Then, this data is parsed and converted into nodes that represent the data chunks we want to retrieve. We used 512 as the chunk size.

documents = SimpleDirectoryReader(input_files=[<span>"llama2.pdf"</span>]).load_data()node_parser = SimpleNodeParser.from_defaults(chunk_size=512)nodes = node_parser.get_nodes_from_documents(documents)

Generating Question-Context Pairs

For evaluation purposes, we created a dataset of question-answer pairs. This dataset can be viewed as a set of questions from our data along with their corresponding context. To eliminate bias in evaluating embeddings (OpenAI/CohereAI) and rerankers (CohereAI), we used the Anthropic LLM to generate the question-answer pairs.

Let’s initialize a prompt template to generate question-answer pairs.

<span># Prompt to generate questions</span>qa_generate_prompt_tmpl = <span>"""Context information is below.---------------------{context_str}---------------------Given the context information and not prior knowledge.generate only questions based on the below query.You are a Professor. Your task is to setup {num_questions_per_chunk} questions for an upcoming quiz/examination. The questions should be diverse in nature across the document. The questions should not contain options, not start with Q1/ Q2. Restrict the questions to the context information provided."""</span>llm = Anthropic(api_key=anthropic_api_key)qa_dataset = generate_question_context_pairs(    nodes, llm=llm, num_questions_per_chunk=2)

Function to filter out sentences like— Here are 2 questions based on provided context

<span># function to clean the dataset</span>def filter_qa_dataset(qa_dataset):    <span>"""    Filters out queries from the qa_dataset that contain certain phrases and the corresponding    entries in the relevant_docs, and creates a new EmbeddingQAFinetuneDataset object with    the filtered data.    :param qa_dataset: An object that has 'queries', 'corpus', and 'relevant_docs' attributes.    :return: An EmbeddingQAFinetuneDataset object with the filtered queries, corpus and relevant_docs."""</span> <span># Extract keys from queries and relevant_docs that need to be removed</span>    queries_relevant_docs_keys_to_remove = {        k <span>for</span> k, v <span>in</span> qa_dataset.queries.items()        <span>if</span> <span>'Here are 2'</span> <span>in</span> v or <span>'Here are two'</span> <span>in</span> v    }    <span># Filter queries and relevant_docs using dictionary comprehensions</span>    filtered_queries = {        k: v <span>for</span> k, v <span>in</span> qa_dataset.queries.items()        <span>if</span> k not <span>in</span> queries_relevant_docs_keys_to_remove    }    filtered_relevant_docs = {        k: v <span>for</span> k, v <span>in</span> qa_dataset.relevant_docs.items()        <span>if</span> k not <span>in</span> queries_relevant_docs_keys_to_remove    }    <span># Create a new instance of EmbeddingQAFinetuneDataset with the filtered data</span> <span>return</span> EmbeddingQAFinetuneDataset(        queries=filtered_queries,        corpus=qa_dataset.corpus,        relevant_docs=filtered_relevant_docs    )<span># filter out pairs with phrases `Here are 2 questions based on provided context`</span>qa_dataset = filter_qa_dataset(qa_dataset)

Custom Retriever

To determine the optimal retriever, we adopted a combination of embedding models and rerankers. Initially, we established a basic VectorIndexRetriever. After retrieving nodes, we introduced a reranker to further optimize the results. Notably, in this specific experiment, we set similarity_top_k to 10 and selected the top 5 results from the reranker. However, you can freely adjust this parameter based on the needs of your specific experiment. Here, we showcase the code using OpenAIEmbedding; please refer to the code for other embedding models in the notebook.

embed_model = OpenAIEmbedding()service_context = ServiceContext.from_defaults(llm=None, embed_model = embed_model)vector_index = VectorStoreIndex(nodes, service_context=service_context)vector_retriever = VectorIndexRetriever(index=vector_index, similarity_top_k = 10)class CustomRetriever(BaseRetriever):    <span>"""Custom retriever that performs both Vector search and Knowledge Graph search"""</span>    def __init__(        self,        vector_retriever: VectorIndexRetriever,    ) -> None:        <span>"""Init params."""</span>        self._vector_retriever = vector_retriever    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:        <span>"""Retrieve nodes given query."""</span>    retrieved_nodes = self._vector_retriever.retrieve(query_bundle)    <span>if</span> reranker != <span>'None'</span>:      retrieved_nodes = reranker.postprocess_nodes(retrieved_nodes, query_bundle)       <span>else</span>:          retrieved_nodes = retrieved_nodes[:5]                <span>return</span> retrieved_nodes    async def _aretrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:        <span>"""Asynchronously retrieve nodes given query.        Implemented by the user.        """</span> <span>return</span> self._retrieve(query_bundle)    async def aretrieve(self, str_or_query_bundle: QueryType) -> List[NodeWithScore]:        <span>if</span> isinstance(str_or_query_bundle, str):            str_or_query_bundle = QueryBundle(str_or_query_bundle)        <span>return</span> await self._aretrieve(str_or_query_bundle)custom_retriever = CustomRetriever(vector_retriever)

Evaluation

To evaluate our retriever, we calculated the Mean Reciprocal Rank (MRR) and Hit Rate metrics:

retriever_evaluator = RetrieverEvaluator.from_metric_names(    [<span>"mrr"</span>, <span>"hit_rate"</span>], retriever=custom_retriever)eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)

Results

We tested various embedding models and rerankers. Here are the models we considered:

Embedding Models:

OpenAI Embedding
Voyage Embedding
CohereAI Embedding (v2.0/ v3.0)
Jina Embeddings (small/ base)
BAAI/bge-large-en
Google PaLM Embedding

Rerankers:

CohereAI
bge-reranker-base
bge-reranker-large

It is worth mentioning that these results provide deep performance insights for this specific dataset and task. However, actual results may vary due to data characteristics, dataset size, and other variables (such as chunk_size, similarity_top_k, etc.).

The table below displays the evaluation results based on Hit Rate and Mean Reciprocal Rank (MRR) metrics:

Copy and try sharing again

Enhancing RAG: Choosing the Best Embedding and Reranker Models

Analysis:

Performance by Embedding Model:

OpenAI: Demonstrated top-tier performance, especially when combined with CohereRerank (Hit Rate 0.926966, MRR 0.86573) and bge-reranker-large (Hit Rate 0.910112, MRR 0.855805), indicating strong compatibility with reranking tools.
bge-large: Experienced significant performance improvements when using rerankers, with the best results coming from CohereRerank (Hit Rate 0.876404, MRR 0.822753).
llm-embedder: Benefited greatly from reranking, especially when combined with CohereRerank (Hit Rate 0.882022, MRR 0.830243), providing a significant performance boost.
Cohere: The latest v3.0 embeddings of Cohere outperformed v2.0, and significantly improved its metrics after integrating with native CohereRerank, achieving a Hit Rate of 0.88764 and MRR of 0.836049.
Voyage: Showed strong initial performance, further enhanced by CohereRerank (Hit Rate 0.91573, MRR 0.851217), indicating high responsiveness to reranking.
JinaAI: Very strong performance, with significant gains seen when using bge-reranker-large (Hit Rate 0.938202, MRR 0.868539) and CohereRerank (Hit Rate 0.932584, MRR 0.873689), indicating that reranking significantly improved its performance.
Google-PaLM: This model demonstrated strong performance, with measurable gains when using CohereRerank (Hit Rate 0.910112, MRR 0.855712). This indicates that reranking provided a clear enhancement to its overall results.

Impact of Rerankers:

No Reranker: This provided baseline performance for each embedding model.
bge-reranker-base: Generally improved the Hit Rate and MRR of all embedding models.
bge-reranker-large: This reranker often provided the highest or near-highest MRR for embedding models. For several embeddings, its performance was comparable to or surpassed that of CohereRerank.
CohereRerank: Consistently enhanced performance across all embedding models, often providing the best or near-best results.

Necessity of Rerankers:

Data clearly indicates the importance of rerankers in optimizing search results. Almost all embedding models benefited from reranking, showing improved Hit Rates and MRR values.
In particular, CohereRerank has proven its ability to turn any embedding model into a competitive one.

Overall Advantages:

When considering Hit Rate and MRR, the combinations of OpenAI + CohereRerank and JinaAI-Base + bge-reranker-large/CohereRerank stand out as top contenders.
However, the continuous improvements brought by CohereRerank/bge-reranker-large rerankers across different embedding models make them outstanding choices for enhancing search quality, regardless of the embedding model used.

In summary, to achieve optimal performance in Hit Rate and MRR, the combination of OpenAI or JinaAI-Base embeddings with CohereRerank/bge-reranker-large rerankers stands out.

Please note that our benchmarking aims to provide a reproducible script for your own data. Nevertheless, treat these numbers as estimates and exercise caution when interpreting them.

Conclusion:

In this blog post, we demonstrated how to evaluate and enhance the performance of retrievers using different embedding models and rerankers. Here are our final conclusions.

Embedding Models: OpenAI and JinaAI-Base embedding models, especially when paired with CohereRerank/bge-reranker-large rerankers, set the gold standard for Hit Rate and MRR.
Rerankers: The impact of rerankers, especially CohereRerank/bge-reranker-large, cannot be overstated. They play a crucial role in improving the MRR of many embedding models, demonstrating their importance in making search results better.
Foundation is Key: Choosing the right embedding model for initial searches is crucial; even the best rerankers won’t help much if the base search results are poor.
Working Together: To get the best out of the retriever, it’s important to find the right combination of embedding models and rerankers. This study highlights the importance of careful testing and finding the best pairing.

These conclusions emphasize the importance of selecting embedding models and rerankers when building efficient retrieval systems, and how they work together to provide the best search results.

Original: https://www.llamaindex.ai/blog/boosting-rag-picking-the-best-embedding-reranker-models-42d079022e83Author: Ravi Theja

Edit /Fan Ruiqiang

Review / Fan Ruiqiang

Verification / Fan Ruiqiang

Click below