Detailed steps and code are provided for how to choose the best embedding model and reranker model.
When building a Retrieval-Augmented Generation (RAG) pipeline, one of the key components is the retriever. We have various embedding models to choose from, including OpenAI, CohereAI, and open-source sentence transformers. Additionally, there are several rerankers available from CohereAI and sentence transformers.
However, among all these options, how do we determine the best combination for top retrieval performance? How do we know which embedding model is best suited for our data? Or which reranker can most enhance our results?
In this blog post, we will quickly determine the best combination of embedding and reranker models using the Retrieval Evaluation
module from LlamaIndex. Let’s get started!
First, let’s understand the metrics available in Retrieval Evaluation
.
Understanding Metrics in Retrieval Evaluation
To measure the efficiency of our retrieval system, we primarily rely on two widely accepted metrics: Hit Rate and Mean Reciprocal Rank (MRR). Let’s dive into these metrics, understand their significance, and how they work.
Hit Rate:
Hit Rate calculates the proportion of correct answers appearing in the top k retrieved documents for a query. Simply put, it concerns the frequency with which the correct answer appears in the first few guesses of our system.
Mean Reciprocal Rank (MRR):
For each query, MRR assesses the accuracy of the system by looking at the highest rank of the most relevant document. Specifically, it is the average of these reciprocal ranks across all queries. Thus, if the first relevant document is the top result, the reciprocal rank is 1; if it is second, the reciprocal rank is 1/2, and so on.
Now that we have established the scope and are familiar with these metrics, it’s time to dive into the experiments. For practical experience, you can also follow along using our Google Colab Notebook.
Setting Up the Environment
!pip install llama-index sentence-transformers cohere anthropic voyageai protobuf pypdf
Setting Up Keys
openai_api_key = <span>'YOUR OPENAI API KEY'</span>cohere_api_key = <span>'YOUR COHEREAI API KEY'</span>anthropic_api_key = <span>'YOUR ANTHROPIC API KEY'</span>openai.api_key = openai_api_key
Downloading Data
We are using the Llama2 paper in this experiment, and we will download this paper.
!wget --user-agent <span>"Mozilla"</span> <span>"https://arxiv.org/pdf/2307.09288.pdf"</span> -O <span>"llama2.pdf"</span>
Loading Data
Let’s load the data. We will use the content before page 36 for the experiment, which excludes the table of contents, references, and appendices.
Then, this data is parsed and converted into nodes that represent the data chunks we want to retrieve. We used 512 as the chunk size.
documents = SimpleDirectoryReader(input_files=[<span>"llama2.pdf"</span>]).load_data()node_parser = SimpleNodeParser.from_defaults(chunk_size=512)nodes = node_parser.get_nodes_from_documents(documents)
Generating Question-Context Pairs
For evaluation purposes, we created a dataset of question-answer pairs. This dataset can be viewed as a set of questions from our data along with their corresponding context. To eliminate bias in evaluating embeddings (OpenAI/CohereAI) and rerankers (CohereAI), we used the Anthropic LLM to generate the question-answer pairs.
Let’s initialize a prompt template to generate question-answer pairs.
<span># Prompt to generate questions</span>qa_generate_prompt_tmpl = <span>"""Context information is below.---------------------{context_str}---------------------Given the context information and not prior knowledge.generate only questions based on the below query.You are a Professor. Your task is to setup {num_questions_per_chunk} questions for an upcoming quiz/examination. The questions should be diverse in nature across the document. The questions should not contain options, not start with Q1/ Q2. Restrict the questions to the context information provided."""</span>llm = Anthropic(api_key=anthropic_api_key)qa_dataset = generate_question_context_pairs( nodes, llm=llm, num_questions_per_chunk=2)
Function to filter out sentences like— Here are 2 questions based on provided context
<span># function to clean the dataset</span>def filter_qa_dataset(qa_dataset): <span>""" Filters out queries from the qa_dataset that contain certain phrases and the corresponding entries in the relevant_docs, and creates a new EmbeddingQAFinetuneDataset object with the filtered data. :param qa_dataset: An object that has 'queries', 'corpus', and 'relevant_docs' attributes. :return: An EmbeddingQAFinetuneDataset object with the filtered queries, corpus and relevant_docs."""</span> <span># Extract keys from queries and relevant_docs that need to be removed</span> queries_relevant_docs_keys_to_remove = { k <span>for</span> k, v <span>in</span> qa_dataset.queries.items() <span>if</span> <span>'Here are 2'</span> <span>in</span> v or <span>'Here are two'</span> <span>in</span> v } <span># Filter queries and relevant_docs using dictionary comprehensions</span> filtered_queries = { k: v <span>for</span> k, v <span>in</span> qa_dataset.queries.items() <span>if</span> k not <span>in</span> queries_relevant_docs_keys_to_remove } filtered_relevant_docs = { k: v <span>for</span> k, v <span>in</span> qa_dataset.relevant_docs.items() <span>if</span> k not <span>in</span> queries_relevant_docs_keys_to_remove } <span># Create a new instance of EmbeddingQAFinetuneDataset with the filtered data</span> <span>return</span> EmbeddingQAFinetuneDataset( queries=filtered_queries, corpus=qa_dataset.corpus, relevant_docs=filtered_relevant_docs )<span># filter out pairs with phrases `Here are 2 questions based on provided context`</span>qa_dataset = filter_qa_dataset(qa_dataset)
Custom Retriever
To determine the optimal retriever, we adopted a combination of embedding models and rerankers. Initially, we established a basic VectorIndexRetriever
. After retrieving nodes, we introduced a reranker to further optimize the results. Notably, in this specific experiment, we set similarity_top_k
to 10 and selected the top 5 results from the reranker. However, you can freely adjust this parameter based on the needs of your specific experiment. Here, we showcase the code using OpenAIEmbedding
; please refer to the code for other embedding models in the notebook.
embed_model = OpenAIEmbedding()service_context = ServiceContext.from_defaults(llm=None, embed_model = embed_model)vector_index = VectorStoreIndex(nodes, service_context=service_context)vector_retriever = VectorIndexRetriever(index=vector_index, similarity_top_k = 10)class CustomRetriever(BaseRetriever): <span>"""Custom retriever that performs both Vector search and Knowledge Graph search"""</span> def __init__( self, vector_retriever: VectorIndexRetriever, ) -> None: <span>"""Init params."""</span> self._vector_retriever = vector_retriever def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]: <span>"""Retrieve nodes given query."""</span> retrieved_nodes = self._vector_retriever.retrieve(query_bundle) <span>if</span> reranker != <span>'None'</span>: retrieved_nodes = reranker.postprocess_nodes(retrieved_nodes, query_bundle) <span>else</span>: retrieved_nodes = retrieved_nodes[:5] <span>return</span> retrieved_nodes async def _aretrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]: <span>"""Asynchronously retrieve nodes given query. Implemented by the user. """</span> <span>return</span> self._retrieve(query_bundle) async def aretrieve(self, str_or_query_bundle: QueryType) -> List[NodeWithScore]: <span>if</span> isinstance(str_or_query_bundle, str): str_or_query_bundle = QueryBundle(str_or_query_bundle) <span>return</span> await self._aretrieve(str_or_query_bundle)custom_retriever = CustomRetriever(vector_retriever)
Evaluation
To evaluate our retriever, we calculated the Mean Reciprocal Rank (MRR) and Hit Rate metrics:
retriever_evaluator = RetrieverEvaluator.from_metric_names( [<span>"mrr"</span>, <span>"hit_rate"</span>], retriever=custom_retriever)eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)
Results
We tested various embedding models and rerankers. Here are the models we considered:
Embedding Models:
-
OpenAI Embedding -
Voyage Embedding -
CohereAI Embedding (v2.0/ v3.0) -
Jina Embeddings (small/ base) -
BAAI/bge-large-en -
Google PaLM Embedding
Rerankers:
-
CohereAI -
bge-reranker-base -
bge-reranker-large
It is worth mentioning that these results provide deep performance insights for this specific dataset and task. However, actual results may vary due to data characteristics, dataset size, and other variables (such as chunk_size, similarity_top_k, etc.).
The table below displays the evaluation results based on Hit Rate and Mean Reciprocal Rank (MRR) metrics:
Copy and try sharing again

Analysis:
Performance by Embedding Model:
-
OpenAI: Demonstrated top-tier performance, especially when combined with CohereRerank
(Hit Rate 0.926966, MRR 0.86573) andbge-reranker-large
(Hit Rate 0.910112, MRR 0.855805), indicating strong compatibility with reranking tools. -
bge-large: Experienced significant performance improvements when using rerankers, with the best results coming from CohereRerank
(Hit Rate 0.876404, MRR 0.822753). -
llm-embedder: Benefited greatly from reranking, especially when combined with CohereRerank
(Hit Rate 0.882022, MRR 0.830243), providing a significant performance boost. -
Cohere: The latest v3.0 embeddings of Cohere outperformed v2.0, and significantly improved its metrics after integrating with native CohereRerank
, achieving a Hit Rate of 0.88764 and MRR of 0.836049. -
Voyage: Showed strong initial performance, further enhanced by CohereRerank
(Hit Rate 0.91573, MRR 0.851217), indicating high responsiveness to reranking. -
JinaAI: Very strong performance, with significant gains seen when using bge-reranker-large
(Hit Rate 0.938202, MRR 0.868539) andCohereRerank
(Hit Rate 0.932584, MRR 0.873689), indicating that reranking significantly improved its performance. -
Google-PaLM: This model demonstrated strong performance, with measurable gains when using CohereRerank
(Hit Rate 0.910112, MRR 0.855712). This indicates that reranking provided a clear enhancement to its overall results.
Impact of Rerankers:
-
No Reranker: This provided baseline performance for each embedding model. -
bge-reranker-base: Generally improved the Hit Rate and MRR of all embedding models. -
bge-reranker-large: This reranker often provided the highest or near-highest MRR for embedding models. For several embeddings, its performance was comparable to or surpassed that of CohereRerank
. -
CohereRerank: Consistently enhanced performance across all embedding models, often providing the best or near-best results.
Necessity of Rerankers:
-
Data clearly indicates the importance of rerankers in optimizing search results. Almost all embedding models benefited from reranking, showing improved Hit Rates and MRR values. -
In particular, CohereRerank
has proven its ability to turn any embedding model into a competitive one.
Overall Advantages:
-
When considering Hit Rate and MRR, the combinations of OpenAI + CohereRerank
andJinaAI-Base + bge-reranker-large/CohereRerank
stand out as top contenders. -
However, the continuous improvements brought by CohereRerank/bge-reranker-large
rerankers across different embedding models make them outstanding choices for enhancing search quality, regardless of the embedding model used.
In summary, to achieve optimal performance in Hit Rate and MRR, the combination of OpenAI
or JinaAI-Base
embeddings with CohereRerank/bge-reranker-large
rerankers stands out.
Please note that our benchmarking aims to provide a reproducible script for your own data. Nevertheless, treat these numbers as estimates and exercise caution when interpreting them.
Conclusion:
In this blog post, we demonstrated how to evaluate and enhance the performance of retrievers using different embedding models and rerankers. Here are our final conclusions.
-
Embedding Models: OpenAI
andJinaAI-Base
embedding models, especially when paired withCohereRerank/bge-reranker-large
rerankers, set the gold standard for Hit Rate and MRR. -
Rerankers: The impact of rerankers, especially CohereRerank/bge-reranker-large
, cannot be overstated. They play a crucial role in improving the MRR of many embedding models, demonstrating their importance in making search results better. -
Foundation is Key: Choosing the right embedding model for initial searches is crucial; even the best rerankers won’t help much if the base search results are poor. -
Working Together: To get the best out of the retriever, it’s important to find the right combination of embedding models and rerankers. This study highlights the importance of careful testing and finding the best pairing.
These conclusions emphasize the importance of selecting embedding models and rerankers when building efficient retrieval systems, and how they work together to provide the best search results.
Original: https://www.llamaindex.ai/blog/boosting-rag-picking-the-best-embedding-reranker-models-42d079022e83Author: Ravi Theja
Edit /Fan Ruiqiang
Review / Fan Ruiqiang
Verification / Fan Ruiqiang
Click below
Follow us