Three Advanced Retrieval Techniques in RAG

Source: DeepHub IMBA

This article is about 3000 words long and is recommended to be read in 5 minutes.
This article will explore three effective techniques to enhance document retrieval in applications based on <strong>RAG</strong>. By combining these techniques, it is possible to retrieve documents that closely match user queries, thus generating better answers.

The documents retrieved by the RAG system may not always align with the user’s query, which is a common phenomenon. This often occurs when the documents may lack a complete answer to the query, contain redundant information, or include irrelevant details, or when the order of documents may not match the user’s intent.

Query Expansion

Query expansion refers to a set of techniques for rephrasing the original query.

This article will discuss two popular methods that are easy to implement.

1. Use Generated Answers to Expand Queries

Given an input query, first let the LLM provide a hypothetical answer (regardless of its correctness), then combine the query and the generated answer into a prompt and send it to the retrieval system.

This technique works very well.

This paper provides a detailed introduction:https://arxiv.org/abs/2212.10496

The idea behind this method is that we want to retrieve documents that look more like answers; we are interested in their structure and expression. Therefore, the hypothetical answer can be seen as a template to help identify relevant neighborhoods in the embedding space.

Here is an example prompt:

You are a helpful expert financial research assistant. Provide an example answer to the given question, that might be found in a document like an annual report.

2. Expand Queries by Including Multiple Related Questions

The second method instructs the LLM to generate N questions related to the original query, and then send all of them (+ the original query) to the retrieval system.

This allows for retrieving more documents from the vector store. However, some of these will be duplicates, so post-processing is needed to remove them.

The idea behind this method is to expand potentially incomplete or ambiguous initial queries, merging them into a final set of potentially relevant and complementary results.

Here is the prompt used to generate related questions:

You are a helpful expert financial research assistant. Your users are asking questions about an annual report. Suggest up to five additional related questions to help them find the information they need, for the provided question. Suggest only short questions without compound sentences. Suggest a variety of questions that cover different aspects of the topic. Make sure they are complete questions, and that they are related to the original question. Output one question per line. Do not number the questions.

The downside of this method is that it results in more documents, which may distract the LLM and prevent it from generating useful answers.

Thus, a new method has emerged: reordering.

Reordering

This method reorders the retrieved documents based on quantifying their relevance score to the input query.

Using cross-encoder for reordering:

The cross-encoder is a type of deep neural network that processes two input sequences as a single input. It allows the model to directly compare and contrast the inputs to understand their relationships in a more comprehensive and detailed way.

Given a query, encode it with all retrieved documents. Then sort them in descending order. The highest scores are considered the most relevant documents.

Here is a simple example:

First, install sentence-transformers

pip install -U sentence-transformers

Use it to load the cross-encoder model:

from sentence_transformers import CrossEncoder
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

Here we choose ms-marco-MiniLM-L-6-v2, for performance metrics refer to SBERT to select a better model.

Score each pair (query, document):

pairs = [[query, doc] for doc in retrieved_documents]
scores = cross_encoder.predict(pairs)
print("Scores:")
for score in scores:     print(score)  
# Scores: # 0.98693466 # 2.644579 # -0.26802942 # -10.73159 # -7.7066045 # -5.6469955 # -4.297035 # -10.933233 # -7.0384283 # -7.3246956

Reorganize the documents:

print("New Ordering:")
for o in np.argsort(scores)[::-1]:    print(o+1)

Reordering can be used alongside query expansion, where after generating multiple related questions and retrieving the corresponding documents (assuming there are M documents), they are reordered and the top K (K < M) are selected. This way, the most important parts can be selected and the context size reduced.

Embedding Adapters

This method utilizes user feedback on document relevance to train a new adapter.

Adapters are lightweight alternatives to fully fine-tuning pre-trained models. In a scenario, adapters are small feedforward neural networks inserted between the layers of a pre-trained model. The goal of training the adapter is to change the embedding queries to generate better retrieval results for specific tasks.

Embedding adapters are a stage that can be inserted after the embedding phase and before retrieval. Think of it as a matrix that scales the original embeddings.

To train the adapter, the following steps need to be performed.

1. Prepare Data

This data can be manually labeled or generated by the LLM. The data must include tuples (query, document) and their corresponding labels (1 if the document is relevant to the query, otherwise -1).

For demonstration, we will create a synthetic dataset by first generating example questions that financial analysts might ask when analyzing financial reports.

Let’s use the LLM to generate directly:

import os
import openai
from openai import OpenAI
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai.api_key = os.environ['OPENAI_API_KEY']
PROMPT_DATASET = """
You are a helpful expert financial research assistant. You help users analyze financial statements to better understand companies. Suggest 10 to 15 short questions that are important to ask when analyzing an annual report. Do not output any compound questions (questions with multiple sentences or conjunctions). Output each question on a separate line divided by a newline.
"""
def generate_queries(model="gpt-3.5-turbo"):
    messages = [
        {
            "role": "system",
            "content": PROMPT_DATASET,
        },
    ]
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    content = content.split("\n")
    return content

generated_queries = generate_queries()
for query in generated_queries:
    print(query)
# 1. What is the company's revenue growth rate over the past three years? # 2. What are the company's total assets and total liabilities? # 3. How much debt does the company have? Is it increasing or decreasing? # 4. What is the company's profit margin? Is it improving or declining? # 5. What are the company's cash flow from operations, investing, and financing activities? # 6. What are the company's major sources of revenue? # 7. Does the company have any pending litigation or legal issues? # 8. What is the company's market share compared to its competitors? # 9. How much cash does the company have on hand? # 10. Are there any major changes in the company's executive team or board of directors? # 11. What is the company's dividend history and policy? # 12. Are there any related party transactions? # 13. What are the company's major risks and uncertainties? # 14. What is the company's current ratio and quick ratio? # 15. How has the company's stock price performed over the past year?

Then, we perform document retrieval for each generated question, resulting in a collection of results.

results = chroma_collection.query(query_texts=generated_queries, n_results=10, include=['documents', 'embeddings'])
retrieved_documents = results['documents']

Next, we need to evaluate the relevance of each question to its corresponding document. Here, we also use the LLM to accomplish this task:

PROMPT_EVALUATION = """
You are a helpful expert financial research assistant. You help users analyze financial statements to better understand companies. For the given query, evaluate whether the following statement is relevant. Output only 'yes' or 'no'.
"""
def evaluate_results(query, statement, model="gpt-3.5-turbo"):
    messages = [
    {
        "role": "system",
        "content": PROMPT_EVALUATION,
    },
    {
        "role": "user",
        "content": f"Query: {query}, Statement: {statement}"
    }
    ]
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
        max_tokens=1
    )
    content = response.choices[0].message.content
    if content == "yes":
        return 1
    return -1

Now we have our training data: each tuple will contain the query embedding, document embedding, and label (1, -1).

retrieved_embeddings = results['embeddings']
query_embeddings = embedding_function(generated_queries)
adapter_query_embeddings = []
adapter_doc_embeddings = []
adapter_labels = []
for q, query in enumerate(tqdm(generated_queries)):
    for d, document in enumerate(retrieved_documents[q]):
        adapter_query_embeddings.append(query_embeddings[q])
        adapter_doc_embeddings.append(retrieved_embeddings[q][d])
        adapter_labels.append(evaluate_results(query, document))

Then we place them in a Torch Dataset as the training set.

adapter_query_embeddings = torch.Tensor(np.array(adapter_query_embeddings))
adapter_doc_embeddings = torch.Tensor(np.array(adapter_doc_embeddings))
adapter_labels = torch.Tensor(np.expand_dims(np.array(adapter_labels),1))
dataset = torch.utils.data.TensorDataset(adapter_query_embeddings, adapter_doc_embeddings, adapter_labels)

2. Define the Model

We define a function that takes query embeddings, document embeddings, and the adapter matrix as inputs. This function first multiplies the query embeddings by the adapter matrix and computes the cosine similarity of the result with the document embeddings.

def model(query_embedding, document_embedding, adaptor_matrix):
    updated_query_embedding = torch.matmul(adaptor_matrix, query_embedding)
    return torch.cosine_similarity(updated_query_embedding, document_embedding, dim=0)

3. Loss

Our goal is to minimize the cosine similarity calculated by the previous function. Therefore, we can directly use Mean Squared Error (MSE).

def mse_loss(query_embedding, document_embedding, adaptor_matrix, label):
    return torch.nn.MSELoss()(model(query_embedding, document_embedding, adaptor_matrix), label)

4. Training Process

Initialize the adapter matrix and train it for 100 epochs.

# Initialize the adaptor matrix
mat_size = len(adapter_query_embeddings[0])
adapter_matrix = torch.randn(mat_size, mat_size, requires_grad=True)
min_loss = float('inf')
best_matrix = None
for epoch in tqdm(range(100)):
    for query_embedding, document_embedding, label in dataset:
        loss = mse_loss(query_embedding, document_embedding, adapter_matrix, label)
        if loss < min_loss:
            min_loss = loss
            best_matrix = adapter_matrix.clone().detach().numpy()
        loss.backward()
        with torch.no_grad():
            adapter_matrix -= 0.01 * adapter_matrix.grad
            adapter_matrix.grad.zero_()

After training, the adapter can be used to extend the original embeddings and adapt to user tasks.

All we need to do is multiply the original embedding output by the adapter matrix and then input it into the retrieval system.

test_vector = torch.ones((mat_size,1))
scaled_vector = np.matmul(best_matrix, test_vector).numpy()
test_vector.shape # torch.Size([384, 1])
scaled_vector.shape # (384, 1)
best_matrix.shape # (384, 384)

Thus, the embedding vectors used by our retrieval system afterwards are the fine-tuned vectors, optimized for specific tasks.

Conclusion

The retrieval techniques we introduced help improve document relevance. However, research in this area is ongoing, and there are many other methods such as fine-tuning embedding models using real feedback data; directly fine-tuning LLM to maximize its retrieval capabilities (RA-DIT); exploring more complex embedding adapters using deep neural networks instead of matrices; and deep and intelligent chunking techniques.

We will also organize and introduce these techniques later. Thank you for reading.

Editor: Wang Jing

Query Expansion

Reordering

Embedding Adapters

Conclusion

Leave a Comment Cancel reply