Building Document-Based Q&A System Using LangChain, Pinecone, and LLMs

Click the blue text above to follow us

1. Introduction

Today we will delve into the process of creating a document-based Q&A system using LangChain and Pinecone, leveraging the latest large language models (LLMs) such as OpenAI’s GPT-4 and ChatGPT.

LangChain is a powerful framework designed for developing applications driven by language models, while Pinecone is an efficient vector database for building high-performance vector search applications. Our use case focuses on answering questions based solely on information contained within specific documents to generate accurate and contextually relevant answers.

By combining the capabilities of semantic search with the excellence of LLMs like GPT, we will demonstrate how to build an advanced document Q&A system utilizing cutting-edge AI technology.

2. Why Semantic Search + GPT Q&A is Better than Fine-tuning GPT?

Before we dive into the implementation, let’s understand the advantages of using semantic search + GPT Q&A compared to fine-tuning GPT:

2.1. Broader Knowledge Coverage:

Semantic search + GPT Q&A mainly involves two core steps: first, finding relevant paragraphs from a large number of documents, and then generating answers based on those paragraphs. This method can provide more accurate and up-to-date information, utilizing the latest information from various sources. In contrast, fine-tuning GPT relies on the knowledge encoded in the model during training, which may become outdated or incomplete over time.

2.2. Context-specific Answers:

Semantic search + GPT Q&A can generate more contextually precise answers by basing answers on specific paragraphs from relevant documents. However, fine-tuned GPT models may generate answers based on general knowledge embedded in the model, which may not be accurate or relevant to the context of the question.

2.3. Adaptability:

The semantic search component can easily be updated with new information sources or adjusted to different domains, making it more adaptable to specific use cases or industries. In contrast, fine-tuning GPT requires retraining the model, which can be time-consuming and computationally expensive.

2.4. Better Handling of Ambiguous Queries:

Semantic search can eliminate query ambiguity by identifying the most relevant paragraphs related to the question. This can lead to more accurate and relevant answers compared to fine-tuned GPT models that lack appropriate context.

3. LangChain Modules

LangChain provides support for several key modules:

Models: Various types of models and model integrations supported by LangChain.

Index: Language models are generally more powerful when combined with your own text data – this module covers best practices for doing so.

Chains: Chains are not just a single LLM call but a series of calls (whether to LLMs or other tools). LangChain provides a standard linking interface, many integrations with other tools, and end-to-end chains for common applications.

4. Setting Up the Environment

First, we need to install the required packages and import the necessary libraries.

Install Required Packages:

!pip install --upgrade langchain openai -q
!pip install unstructured -q
!pip install unstructured[local-inference] -q
!pip install detectron2@git+https://github.com/facebookresearch/[email protected]#egg=detectron2 -q
!apt-get install poppler-utils

Import Necessary Libraries:

import os
import openai
import pinecone
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

5. Loading Documents

Before importing documents, ensure that pillow <= 6.2.2, otherwise the following exception will be thrown:

ImportError: cannot import name 'is_directory' from 'PIL._util' (/usr/local/lib/python3.10/dist-packages/PIL/_util.py)

Check the version of pillow, reinstall version 6.2.2, and restart the Colab Runtime environment after installation.

!pip show pillow
!pip uninstall pillow 
!pip install --upgrade pillow==6.2.2

First, we need to use LangChain‘s DirectoryLoader to load documents from a directory. In this example, we assume the documents are stored in a directory called ‘data’.

directory = '/content/data'

def load_docs(directory):
  loader = DirectoryLoader(directory)
  documents = loader.load()
  return documents

documents = load_docs(directory)
len(documents)

6. Splitting Documents

Now, we need to split the documents into smaller chunks for processing. We will use LangChain’s RecursiveCharacterTextSplitter, which attempts to split on the characters [“\n\n”, “\n”, ” “, “”] by default.

def split_docs(documents, chunk_size=1000, chunk_overlap=20):
  text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
  docs = text_splitter.split_documents(documents)
  return docs

docs = split_docs(documents)
print(len(docs))

7. Using OpenAI to Embed Documents

Once the documents are split, we need to use OpenAI’s language model to embed them. First, we need to install the tiktoken library.

!pip install tiktoken -q

Now, we can use LangChain's OpenAIEmbeddings class to embed the documents.

import openai
from langchain.embeddings.openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

query_result = embeddings.embed_query("memory")
len(query_result)

8. Using Pinecone for Vector Search

Next, we will use Pinecone to create an index for our documents. First, we need to install the pinecone-client.

!pip install pinecone-client -q

Before initializing, we need to create an Index in the Pinecone console, with the Dimensions value being the embedding dimension size we calculated earlier, which is 1536, and should be configured based on your actual calculation results.

Then we can initialize Pinecone and create a Pinecone index.

pinecone.init(
    api_key="pinecone api key",
    environment="env"
)

index_name = "langchain-demo"

index = Pinecone.from_documents(docs, embeddings, index_name=index_name)

We created a new Pinecone vector index using this method. This method accepts three parameters: Pinecone.from_documents()

1. docs: A list of documents split into smaller chunks by the RecursiveCharacterTextSplitter. These smaller chunks will be indexed in Pinecone for easier searching and retrieval of related documents later.

2. embeddings: An instance of the OpenAIEmbeddings class, responsible for converting text data into embeddings (i.e., numerical representations) using OpenAI’s language model. These embeddings will be stored in the Pinecone index and used for similarity searches.

3. index_name: A string representing the name of the Pinecone index. This name is used to identify the index in Pinecone’s database and should be unique to avoid conflicts with other indexes.

This method processes the input documents, generates embeddings using the provided OpenAIEmbeddings instance, and creates a new Pinecone index with the specified name. The generated index object can perform similarity searches and retrieve relevant documents based on user queries. Pinecone.from_documents()

9. Finding Similar Documents

Now, we can define a function to find similar documents based on a given query.

def get_similiar_docs(query, k=2, score=False):
  if score:
    similar_docs = index.similarity_search_with_score(query, k=k)
  else:
    similar_docs = index.similarity_search(query, k=k)
  return similar_docs

Query Results

10. Using LangChain and OpenAI LLM for Q&A

With the necessary components in place, we can now create a Q&A system using LangChain‘s OpenAI class and a pre-built Q&A chain.

# model_name = "text-davinci-003"
# model_name = "gpt-3.5-turbo"
model_name = "gpt-4"
llm = OpenAI(model_name=model_name)

chain = load_qa_chain(llm, chain_type="stuff")

def get_answer(query):
  similar_docs = get_similiar_docs(query)
  answer = chain.run(input_documents=similar_docs, question=query)
  return answer

11. Example Queries and Answers

Finally, let’s test our Q&A system with some example queries.

query = "How to install LangChain?"
answer = get_answer(query)
print(answer)

query = "Managing LLM prompts?"
answer = get_answer(query)
print(answer)

12. Conclusion

In this article, we demonstrated how to build a document-based Q&A system using LangChain and Pinecone. By leveraging semantic search and large language models, this approach provides a powerful and flexible solution for extracting information from a large number of documents. You can further customize this system to suit your specific needs or domain.

Google Colab Notebook:

https://github.com/Crossme0809/langchain-tutorials/blob/main/Langchain_Semnatic_Serach_Pinecone.ipynb

If you are interested in this article and want to learn more about practical skills in the AI field, you can follow the ‘Tech Wave AI’ public account. Here you can see the latest and hottest articles and practical tutorials in the AIGC field.

Leave a Comment Cancel reply