Building Powerful Multimodal Search with Voyager-3 and LangGraph

Embedding images and text in the same space allows us to perform high-precision searches on multimodal content such as web pages, PDFs, magazines, books, brochures, and various papers. Why is this technology so interesting? The most exciting aspect of embedding text and images in the same space is that you can search for and retrieve text related to a specific image and vice versa. For example, if you search for a cat, you will find images showing cats, but even if the text does not explicitly mention the word “cat,” you will still get the text that references those images.

Let me show you the difference between traditional text embedding similarity search and multimodal embedding space:

Example Question: What does the magazine say about cats?

Regular Similarity Search Answer

The search results contain no specific information about cats, although it mentions animal portraits and photography techniques, but does not explicitly mention cats or any details related to them.

As shown in the image above, the word “cat” is not mentioned; there is only an image and a description of how to photograph animals. Regular similarity search yields no results because the word “cat” is not mentioned.

Multimodal Search Answer

The magazine features a portrait of a cat that highlights the intricate capture of the cat’s facial features and personality. The article emphasizes that a fine animal portrait can delve into the subject’s soul and establish an emotional connection with the audience through engaging eye contact.

With multimodal search, we will find an image of a cat and then link relevant text to it. Feeding this data into the model will enable it to better answer and understand the context.

How to Build a Multimodal Embedding and Retrieval Pipeline

Now, I will describe how such a pipeline works through several steps:

We will use Unstructured (a powerful Python data extraction library) to extract text and images from PDF files.
We will use Voyager Multimodal 3 model to create multimodal vectors for text and images in the same vector space.
We will insert it into a vector store (Weaviate).
Finally, we will perform similarity searches and compare the results of text and images.

Step 1: Set Up Vector Store and Extract Images and Text from Files (PDF)

Here, we need to do some manual work. Typically, Weaviate is a very user-friendly vector store that automatically converts data and adds embeddings when inserting data. However, Voyager Multimodal v3 does not have a plugin, so we must calculate the embeddings manually.In this case, we need to create a collection without defining a vectorizer module.

import weaviate
from weaviate.classes.config import Configure
client = weaviate.connect_to_local()
collection_name = "multimodal_demo"
client.collections.delete(collection_name)
try:
    client.collections.create(
        name=collection_name,
        vectorizer_config=Configure.Vectorizer.none() # Don't set a vectorizer for this collection
    )
    collection = client.collections.get(collection_name)
except Exception:
    collection = client.collections.get(collection_name)
pyt

Here, I run a local Weaviate instance in a Docker container.

Step 2: Extract Documents and Images from PDF

This is the key step to make the process work. Here, we will obtain a PDF containing text and images. Then, we will extract the content (images and text) and store them in relevant blocks. Thus, each block will be a list of elements containing strings (actual text) and Python PIL images.

We will use the Unstructured library to do some heavy lifting, but we still need to write some logic and configure library parameters.

from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title
elements = partition(
            filename="./files/magazine_sample.pdf",
            strategy="hi_res",
            extract_image_block_types=["Image", "Table"],
            extract_image_block_to_payload=True)
chunks = chunk_by_title(elements)

Here, we must use the strategy hi_res and use extract_image_block_to_payload to export images to the payload, as we will need this information later for the actual embedding. After extracting all elements, we will group them into blocks based on the titles in the document.

See the Unstructured documentation on chunking for more information.

In the script below, we will use these blocks to output two lists:

A list of objects that we will send to Voyager 3 to create vectors
A list of metadata extracted from Unstructured. This metadata is necessary because we need to add it to the vector store. It will provide us with additional filtering properties and tell us some information about the retrieved data.

from unstructured.staging.base import elements_from_base64_gzipped_json
import PIL.Image
import io
import base64

embedding_objects = []
embedding_metadatas = []
for chunk in chunks:
    embedding_object = []
    metedata_dict = {
        "text": chunk.to_dict()["text"],
        "filename": chunk.to_dict()["metadata"]["filename"],
        "page_number": chunk.to_dict()["metadata"]["page_number"],
        "last_modified": chunk.to_dict()["metadata"]["last_modified"],
        "languages": chunk.to_dict()["metadata"]["languages"],
        "filetype": chunk.to_dict()["metadata"]["filetype"]
    }
    embedding_object.append(chunk.to_dict()["text"])
    # Add the images to the embedding object
    if "orig_elements" in chunk.to_dict()["metadata"]:
        base64_elements_str = chunk.to_dict()["metadata"]["orig_elements"]
        eles = elements_from_base64_gzipped_json(base64_elements_str)
        image_data = []
        for ele in eles:
            if ele.to_dict()["type"] == "Image":
                base64_image = ele.to_dict()["metadata"]["image_base64"]
                image_data.append(base64_image)
                pil_image = PIL.Image.open(io.BytesIO(base64.b64decode(base64_image)))
                # Resize image if larger than 1000x1000 while maintaining aspect ratio
                if pil_image.size[0] > 1000 or pil_image.size[1] > 1000:
                    ratio = min(1000/pil_image.size[0], 1000/pil_image.size[1])
                    new_size = (int(pil_image.size[0] * ratio), int(pil_image.size[1] * ratio))
                    pil_image = pil_image.resize(new_size, PIL.Image.Resampling.LANCZOS)
                embedding_object.append(pil_image)
        metedata_dict["image_data"] = image_data

    embedding_objects.append(embedding_object)
    embedding_metadatas.append(metedata_dict)

The result of this script will be a list of lists that looks like the following:

[['FROM

ON LOCATION KIRKJUFELL, ICELAND', <PIL.Image.Image image mode=RGB size=1000x381>, <PIL.Image.Image image mode=RGB size=526x1000>], ['This iconic mountain was on the top of our list of places to shoot in Iceland, and we had seen many images taken from the nearby waterfalls long before we went there. So this was the first place we headed to at sunrise - and we weren't disappointed. The waterfalls provided the perfect foreground interest for this image (top), and Kirkjufell is a perfect pointed mountain from this viewpoint. We spent an hour or two simply exploring these waterfalls, finding several different viewpoints.']]

Step 3: Vectorize the Extracted Data

In this step, we will use the blocks created in the previous step and send them to the Voyager Python package. It will return a list of all embedding objects. We can then use this result to finally store it in Weaviate.

from dotenv import load_dotenv
import voyageai
load_dotenv()
vo = voyageai.Client()# This will automatically use the environment variable VOYAGE_API_KEY.# Alternatively, you can use vo = voyageai.Client(api_key="<your secret key>")
# Example input containing a text string and PIL image object
inputs = embedding_objects
# Vectorize inputs
result = vo.multimodal_embed(
    inputs,
    model="voyage-multimodal-3",
    truncation=False)

If we access result.embeddings, we will get a list of lists containing all the computed embedding vectors:

[[-0.052734375, -0.0164794921875, 0.050048828125, 0.01348876953125, -0.048095703125, …]]

We can now use the `batch.add_object` method to store these embedding data in Weaviate in a single batch. Note that we also add metadata in the properties parameter.

with collection.batch.dynamic() as batch:
    for i, data_row in enumerate(embedding_objects):
        batch.add_object(
            properties=embedding_metadatas[i],
            vector=result.embeddings[i]
        )

Step 4: Query the Data

Now we can perform similarity searches and query the data. This is easy because this process is similar to the regular similarity search for text embeddings. Since Weaviate does not have a Voyager multimodal module, we must calculate the vector for the search query ourselves and then pass the search vector to Weaviate to perform the similarity search.

from weaviate.classes.query import MetadataQuery
question = "What does the magazine say about waterfalls?"
vect = vo.multimodal_embed([[question]], model="voyage-multimodal-3")
vect.embeddings[0]
response = collection.query.near_vector(
    near_vector=vect.embeddings[0], # your query vector goes here
    limit=2,
    return_metadata=MetadataQuery(distance=True))
# Displaying the results
for o in response.objects:
    print(o.properties['text'])
    for image_data in o.properties['image_data']:
        # Display the image using PIL
        img = PIL.Image.open(io.BytesIO(base64.b64decode(image_data)))
        width, height = img.size
        if width > 500 or height > 500:
            ratio = min(500/width, 500/height)
            new_size = (int(width * ratio), int(height * ratio))
            img = img.resize(new_size)
        display(img)
    print(o.metadata.distance)

The image below shows that searching for waterfalls will return text and images related to this search query. As you can see, the photo reflects waterfalls, but the text itself does not mention waterfalls. The text is about an image that has waterfalls in it, which is why it is also retrieved. Regular text embedding search cannot do this.

Step 5: Add It to the Entire Retrieval Pipeline

Now that we have extracted text and images from the magazine, created embeddings for them, added them to Weaviate, and set up similarity search, I will add it to the entire retrieval pipeline. In this example, I will use LangGraph. Users will ask questions about the magazine, and the pipeline will answer those questions. Now that all the work is done, this part is as simple as setting up a typical retrieval pipeline with regular text.

I have abstracted some of the logic we discussed in the previous sections into other modules. Here is an example of how I integrated it into the LangGraph pipeline.

class MultiModalRetrievalState(TypedDict):
    messages: Annotated[Sequence[BaseMessage], add_messages]
    results: List[Document]
    base_64_images: List[str]

class RAGNodes(BaseNodes):
    def __init__(self, logger, mode="online", document_handler=None):
        super().__init__(logger, mode)
        self.weaviate = Weaviate()
        self.mode = mode
    async def multi_modal_retrieval(self, state: MultiModalRetrievalState, config):
        collection_name = config.get("configurable", {}).get("collection_name")
        self.weaviate.set_collection(collection_name)
        print("Running multi-modal retrieval")
        print(f"Searching for {state['messages'][-1].content}")
        results = self.weaviate.similarity_search(
            query=state["messages"][-1].content, k=3, type="multimodal"
        )
        return {"results": results}
    async def answer_question(self, state: MultiModalRetrievalState, config):
        print("Answering question")
        llm = self.llm_factory.create_llm(mode=self.mode, model_type="default")
        include_images = config.get("configurable", {}).get("include_images", False)
        chain = self.chain_factory.create_multi_modal_chain(
            llm,
            state["messages"][-1].content,
            state["results"],
            include_images=include_images,
        )
        response = await chain.ainvoke({})
        message = AIMessage(content=response)
        return {"messages": message}
# Define the config
class GraphConfig(TypedDict):
    mode: str = "online"
    collection_name: str
    include_images: bool = False

graph_nodes = RAGNodes(logger)

graph = StateGraph(MultiModalRetrievalState, config_schema=GraphConfig)
graph.add_node("multi_modal_retrieval", graph_nodes.multi_modal_retrieval)
graph.add_node("answer_question", graph_nodes.answer_question)
graph.add_edge(START, "multi_modal_retrieval")
graph.add_edge("multi_modal_retrieval", "answer_question")
graph.add_edge("answer_question", END)
multi_modal_graph = graph.compile()
__all__ = ["multi_modal_graph"]

The code above will generate the following chart

In this trace, you can see that both content and images are sent to OpenAI to answer questions.

Conclusion

Multimodal embeddings provide the possibility to integrate and retrieve information of different data types (such as text and images) within the same embedding space. By combining cutting-edge tools like the Voyager Multimodal 3 model, Weaviate, and LangGraph, we demonstrated how to build a powerful retrieval pipeline that can understand and link content more intuitively than traditional pure-text methods.

This approach significantly improves search and retrieval accuracy across various data sources such as magazines, brochures, and PDFs. It also showcases how multimodal embeddings can provide richer, context-aware insights that can connect images with descriptive text even in the absence of explicit keywords. This tutorial allows you to explore these techniques and apply them to your projects.

Original Author: Ben Selleslagh

Example Notebook:https://github.com/vectrix-ai/vectrix-graphs/blob/main/examples/multi-model-embeddings.ipynb