Using Large Language Models in LlamaIndex

One of the primary steps to consider when building any LLM application based on data is choosing the right LLM.

LLMs are a core component of LlamaIndex. They can be used as standalone modules or inserted into other core LlamaIndex modules (indexers, retrievers, query engines). They are generally used during the response synthesis step after retrieval. Depending on the type of index used, LLMs can also be utilized during indexing, insertion, and query traversal.

During indexing, you can use an LLM to determine the relevance of data (whether to index it), or you can use an LLM to summarize raw data into a summary before indexing.
During querying, LLMs can be used in two ways:

During retrieval (fetching data from the index), the LLM can gather a range of information (e.g., from multiple different indexes) and decide how to find the most relevant data for our query. The Agent LLM can also use tools to query different data sources at this stage.
During response synthesis (converting retrieved data into answers), the LLM can combine answers from multiple subqueries into a coherent answer or transform data, such as converting unstructured text into JSON or other programming output formats.

LlamaIndex provides a unified interface for defining LLM modules in the llama_index.llms module for a large number of different LLMs (open-source Llama2, Mixtral, Gork, etc., and closed-source ChatGPT, Gemini, etc.) so that we do not have to write boilerplate code to define the LLM interface ourselves.

Using LLM as a Standalone Module

To automatically generate text:

from llama_index.llms.openai import OpenAI
# non-streaming
resp = OpenAI().complete("Paul Graham is ")
print(resp)
# using streaming endpoint
from llama_index.llms.openai import OpenAI
llm = OpenAI()
resp = llm.stream_complete("Paul Graham is ")
for delta in resp:    print(delta, end="")

To use as a chatbot:

from llama_index.core.llms import ChatMessage
from llama_index.llms.openai import OpenAI
messages = [    ChatMessage(        role="system", content="You are a pirate with a colorful personality"    ),    ChatMessage(role="user", content="What is your name"),]
resp = OpenAI().chat(messages)
print(resp)

Customizing LLM

LlamaIndex defaults to using GPT-3.5-turbo. However, we can choose an LLM that suits our needs or is available to us.

Typically, we pass the LLM instance to Settings to specify the designated LLM as a global configuration:

from llama_index.llms.openai import OpenAI
from llama_index.core import Settings
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
Settings.llm = OpenAI(temperature=0.2, model="gpt-4")
documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)

In the above example, we specify the use of the GPT4 model.

The temperature is a parameter that controls the diversity of LLM output, generally ranging from 0 to 1. A higher value indicates that the model’s output is more diverse, meaning the same question may be answered in different ways.

LlamaIndex now integrates most LLMs, including OpenAI, HuggingFace, LlamaCPP, etc. For a detailed list:

https://docs.llamaindex.ai/en/stable/module_guides/models/llms/modules/

When choosing to use a default or provided LLM, consider the following three aspects:

Privacy and Security

By default, LlamaIndex sends your data to OpenAI to generate embeddings and natural language responses. However, it is important to note that this can be configured according to your preferences. LlamaIndex can flexibly use your own embedding model or run LLM locally.

Data Privacy

Regarding data privacy, when LlamaIndex is used in conjunction with OpenAI, privacy details and data processing must comply with OpenAI’s policies. Each LLM vendor also has its own policies aside from OpenAI.
Vector Storage

LlamaIndex provides modules for integration with other vector databases. It is worth noting that each vector database has its own privacy policies and practices, and LlamaIndex is not responsible for how they handle or use your data. By default, LlamaIndex has an option for locally storing embeddings.

Tokenization

By default, LlamaIndex uses a global tokenizer for token counting. The default is tiktoken’s cl100k, which matches the default LLM’s tokenizer.

If you change the LLM, you may need to synchronize the tokenizer to ensure accurate token counting, chunking, and prompting.

The only requirement for the tokenizer is that it is a callable function that accepts a string and returns a list.

You can set the global tokenizer like this:

from llama_index.core import Settings
# tiktoken
import tiktoken
Settings.tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo").encode
# huggingface
from transformers import AutoTokenizer
Settings.tokenizer = AutoTokenizer.from_pretrained(    "HuggingFaceH4/zephyr-7b-beta")

Setting LLM Output Token Count

The number of tokens output by the LLM at a time is limited, generally to 256.

For OpenAI’s LLM, we can change this limit using the max_tokens parameter:

from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

# define global LLM
Settings.llm = OpenAI(max_tokens=512)

For the open-source Llama2 model, we can load it with LlamaCPP by specifying the max_new_tokens parameter value during loading.

from llama_index.llms.llama_cpp import LlamaCPP

llm = LlamaCPP(    # You can pass in the URL to a GGML model to download it automatically    model_url=model_url,    # optionally, you can set the path to a pre-downloaded model instead of model_url    model_path=None,    temperature=0.1,    max_new_tokens=256,    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room    context_window=3900,    # kwargs to pass to __call__()    generate_kwargs={},    # kwargs to pass to __init__()    # set to at least 1 to use GPU    model_kwargs={"n_gpu_layers": 1},    # transform inputs into Llama2 format    messages_to_prompt=messages_to_prompt,    completion_to_prompt=completion_to_prompt,    verbose=True,)

For Llama2, another key parameter is the context_window, which indicates the context window. The larger the LLM’s context window, the stronger the LLM’s “memory” (it can remember more chat history), and in RAG applications, it also means that the LLM can receive more input information at once, ensuring that the LLM can reason across large spans to obtain answers.

Leave a Comment Cancel reply