Data is key. Before we can operate on our data with LLM, we first need to process and load the data. This is similar to ETL (Extract, Transform, Load) in machine learning engineering.
In LlamaIndex, the data ingestion pipeline typically consists of three main stages:
-
Loading Data
-
Transforming Data
-
Indexing and Storing Data
We will cover indexing/storing in future articles. In this article, we will primarily discuss loading data and transforming data.
Loading Data
LlamaIndex uses a data linker called Reader to load data. Reader loads data from various data sources and formats it into Document objects. A Document is a collection of data (text, images, audio, video) and its metadata.
The easiest to use Reader is the SimpleDirectoryReader, which creates Document objects for each file in a given directory. It is built into LlamaIndex and can read data in multiple formats including Markdown, PDF, Word documents, PowerPoint slides, images, audio, and video.
It is also very simple to use:
from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader("./data").load_data()
Since real-world data sources are not just on disk, LlamaIndex offers many Readers on LlamaHub that can read from other data sources. We can go to the following address to download the required Reader:
https://llamahub.ai/
For example, if we want to read data from a database, we first install DatabaseReader:
pip install llama-index-readers-database
Then initialize:
from llama_index.core import download_loader
from llama_index.readers.database import DatabaseReader
reader = DatabaseReader( scheme=os.getenv("DB_SCHEME"), host=os.getenv("DB_HOST"), port=os.getenv("DB_PORT"), user=os.getenv("DB_USER"), password=os.getenv("DB_PASS"), dbname=os.getenv("DB_NAME"),)
query = "SELECT * FROM users"
documents = reader.load_data(query=query)
DatabaseReader will run queries against the SQL database and return each row of results as Document objects.
Sometimes, for quick experiments or debugging code, we can create Document objects directly:
from llama_index.core import Document
doc = Document(text="your text")
Processing and Transforming Data
After loading data, we need to process and transform it before putting it into the storage system. These transformations include chunking, extracting metadata, and generating embedding vectors for each chunk produced. This is essential to ensure that LLM can retrieve and use the best data.
Transforming input/output is done through Node objects (Document is a subclass of Node). We can define different transformation steps as needed.
Advanced API
The index has a .from_documents() method that takes an array of Document objects and performs a series of transformations and chunking operations:
from llama_index.core import VectorStoreIndex
vector_index = VectorStoreIndex.from_documents(documents)
vector_index.as_query_engine()
We can also have finer control over the transformation operations used by this method, such as controlling the text chunking method, which we can set through Settings:
from llama_index.core.node_parser import SentenceSplitter
text_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=10)
# global
from llama_index.core import Settings
Settings.text_splitter = text_splitter
# per-index
index = VectorStoreIndex.from_documents( documents, transformations=[text_splitter])
Low-Level API
We can achieve this by using LlamaIndex’s transformation modules (text splitters, metadata extractors, etc.) as standalone components or by combining them in a declared transformation pipeline interface.
The key step in processing Documents is to break them into chunks that LLM can use directly (i.e., Node objects).
LlamaIndex supports a variety of text splitters ranging from paragraph/sentence/token-based splitters to file-based splitters (like HTML, JSON).
from llama_index.core import SimpleDirectoryReader
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import TokenTextSplitter
documents = SimpleDirectoryReader("./data").load_data()
pipeline = IngestionPipeline(transformations=[TokenTextSplitter(), ...])
nodes = pipeline.run(documents=documents)
Adding Metadata
We can also choose to add metadata to Documents and Nodes. This can be done manually or using an automatic metadata extractor.
document = Document( text="text", metadata={"filename": "<doc_file_name>", "category": "<category>"},)
We can directly insert Nodes into the vector index:
from llama_index.core.schema import TextNode
node1 = TextNode(text="<text_chunk>", id_="<node_id>")
node2 = TextNode(text="<text_chunk>", id_="<node_id>")
index = VectorStoreIndex([node1, node2])