RAG Mastery Manual: Understanding the Technology Behind RAG

In a previous article titled RAG Mastery Manual: Is RAG Sounding the Death Knell? Does Long Context in Large Models Mean Vector Retrieval is No Longer Important, we introduced the indispensability of RAG in solving the hallucination problem of large models, and reviewed how to enhance the practical effects of RAG using vector databases.

Today, we will continue to analyze RAG and provide a detailed introduction to the development history and basic principles of technologies behind RAG, such as Embedding, Transformer, BERT, and LLM, as well as how they are applied.

RAG Mastery Manual: Understanding the Technology Behind RAG

01.

What is Embedding?

Embedding is a technique that transforms discrete, unstructured data into continuous vector representations.

In natural language processing, Embedding is often used to map words, sentences, or documents in text data into fixed-length real-valued vectors, enabling better processing and understanding of text data by computers. Through Embedding, each word or sentence can be represented by a real-valued vector that contains semantic information about that word or sentence. Thus, similar words or sentences are mapped to nearby vectors in the embedding space, and words or sentences with similar meanings will have a shorter distance in vector space. This allows for matching, classification, clustering, and other operations based on the distances or similarities between vectors during natural language processing tasks.

Word2Vec

Word2Vec is a word embedding method proposed by Google in 2013. It was one of the mainstream Word Embedding methods before 2018. As one of the classic algorithms for word vectors, Word2Vec has been widely used in various natural language processing tasks. It learns the semantic and syntactic relationships between words by training on a corpus, mapping words to dense vectors in high-dimensional space. The emergence of Word2Vec pioneered the conversion of words into vector representations, greatly promoting the development of the natural language processing field.

The Word2vec model can map each word to a vector, representing the relationships between words. The following image shows an example of a 2-dimensional vector space (which may actually be of higher dimensions).

From the figure, it can be seen that in this 2D space, each word’s distribution has distinct characteristics. For example, to move from man to woman, one needs to add a vector directed to the upper right, which can be thought of as a vector that “transforms male to female.” If this vector is also added to king, one can obtain the position of queen. It can also be observed in the figure that moving from Paris to France represents a structure vector similar to “from city to country.”

This fascinating phenomenon indicates that the distribution of vectors within the embedding space is not chaotic or random. The regions represent specific categories, and the differences between regions have clear characteristics. Thus, we can conclude that the similarity of vectors represents the similarity of the original data. Therefore, vector search effectively represents semantic search of the original data. This allows us to use vector search to implement many semantic similarity search applications.

However, as an early technology, Word2Vec also has certain limitations:

Since the relationship between words and vectors is one-to-one, the issue of polysemy cannot be resolved. For example, the word bank in the following examples does not all have the same meaning.

...very useful to protect banks or slopes from being washed away by river or rain...
...the location because it was high, about 100 feet above the bank of the river...
...The bank has plans to branch throughout the country...
...They throttled the watchman and robbed the bank...

Word2Vec is a static method, which, although versatile, cannot be dynamically optimized for specific tasks.

The Transformation of Transformer

Although Word2Vec performed well in word vector representation, it did not capture the complex relationships between contexts. To better handle contextual dependencies and semantic understanding, the Transformer model was born.

Transformer is a neural network model based on the self-attention mechanism, first proposed by researchers at Google in 2017 and applied to natural language processing tasks. It can model the relationships between words at different positions in the input sentence, thus better capturing contextual information. The introduction of Transformer marked a significant revolution in neural network models in the field of natural language processing, leading to significant performance improvements in tasks such as text generation and machine translation.

Initially, Transformer was proposed for machine translation tasks and achieved significant performance improvements. This model consists of an “Encoder” and a “Decoder,” where the encoder encodes the input language sequence into a series of hidden representations, and the decoder decodes these hidden representations into the target language sequence. Each encoder and decoder is composed of multiple layers of self-attention mechanisms and feedforward neural networks.

Compared to traditional CNNs (Convolutional Neural Networks) and RNNs (Recurrent Neural Networks), Transformer allows for more efficient parallel computation since the self-attention mechanism enables simultaneous computation of all input positions, whereas CNNs and RNNs need to compute sequentially. Traditional CNNs and RNNs encounter difficulties when handling long-distance dependencies, while Transformer can learn long-distance dependencies through self-attention mechanisms.

Due to the excellent performance of the original Transformer model on large-scale tasks, researchers began to experiment with adjusting the model’s size to improve performance. They found that by increasing the depth, width, and number of parameters of the model, Transformer could better capture the relationships and patterns between input sequences.

Another important development of Transformer is the emergence of large-scale pre-trained models. By training on a large amount of unsupervised data, pre-trained models can learn richer semantic and syntactic features and fine-tune them for downstream tasks. These pre-trained models include BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pre-trained Transformer), etc., which have achieved great success in various natural language processing tasks.

The development of Transformer has brought tremendous changes to artificial intelligence, such as the encoder part evolving into the BERT series, which then developed into various embedding models. The decoder part evolved into the GPT series, leading to the subsequent revolution of LLMs, including the current ChatGPT.

BERT and Sentence Embedding

The encoder part of Transformer evolved into BERT.

BERT uses a two-stage pre-training method, namely MLM (Masked Language Model) for cloze tasks and NSP (Next Sentence Prediction). The MLM stage allows BERT to predict masked vocabulary to help it understand the context of the entire sequence; the NSP stage enables BERT to judge whether two sentences are consecutive, helping it understand the relationship between sentences. These two pre-training stages give BERT a powerful ability to learn semantic information and achieve excellent performance in various natural language processing tasks.

One very important application of BERT is sentence embedding, which generates an embedding vector from a sentence. This vector can be used for various downstream natural language processing tasks, such as sentence similarity computation, text classification, sentiment analysis, etc. By using sentence embedding, sentences can be converted into vector representations in high-dimensional space, allowing computers to understand the sentences and express their semantics.

Compared to traditional word embedding methods, BERT’s sentence embedding captures more semantic information and sentence-level relationships. By taking the entire sentence as input, the model can comprehensively consider the contextual relationships of the words within the sentence and the semantic relevance between sentences. This provides a more powerful and flexible tool for solving a range of natural language processing tasks.

Why is Embedding Search Better than Frequency-Based Search?

Traditional frequency-based search algorithms include TF-IDF, BM25, etc. Frequency-based search only considers the frequency of words in the text, ignoring the semantic relationships between words. In contrast, embedding search captures the semantic relationships between words by mapping each word to a vector representation in a vector space. Therefore, when searching, it can more accurately match relevant texts by calculating the similarity between words.

Frequency-based search can only perform exact matches and performs poorly for synonyms or semantically related words. In contrast, embedding search can achieve fuzzy matching for synonyms and semantically related words by calculating the similarity between words, thus improving the coverage and accuracy of search results. Embedding search can better utilize the semantic relationships between words, enhancing the accuracy and coverage of search results, making it superior to frequency-based search.

Using frequency-based search methods, if we query “cat,” the results may rank articles containing “cat” with high frequency at the top. However, this method cannot consider the semantic relationships between “cat” and other animals, such as “British Shorthair” or “Ragdoll.” In contrast, by using embedding search methods, words can be mapped to vectors in high-dimensional space, making semantically similar words close together in space. When querying “cat,” embedding search can find semantically similar words like “British Shorthair” and “Ragdoll,” placing these related articles at the top of the results. This provides more accurate and relevant search results.

02.

The Development of LLM

Currently, most large language models (LLMs) are derived versions of the “decoder-only” Transformer architecture, such as GPT. Compared to BERT, which only uses the Transformer encoder structure, the LLM’s decoder-only structure can generate text with contextual semantics.

The training task of a language model is to predict the probability of the next word based on historical context. By continuously predicting and adding the next word, the model can achieve more accurate and fluent predictions. This training process helps language models better understand linguistic patterns and contextual information, thereby enhancing their natural language processing capabilities.

From GPT-1 to GPT-3

The GPT series is a large language model developed and continuously improved by OpenAI since 2018.

The earliest GPT-1 had issues with semantic incoherence or repetition when generating long texts. GPT-2, released in 2019, was an improved version built on GPT-1, featuring several enhancements, including larger training data, deeper model structures, and more training iterations. GPT-2 significantly improved the quality and coherence of generated text and introduced zero-shot learning capabilities, allowing it to reason and generate text for unseen tasks. GPT-3 further enhanced and expanded the model’s scale and capabilities based on GPT-2. The GPT-3 model has 175 billion parameters, providing powerful generation capabilities, enabling it to produce longer, more logical, and coherent texts. GPT-3 also introduced more contextual understanding and reasoning capabilities, allowing for deeper analysis of questions and providing more accurate answers.

RAG Mastery Manual: Understanding the Technology Behind RAG

From GPT-1 to GPT-3, OpenAI’s language generation models have undergone significant improvements in data scale, model structure, and training techniques, achieving higher quality, more logical, and coherent text generation capabilities. By the time of GPT-3, some effects different from previous LLMs have begun to emerge, and GPT-3 has the following capabilities:

Language continuation: Given a prompt, GPT-3 can generate sentences that complete the prompt.

In-context learning: Following several examples of a given task, GPT-3 can reference them and generate similar answers for new use cases, also known as few-shot learning.

World knowledge: Including factual knowledge and commonsense.

ChatGPT

In November 2022, OpenAI released ChatGPT, a chatbot that can answer almost any question. Its performance is surprisingly good; you can ask it to summarize documents, translate, write code, and create any copy. With some tools, you can even have it help you order takeout, book flights, and assist with various tasks you could not have imagined before.

This powerful functionality is supported by technologies such as Reinforcement Learning from Human Feedback (RLHF), which makes its conversations with humans more satisfying. RLHF is a method of reinforcement learning through human feedback, aiming to align the model’s outputs with human preferences. The specific process involves: the model generating multiple potential answers based on given prompts, human evaluators ranking these answers, and then using these rankings to train a preference model that learns to provide scores reflecting human preferences for answers, ultimately fine-tuning the language model further based on the preference model. This is why you find ChatGPT so useful. Compared to GPT-3, ChatGPT takes it a step further, unlocking powerful capabilities:

Responding to human instructions: The output of GPT-3 generally continues from the prompt; if the prompt is an instruction, GPT-3 may continue with more instructions, while ChatGPT can effectively respond to these instructions.
Code generation and understanding: The model has been trained on a large amount of code, allowing ChatGPT to generate high-quality, runnable code.
Complex reasoning using chain-of-thought: The reasoning ability of the initial GPT-3 model was weak or even nonexistent. This ability enables upper-layer applications to become more powerful and accurate through prompt engineering.
Detailed responses: ChatGPT’s responses are generally very detailed, requiring users to explicitly request “answer me in one sentence” to receive a more concise answer.
Fair responses: ChatGPT usually provides very balanced answers on issues involving multiple interests, striving to satisfy everyone. It also refuses to answer inappropriate questions.
Refusal of questions outside its knowledge scope: For example, it refuses to answer questions about new events after June 2021 because it has not been trained on data beyond that point. It also declines to answer questions about data it has never seen in its training.

However, ChatGPT currently has some shortcomings:

Relatively poor mathematical abilities: ChatGPT’s mathematical abilities are relatively weak. It may become confused or provide inaccurate answers when solving complex mathematical problems or those involving advanced mathematical concepts.
Sometimes generates hallucinations: Sometimes, ChatGPT produces hallucinations. When answering questions related to the real world, it may provide false or inaccurate information. This may be due to the model encountering inaccurate or misleading examples in the training data, leading to biases in its answers to certain questions.
Inability to update knowledge in real-time: ChatGPT cannot update its knowledge in real-time. Unlike humans, it cannot continuously learn to update and acquire the latest knowledge. This limits its application in fields that require timely updates, such as news reporting or financial market analysis.

Fortunately, we can use Retrieval Augmented Generation (RAG) technology to address the hallucination issue and the inability to update knowledge in real-time. RAG is a technological application that combines vector databases and LLMs; for an introduction to RAG and optimization techniques, please refer to other articles.

03.

Conclusion

In this article, we started from Embedding and introduced the current mainstream models and applications in deep learning, especially in the NLP field. From the early Word Embedding to the current popularity of ChatGPT, the development of AI is accelerating. With continuous technological progress and the richness of data, we can expect the emergence of even more powerful models. The application of deep learning will become more widespread, not limited to the field of natural language processing but expanding into more areas such as vision and speech. We believe that with continuous breakthroughs in technology and societal development, we will witness more exciting advancements and innovations in the future.

Author of this article

Zhang Chen

Zilliz Algorithm Engineer

Recommended Reading

Leave a Comment Cancel reply