Comparing Mistral AI and Meta: Top Open Source LLMs

Source: Deephub Imba


This article is about 5000 words long, and it is recommended to read for 10 minutes.
This article will compare Mistral 7B vs Llama 2 7B and Mixtral 8x7B vs Llama 2 70B. To improve performance, large language models (LLMs) typically achieve this goal by increasing the model size.

This article will compare Mistral 7B vs Llama 2 7B and Mixtral 8x7B vs Llama 2 70B.

To improve performance, large language models (LLM) typically achieve this goal by increasing the model size, but increasing the model size also increases computational costs and inference latency, raising barriers to deploying and using LLMs in real-world scenarios.

Mistral AI is a European company based in Paris that has been researching ways to improve model performance while reducing the computational resources required to deploy LLMs for practical use cases. Mistral 7B is the smallest LLM they created, bringing two new concepts, Group-Query Attention (GQA) and Sliding Window Attention (SWA), to the traditional Transformer architecture. These components accelerate inference speed and reduce memory requirements during the decoding process, enabling higher throughput and the ability to handle longer token sequences.

Additionally, they created Mixtral 8x7B, which reduces inference time by activating 2 of the 8 available experts for each token using Sparse Mixture of Experts (SMoEs), reducing the number of parameters required to process tokens from 47B to 13B.

In this article, we will explain in detail each new concept that Mistral AI has added to the traditional Transformer architecture and compare the inference times between Mistral 7B and Llama 2 7B. In addition, we will also compare memory, inference time, and response quality between Mixtral 8x7B and Llama 2 70B.

GQA: Group-Query Attention

Autoregressive decoder inference is a bottleneck for transformers, as loading all queries, keys, and values in the multi-head attention layer (MHA) requires a significant amount of memory resources. To overcome this problem, Multi-Query Attention (MQA) was developed, which reduces the required memory by using only one key and value in the attention layer but multiple query heads. However, this solution can lead to quality degradation and unstable training, causing open-source LLMs (such as T5 and Llama) to opt not to use this method.

GQA achieves a balance between MHA and MQA by partitioning the query values into G groups (GQA-G), which share a key and value head. GQA-1 means all queries are assigned to one group, thus being the same as MQA, while GQA-H (H = number of heads) corresponds to MHA, where each query is treated as a group.

This method reduces the number of keys and values heads entering a single key and value for each query group, decreasing the size of cached key-values, thus reducing the amount of data that needs to be loaded. This moderate reduction compared to MQA accelerates inference speed and reduces memory requirements during the decoding process, with quality closer to MHA, but speed almost the same as MQA.

Mistral has 32 query heads and 8 key-value heads, with queries divided into 4 groups.

SWA: Sliding Window Attention

Most Transformers use the traditional attention mechanism, where each token in the sequence can process itself and all previous tokens. This results in memory increasing linearly with the number of tokens. This method poses problems during inference, as it has higher latency and lower throughput due to reduced cache availability.

The design of SWA can alleviate these issues; it utilizes stacked attention layers to focus on information from a hyper-window size w. Each hidden state h at position i in layer k can attend to all hidden states in the previous layer from position i-w to i. Hidden states can access tokens from the input layer that are W x k tokens away. The model has 32 layers with a window size of 4096, with an attention breadth of 131k tokens.

To better understand how SWA works, imagine the following scenario where our input prompt is:

Mixtral 8x7B is a Large Language Model designed to deliver high performance while maintaining efficiency at inference time …

If the window size is 3 (W=3), at layer 6 (k=6), at position 16 (i=16), we access the token “at” and the last 3 tokens from layer 5; due to the recursive process, layer 6 can also access information beyond W=3, as layer 5 can access the last 3 tokens from layer 4, and layer 4 can access the last 3 tokens from layer 3. Thus, tokens outside the sliding window still influence the prediction of the next word.

Since Mistral has a fixed attention breadth of 131k tokens, the cache size is also limited to a fixed size W, so the authors use a rolling buffer cache to overwrite past values and stop the linear growth demand for cache size. The keys and values at timestep i are stored in the cache at position i mod W, and when position i exceeds W, the first value will be overwritten by the new token (conceptually understood as FIFO).

Considering the previous example, we have a window size of 3. When the model generates the fourth token, the first token will be replaced, as shown in the diagram at timestep i+1.

The last memory optimization in SWA relies on pre-filling and chunking; the authors split very large prompt chunks into smaller chunks of the same size as W and pre-fill the key-value cache to limit memory usage. When processing chunks of size 3 (W=3), the model can access the current chunk and the chunks in the cache using the sliding window, but it cannot access past tokens as they are outside the sliding window.

SMoE: Sparse Mixture of Experts

Mixture of Experts (MoE) breaks the traditional notion of linear data processing through consecutive layers by introducing the concept of expert networks (typically feedforward neural networks), where each expert network specializes in handling specific tasks or types of data.

This architecture improves training efficiency since the FFN layers are treated as separate experts, while the remaining model parameters are shared. For instance, Mixtral 8x7B has 47B parameters instead of 56B, allowing the model to be pre-trained with fewer computational resources than a dense model with 56B. It also brings benefits during inference: only 2 experts are activated, thus only using 13B parameters, making it faster than dense models.

MoE has two main components:

Replacing FFN layers with Sparse Mixture of Experts layers: Mixtral 8x7B has 8 SMoE layers, meaning 8 experts, each responsible for a set of tokens. For example, one can be a punctuation expert, a visual description expert, or a numerical expert, but is not omnipotent.

Gate or routing network: determines which tokens are sent to which experts; this network is pre-trained simultaneously with the rest of the network, learning how to assign tokens to the experts that can handle them best.

For the routing network, using only the softmax function may lead to an imbalance in load distribution among experts, so the authors proposed a noisy top-k gating function, adding adjustable Gaussian noise and sparsity before the softmax gating.

We simply explain how the top-k gating works: if we want each token to be assigned to the top 2 experts (k=2), as shown in the equation in the diagram. A transformation occurs where the top 2 values are retained, and the rest are set to -∞. This sparsity allows for computational savings, as the corresponding -∞ softmax values become 0, thus the experts are not activated. Finally, the softmax function calculates the weights of each expert for the input token. These weights define the contribution of experts to the final output.

For example, in the text above, the first token “Mixtral” is routed through the routing network, activating only 2 experts instead of all experts, saving time during inference and computational resources during training, as a specific token is processed by 2 smaller FFNs instead of a dense FFN.

Comparing Mistral AI vs Meta: Mistral 7B vs Llama 2 7B and Mixtral 8x7B vs Llama 2 70B

After introducing Mistral’s improvements, we will begin the comparison. We will create four RAG systems, with the difference between the systems being the generative models, where we will use Mistral 7B, Llama 2 7B, Mixtral 8x7B, and Llama 2 70B. We will compare the performance of Mistral 7B with Llama 2 7B in terms of inference time, as well as Mixtral 8x7B with Llama 2 70B in terms of inference time, memory, and response quality.

First, we establish a PGVector database to support semantic search for context retrieval.

postgres.env POSTGRES_DB=postgres POSTGRES_USER=admin POSTGRES_PASSWORD=root
docker-compose.yaml version: '3.8' services:  postgres:    container_name: container-pg    image: ankane/pgvector    hostname: localhost    ports:      - "5432:5432"    env_file:      - ./env/postgres.env    volumes:      - postgres-data:/var/lib/postgresql/data    restart: unless-stopped
 volumes:  postgres-data:

We run the command docker-compose up -d, and the PGVector database is ready.

The database is filled with customer reviews for the first 10 products in the following way:

Using sentence-transformers/multi-qa-mpnet-base-dot-v1 for embedding and using LangChain to store them in PGVector. Then create a new column full_review that concatenates the customer’s title and review, looping through 10 different product IDs, converting them to Documents (the format expected by LangChain), and storing them in PGVector.

 from encoder.encoder import Encoder 
from retriever.vector_db import VectorDatabase 
from langchain.docstore.document import Document 
import pandas as pd

encoder = Encoder() 
vectordb = VectorDatabase(encoder.encoder)

df = pd.read_csv('data/data.csv') # create new column that concatenates title and review
df['full_review'] = df[['reviews.title', 'reviews.text']].apply(    lambda row: ". ".join(row.values.astype(str)), axis=1 )

for product_id in df['asins'].unique()[:10]:    # create documents to store in Postgres    docs = [        Document(page_content=item)        for item in df[df['asins'] == product_id]["full_review"].tolist()    ]    passages = vectordb.create_passages_from_documents(docs)    vectordb.store_passages_db(passages, product_id)

The connection settings with PGVector must be in one connection. Create the Env file under Env with the following variables:

 DRIVER=psycopg2 
HOST=localhost 
PORT=5432 
DATABASE=postgres 
USERNAME=admin 
PASSWORD=root

Now we will create 20 queries, 2 for each product, asking the LLM, “What do people like about the product?” and “What do people dislike about the product?” Before sending the questions to the LLM, we retrieve context from the vector database to help guide the answers.

To retrieve the correct context for each product, we need to send the query along with the product ID so that we can fetch the correct data from the table. By retrieving context in advance, we can ensure that both models receive the same information, making the comparison fairer.

 # generate 2 questions for each product id (20 questions in total)
like_questions = [f"{product_id}|What people like about the product?" for product_id in df["asins"].unique()[:10]]
dislike_questions = [f"{product_id}|What people dislike about the product?" for product_id in df["asins"].unique()[:10]]

# retrieve query and context to give to llama and mistral
QUERIES = []
CONTEXTS = []
for q in like_questions+dislike_questions:    
    id = q.split("|")[0]
    query = q.split("|")[1]
    context = vectordb.retrieve_most_similar_document(query, k=2, id=id)
    QUERIES.append(query)
    CONTEXTS.append(context)

We now have the questions and context, bromide can pass them to the LLM, recording how many words they generate per second and the average length of the answers.

We download all models in .gguf format to run them in CPU.

mistral-7b-v0.1.Q4_K_M and nous-hermes-llama-2-7b.Q4_K_M. Using 4-bit quantization, Mistral 7B requires 6.87 GB RAM, Llama 2 requires 6.58 GB RAM.

mixtral-8x7b-v0.1.Q4_K_M and llama-2-70b-chat.Q4_K_M also use 4-bit quantization, where Mixtral 8x7B requires 28.94 GB RAM, Llama 2 requires 43.92 GB RAM.

We then import the Generator class, which takes the model we want to use as a parameter.

 from generator.generator import Generator

llama = Generator(model='llama')
mistral = Generator(model='mistral')
llama70b = Generator(model='llama70b')
mixtral8x7b = Generator(model='mixtral8x7b')

This class is responsible for importing the model parameters defined in the configuration. The YAML file has the following features: context_length is 1024, temperature is 0.7, max_tokens is 2000.

 generator:  
  llama:    
    llm_path: "model/nous-hermes-llama-2-7b.Q4_K_M.gguf"  
  mistral:    
    llm_path: "model/mistral-7b-v0.1.Q4_K_M.gguf"  
  llama70b:    
    llm_path: "model/llama-2-70b.Q4_K_M.gguf"  
  mixtral8x7b:    
    llm_path: "model/mixtral-8x7b-v0.1.Q4_K_M.gguf"  
  context_length: 1024  
  temperature: 0.7  
  max_tokens: 2000

Then it also creates the prompt template and formats the queries and context according to the template before passing it to the LLM for a response.

 from langchain import PromptTemplate 
from langchain.chains import LLMChain 
from langchain.llms import LlamaCpp
from base.config import Config

class Generator(Config):    
    """Generator, aka LLM, to provide an answer based on some question and context"""
    def __init__(self, model) -> None:        
        super().__init__()
    # template        
        self.template = """            Use the following pieces of context to answer the question at the end.            {context}            Question: {question}            Answer:        """
    # load llm from local file        
        self.llm = LlamaCpp(            model_path=f"{self.parent_path}/{self.config['generator'][model]['llm_path']}",            n_ctx=self.config["generator"]["context_length"],            temperature=self.config["generator"]["temperature"],        )
        # create prompt template        
        self.prompt = PromptTemplate(            template=self.template, input_variables=["context", "question"]        )
    def get_answer(self, context: str, question: str) -> str:        
        """        Get the answer from llm based on context and user's question        Args:            context (str): most similar document retrieved            question (str): user's question        Returns:            str: llm answer        """
        query_llm = LLMChain(            llm=self.llm,            prompt=self.prompt,            llm_kwargs={"max_tokens": self.config["generator"]["max_tokens"]},        )
        return query_llm.run({"context": context, "question": question})

Now iterate through the questions and context, recording the metrics mentioned above.

At the end, the metrics graph shows that Mistral 7B is significantly faster than Llama 2 7B, generating about 1.5 words per second on average, while Llama 2 7B generates only about 0.8 words per second. The answers generated by Mistral 7B are more complete, with an average answer length of 248, while Llama 2 7B generates sentences of only 75 words.

In terms of time, Mixtral 8x7B takes about 3 minutes, while Llama 2 70B takes about 10 minutes.

For memory utilization, Mixtral 8x7B has 47B parameters, while Llama 2 has 70B parameters. Therefore, we can expect that Mixtral’s memory utilization is 67% of what Llama 2 uses, but due to the parameters shared between SMoE and its experts, its memory utilization is only 62.5%.

Conclusion

LLMs have made tremendous advancements in the past two years, making it possible to obtain high-quality responses, and it is often difficult to distinguish who wrote these responses, whether human or machine. Current research is shifting its focus from generating high-quality responses to creating LLMs as small as possible to run on less resource-intensive devices, saving costs and making them more accessible.

Mistral is one of the companies actively researching in this field, and as we have seen, they have achieved very good results. Their smallest model, Mistral 7B, is capable of improving memory efficiency during training and reducing inference time by nearly half.

For Mixtral 8x7B, in addition to incorporating GQA and SWA, they have also introduced a third concept, Sparse Mixture of Experts, which further enhances training and inference efficiency. This approach ensures that during inference, not all 47B parameters are used to process each token, but only 13B parameters. When we compare the answers of Mistral 8x7B and Llama 2 70B, we can see that it produces answers as good as Llama 2, but with about a 70% reduction in time and 62.5% less memory usage.

It is conceivable that LLMs will continue to develop further in 2024.

References

[1] Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed. Mixtral of Experts. arXiv:2401.04088, 2024.

[2] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed. Mistral 7B. arXiv:2310.06825, 2023.

[3] Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv:2305.13245, 2023.

[4] Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv:2004.05150, 2020.

[5] Bo Li, Yifei Shen, Jingkang Yang, Yezhen Wang, Jiawei Ren, Tong Che, Jun Zhang, Ziwei Liu. Sparse Mixture-of-Experts are Domain Generalizable Learners. arXiv:2206.04046, 2023.

[6] Noam Shazeer. Fast Transformer Decoding: One Write-Head is All You Need. arXiv:1911.02150, 2019.

[7] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. arXiv:1701.06538, 2017.

Editor: Wang Jing

GQA: Group-Query Attention

SWA: Sliding Window Attention

SMoE: Sparse Mixture of Experts

Comparing Mistral AI vs Meta: Mistral 7B vs Llama 2 7B and Mixtral 8x7B vs Llama 2 70B

Conclusion

Leave a Comment Cancel reply