Source: DeepHub IMBA



This article is approximately 2300 words long and is recommended for a 10-minute read.
This article introduces how to run LLM on high-performance CPU using the llama.cpp library in Python.

Large Language Models (LLM) Are Becoming Increasingly Popular, But They Require A Lot Of Resources, Especially GPU.

Large language models (LLM) are becoming increasingly popular, but running them is computationally resource-intensive. Many researchers are working to improve this drawback, such as HuggingFace, which has developed models that support 4-bit and 8-bit loading. However, they still require GPUs to operate. While it is possible to run these LLMs directly on a CPU, the performance of CPUs has not yet met the existing demand. Recently, Georgi Gerganov’s work has made it possible to run LLMs on high-performance CPUs. This is thanks to his llama.cpp library, which provides high-speed inference for various LLMs.

The original llama.cpp library focused on running models locally in the shell. This did not provide much flexibility for users and made it difficult for them to leverage a variety of Python libraries to build applications. However, the recent development of LangChain has enabled me to use llama.cpp in Python.

In this article, we will introduce how to use the llama-cpp-python package to utilize the llama.cpp library in Python. We will also discuss how to run Vicuna LLM using the LLaMA-cpp-python library.

llama-cpp-python

 pip install llama-cpp-python

For more detailed installation instructions, please refer to the llama-cpp-python documentation: https://github.com/abetlen/llama-cpp-python#installation-from-pypi-recommended.

Using LLM and llama-cpp-python

As long as the language model is converted to GGML format, it can be loaded and used by llama.cpp. Most popular LLMs have available GGML versions.

It is important to note that when the original LLM is converted to GGML format, it has already been quantized. The benefit of quantization is that it reduces the memory required to run these large models without significantly degrading performance. For example, a 7 billion parameter model of 13GB can be loaded in less than 4GB of RAM.

In this article, we use the GGML version of Vicuna-7B, which can be downloaded from HuggingFace: https://huggingface.co/CRD716/ggml-vicuna-1.1-quantized

Download GGML File and Load LLM

You can use the following code to download the model. This code also checks if the file already exists before attempting to download it.

 import os
 import urllib.request

 def download_file(file_link, filename):    # Checks if the file already exists before downloading    if not os.path.isfile(filename):        urllib.request.urlretrieve(file_link, filename)        print("File downloaded successfully.")    else:        print("File already exists.")
# Downloading GGML model from HuggingFace
ggml_model_path = "https://huggingface.co/CRD716/ggml-vicuna-1.1-quantized/resolve/main/ggml-vicuna-7b-1.1-q4_1.bin"
filename = "ggml-vicuna-7b-1.1-q4_1.bin"
download_file(ggml_model_path, filename)

The next step is to load the model:

from llama_cpp import Llama
llm = Llama(model_path="ggml-vicuna-7b-1.1-q4_1.bin", n_ctx=512, n_batch=126)

When loading the model, two important parameters should be set.

n_ctx: This parameter sets the maximum context size of the model. The default value is 512 tokens.

The context size is the total number of tokens in the input prompt and the maximum number of tokens the model can generate. Models with a smaller context size generate text much faster than those with a larger context size.

n_batch: This parameter sets the maximum number of prompt tokens to be processed during text generation. The default value is 512 tokens.

The n_batch parameter should be set carefully. Lowering n_batch helps speed up text generation on multi-threaded CPUs. However, too low a value may lead to a noticeable deterioration in text generation.

Using LLM to Generate Text

The following code writes a simple wrapper function to generate text using LLM.

def generate_text(    prompt="Who is the CEO of Apple?",    max_tokens=256,    temperature=0.1,    top_p=0.5,    echo=False,    stop=["#"], ):    output = llm(        prompt,        max_tokens=max_tokens,        temperature=temperature,        top_p=top_p,        echo=echo,        stop=stop,    )    output_text = output["choices"][0]["text"].strip()    return output_text


llm object has several important parameters:

prompt: The input prompt for the model. This text is tokenized and passed to the model.

max_tokens: This parameter is used to set the maximum number of tokens the model can generate. This parameter controls the length of the generated text. The default value is 128 tokens.

temperature: The temperature, which ranges between 0 and 1. Higher values (e.g., 0.8) will make the output more random, while lower values (e.g., 0.2) will make the output more focused and deterministic. The default value is 1.

top_p: An alternative to temperature sampling called nucleus sampling, where the model considers the results of tokens that have top_p probability mass. So, 0.1 means only considering tokens that contain the top 10% probability mass.

echo: This controls whether the model returns (echoes) the model prompt at the beginning of the generated text.

stop: A list of strings used to stop text generation. If the model encounters any of these strings, text generation will stop at that token. This is used to control model hallucination and prevent the model from generating unnecessary text.

The llm object returns a dictionary object in the following format:

 {  "id": "xxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx", # text generation id  "object": "text_completion",             # object name  "created": 1679561337,                   # time stamp  "model": "./models/7B/ggml-model.bin",   # model path  "choices": [    {      "text": "Q: Name the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune and Pluto.", # generated text      "index": 0,      "logprobs": None,      "finish_reason": "stop"    }  ],  "usage": {    "prompt_tokens": 14,       # Number of tokens present in the prompt    "completion_tokens": 28,   # Number of tokens present in the generated text    "total_tokens": 42  } }

You can extract the generated text from the dictionary object using output[“choices”][“text”].

Example Code to Generate Text Using Vicuna-7B

 import os
 import urllib.request
 from llama_cpp import Llama

 def download_file(file_link, filename):    # Checks if the file already exists before downloading    if not os.path.isfile(filename):        urllib.request.urlretrieve(file_link, filename)        print("File downloaded successfully.")    else:        print("File already exists.")

# Downloading GGML model from HuggingFace
ggml_model_path = "https://huggingface.co/CRD716/ggml-vicuna-1.1-quantized/resolve/main/ggml-vicuna-7b-1.1-q4_1.bin"
filename = "ggml-vicuna-7b-1.1-q4_1.bin"
download_file(ggml_model_path, filename)

llm = Llama(model_path="ggml-vicuna-7b-1.1-q4_1.bin", n_ctx=512, n_batch=126)

def generate_text(    prompt="Who is the CEO of Apple?",    max_tokens=256,    temperature=0.1,    top_p=0.5,    echo=False,    stop=["#"], ):    output = llm(        prompt,        max_tokens=max_tokens,        temperature=temperature,        top_p=top_p,        echo=echo,        stop=stop,    )    output_text = output["choices"][0]["text"].strip()    return output_text

generate_text(    "Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions.",    max_tokens=356, )

The generated text is as follows:

 Hawaii is a state located in the United States of America that is known for its beautiful beaches, lush landscapes, and rich culture. It is made up of six islands: Oahu, Maui, Kauai, Lanai, Molokai, and Hawaii (also known as the Big Island). Each island has its own unique attractions and experiences to offer visitors. One of the most interesting cultural experiences in Hawaii is visiting a traditional Hawaiian village or ahupuaa. An ahupuaa is a system of land use that was used by ancient Hawaiians to manage their resources sustainably. It consists of a coastal area, a freshwater stream, and the surrounding uplands and forests. Visitors can learn about this traditional way of life at the Polynesian Cultural Center in Oahu or by visiting a traditional Hawaiian village on one of the other islands. Another must-see attraction in Hawaii is the Pearl Harbor Memorial. This historic site commemorates the attack on Pearl Harbor on December 7, 1941, which led to the United States' entry into World War II. Visitors can see the USS Arizona Memorial, a memorial that sits above the sunken battleship USS Arizona and provides an overview of the attack. They can also visit other museums and exhibits on the site to learn more about this important event in American history. Hawaii is also known for its beautiful beaches and crystal clear waters, which are perfect for swimming, snorkeling, and sunbathing.

Summary

In this article, we introduced how to use the llama.cpp library and the llama-cpp-python package in Python. These tools support high-performance execution of LLMs based on CPU.

Llama.cpp is updated almost daily. The inference speed is getting faster, and the community regularly adds support for new models. There is a “convert.py” in Llama.cpp that can help you convert your own Pytorch models to GGML format.

The llama.cpp library and the llama-cpp-python package provide a robust solution for efficiently running LLMs on a CPU. If you are interested in integrating LLMs into your applications, I recommend doing in-depth research on this package.

Source code for this article: https://github.com/awinml/llama-cpp-python-bindings

Author of this article: Ashwin Mathur

Editor: Huang Jiyan

Run LLM Quickly on CPU Using Llama.cpp