Ollama: A Powerful Tool for Local Large Model Building

1. What is Ollama

Ollama is a concise and easy-to-use local framework for running large models, allowing users to quickly run large models on their local computers, with most of the code written in Golang.

Project address: https://github.com/ollama/ollama

Official project: https://ollama.com/

2. Why Ollama Exists

The existence of Ollama can be traced back to Llama and llama.cpp.

Llama is an open-source large language model series released by Meta AI, with the full name Large Language Model Meta AI. The word Llama itself refers to the South American llama, so the community nicknamed this series of models the llama series models.

Llama1 is divided into four models based on parameter size: Llama1-7B, Llama1-13B, Llama1-30B, and Llama1-65B. After July 2023, Meta AI released Llama2, which has three models: Llama2-7B, Llama2-13B, and Llama2-70B.

Here, B is an abbreviation for billion, referring to the scale of model parameters. The smallest model, 7B, contains 7 billion parameters, while the largest model, 65B, contains 65 billion parameters.

This is a benchmark comparison of llama series models[1]

Ollama: A Powerful Tool for Local Large Model Building

Obtaining the Llama large model requires filling out an online form[2] and agreeing to Meta AI’s agreement before receiving the download link they send.

Ollama: A Powerful Tool for Local Large Model Building

After briefly introducing Llama, let’s turn our attention to llama.cpp.

llama.cpp[3] was initially a pure C/C++ version of the Llama large model (Inference of Meta’s LLaMA model (and others) in pure C/C++), but now it also supports other large models, such as Mistral 7B, Mixtral MoE, etc.

The benefit of llama.cpp is that it can reduce the running cost of large language models (LLMs) and enable these models to run on various hardware (and run faster), including devices without GPUs, meaning that devices with only pure CPUs can also run!

llama.cpp has the following features:

  • Wide Hardware Support: llama.cpp is implemented in pure C/C++, meaning it can run on various hardware platforms. Currently supported platforms include: Mac OS, Linux, Windows (via CMake), Docker, and FreeBSD.
  • Simplified Setup and Operation: The goal of llama.cpp is to minimize setup, allowing users to perform LLM inference locally and in the cloud with state-of-the-art performance. It simplifies the installation and usage process by providing a dependency-free pure C/C++ implementation.
  • Efficient Performance: By supporting different bit-width quantization (such as 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit), llama.cpp can speed up inference and reduce memory usage. This makes it possible to run large models on resource-limited devices.
  • Architecture-Specific Optimizations: llama.cpp provides Apple Silicon with specific optimizations and supports AVX, AVX2, and AVX512 for x86 architecture, further improving model performance on these architectures.

It can be said that llama.cpp has promoted the rapid development of local large models!

💡 Supplementary Knowledge: What is Quantization?

To understand quantization, we need to start with the parameters of large models. The parameters of large models are typically stored as floating-point numbers, with most large models using 32-bit floating-point numbers, also known as single precision. However, some layers of the model may also adopt different precision levels, such as 16 bits.

Let’s do a simple math calculation: 8 bits = 1 byte 32 bits = 4 bytes

Consider a 7B model, which has 7 billion parameters. Assuming this large model uses 32-bit floating-point storage, the total memory would be: 4 bytes * 7 billion = 28 billion bytes, which is approximately 28GB.

This means that using this model requires 28 GB of memory capacity.

However, for personal computers, it is actually quite difficult to reach 28GB! Currently, most have 8-16GB.

Quantization is the process of reducing the precision of large model parameters, which can also be simply understood as compressing the model!

After quantization, the 32-bit floating-point numbers are converted to 4-bit integers. Although this may slightly affect the model’s performance, it greatly reduces the memory consumption required to store the model!

If we perform 4-bit quantization on the above model, the quantized model becomes: 4 / 8 bytes * 7 billion = 3.5 billion bytes, which is approximately 3.5GB!

Let’s finally turn our thoughts back to Ollama. The reason for Ollama’s existence is that when you want to use llama.cpp, it can be quite troublesome because you need to obtain model weights, clone project code, execute model quantization, set environment variables, build executable files, etc.

Ollama was created to simplify the deployment of llama.cpp and provides an API interface and chat interface similar to OpenAI, making it easy to use different models.

3. Running Ollama

3.1 Installation

If you are on a Mac system, you need to download the corresponding installation package[4], and if you are on Linux, just execute the following command:

curl -fsSL https://ollama.com/install.sh | sh

It is worth noting that the above install.sh is only suitable for Linux, and even if executed, it doesn’t matter, as the script has system detection.

Whether it is the installation package or install.sh, they essentially install the ollama binary file into the system and start the ollama service.

Ollama: A Powerful Tool for Local Large Model Building

After installation, use the version command to verify whether the installation was successful.

$ ollama --version
ollama version is 0.1.30

If you installed with Docker, the ollama service may not start, and in this case, the version command will prompt the following:

$ ollama --version
Warning: could not connect to a running Ollama instance
Warning: client version is 0.1.30

3.2 Running

Taking the llama2-chinese model as an example, this model is a Chinese fine-tuned parameter model for Llama 2 dialogue, fine-tuned based on the open-source model released by MetaAI, Llama 2 Chat.

Since Llama 2’s Chinese alignment is relatively weak, the developers used a Chinese instruction set for fine-tuning, giving it strong Chinese dialogue capabilities. Currently, this Chinese fine-tuned parameter model is available in sizes of 7B and 13B.

$ ollama run llama2-chinese

The default latest tag downloads the 7B model. If you want to download the 13B model, the command should be changed to:

$ ollama run llama2-chinese:13b

The download is quite fast.

Ollama: A Powerful Tool for Local Large Model Building

Other models can be found in the Ollama Library[5].

3.3 Testing

Executing the ollama run llama2-chinese command will start an input box where you can enter a prompt.

Ollama: A Powerful Tool for Local Large Model Building

You can also call it via the API.

curl -X POST http://localhost:11434/api/generate -d '{
  "model": "llama2-chinese",
  "prompt":"Why is the sky blue?"
}'
Ollama: A Powerful Tool for Local Large Model Building

Finally, regarding the API capabilities of Ollama, the official provides a Markdown document: https://github.com/ollama/ollama/blob/main/docs/api.md

Various interface tools for Ollama can also be viewed on its official project; I will not elaborate further here.

If this article helps you, please give it a thumbs up to support me; your likes are my greatest motivation for updates!

References
[1]

Benchmark test of llama series models: https://github.com/meta-llama/llama/blob/main/MODEL_CARD.md

[2]

Online form: https://llama.meta.com/llama-downloads

[3]

llama.cpp: https://github.com/ggerganov/llama.cpp

[4]

Corresponding installation package: https://ollama.com/download/Ollama-darwin.zip

[5]

Ollama Library: https://ollama.com/library

Leave a Comment