1. What is Ollama
Ollama is a concise and easy-to-use local framework for running large models, allowing users to quickly run large models on their local computers, with most of the code written in Golang.
Project address: https://github.com/ollama/ollama
Official project: https://ollama.com/
2. Why Ollama Exists
The existence of Ollama
can be traced back to Llama
and llama.cpp
.
Llama
is an open-source large language model series released by Meta AI
, with the full name Large Language Model Meta AI
. The word Llama
itself refers to the South American llama, so the community nicknamed this series of models the llama series models.
Llama1
is divided into four models based on parameter size: Llama1-7B, Llama1-13B, Llama1-30B, and Llama1-65B. After July 2023, Meta AI released Llama2
, which has three models: Llama2-7B, Llama2-13B, and Llama2-70B.
Here, B is an abbreviation for billion, referring to the scale of model parameters. The smallest model, 7B, contains 7 billion parameters, while the largest model, 65B, contains 65 billion parameters.
This is a benchmark comparison of llama series models[1]

Obtaining the Llama
large model requires filling out an online form[2] and agreeing to Meta AI’s agreement before receiving the download link they send.

After briefly introducing Llama, let’s turn our attention to llama.cpp
.
llama.cpp[3] was initially a pure C/C++ version of the Llama large model (Inference of Meta’s LLaMA model (and others) in pure C/C++), but now it also supports other large models, such as Mistral 7B
, Mixtral MoE
, etc.
The benefit of llama.cpp
is that it can reduce the running cost of large language models (LLMs) and enable these models to run on various hardware (and run faster), including devices without GPUs, meaning that devices with only pure CPUs can also run!
llama.cpp has the following features:
-
Wide Hardware Support: llama.cpp is implemented in pure C/C++, meaning it can run on various hardware platforms. Currently supported platforms include: Mac OS, Linux, Windows (via CMake), Docker, and FreeBSD. -
Simplified Setup and Operation: The goal of llama.cpp is to minimize setup, allowing users to perform LLM inference locally and in the cloud with state-of-the-art performance. It simplifies the installation and usage process by providing a dependency-free pure C/C++ implementation. -
Efficient Performance: By supporting different bit-width quantization (such as 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit), llama.cpp can speed up inference and reduce memory usage. This makes it possible to run large models on resource-limited devices. -
Architecture-Specific Optimizations: llama.cpp provides Apple Silicon
with specific optimizations and supports AVX, AVX2, and AVX512 for x86 architecture, further improving model performance on these architectures.
It can be said that llama.cpp has promoted the rapid development of local large models!
💡 Supplementary Knowledge: What is Quantization?
To understand quantization, we need to start with the parameters of large models. The parameters of large models are typically stored as floating-point numbers, with most large models using 32-bit floating-point numbers, also known as single precision. However, some layers of the model may also adopt different precision levels, such as 16 bits.
Let’s do a simple math calculation: 8 bits = 1 byte 32 bits = 4 bytes
Consider a 7B model, which has 7 billion parameters. Assuming this large model uses 32-bit floating-point storage, the total memory would be: 4 bytes * 7 billion = 28 billion bytes, which is approximately 28GB.
This means that using this model requires 28 GB of memory capacity.
However, for personal computers, it is actually quite difficult to reach 28GB! Currently, most have 8-16GB.
Quantization is the process of reducing the precision of large model parameters, which can also be simply understood as compressing the model!
After quantization, the 32-bit floating-point numbers are converted to 4-bit integers. Although this may slightly affect the model’s performance, it greatly reduces the memory consumption required to store the model!
If we perform 4-bit quantization on the above model, the quantized model becomes: 4 / 8 bytes * 7 billion = 3.5 billion bytes, which is approximately 3.5GB!
Let’s finally turn our thoughts back to Ollama. The reason for Ollama’s existence is that when you want to use llama.cpp
, it can be quite troublesome because you need to obtain model weights, clone project code, execute model quantization, set environment variables, build executable files, etc.
Ollama was created to simplify the deployment of llama.cpp
and provides an API interface and chat interface similar to OpenAI, making it easy to use different models.
3. Running Ollama
3.1 Installation
If you are on a Mac system, you need to download the corresponding installation package[4], and if you are on Linux, just execute the following command:
curl -fsSL https://ollama.com/install.sh | sh
It is worth noting that the above install.sh
is only suitable for Linux, and even if executed, it doesn’t matter, as the script has system detection.
Whether it is the installation package or install.sh
, they essentially install the ollama
binary file into the system and start the ollama
service.

After installation, use the version command to verify whether the installation was successful.
$ ollama --version
ollama version is 0.1.30
If you installed with Docker, the ollama
service may not start, and in this case, the version
command will prompt the following:
$ ollama --version
Warning: could not connect to a running Ollama instance
Warning: client version is 0.1.30
3.2 Running
Taking the llama2-chinese
model as an example, this model is a Chinese fine-tuned parameter model for Llama 2 dialogue, fine-tuned based on the open-source model released by MetaAI, Llama 2 Chat.
Since Llama 2’s Chinese alignment is relatively weak, the developers used a Chinese instruction set for fine-tuning, giving it strong Chinese dialogue capabilities. Currently, this Chinese fine-tuned parameter model is available in sizes of 7B and 13B.
$ ollama run llama2-chinese
The default latest
tag downloads the 7B model. If you want to download the 13B model, the command should be changed to:
$ ollama run llama2-chinese:13b
The download is quite fast.

Other models can be found in the Ollama Library[5].
3.3 Testing
Executing the ollama run llama2-chinese
command will start an input box where you can enter a prompt.

You can also call it via the API.
curl -X POST http://localhost:11434/api/generate -d '{
"model": "llama2-chinese",
"prompt":"Why is the sky blue?"
}'

Finally, regarding the API capabilities of Ollama, the official provides a Markdown document: https://github.com/ollama/ollama/blob/main/docs/api.md
Various interface tools for Ollama can also be viewed on its official project; I will not elaborate further here.
If this article helps you, please give it a thumbs up to support me; your likes are my greatest motivation for updates!
Benchmark test of llama series models: https://github.com/meta-llama/llama/blob/main/MODEL_CARD.md
[2]Online form: https://llama.meta.com/llama-downloads
[3]llama.cpp: https://github.com/ggerganov/llama.cpp
[4]Corresponding installation package: https://ollama.com/download/Ollama-darwin.zip
[5]Ollama Library: https://ollama.com/library