Conversion and Quantization of Multimodal Large Models for Robots

1. Introduction

In today’s field of artificial intelligence, the application of multimodal large models in robotics is becoming increasingly widespread. This article aims to introduce how to convert multimodal large models to the gguf format and quantize them for efficient deployment on the ollama platform. Through this process, we achieve more efficient model operation and lower resource consumption.

2. Principle Introduction

The conversion and quantization of multimodal large models are aimed at optimizing model performance and resource usage. Mainstream conversion methods include converting models from safetensor format to .bin or gguf format. We choose the gguf format because it has better compatibility and performance on the ollama platform.

3. Hardware Introduction for Implementation and Deployment

1. Model Format Conversion

During the implementation process, we used high-performance server hardware, such as RTX4090*2 graphics cards, to ensure sufficient computing power and memory support during the conversion and quantization processes. Additionally, the support from the ollama platform ensures the efficient operation of the model.

2. Hardware Deployment Environment: Qinglong Humanoid Robot

The concept of embodied intelligence can be traced back to 1950 when Turing proposed in his paper “Computing Machinery and Intelligence” that machines can interact with their environment like humans, autonomously plan, make decisions, act, and possess execution capabilities, which is the ultimate form of AI. The difference between large language models and traditional machine learning lies in their strong generalization ability, with breakthroughs in complex task understanding, continuous dialogue, and zero-shot reasoning. This provides new solutions for robots’ understanding, continuous decision-making, and human-machine interaction capabilities. In the era of large models, large models serve as the “brain” of robots, while multimodal serves as the “small brain.” The training and testing of models combined with cloud services can conduct end-to-end real-time training and testing in virtual simulation scenarios on the cloud, rapidly completing edge-side iterations and development, which greatly accelerates the evolution speed of embodied intelligent agents.

Currently, many large companies have laid out plans in the field of embodied intelligence, such as industrial manufacturing, warehousing logistics, surveillance detection, environmental exploration, emergency rescue, and swarm combat. Google has released PaLM-E; Microsoft has expanded ChatGPT into the field of robotics; Alibaba’s Qianwen large model has been integrated into industrial robots, etc.

Boston Dynamics: Robots like Spot, Stretch, and Atlas are examples of embodied large model robots. Boston Dynamics is conducting research combining robotic functionality and expressiveness, including four research areas (cognitive AI, motion AI, advanced hardware design, and ethics). The latest research directions include: (1) Observation-Understanding-Practice projects, combining motion and cognitive abilities. E.g., robots observing human behavior on production lines, imitating human dancing, practicing without manual programming; scene recognition (work), navigation, understanding human tasks, using tools; perceiving cars, parts, human behavior, etc. Consumer applications: teaching robots to cook their favorite dishes. (2) Inspection-Diagnosis-Repair projects (medical, nuclear radiation repair, oil drilling platforms). E.g., inspecting whether equipment is functioning normally, intelligent diagnosis, repairs; fixing household appliances. (3) Agile mobile manipulation projects. (4) Ethical projects, including the laws and policies of robots, technology and design, social norms, and market forces.

ABB: (1) Stereoscopic warehouses; (2) Robotic vision applications: ① Robot guidance: item picking, box picking, stacking and unstacking, assembly guidance, picking and placing; ② Quality inspection: assembly testing, geometric analysis, defect detection; ③ Navigation and mapping: human-machine safety, AGV navigation, collision avoidance; programming; (3) RAPID programming (owned by ABB, system code and application code modules), Wizard (visual programming), guided programming (robot teaching), autonomous path planning; (4) Industrial robots: material handling, high-precision assembly, spot welding, laser welding, glue dispensing, machining.

The Qinglong Full-Size General Humanoid Robot is an outstanding representative in this field. It features a highly biomimetic torso configuration and human-like motion control, with capabilities such as fast walking, agile obstacle avoidance, robust uphill and downhill movement, and resistance to impact disturbances. It has up to 43 degrees of freedom, with a maximum joint peak torque of 396N.m, making it close to human flexibility and strength.

The Qinglong robot’s dexterous hands are designed with a modular structure, allowing for quick detachment and replacement with humanoid robotic arms. Each fingertip and palm section integrates array-type tactile sensors for precise contact perception. The overall appearance design references human hands and actual grasping needs, resembling human hands, with a simple and aesthetic design. It features a high grasp-to-weight ratio and adaptive grasping characteristics, with fast response speed, agile movement, and a highly human-like appearance; it uses a quick-change mechanism for easy assembly and disassembly with robotic arms; it has tactile perception capabilities, with fingers designed based on modular principles.

Conversion and Quantization of Multimodal Large Models for Robots

Serial Number	Technical Parameters of the Dexterous Hand
1	The main body has 19 degrees of freedom, 6 active degrees of freedom, the thumb has 3 degrees of freedom (2 active degrees of freedom), allowing for active bending and stretching, and lateral swinging. All four fingers have 4 degrees of freedom (1 active degree of freedom), allowing for active bending and stretching and passive lateral swinging.
2	The total weight of the hand does not exceed 600g.
3	Speed (finger movement): 90 degrees/S.
4	Response speed: 10ms
5	Single finger load not less than 15N.
6	Total hand load (four-finger fist) not less than 5kg.
7	Rated voltage: 8.4V.

4. Deployment Process Introduction

1. Preparation Work: First, clone the ollama and llama.cpp code repositories and initialize the submodules.

2. Install Dependencies: Install the necessary Python packages in a virtual environment.

3. Build Quantization Tools: Use the make command to build the quantization tools.

4. Convert Model Format: Convert the multimodal large model from safetensor format to gguf format.

5. Quantize Model: Use the llama-quantize tool to quantize the model.

6. Fine-tune the Model and Import into Ollama: Create a Modelfile and run and upload the ollama model locally.

5. Deployment Steps

1. Simple Language Model Conversion Method

Reference: https://blog.csdn.net/spiderwower/article/details/138506271

2. Download Ollama and Llama.cpp

git clone https://gitcode.com/gh_mirrors/oll/ollama.git
git clone https://gitcode.com/gh_mirrors/ll/llama.cpp.git
git clone https://gitee.com/Zyi-opts/llama.cpp-zh.git

3. Model Conversion Format and Quantization

Place the llama.cpp folder into ollama/llm (replace the original empty llama.cpp folder).

cd /llama.cpp
pip install -r requirements.txt
python convert.py D:\huggingface\robollava-merged --outtype f16 --outfile D:\huggingface\robollava-convert/con

Since llama.cpp does not support multimodal large models, use the following method: (in autodl server/linux).

Reference:

https://blog.csdn.net/weixin_53162188/article/details/137754362

https://ollama.fan/getting-started/import/#setup

① Preparation Work:

git clone https://github.com/ollama/ollama.git
cd ollama
git submodule init
git submodule update llm/llama.cpp

② Install Dependencies:

python3 -m venv llm/llama.cpp/.venv
source llm/llama.cpp/.venv/bin/activate
pip install -r llm/llama.cpp/requirements.txt

③ Build Quantization Tools:

make -C llm/llama.cpp quantize

④ Convert Model Format: (on server RTX4090, otherwise memory may be insufficient, requires q4_0 precise gguf format large model file)

a. Convert the multimodal large model file from safetensor format to .bin format (for record only).

Use the tool: https://github.com/davidtorcivia/convert-safetensor-to-bin/tree/main

After configuring the environment according to the link, modify the safetensor-to-bin.py file as follows:

safetensor-to-bin.py code as follows:

import torch
import os 
from safetensors import safe_open
import argparse
from safetensors.torch import load_file, save_file  
def convert_safetensors_to_bin(input_path, output_path):      if os.path.isdir(input_path):          # If the input is a directory, all .safetensors files need to be processed          tensors = {}          for file in os.listdir(input_path):              if file.endswith('.safetensors'):                  file_path = os.path.join(input_path, file)                  tensors.update(load_file(file_path))      else:          # If the input is a single .safetensors file          tensors = load_file(input_path)  
    # Save as .bin format      torch.save(tensors, output_path)      print(f"Model converted successfully and saved to {output_path}") 
if __name__ == "__main__":    parser = argparse.ArgumentParser(description="Convert .safetensors model to .bin format")    parser.add_argument('input_path', type=str, help='Path to the input .safetensors file')    parser.add_argument('output_path', type=str, help='Path to save the output .bin file')    args = parser.parse_args()    convert_safetensors_to_bin(args.input_path, args.output_path)

b. Convert the multimodal large model file from safetensor format to gguf format.

pip install torch safetensors transformers gguf llama-cpp-python

# Create st2gguf.py in the previously created /autodl-tmp/LLAVA folder, with the following code:

import os  
import sys  
import torch  
from gguf import GGUFWriter, MODEL_ARCH_NAMES   # Add LLaVA directory to Python path  
sys.path.append('/root/autodl-tmp/LLaVA')  
from llava.model.language_model.llava_llama import LlavaLlamaForCausalLM as LlavaForConditionalGeneration  # Set paths  
model_dir = '/root/autodl-tmp/robollava-merged'  
output_path = '/root/autodl-tmp/robollava.gguf'  # Load model  
model = LlavaForConditionalGeneration.from_pretrained(model_dir, torch_dtype=torch.float16)  # Convert to GGUF format  
def convert_llava_to_gguf(model, output_path):      config = model.config      gguf_writer = GGUFWriter(output_path, "llama")      # Add model parameters to GGUF      for name, param in model.state_dict().items():          gguf_writer.add_tensor(name, param.detach().cpu().numpy())      # Add architecture information and other metadata      metadata = {          "llama.architecture": "llama",          "llama.vocab_size": config.vocab_size,          "llama.dim": config.hidden_size,          "llama.multiple_of": 256,          "llama.n_heads": config.num_attention_heads,          "llama.n_layers": config.num_hidden_layers,          "llama.n_kv_heads": getattr(config, 'num_key_value_heads', config.num_attention_heads),          "llama.norm_eps": config.rms_norm_eps,          "llama.max_seq_len": config.max_position_embeddings,          "llama.context_length": config.max_position_embeddings,          "llama.tensor_data_type": str(config.torch_dtype),          "llama.rope.scaling_type": "none",          "llama.rope.freq_base": getattr(config, 'rope_theta', 10000.0),      }      # Add metadata      for key, value in metadata.items():          if hasattr(gguf_writer, 'add_string'):              gguf_writer.add_string(key, str(value))          elif hasattr(gguf_writer, 'add_data'):              gguf_writer.add_data(key, str(value))          else:              print(f"Warning: Unable to add metadata '{key}': {value}")      # Complete and save GGUF file      gguf_writer.write_header_to_file()      gguf_writer.write_kv_data_to_file()      gguf_writer.write_tensors_to_file()      gguf_writer.close()  
print("Starting conversion...")  
convert_llava_to_gguf(model, output_path)  
print("GGUF file has been created successfully.")

⑤ Quantize Model: (the following method can only quantize large language models)

#cd /path/to/llama.cpp
make llama-quantize
./bin/llama-quantize /root/autodl-tmp/robollava.gguf /root/autodl-tmp/robollava-q4_0.gguf

⑥ Fine-tune the model and import into ollama

# Create Modelfile
FROM /home/yyang/Matcha-agent/robollava/robollava.gguf
TEMPLATE """{{- if .System }}&lt;|system|&gt;{{ .System }}&lt;/s&gt;{{- end }}&lt;|user|&gt;{{ .Prompt }}&lt;/s&gt;&lt;|assistant|&gt;"""
PARAMETER stop "&lt;|system|&gt;"
PARAMETER stop "&lt;|user|&gt;"
PARAMETER stop "&lt;|assistant|&gt;"
PARAMETER stop "&lt;/s&gt;"

4. Create Local Ollama Model

ollama create robollava -f /home/yyang/Matcha-agent/robollava/Modelfile

5. Run Local Ollama Model

ollama run robollava

6. Upload Ollama Model

cat /usr/share/ollama/.ollama/id_ed25519.pub
ollama cp robollava XXX/robollava  # according to your homepage
ollama push XXX/robollava

6. Effect Description

Through the above process, we successfully converted the multimodal large model to the gguf format and quantized it. The model can run on the ollama platform, significantly reducing resource consumption and improving performance.

The step-by-step explanation is as follows:

Conversion and Quantization of Multimodal Large Models for Robots

Leave a Comment Cancel reply