Fraud Text Classification Detection: LLama.cpp + CPU Inference

Fraud Text Classification Detection: LLama.cpp + CPU Inference

1. Introduction Previously, after training our personalized model with Lora, the first issue we faced was: how to run the model on a regular machine? After all, the model was fine-tuned on dedicated GPUs with dozens of gigabytes of memory, and switching to a regular computer with only a CPU could lead to the awkward … Read more

LlamaFactory Model Export Quantization

LlamaFactory Model Export Quantization

1. Each large model framework has specific format requirements for its fine-tuning data. For example, LlamaFactory supports it, and you can refer to the documentation: https://llamafactory.readthedocs.io/zh-cn/latest/getting_started/data_preparation.html 2. Convert Ruozhiba data into LlamaFactory data format. import json # Conversion function def convert_format(original_data): converted_data = [] for item in original_data: converted_item = { "instruction": item["query"], "input": "", … Read more

Ollama: A Powerful Tool for Local Large Model Building

Ollama: A Powerful Tool for Local Large Model Building

1. What is Ollama Ollama is a concise and easy-to-use local framework for running large models, allowing users to quickly run large models on their local computers, with most of the code written in Golang. Project address: https://github.com/ollama/ollama Official project: https://ollama.com/ 2. Why Ollama Exists The existence of Ollama can be traced back to Llama … Read more

Neural Network Model Compression Techniques

Neural Network Model Compression Techniques

Baido NLP Column Author: Baido NLP Introduction In recent years, we have been deeply engaged in the integration of neural network models with NLP tasks, achieving significant progress in various areas such as syntactic analysis, semantic similarity computation, and chat generation. In search engines, semantic similarity features have also become one of the most important … Read more

TensorFlow Model Optimization Toolkit – Quantization Aware Training

TensorFlow Model Optimization Toolkit - Quantization Aware Training

Written by / TensorFlow Model Optimization Team We are pleased to announce the release of the Quantization Aware Training (QAT) API, which is part of the TensorFlow Model Optimization Toolkit. With QAT, you can leverage the advantages of quantization in performance and size while maintaining accuracy close to the original. This work is part of … Read more

SpinQuant: LLM Quantization with Learnable Rotation Matrices

SpinQuant: LLM Quantization with Learnable Rotation Matrices

↑ ClickBlue Text Follow the Jishi Platform Author丨Tech Beast Editor丨Jishi Platform Jishi Introduction SpinQuant combines learnable rotation matrices to achieve optimal network accuracy, quantizing weights, activations, and KV cache to a 4-bit width. On the LLaMA-2 7B model, SpinQuant reduces the accuracy gap in Zero-Shot inference tasks to only 2.9 points compared to the full-precision … Read more

Huggingface Visualizes GGUF Models

Huggingface Visualizes GGUF Models

Huggingface has added a visualization feature for GGUF files, allowing users to directly view the model’s metadata and tensor information from the model page. All these features are performed on the client side. GGUF (GPT-Generated Unified Format) is a binary large model file format that allows for fast loading and saving of GGML models. It … Read more

Overview of Transformer Compression

Overview of Transformer Compression

Large models based on the Transformer architecture are playing an increasingly important role in artificial intelligence, especially in the fields of natural language processing (NLP) and computer vision (CV). Model compression methods reduce their memory and computational costs, which is a necessary step for implementing Transformer models on practical devices. Given the unique architecture of … Read more

Deploy Personal Code Assistant Using LLama.cpp in 3 Minutes

Deploy Personal Code Assistant Using LLama.cpp in 3 Minutes

Deploy Personal Code Assistant Using LLama.cpp in 3 Minutes Today, I will demonstrate the use of the most popular on-device LLM deployment engine, llama.cpp. The demonstration will be conducted on a MacBook Pro (M3 Pro). Project address: https://github.com/ggerganov/llama.cpp. Compilation method: https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md. The model used for testing is the Qwen2.5-Coder-3B-Instruct. Model download address: https://huggingface.co/Qwen/Qwen2.5-Coder-3B-Instruct. This model … Read more