Why Large Models Need Quantization and How to Quantize

Why Large Models Need Quantization and How to Quantize

MLNLP community is a well-known machine learning and natural language processing community, covering both domestic and international NLP master’s and doctoral students, university teachers, and corporate researchers. The Vision of the Community is to promote communication and progress between the academic and industrial sectors of natural language processing and machine learning, especially for beginners. Reproduced … Read more

TurboAttention: Efficient Attention Mechanism Optimization Reducing LLM Costs by 70%

TurboAttention: Efficient Attention Mechanism Optimization Reducing LLM Costs by 70%

Source: Deephub Imba This article is approximately 6500 words long and is recommended for a 10-minute read. This article will delve into how TurboAttention achieves efficiency improvements from a technical perspective and analyze its architectural innovations. As large language models (LLMs) continue to evolve in the AI application domain, their computational costs are also showing … Read more

Guide to Optimizing Transformer Memory Usage

Guide to Optimizing Transformer Memory Usage

MLNLP community is a well-known machine learning and natural language processing community in China and abroad, covering NLP master’s and PhD students, university teachers, and corporate researchers. The vision of the community is to promote communication and progress between the academic and industrial fields of natural language processing and machine learning, especially for beginners. Reprinted … Read more

Fraud Text Classification Detection: LLama.cpp + CPU Inference

Fraud Text Classification Detection: LLama.cpp + CPU Inference

1. Introduction Previously, after training our personalized model with Lora, the first issue we faced was: how to run the model on a regular machine? After all, the model was fine-tuned on dedicated GPUs with dozens of gigabytes of memory, and switching to a regular computer with only a CPU could lead to the awkward … Read more

LlamaFactory Model Export Quantization

LlamaFactory Model Export Quantization

1. Each large model framework has specific format requirements for its fine-tuning data. For example, LlamaFactory supports it, and you can refer to the documentation: https://llamafactory.readthedocs.io/zh-cn/latest/getting_started/data_preparation.html 2. Convert Ruozhiba data into LlamaFactory data format. import json # Conversion function def convert_format(original_data): converted_data = [] for item in original_data: converted_item = { "instruction": item["query"], "input": "", … Read more

Ollama: A Powerful Tool for Local Large Model Building

Ollama: A Powerful Tool for Local Large Model Building

1. What is Ollama Ollama is a concise and easy-to-use local framework for running large models, allowing users to quickly run large models on their local computers, with most of the code written in Golang. Project address: https://github.com/ollama/ollama Official project: https://ollama.com/ 2. Why Ollama Exists The existence of Ollama can be traced back to Llama … Read more

Neural Network Model Compression Techniques

Neural Network Model Compression Techniques

Baido NLP Column Author: Baido NLP Introduction In recent years, we have been deeply engaged in the integration of neural network models with NLP tasks, achieving significant progress in various areas such as syntactic analysis, semantic similarity computation, and chat generation. In search engines, semantic similarity features have also become one of the most important … Read more

TensorFlow Model Optimization Toolkit – Quantization Aware Training

TensorFlow Model Optimization Toolkit - Quantization Aware Training

Written by / TensorFlow Model Optimization Team We are pleased to announce the release of the Quantization Aware Training (QAT) API, which is part of the TensorFlow Model Optimization Toolkit. With QAT, you can leverage the advantages of quantization in performance and size while maintaining accuracy close to the original. This work is part of … Read more

SpinQuant: LLM Quantization with Learnable Rotation Matrices

SpinQuant: LLM Quantization with Learnable Rotation Matrices

↑ ClickBlue Text Follow the Jishi Platform Author丨Tech Beast Editor丨Jishi Platform Jishi Introduction SpinQuant combines learnable rotation matrices to achieve optimal network accuracy, quantizing weights, activations, and KV cache to a 4-bit width. On the LLaMA-2 7B model, SpinQuant reduces the accuracy gap in Zero-Shot inference tasks to only 2.9 points compared to the full-precision … Read more

Huggingface Visualizes GGUF Models

Huggingface Visualizes GGUF Models

Huggingface has added a visualization feature for GGUF files, allowing users to directly view the model’s metadata and tensor information from the model page. All these features are performed on the client side. GGUF (GPT-Generated Unified Format) is a binary large model file format that allows for fast loading and saving of GGML models. It … Read more