DeepSeek and KiMi: Large Model Inference

Overview:Recently, with the rising popularity of large models such as DeepSeek and KiMi, the related technologies and topics of large model inference have become a focal point. These models not only demonstrate strong capabilities in fields such as natural language processing and computer vision, but also drive the rapid development of intelligent dialogue, content generation, knowledge question answering, and other applications. The progress of large model inference technology, especially how to deploy and optimize these models efficiently and at low cost, has become a core concern for both academia and industry. This report mainly introduces the technical details of DeepSeek and KiMi 1.5.

Paper Title:DeepSeek-V3 Technical Report

Paper Link:https://arxiv.org/abs/2412.19437

Paper Code:https://github.com/deepseek-ai/DeepSeek-V3

Publication Date:2024.12

Motivation:The optimization of large language model training is approached from both methodological and engineering perspectives. The engineering optimizations focus on accelerating training, while the methodological optimizations revolve around improvements in structure and loss functions.

Innovations:

(1) Utilizes MLA attention, employing low-rank joint compression techniques to reduce storage and computation overhead of K and V matrices.

(2) Adopts DeepSeekMoE architecture without load balancing loss.(3) Implements a Multi-Token-Prediction mechanism for auxiliary training.

Key Methods:

Multi-Head Latent Attention Mechanism

DeepSeek-V3 employs the MLA architecture for its attention mechanism. The core innovation of MLA lies in the low-rank joint compression of attention keys and values to reduce the cache overhead of key-value (KV) during inference:

DeepSeekMoE and Its Load Balancing Mechanism Without Auxiliary Loss

The base architecture of DeepSeekMoE: In the feed-forward network (Feed-Forward Networks, FFN) section, DeepSeek-V3 adopts the DeepSeekMoE architecture. Compared to traditional MoE architectures (like GShard), DeepSeekMoE employs a finer-grained expert allocation mechanism and innovatively sets some experts as shared experts.

For MoE models, an imbalanced expert load can lead to routing failures and reduce computational efficiency in expert parallel scenarios. Traditional solutions often rely on auxiliary losses to avoid imbalanced loads. However, excessive auxiliary losses can harm model performance. To achieve a better balance between load balancing and model performance, the research team pioneered a load balancing strategy without auxiliary loss to ensure load balance.

Specifically, the research team introduced a bias term for each expert, which is added to the corresponding affinity score to determine the top-K routing. In this design, the bias term is only used for routing selection, while the gating values (used to multiply with FFN output) are still computed based on the original relevance scores. During training, the system monitors the expert load distribution across all batches in real-time for each training step. At the end of each step, for experts with excessive load, their bias term is reduced by a fixed coefficient; for underloaded experts, their bias term is increased by a fixed coefficient.

Through this dynamic adjustment mechanism, DeepSeek-V3 achieves a balanced distribution of expert loads during training, outperforming traditional models that rely solely on auxiliary losses for load balancing.

Multi-Token Prediction Mechanism (MTP)

DeepSeek-V3 innovatively adopts an MTP objective, extending the prediction range to multiple subsequent tokens at each position. This design has dual advantages: first, the MTP objective may improve data utilization efficiency by increasing the density of training signals; second, it allows the model to plan representations in advance, enabling more accurate predictions of subsequent tokens. As shown in the figure below, this implementation differs from previous studies: the former uses independent output heads to predict additional tokens in parallel, while DeepSeek-V3 employs a sequential prediction approach, maintaining a complete causal relationship chain at each prediction level.

Main Results:The resulting DeepSeek-V3 model is a highly performant mixture of experts (MoE) language model, with a total parameter scale of 671B, where the number of parameters activated per token is 37B. Evaluation results indicate that DeepSeek-V3 outperforms other open-source models and can compete with mainstream closed-source models.

Paper Title:DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Paper Link:https://arxiv.org/abs/2501.12948

Paper Code:https://github.com/deepseek-ai/DeepSeek-R1

Publication Date:2025.01

Motivation:In recent years, the rapid development of LLM technology has continuously narrowed the gap with AGI. Post-training techniques have become a key part of the complete training process, proving capable of enhancing the accuracy of reasoning tasks, aligning with societal values, and adapting to user preferences, all while requiring fewer computational resources than pre-training. Regarding reasoning capabilities, OpenAI’s o1 series models have made significant progress in various reasoning tasks, including mathematics, programming, and scientific reasoning, by extending the Chain-of-Thought (CoT) reasoning process. However, achieving effective testing time extension remains an important challenge in academia. Previous research has explored various methods, including process-based reward models, reinforcement learning, and algorithms such as Monte Carlo tree search and beam search. However, these methods have not reached the universal reasoning level comparable to OpenAI’s o1 series models. This study applies pure RL methods to enhance the reasoning capabilities of language models, aiming to explore the potential of LLMs to self-evolve reasoning capabilities under unsupervised data conditions through pure RL processes.

Innovations:

(1) Directly applying RL to base models without requiring SFT as a precursor step. This approach allows the model to explore solutions to complex problems through CoT, ultimately leading to the development of the DeepSeek-R1-Zero model.

(2) Proposes a development process for DeepSeek-R1, including two RL stages for optimizing reasoning patterns and aligning with human preferences, as well as two SFT stages for building the model’s reasoning and non-reasoning foundational capabilities. This process will assist the industry in developing higher-performance models.

(3) The reasoning patterns of large models can be transferred to smaller models through knowledge distillation, outperforming direct RL training of smaller models. The open-source DeepSeek-R1 and its API will support academia in developing superior small models.

(4) Utilizing reasoning data generated by DeepSeek-R1, the research team fine-tuned several widely used dense models in academia. Evaluation results show that the small dense models, after knowledge distillation, perform excellently in benchmark tests.

Key Methods:

DeepSeek-R1-Zero: Reinforcement Learning Application on Base Models

Previous related research has shown significant effects of reinforcement learning on reasoning tasks. However, these studies heavily rely on time-consuming supervised data collection. This work explores the potential of LLMs to self-evolve reasoning capabilities under unsupervised data conditions through prompt design and the GRPO algorithm.

Reward mechanisms serve as a source of training signals, determining the optimization direction of RL. DeepSeek-R1-Zero adopts a dual reward system based on rules:

Accuracy Reward: Evaluates the correctness of responses. For deterministic math problems, the model is required to provide the final answer in a specific format (e.g., inside a box) for reliable verification. For LeetCode problems, feedback is generated by the compiler based on preset test cases.

Format Reward: Requires the model to place the reasoning process within specified label pairs. The study did not adopt result or process neural reward models due to concerns that neural reward models might lead to reward hacking issues during large-scale RL processes, and retraining reward models requires additional resources, increasing training process complexity.

Performance and output length variations during training

DeepSeek-R1: Addressing Readability and User Preference Issues

Step-1: Cold Start Mechanism

Uses a small amount of long CoT data to pre-fine-tune the model as the initial RL strategy network, avoiding instability in the early stages of training. Data collection employs various methods: few-shot prompts based on long CoT examples, directly generating detailed answers containing reflective validation, and organizing the normalized outputs of DeepSeek-R1-Zero.

Step-2: Reinforcement Learning Optimization for Reasoning

Adopts a large-scale RL training process similar to that of DeepSeek-R1-Zero. CoT still exhibits a phenomenon of language mixing, introducing a language consistency reward mechanism based on the proportion of target language words, which slightly affects performance but enhances user experience.

Step-3: Reject Sampling and Supervised Fine-Tuning

After the reasoning RL converges, checkpoint generation is used for subsequent SFT data. Unlike the cold start phase that focuses on reasoning, this phase integrates multi-domain data to enhance the model’s writing, role-playing, and other general capabilities. The specific implementation is as follows:

Reasoning Data: Generates reasoning trajectories through reject sampling. The evaluation mechanism is expanded; in addition to rule-based rewards, a generative reward model based on DeepSeek-V3 judgments is introduced. This optimizes output quality, filtering mixed languages, lengthy paragraphs, and code blocks. Multiple samples are generated for each prompt, retaining correct results. Approximately 600,000 reasoning training samples are ultimately obtained.

Non-Reasoning Data: In writing, factual QA, self-awareness, and translation fields, the DeepSeek-V3 process and some SFT data are employed. For complex non-reasoning tasks, prompts are used to generate pre-CoT from DeepSeek-V3; for simple queries, responses are provided directly. Approximately 200,000 non-reasoning training samples are cumulatively obtained.

A total of about 800,000 sample data is used to perform two rounds of fine-tuning on DeepSeek-V3-Base.

Step-4: Full-Scene Reinforcement Learning

To optimize alignment with human preferences, a second stage of RL training is implemented, focusing on enhancing the model’s practicality, safety, and reasoning capabilities. A variety of reward signals and diverse prompt distributions are adopted:

Reasoning Data: Continues the DeepSeek-R1-Zero method, applying rule-based rewards in mathematical logic.

General Data: Utilizes a reward model to capture human preferences in complex scenarios.

Practicality Assessment: Focuses on response summaries, ensuring the practicality and relevance of outputs.

Safety Assurance: Conducts comprehensive assessments of reasoning processes and summaries, identifying and mitigating potential risks.

Knowledge Distillation: Enhancing Reasoning Capabilities of Small Models

Using 800,000 training samples generated by DeepSeek-R1, direct fine-tuning is performed on open-source models such as Qwen and Llama, aiming to transfer the reasoning capabilities of DeepSeek-R1 to smaller models with higher computational efficiency. The selected base models include:Qwen2.5-Math-1.5B,Qwen2.5-Math-7B,Qwen2.5-14B,Qwen2.5-32B,Llama-3.1-8B,Llama-3.3-70B-Instruct.

Main Results:

DeepSeek-R1-Zero approaches o1 performance:

DeepSeek-R1 performance surpasses o1:

Small models trained using R1-Zero methodology are not as effective as direct distillation:

Paper Title:KIMI K1.5: SCALING REINFORCEMENT LEARNING WITH LLMS

Paper Link:https://github.com/MoonshotAI/Kimi-k1.5/blob/main/Kimi_k1.5.pdf

Paper Code:Not open source

Publication Date:2025.1.20

Motivation:Although traditional pre-training methods for language models have achieved significant results, they are limited by a lack of high-quality training data. Therefore, the Kimi team aims to expand the model’s training data through reinforcement learning (RL), allowing the model to autonomously explore through reward mechanisms, thus achieving more efficient training and performance enhancement. The goal is to improve the reasoning performance of large language models (LLMs) through reinforcement learning and long-context expansion, especially in multimodal reasoning and complex task handling. This new training framework aims to break through the bottlenecks of traditional methods. Kimi 1.5 not only excels in long text reasoning (Long-CoT) mode but also significantly outperforms existing models in short text reasoning (Short-CoT) mode. This enhancement in multimodal reasoning capabilities opens new possibilities for applications of the model in various fields such as education, scientific research, and data analysis.

Innovations:Reinforcement learning (RL) prompt dataset construction, Long2short, and strategy optimization.

Key Methods:

1.The Kimi team found that high-quality datasets can reduce the risks of reward hacking and overfitting to superficial patterns. Therefore, three aspects are considered in constructing the prompt set:Diverse Coverage: The prompt set should cover a wide range of disciplines, such as STEM (Science, Technology, Engineering, Mathematics), programming, general reasoning, and multimodal data (subtitles, image-text interleaved data, optical character recognition, knowledge and general question answering) to ensure the model performs well across various application scenarios, not limited to a specific domain. Balanced Difficulty: The prompt set should include simple, medium, and difficult questions, with a reasonable distribution. This promotes progressive learning and prevents the model from overfitting to specific difficulty levels. Using the model itself (a supervised fine-tuned model), generate 10 diverse answers for each sample. Based on the generated answers, calculate the model’s pass rate (i.e., the proportion of correct answers). A lower pass rate indicates a higher difficulty of the prompt; a higher pass rate indicates a lower difficulty. Accurate Evaluability: Ensure that the model’s performance is based on correct reasoning rather than superficial patterns or random guessing. Prevent the model from obtaining high scores through shortcuts (e.g., relying on superficial features or random guessing) by excluding error-prone question types and identifying easily hackable prompts to effectively reduce false positive validations.

2.Strategy optimization utilizes curriculum sampling and priority sampling strategies. Curriculum sampling starts training with easier tasks and gradually transitions to more challenging tasks. The collected data naturally includes grading and difficulty labels, making difficulty-based sampling an intuitive and effective method to improve training efficiency. Priority sampling strategy focuses on problems where the model performs poorly. By tracking the success rate si of each question i, problems are sampled according to the ratio of 1 – si, allowing problems with lower success rates to have higher sampling probabilities.

3.Long2short: Transfers the reasoning priors of long chain models to short chain models, thus improving performance even with limited testing token budgets. The following is a detailed description of these methods: Model merging combines long chain models with shorter models without requiring training to obtain a new model. Specifically, this is achieved by simply averaging the weights of the two models. Shortest reject sampling samples n times (in our experiments, n=8) for the same question and selects the shortest correct response for supervised fine-tuning. DPO: Utilizes long chain models to generate multiple response samples. The shortest correct solution is selected as a positive sample, while longer responses are treated as negative samples, including incorrect longer responses and correct longer responses (1.5 times longer than the chosen positive sample).

Main Results:

Related Work:

[1] Li M, Zhang Y, Li Z, et al. From quantity to quality: Boosting LLM performance with self-guided data selection for instruction tuning[J]. arXiv preprint arXiv:2308.12032, 2023.

[2] Liu W, Zeng W, He K, et al. What makes good data for alignment? A comprehensive study of automatic data selection in instruction tuning[J]. arXiv preprint arXiv:2312.15685, 2023.

[3] Cassano F, Gouwar J, Nguyen D, et al. Multipl-e: A scalable and extensible approach to benchmarking neural code generation[J]. arXiv preprint arXiv:2208.08227, 2022.

[4] Dubey A, Jauhri A, Pandey A, et al. The llama 3 herd of models[J]. arXiv preprint arXiv:2407.21783, 2024.

[5] Li J, Fang A, Smyrnis G, et al. Datacomp-lm: In search of the next generation of training sets for language models[J]. arXiv preprint arXiv:2406.11794, 2024.

This week’s meeting host:Wang Zili

Leave a Comment Cancel reply