Author: Jacob, Code Intelligent Copilot & High-Performance Distributed Machine Learning SystemOriginal: https://zhuanlan.zhihu.com/p/18653363414>>Join the Qingke AI Technology Group to exchange the latest AI technologies with young researchers/developers
Interpretation of MiniMax-01 Technical Report
Interpretation of DeepSeek-V3 Technical Report
Comparison of MiniMax-01 and DeepSeek-V3
|
|
|
|
Based on linear attention mechanism, using hybrid architecture (Hybrid-Lightning), and integrating MoE architecture.
|
Based on Transformer architecture, using MLA and DeepSeekMoE architecture, and introducing auxiliary loss-independent load balancing strategy.
|
|
456 billion total parameters, 45.9 billion active parameters.
|
671 billion total parameters, 37 billion active parameters.
|
|
14.8 trillion tokens, covering academic literature, books, online content, and programming code.
|
14.8 trillion tokens, covering high-quality, diverse text data, and optimizing the ratio of mathematical and programming samples.
|
|
Adopts a three-stage training method, extending the context window to 1 million tokens, and finally extrapolating to 4 million tokens.
|
Adopts a two-stage context expansion training, extending the context window from 4K to 32K, then to 128K.
|
|
Not explicitly stated, but emphasizes high training efficiency.
|
2.788 million H800 GPU hours, total cost about 5.576 million USD.
|
|
MiniMax-VL-01 enhances the model’s multimodal understanding capability by integrating image encoders and image adapters.
|
No mention of multimodal capability.
|
|
Excels in long context processing, performing well in long context benchmarks such as Ruler and LongBench-V2.
|
Excels in most benchmark tests, especially in mathematical and coding tasks, and demonstrates strong capabilities in long context understanding tasks, performing well in FRAMES and LongBench v2.
|
|
– Linear attention mechanism and hybrid architecture give it an edge in handling ultra-long contexts.– MoE architecture and global routing strategy improve training efficiency.– Variable-length ring attention and improved LASP algorithm further enhance long context processing capability.
|
– MLA and DeepSeekMoE architecture achieve strong performance while ensuring efficient training and inference.– Auxiliary loss-independent load balancing strategy and multi-token prediction training objectives enhance model performance.– FP8 mixed-precision training framework reduces training costs.
|
|
– Retains some softmax attention layers in the hybrid architecture, which may affect long context processing efficiency.– Performance on complex programming tasks needs improvement.– Lack of deeper assessment of long context retrieval and reasoning ability.
|
– Recommended deployment unit is large, which may burden small teams.– Inference speed still has room for improvement.
|
Conclusion
Both MiniMax-01 and DeepSeek-V3 are innovative models aimed at breaking through the existing LLM performance bottleneck, each with its focus:
-
• MiniMax-01 focuses more on long context processing capability, with its linear attention mechanism and hybrid architecture giving it an advantage in handling ultra-long sequences.
-
• DeepSeek-V3 excels in maintaining efficient training and inference, performing outstandingly in mathematical and coding tasks, and also demonstrating strong capabilities in long context understanding.
Both adopt MoE architecture and advanced training strategies, enhancing model performance while considering training costs and efficiency.
In the future, with continuous advancements in hardware and algorithms, both MiniMax-01 and DeepSeek-V3 are expected to achieve greater breakthroughs in their respective fields, driving the development of LLMs.
Analyzing the past and present of MTP technology from DeepSeek V3’s MTP
Analysis of training optimization for DeepSeek V3
Calculating the MFU of training for DeepSeekV3
Interpretation of Deepseek V3 pre-training strategy
》Join the Qingke Community · Walk with Qingke《
Note: Name + School/Company + Direction

You’ve made it this far,give a follow before you go🧐~