|
Based on linear attention mechanism, using a hybrid architecture (Hybrid-Lightning), and integrating MoE architecture.
|
Based on Transformer architecture, using MLA and DeepSeekMoE architectures, and introducing auxiliary loss-independent load balancing strategies.
|
|
456 billion total parameters, 45.9 billion active parameters.
|
671 billion total parameters, 37 billion active parameters.
|
|
14.8 trillion tokens, covering academic literature, books, web content, and programming code.
|
14.8 trillion tokens, covering high-quality, diverse text data, and optimizing the ratio of mathematical and programming samples.
|
|
Employs a three-phase training method, expanding the context window to 1 million tokens, and ultimately extrapolating to 4 million tokens.
|
Employs a two-phase context expansion training, expanding the context window from 4K to 32K, then to 128K.
|
|
Not specified, but emphasizes high training efficiency.
|
2.788 million H800 GPU hours, total cost approximately 5.576 million USD.
|
|
MiniMax-VL-01 expands the model’s multimodal understanding capability by integrating image encoders and image adapters.
|
No mention of multimodal capability.
|
|
Excels in long context processing, performing well in long context benchmarks such as Ruler and LongBench-V2.
|
Excels in most benchmarks, especially in mathematical and coding tasks, demonstrating strong capabilities in long context understanding tasks, such as performing well on FRAMES and LongBench v2.
|
|
– Linear attention mechanism and hybrid architecture give it an edge in handling ultra-long contexts.
– MoE architecture and global routing strategies improve training efficiency.
– Variable-length ring attention and improved LASP algorithm further enhance long context processing capabilities.
|
– MLA and DeepSeekMoE architectures achieve strong performance while ensuring efficient training and inference.
– Auxiliary loss-independent load balancing strategies and multi-token prediction training objectives enhance model performance.
– FP8 mixed precision training framework reduces training costs.
|
|
– Retains some softmax attention layers in the hybrid architecture, which may affect long context processing efficiency.
– Performance in complex programming tasks needs improvement.
– Lacks a deeper evaluation of long context retrieval and reasoning capabilities.
|
– Recommended deployment units are large, which may burden small teams.
– Inference speed still has room for improvement.
|