Reflections on DeepSeek-V3: Beyond Hardware, Optimize Models!

The financial backer of DeepSeek-V3 is the quant giant, Huansheng Quant. Huansheng Quant has strong capabilities in the field of quantitative investment, managing assets that once reached hundreds of billions. Since its establishment, DeepSeek has developed rapidly, being the first to open-source China’s first MoE large model (DeepSeek-MoE) in January 2024, launching the second-generation open-source MoE model DeepSeek-V2 in May, and announcing the merger of DeepSeek-Coder-V2 and DeepSeek-V2-chat on September 5, introducing the new DeepSeek-V2.5 model. On December 26, the first version of DeepSeek-V3 was launched and open-sourced simultaneously.

DeepSeek-V3 has a total parameter count of 671B, using 14.8T high-quality tokens for pre-training. It then fully taps into its potential through supervised fine-tuning and reinforcement learning phases. Comprehensive evaluations show that DeepSeek-V3 surpasses other open-source models, with performance comparable to leading closed-source models.

The results of training a super-large MoE model with DeepSeek are indeed remarkable. With just two thousand H800 GPUs and two months of training, such outstanding results were achieved, underscoring that practice is key to gaining true knowledge. From DeepSeek’s past technical reports, one can clearly feel the continuous upgrades in its team’s algorithm and system capabilities.

This has led to deep reflections among other AI companies. Before the release of DeepSeek-V3, many AI companies were focused on competing in hardware infrastructure, comparing the quality of GPU cards (e.g., H20, H100, H200, GB200), the number of GPU cards (e.g., tens of thousands, hundreds of thousands, millions), the scale of network clusters (using TH5, TH6?), and network bandwidth (200G, 400G, 800G, 1.6T; scale-up 900GB/S).

However, DeepSeek-V3 trained a powerful model using just H800 and a scale of 2K, and generously open-sourced it, telling everyone that there is no need to compete in hardware infrastructure, but rather to think about how to efficiently optimize models!

1. Model Structure: Adhering to the System-Algorithm Co-Design Principle

DeepSeek-V3 continues the MLA and MoE structure from V2.

  1. MLA Technology: I previously introduced [1] this technology, which is similar to LoRA, compressing kv through dimensionality reduction and transferring the dimensionality increase operation to Q and O to avoid repeated decompression, thus reducing kv cache/token overhead. However, MLA has not received much attention, possibly due to its lack of significant advantages compared to MQA [2], which instead increases system complexity.

  2. MoE Structure: Unlike Mixtral’s large expert design (where the MLP structure in the dense model is replicated 8 times), DeepSeek-V3 adopts a large number of “small experts” design, significantly enhancing the model’s sparsity (total parameter count divided by active parameter count). Compared to V2’s 236B total parameters (21B active parameters), V3 more aggressively introduces 256 experts, reaching a total parameter count of 671B, with active parameters only increasing to 37B. Thanks to the sparser MoE design and system optimizations, the GPU hours needed to train V3 for every trillion data points is just 180K (compared to 172.8K for V2), fully embodying the “Economical” concept from V2’s technical report title.

  3. Other Improvements: V3 not only inherits the model design of V2 but also employs the previously released auxiliary-loss-free strategy [3] to alleviate the imbalance in load among experts (demonstrating DeepSeek’s high regard for innovation by quickly applying academic exploration results to its large models). Additionally, V3 introduces multi-token prediction (MTP), which provides more supervision information during training and accelerates model decoding during inference by combining speculative sampling. From the paper’s results, this is a good training technique.

2. Training Optimization: Innovative Application of FP8

During training, the most notable aspect is the use of FP8. As far as I know, DeepSeek-V3 is the first (at least within the open-source community) to successfully apply FP8 mixed-precision training to obtain a large MoE model. FP8 has a risk of numerical overflow, and the MoE training itself is unstable, making BF16 still the mainstream in actual large model training. The difficulties with existing FP8 solutions [4] mainly stem from two aspects: first, the coarse-grained per-tensor E4M3 quantization increases quantization errors due to individual outliers; second, the E5M2 format during the backward process leads to significant rounding errors.

To address these issues, DeepSeek-V3 uniformly uses the E4M3 format during training and reduces errors through fine-grained per-tile (1×128) and per-group (128×128) quantization. This design is close to the micro-scaling format [5], but the current hardware architecture does not support operations in this format, posing challenges for implementing FP8 matrix multiplication (which needs to be done via partial sums). Although V3 demonstrates the importance of per-tile and per-group quantization for model convergence, the paper does not provide corresponding efficiency for FP8 matrix multiplication operators and lacks discussion on the impact of a more user-friendly quantization method on training stability. Of course, FP8 has advantages in saving memory (especially for activation values). Additionally, DeepSeek-V3 uses BF16 to save optimizer states and selectively recomputes certain operations (like RMSNorm, MLA Up-Proj, SwiGLU). Memory optimization aids in designing better parallel strategies, such as reducing or eliminating tensor parallel usage.

In terms of parallel strategies, DeepSeek-V3 adopts 64-way expert parallelism, 16-way pipeline parallelism, and data parallelism (ZeRO1). Expert parallelism introduces all2all communication, with each token activating 8 experts, leading to cross-node all2all communication overhead becoming a major system bottleneck. To reduce communication overhead, at the algorithm level, V3 employs grouped routing, limiting each token to activate experts on only 4 nodes, halving the cross-node communication traffic; at the system level, it pipelines communication between nodes and within nodes, maximizing the use of network bandwidth and NVLink bandwidth. Through these optimizations, V3 can maintain a communication-computation ratio of about 1:1, creating opportunities for hiding subsequent communication, i.e., concurrently scheduling computation and communication tasks in different micro-batches for forward and backward passes, making computation and communication as overlapping as possible. For pipeline parallelism, V3 designs a bidirectional pipeline similar to the one in Chimera [6] to reduce bubbles, rather than adopting the more common interleaved 1F1B (although the steady phase of interleaved 1F1B can also hide the computation and communication of forward and backward passes).

3. Inference Optimization: Challenges and Responses

The deployment of the DeepSeek-V3 model is quite challenging. For MoE models, most open-source frameworks continue to use dense model inference schemes, such as the Mixtral model still using tensor parallelism for deployment, causing MoE models to lose their advantages in inference. This is because the advantage of MoE in saving flops mainly manifests during the computation-intensive prefill phase, while in the memory-intensive decode phase, its massive parameter count incurs high data transfer costs. Even if the memory-intensive issues are resolved, the MoE parameters consume a large amount of expensive HBM space, making it not cost-effective. Therefore, to leverage the value of the MoE architecture in inference, it is essential to change the parallel strategy, reverting to the DP + EP method used during training, which means deploying MoE models using larger machine units and minimizing redundant storage at the expert layer to reduce the parameter count on each device, alleviating HBM capacity and bandwidth pressure. Under this deployment scheme, load balancing and all2all communication become core challenges.

Based on the above background, the inference scheme for DeepSeek-V3 is as follows: first, it adopts a PD separation method to address the challenges of the prefill and decode phases. In the prefill phase, the attention module employs 4-way tensor parallelism + 8-way data parallelism, while the MoE module uses 32-way expert parallelism, aiming to maximize system throughput under the latency requirements for the first token (similar to training tasks). In the decode phase, V3 employs 320-way expert parallelism (256 small experts + 64 hotspot experts), effectively reducing decoding latency and alleviating load imbalance issues. Finally, to fill the device idle time during all2all communication phases, V3 adopts a dual-stream inference strategy from NanoFlow [7], concurrently executing computation and communication tasks in different micro-batches to improve device resource utilization.

Reflections on DeepSeek-V3: Beyond Hardware, Optimize Models!

Leave a Comment