Cost-Saving Techniques in DeepSeek: Unveiling the Secrets

Tencent Technology “AI Future Guide”

Special Contributor: Hao Boyang

Editor: Zheng Kejun

No GPU Poor, Only Not Enough Squeeze.

The launch of DeepSeek-V3 perfectly illustrates this statement with a set of astonishing data.

While models like O1, Claude, Gemini, and Llama 3 struggle with billions in training costs, DeepSeek-V3 achieved performance on par with them with a budget of $5.576 million, training on a cluster of 2048 H800 GPUs in just 3.7 days per trillion tokens.

What does this number mean?It means that training costs only 180K H800 GPU hours per trillion tokens, totaling 2.78 million GPU hours. In contrast, Llama 3.1 used 16,384 Nvidia H100 GPUs, totaling over 21 million GPU hours, tenfold higher.

With a total parameter count of 671B, and precise control of 37B parameters per token activation, DeepSeek-V3 built an AI giant capable of surpassing all open-source models, closely rivaling GPT-4 and Claude-3.5, using 14.8 trillion high-quality diverse tokens.

Cost-Saving Techniques in DeepSeek: Unveiling the Secrets

Twitter is filled with admiration.

Andrej Karpathy, an early member of OpenAI, remarked that the emergence of DeepSeek-V3 might mean that large GPU clusters are not necessary to train cutting-edge language models, indicating that there is still significant room for improvement in data and algorithms for large models.

Alexander Wang, founder of Scale AI, further stated that the painful lesson from DeepSeek-V3 is: while the U.S. rests, China works, catching up faster and becoming stronger at lower costs.

Many believe this is magic from the East. But in reality, this magic is called engineering science.

After reviewing DeepSeek’s 53-page technical report, we found that its astonishingly low training cost and its powerful capabilities have clear traces.

In the pre-training phase, they chose extreme compression in areas with limited performance impact; in the post-training phase, they invested heavily in areas where the model excels.

While there has been much praise and debate, no one has unveiled the veil of this “magic”.

Tencent Technology will help you extract the most core part and explain the technical path behind DeepSeek-V3’s “cost-effective efficiency” in simpler terms.

Cost-Saving Techniques in Training:

Squeeze Everything, No Idle Time

Traditionally, reducing costs in large model training mainly relies on two strategies: compression, parallelism, and improving hardware utilization efficiency.

The methods used in DeepSeek-V3 are essentially a vigorous application of these three strategies.

Compression: From Structure to Quantization

Compression is easy to understand; it means reducing large things into smaller ones.

For model training, after compression, the amount of computation required by the computational units (GPUs and CPUs) will decrease, inevitably increasing the computation speed. Another significant impact is that memory usage and caching will decrease, allowing for a substantial reduction in the hardware scale required to train models of the same size.

In the model training process, the highest memory consumption comes from vector data.

DeepSeek-V3 used two methods to compress vector data: one is the Multi-Layer Attention (MLA) architecture, and the other is FP8 mixed precision training.

Multi-Layer Attention (MLA)

The core design of the Multi-Layer Attention (MLA) architecture is to introduce a dynamic layer aggregation mechanism in the Transformer architecture. In a traditional Transformer, each layer requires complete computation and storage, with the Key and Value matrices often occupying a large amount of memory space. MLA reduces the computation by dynamically merging features from adjacent layers.

MLA reduces memory usage and computation by compressing and reusing the K and V from previous layers. Specifically, it merges and compresses the K and V from consecutive layers into a shared representation.

For example, if we compare the attention mechanism to a library retrieval system, the traditional method is like creating a complete index card (Key) and content summary (Value) for each book, while DeepSeek’s method establishes an intelligent classification system that doesn’t remember specific information but retains a simple “label” (compressed Key/Value), restoring detailed information from the label when needed, like simplifying “computer technology, third floor, second row on the right” into a code like “C2-3”.

In this process, DeepSeek used low-rank compression technology (which can be understood as compressing high-dimensional matrices into products of several low-dimensional matrices), compressing KV to 512 dimensions, far smaller than the original dimensions. The low-rank compression of Key/Value reduced training memory usage by 20-30%.

Optimizing the Query side is also very significant for training efficiency. The Query can be understood as the user’s retrieval request; traditional methods allocate a large amount of computing resources for each request. DeepSeek reduces memory usage during computation by applying low-rank compression to the Query. Although this optimization has relatively little impact during the inference phase, it plays a crucial role during training, significantly enhancing training efficiency. This is akin to optimizing the query processing mechanism of a library retrieval system, allowing it to handle a large number of concurrent requests more quickly.

DeepSeek-V3 cleverly finds a balance that allows these compression techniques to have almost no impact on the model’s performance.

FP8 Mixed Precision Training Framework

The MLA method has been used since DeepSeek V2, with optimizations made this time. In DeepSeek-V3, a FP8 mixed precision training framework is introduced, validated for the first time on a large-scale model.

FP8 uses 8 binary bits to represent numbers, which, compared to traditional 32-bit (FP32) and 16-bit (FP16) formats, sacrifices a lot of precision but occupies less space and computes faster.

It’s like replacing “exactly 358 people” with “approximately 350 people”, sacrificing some precision for efficiency. While not perfectly accurate, it suffices in many scenarios and can significantly enhance computation speed and save memory.

When using the FP8 format, DeepSeek adopts a “mixed precision” approach. During training, most of its core computational kernels use FP8 precision. Forward propagation, activation backpropagation, and weight backpropagation all use FP8 as input and output results in BF16 or FP32 format.This design theoretically doubles the computation speed compared to the original BF16 method. Additionally, vector activation values in DeepSeek are stored in FP8 format for backpropagation, significantly reducing memory consumption.

For certain low-precision computation-sensitive operators and some low-cost operators, such as embedding modules, output heads, MoE gating modules, normalization operators, and attention operators, FP16 or even FP32 precision is retained to ensure data accuracy. To maintain numerical stability, DeepSeek also stores main weights, weight gradients, and optimizer states in higher precision.

It’s like a meticulous chef: ordinary kitchen tools suffice for daily prep, but for critical cooking steps, the best utensils are used.

In model training, most forward computations use FP8, significantly saving memory and computational resources, speeding up the entire training process. However, they also know where precision cannot be compromised: for the final seasoning and plating (corresponding to embedding modules, output heads, etc.), precise tools (FP16 or FP32 precision) must be used.

When using the FP8 mode in the past, the biggest challenge was error accumulation. It’s like a regular calculator (Tensor Cores’ FP8) that can only display up to two decimal places, while a scientific calculator (CUDA cores’ FP32) can show up to six decimal places. When needing to add many decimals, using a regular calculator gradually accumulates errors, leading to significant differences in results.

(DeepSeek’s proposed solution for error accumulation)

DeepSeek discovered a clever solution: instead of waiting until the end to calculate the total, it transfers the current result to a scientific calculator every 128 numbers added. To ensure this process does not affect speed, they leveraged the characteristics of the H800 GPUs: just like having two cashiers, when one is checking out a shopping basket, the other can continue scanning new items. This way, while enhancing precision, the processing speed remains largely unaffected.

This strategy significantly boosts model training speed, as core computations can double in speed, while memory usage decreases noticeably. Furthermore, the final model’s accuracy loss can be kept below 0.25%, making it almost lossless.

Parallelism: Extreme Utilization of Hardware

To achieve faster training speeds, the most effective method is to increase the scale of parallel computation, allowing more computational units to process different data or tasks simultaneously. In parallelism, the challenge is to utilize computational resources as effectively as possible, ensuring they are all working at high loads.

At the system architecture level, DeepSeek employs expert parallel training technology by assigning different expert modules to different computing devices for simultaneous training, enhancing computational efficiency during the training process.

However, this simple parallelism is far from sufficient. DeepSeek’s approach to computational power is extreme squeezing: if we consider the training process as a factory, their main task is to ensure no idle workers on the assembly line while optimizing processes as much as possible, allowing parts (data) to be operated (computed) as soon as they enter the assembly line.

DualPipe Inter-node Communication

The primary mode for optimizing the assembly line process is DeepSeek’s innovative DualPipe method.

In terms of computation and communication overlap, DualPipe adopts a multi-task parallel processing approach.

Just as modern computers can download files while processing documents, DualPipe allows the model to prepare the next required data transfer while computing. This design ensures that communication overhead is largely hidden during computation, greatly enhancing overall efficiency.

Traditional training information pipeline parallelism resembles an assembly line where each workstation processes tasks sequentially. When a data packet is passed from one stage to the next, waiting times often occur, leading to what is known as “pipeline bubbles.” These bubbles waste computational resources, as workers on the assembly line must wait for upstream processes to complete before starting work. Additionally, communication times between different nodes can become performance bottlenecks, just as long transfer times between workstations can impact overall production efficiency.

DualPipe introduces the concept of dual pipelines, processing two batches of products simultaneously on the same production line. When one computation stage is waiting for data transfer, it can immediately switch to processing another batch of data, fully utilizing previously idle time.

(DualPipe schematic, where the two cells surrounded by a shared black border have overlapping computation and communication.)

This ensures that no “idle workers” are present on the assembly line.

Additionally, the process from picking to operation should be minimized.

Due to DeepSeek’s special design of the assembly line, the communication and computation processes can overlap. When a node is computing the current batch of data, the system has already begun preparing the expert parameters for the next batch of data transfer. When the forward computation is completed, the next required data is already in place, virtually eliminating waiting times. Most data transfer times are “hidden” within the computation process, similar to a seamlessly connected assembly line where the time taken for parts to be transported has almost no impact on overall production efficiency.

Through precise control of this overlapping process, DualPipe achieves an ideal state of near-zero communication overhead in large-scale distributed training.

According to DeepSeek’s technical report, the DualPipe algorithm reduces computation bubbles by 50%, effectively hiding communication overhead. Cross-node communication optimization enhances bandwidth utilization and reduces communication overhead by 20%.

This effectively doubles the computational efficiency compared to traditional methods.

Lossless Load Balancing Strategy without Auxiliary Loss

The lossless load balancing strategy is an adjustment in DeepSeek-V3 that allows workers to showcase their abilities during training.

The load balancing strategy was introduced in the V2 era but has advanced further in this generation.

In the expert mixture system (MoE), load balancing has always been a key challenge. With many expert models in MoE, ensuring that the experts that should be utilized are not idle is crucial for both training and model efficiency.

Traditional methods often require the introduction of additional auxiliary loss terms to balance expert usage, similar to artificially setting quotas in a factory to ensure load balancing across production lines. This method not only increases training complexity but can also affect the model’s local optimization goals.

DeepSeek’s innovation lies in achieving natural balancing without auxiliary loss. The system dynamically adjusts each expert’s “receiving capacity” based on historical utilization rates. When an expert is consistently overloaded, the system automatically lowers its probability of accepting new tasks; conversely, for experts with low utilization, the system increases their chances of receiving tasks. This adaptive mechanism ensures long-term load balance, more like a market economy than a planned economy.

(The top two lines show the situation with load balancing, while the bottom line shows the situation without load balancing. It can be seen from the graph that the expert layer with the no-load balancing strategy is more evenly loaded and more proactive.)

This improvement stabilizes the training process, ensuring everyone has the opportunity to train, thus increasing training efficiency.

Bottom-level Communication Optimization

For model training, bottom-level communication is also a significant issue; often, poor communication between hardware can lead to local stoppages in the training production line, leaving workers with nothing to do.

DeepSeek has also made substantial optimizations in this area, developing an efficient all-to-all communication kernel across nodes. This is akin to establishing a smarter traffic light scheduling system within a highway system, fully utilizing the bandwidth of high-speed channels like InfiniBand and NVLink. These optimizations ensure that data transmission between different computing nodes always operates at peak efficiency.

These are not all the efficiency-enhancing measures DeepSeek employs in training, but merely the more daring innovations. Currently, other training architectures commonly remove the bias term from LayerNorm, introduce scale factors after FFN, and use RoPE relative position encoding, all of which are also employed in DeepSeek-V3. Additionally, in training strategies, DeepSeek has adopted advanced technologies such as ALiBi position encoding pre-training, Flash Attention 2 implementation, and dynamic sequence length expansion.

DeepSeek-V3 truly leaves no stone unturned in training engineering. In summary, the most important aspects include the following.

Whether it is MLA, FP8, or Dualpipe algorithms, they represent bold applications of cutting-edge technologies that reduce training costs. These foundational technological directions are already mainstream possibilities, but DeepSeek has fine-tuned and optimized them to make them usable and maximize their capabilities.

Since there are fewer GPUs, then let’s roll up our sleeves in engineering; DeepSeek has indeed broken the Western monopoly with East Asian magic this time.

The Secret to Exceptional Performance: Specialty Focus

DeepSeek-V3’s capabilities are indeed impressive, outperforming other top open-source models like Llama 3.1 405B and Qwen2.5 72B in terms of data quality. It even shows stronger performance in multiple metrics when compared to the two top models, Claude 3.5 Sonnet and GPT-4o.

Especially in mathematical reasoning, code generation, and long-text processing, it has reached industry-leading levels. In the GSM8K mathematical reasoning test, it scored an impressive 92.1%, and in the HumanEval coding evaluation, it surpassed GPT-4 with a score of 88.3%, while also handling long texts of 32K.

However, from the benchmarks and DeepSeek’s technical report, we can also see that DeepSeek-V3 has some areas of specialization. Its creative generation is relatively weak, performance in open tasks is average, and its structured thinking capabilities are much higher than its divergent thinking capabilities. It even performs better in specialized fields than in general fields.

So why is DeepSeek-V3 so strong?

First, it is foundational. DeepSeek-V3 has a total parameter count of 671B, activating 37B parameters per token. The overall parameter total is higher than Llama 3.1’s 405B and far exceeds Qwen 2.5’s 72B. As long as the Scaling Law hasn’t hit a wall, the advantage in parameter size is still very real.

Moreover, in the training process discussed above, we see that while DeepSeek-V3 compresses data to the fullest, it minimizes the impact on model quality.

This is the foundation of DeepSeek. However, several other key factors elevate it further.

Data Precision

First is data; efficient data selection means rapid performance enhancement.

DeepSeek-V3 demonstrates meticulous data handling, pushed to the extreme. Its data processing strategy encompasses the entire process from raw data collection to the final training set construction.

According to DeepSeek’s technical report, in training V3, DeepSeek used 14.8 trillion tokens for pre-training. In comparison, Llama 3.1 used 15 trillion tokens, while Qwen 2.5’s training used 18 trillion tokens.

In terms of data source selection, DeepSeek-V3 adopts a more diversified data acquisition strategy. The foundational training data comes from a rigorously selected CommonCrawl corpus, ensuring data breadth and representativeness. Furthermore, the research and development team places special emphasis on incorporating domain-specific data, including large-scale code datasets, mathematical reasoning data, scientific literature, and more.

In the data cleaning phase, DeepSeek employs proprietary data filtering algorithms, implementing multi-level quality control. This process first identifies and removes duplicate content from the raw data, ensuring data uniqueness. Subsequently, low-quality content is filtered out through intelligent algorithms, including incorrectly formatted data, incomplete text fragments, and non-compliant content. This strict data cleaning process not only improves the quality of training data but also lays a solid foundation for the model’s final performance.

In terms of technical implementation for data processing, DeepSeek-V3 adopts a series of advanced processing methods. Firstly, a unified tokenizer design ensures consistency in data processing. Secondly, a dynamic sequence length adjustment mechanism enables the model to better handle inputs of varying lengths. Through data mixed sampling strategies and curriculum learning methods, they also optimize data usage efficiency during training.

MTP Technology

Next is architectural innovation.

The Multi-Token Prediction (MTP) technology introduced by DeepSeek is a game changer. This technology was actually proposed by Meta on April 30 of this year, and DeepSeek’s application of the new technology is even faster than Meta’s own.

Simply put, this is also a form of parallel optimization.

Traditional language models predict one token at a time. It’s like having the model read from “word by word” to “sentence by sentence” for understanding and generation. During training, the model is no longer limited to predicting the next token in the sequence but learns to predict multiple consecutive token positions simultaneously. This parallel prediction mechanism not only improves training efficiency but also allows the model to better capture dependencies between tokens, enhancing overall performance by 2-3% while maintaining output quality.

During the inference phase, MTP’s advantages become even more apparent. Traditional models generate text as if they are writing “stroke by stroke,” while MTP allows for “drafting ahead” by generating multiple tokens at once. Through an innovative speculative decoding mechanism, the model can simultaneously predict multiple possible token sequences based on the current context. Even if some predictions are inaccurate and need to revert, overall efficiency is still significantly improved. This parallel generation mechanism enhances inference speed by 1.8 times while significantly reducing computational overhead.

DeepSeek-R1 Distillation

In addition to incorporating more specialized data in its selection, it is also noteworthy that during the post-training process, DeepSeek utilized R1 distillation. This not only enhances the model’s capabilities but also leads to some specialization.

The DeepSeek R1 series model is DeepSeek’s latest attempt to replicate GPT-o1. It was only released in Preview on November 21 of this year and has already been used for distillation on DeepSeek-V3.

This model itself is trained using reinforcement learning, with a reasoning process that includes extensive reflection and validation, with chain lengths reaching thousands of words. In programming and mathematical capabilities, it even surpasses GPT-o1-preview in several metrics.

By distilling reasoning capabilities from the DeepSeek-R1 series model, extracting key reasoning patterns and problem-solving strategies as data to fine-tune the main DeepSeek model, and employing advanced methods like progressive curriculum learning, DeepSeek-V3 has significantly strengthened its formalized thinking abilities. Additionally, during the distillation process, V3 has optimized its structured data processing and long-sequence calculations.

From a data perspective, just through R1 distillation, DeepSeek V2.5 saw substantial improvements of nearly 20% in mathematics and programming.

However, as seen with GPT-o1, this reinforcement learning enhancement is challenging to generalize beyond mathematics and programming, leading to inevitable specialization in DeepSeek-V3.

Thus, while DeepSeek-V3 is powerful, there is still considerable room for optimization.

DeepSeek-V3,

The Miracle of Engineering is Also an Important Value

Amidst the praise for DeepSeek-V3 on foreign networks, there are also considerable doubts.

Cost-Saving Techniques in DeepSeek: Unveiling the Secrets

Sam Altman seemingly mocked DeepSeek-V3 for lacking truly innovative methods, merely replicating effective ones.

This evaluation is not particularly fair. Indeed, many of the core technologies employed by DeepSeek-V3, such as Multi-Layer Attention (MLA), have been around for a long time, while MTP technology originates from a paper by Meta released in April this year, and R1’s distillation and exploration have also been inspired by OpenAI and Google.

However, in terms of underlying engineering parallel technologies, DeepSeek has actually made many innovations. For instance, the lossless load balancing without auxiliary loss comes from DeepSeek’s August paper, and Dualpipe is also a new attempt by DeepSeek.

At least in terms of engineering, DeepSeek’s innovation is not lacking.

Another influential criticism comes from Hu Yanping, chief expert at FutureLabs.

He stated on Weibo that the current development of large models faces a dual-spiral evolution. One track is the upward performance curve, pursuing deeper understanding and reasoning abilities; the other track is the downward foundational curve, focusing on enhancing efficiency and practical applicability. In this dimension, DeepSeek-V3 seems yet to fully break through the ceiling.

However, he overlooks a fundamental fact: in the era of deep learning, scale effects are catalysts for algorithmic innovation.

The reason AI struggles to penetrate practical applications is largely due to costs still being too high. Especially as models enter the reinforcement learning era, O1’s costs have become prohibitively high for everyday use.

This is precisely where the value of DeepSeek-V3’s attempts lies. It showcases a new possibility: finding a balance between engineering implementation and theoretical innovation. It is not following the paths of OpenAI or Anthropic but pioneering a path of technological evolution that aligns with real-world constraints.

In the AI field, overemphasizing “metaphysical” theoretical innovations while underestimating breakthroughs in engineering implementations can, to some extent, hinder the true application of AI.

The technologies mentioned earlier from Meta’s April paper and those from DeepSeek’s August paper, including the R1 model released in November, all had their capabilities integrated into this latest model released at the end of the year.

DeepSeek has at least achieved the swiftest conversion of theory into reality.

Recommended Reading

Cost-Saving Techniques in DeepSeek: Unveiling the Secrets

Leave a Comment Cancel reply