Understanding Major Misconceptions About DeepSeek After Musk's Grok 3

Recently (on the 18th), after nearly a month since the release of DeepSeek R1, Musk announced that his xAI company has developed the next-generation AI model Grok 3, claiming it to be the smartest, “both good and expensive.” This move has once again sparked global attention on AI large models and DeepSeek. On January 20, the domestic inference large model DeepSeek series achieved phenomenal innovation, showcasing its accessibility through open-source methods, which not only boosted the national confidence in technological independence and self-reliance but also significantly “supplemented” the public’s knowledge and application skills regarding large models, rapidly enhancing scientific literacy. Amidst the flood of evaluations, there are inevitably some vague and exaggerated viewpoints. Therefore, Wenhui Daily reporter Li Nian interviewed two industry experts — economist and chairman of the Academic and Technical Committee of Hengqin Shulian Digital Finance Research Institute Zhu Jiaming, and Ni Xianhao, head of the Strategic Research Center at the Beijing Academy of Artificial Intelligence, to clarify some misconceptions.

Mistake 1: Is low cost one of the success criteria? To lead the trend of large models, cost investment still needs to be increased.

“From an economic perspective, one of the significances of DeepSeek lies in surpassing the sunk costs of early development of AI large models. However, the reduction of costs does not have absolute universal significance.” As a renowned economist who has been in the industry since the 1980s, Zhu Jiaming, who has been deeply engaged in the cutting-edge development of the metaverse and AI large models in the new century, pointed out that cost advantages do not mean long-term sustainability or repeatability. DeepSeek is still in competition and needs to enhance its infrastructure to achieve iterations of high-performance chips. “If we want to lead the trend of AI large models, cost increases are inevitable.” Zhu Jiaming previously wrote that the contributions of DeepSeek V3 and R1 are threefold: relatively low hardware infrastructure costs, improved algorithm reuse maturity, and effective control of data costs.

*Before the “emergence” of artificial intelligence, innovations from 0 to 1 often do not consider costs.

“Reducing costs and forming cost advantages are important means to promote economic growth, but do not absolutize the goal of cost reduction.” Zhu Jiaming pointed out a common misconception in many people’s understanding — the purpose of cost reduction is to achieve sustained innovation in long-term competition. He then analyzed that artificial intelligence has changed the cost structure of traditional industrial products. The latter has clear concepts of narrow costs, broad costs, and marginal costs, while the cost structure, marginal concepts, and depreciation of AI large model products are different. The iteration speed even breaks Moore’s Law; the lifecycle of artificial intelligence products is calculated in years, with months and weeks being very short.

In this round of artificial intelligence “emergence,” innovations from 0 to 1 often do not consider costs. “In the future, it will also be difficult for the development of artificial intelligence to treat costs as a limited scale; the costs of artificial intelligence at different stages are not the same and need specific analysis.

In this regard, Ni Xianhao, from a well-known large model research institution, analyzed that the high-cost investment value of pre-trained models represented by GPT-4, GPT-4o, Claude 3, and 3.5 series in the past and future cannot be overshadowed or diminished by the current cost investment of inference models represented by this OpenAI o1/o3 and Deepseek R1 series.

As Zhu Jiaming mentioned, on one hand, before the emergence of large models, innovations from 0 to 1 often do not consider costs. On the other hand, Ni Xianhao pointed out that the pre-trained models are in the latter half of the Scaling Law, which means they have entered a stage where the diminishing marginal effect is more pronounced. Comparing the costs of inference models, which are still in the early stages of post-training/inference Scaling Law, is not appropriate.

*Performance ceilings for inference models are accelerating, and the demand for computing power will continue to rise.

Considering low costs and significant performance improvements, this will inevitably lead to a surge of players entering the market. Since the release of Deepseek R1, we have already seen several inference models such as the $50 replication model s1 as mentioned by Fei-Fei Li from Stanford University, Grok 3 Reasoning Beta, OpenAI o3-mini, etc. From the progress and disclosures of various domestic and foreign manufacturers, we will see more inference models released in the coming months. Ni Xianhao summarized that, at the same time, we have only counted the major model updates since November 2024, and there are already more than ten inference models, with Deepseek R1 being the most outstanding representative.

Based on the above explanation, low costs, high performance improvements, and many players will inevitably lead to the rapid arrival of the performance ceiling for inference models. Ni Xianhao believes that, referring to the diminishing marginal effects of pre-trained Scaling Law in recent years, the performance ceiling for inference models may arrive in about a year. Correspondingly, the costs of inference models will continue to rise along this trend.

Mistake 2: Has pre-training entered its final version? From Grok 3, Scaling Law is still in effect.

Since GPT-4, although the next generation of pre-trained foundational models has been slow to emerge, from the information disclosed by various manufacturers, next-generation foundational models such as Grok 3 have either been trained or are in training. Companies like xAI and Meta are also continuously building clusters above 100,000 cards.

Ni Xianhao analyzed that the large model Grok 3 released by Musk yesterday was trained on a 100,000 card H100 cluster, and can be further optimized and upgraded after expanding to 200,000 cards. According to the evaluation results released by the official, just in terms of foundational models, Grok 3 is still more than 25% higher than Gemini-2 Pro, Deepseek V3, Claude 3.5 Sonnet, and GPT-4o in dimensions of mathematical, scientific questioning, and programming capabilities.

Although the cost-performance ratio of pre-trained models is relatively low compared to the performance improvements of inference models, Grok 3 proves that the pre-training Scaling Law is still in effect. The scaling law was initially observed in the NLP field and applied to language models. As the model size increases, training loss decreases, model generation performance improves, and the ability to capture global information enhances. Simply put, the larger the pre-training base, the higher the performance. Zhu Jiaming pointed out that domestic observers should avoid the tendency to “speak too full” regarding this observation.

“Based on pre-trained foundational models, using reinforcement learning to increase reasoning capabilities” is a new paradigm for inference models and still has high requirements for pre-trained foundational models. Ni Xianhao further believes that the s1 model, which achieved significant performance improvements through data distillation based on Qwen2.5 by Fei-Fei Li’s team, illustrates the importance of pre-trained foundational models in the new paradigm.

At the same time, in the near future, as the performance ceiling for inference models approaches, the ability enhancements brought about by the still-effective pre-training Scaling Law will still have considerable value for the overall performance improvement of models.

Mistake 3: Does distillation technology have miraculous effects? HLE testing continues to improve, and by the end of the year, it will break the 50% baseline.

This round of artificial intelligence development is marked by the publication of the Transformer architecture paper on June 12, 2017, and by January 20, 2025, the release of DeepSeek R1 and the recent release of the s1 model by Fei-Fei Li’s team, achieving distillation of Google’s Gemini 2.0 Flash Thinking model, demonstrating rapid technological iteration.

“For artificial intelligence to develop to the cutting edge, it is necessary to continuously improve the standards for testing artificial intelligence.” Zhu Jiaming stated in an interview with Securities Times a few days ago. The HLE (Humanity’s Last Exam) standard set has compiled 3,000 questions designed by over 500 institutions from 50 countries and regions, covering core ability assessments such as knowledge reserve, logical reasoning, and cross-domain transfer. Zhu Jiaming predicts that by the end of 2025, the comprehensive performance of large models in the HLE evaluation system is likely to break the 50% baseline, currently only at around 20%. It can be confirmed that, HLE is by no means the final standard set for testing AI large models.

Mistake 4: Has the gap between China and the US narrowed to a few months? Be cautious in evaluating in specific contexts.

Regarding comments suggesting that the emergence of DeepSeek has at least narrowed the gap in AI large models between China and the US from two to three years to a few months, Ni Xianhao believes that the judgment of the years of the gap between China and the US in AI large models should be more cautious.

Currently, Deepseek R1 is more about enhancing model reasoning capabilities through large-scale reinforcement learning and multi-stage post-training, approaching the capability level of OpenAI o1. On this basis, the model’s high degree of openness and cost pricing strategy has earned Deepseek R1 widespread reputation globally.

However, it is worth mentioning that the current capability level of Deepseek R1 is only approaching OpenAI o1, still some distance from OpenAI o3, and the recently released Grok 3 Reasoning Beta is also above it. Before the performance ceiling of inference models approaches the algorithm optimization limit and starts to “compete” for computing power scale,“we should not easily judge how long the gap in AI between China and the US will last.”

Mistake 5: Is commercialization more important? Dual paths: cutting-edge breakthroughs and low-cost accessibility.

Regarding the direction of artificial intelligence development, Zhu Jiaming believes it should be “top-down and bottom-up,” balancing two routes: one route supports cutting-edge breakthroughs, expanding the frontier, and exploring unknown fields, which requires high-cost investment; the other route promotes low-cost and commercial landing, benefiting the public.

Of course, the former route is quite challenging. Ni Xianhao believes that whether it is the still-effective pre-training Scaling Law or the rapidly growing post-training/inference Scaling Law, both currently or in the future will have higher requirements for computing power scale. Therefore, considering the continuous pursuit of model “emergence,” high-cost investments in computing power, data, and algorithm innovation are essential. As for the continuous cost reduction brought about by inference optimization, it is crucial for the real explosion of AI applications. “How to better achieve iterative innovation in inference optimization technology is crucial for realizing the low-cost accessibility of large models, which is one of the trends we will see in 2025.“

Zhu Jiaming has repeatedly stated that artificial intelligence can surpass all traditional tools. This phenomenal innovation once again proves this point. He analyzes that the most advanced microscopes can see the smallest units down to the “angstrom” level, about a millionth of a hair’s radius, and the most advanced astronomical telescopes can observe objects 13 billion light-years away. “AI surpasses the most advanced electronic microscopes and astronomical telescopes, simulating and presenting the inaccessible macro and micro physical worlds before people’s eyes.” Therefore, such cutting-edge breakthroughs are essential. He is particularly concerned about the “spatial intelligence” that Fei-Fei Li’s team is working on, exploring multi-dimensional spatial issues is an important direction for current AI, such as how spaces greater than four dimensions exist, how they manifest, and how to display the quantum world.

The title image is AI generated.

Unauthorized reproduction of this article is strictly prohibited, and legal action will be taken for infringement.

Understanding Major Misconceptions About DeepSeek After Musk’s Grok 3

Leave a Comment Cancel reply