The media landscape changes too quickly, leaving one dazzled. In the morning, it praised Deepseek for its low cost and high cost-performance ratio, claiming that the pre-training Scaling Law is dead, requiring no more machines and GPU cards, prioritizing cost-effectiveness, and NVIDIA is finished; by noon, with the release of Grok 3, which reportedly used 100,000 NVIDIA H100 cards, outperforming OpenAI’s o3 mini and Deepseek R1, the narrative shifted back to say that the Scaling Law still holds, and a large number of cards are needed, NVIDIA’s stock price is saved, and the miracle still requires great effort…
These two viewpoints are obviously contradictory; if one is true, the other must be false. So what is the truth of the matter? Let’s analyze it.
Does the Scaling Law Hold During the Pre-training Phase?
-
Does the Scaling Law hold during the pre-training phase? Of course, it does. The so-called “Scaling Law hitting a wall” refers to a common problem where data is insufficient. Without a large amount of new data, the trend of the pre-training phase’s Scaling Law slows down. Note that it slows down but does not stop; the Scaling Law during the pre-training phase has not reached its ceiling. According to the Chinchilla Scaling Law, even without new data, it does not mean that model performance cannot improve. It’s simple: just increase the base model size, and performance will still improve. However, in terms of computational cost versus performance gain, it’s not cost-effective, which is why everyone has shifted to RL Scaling Law and Test Time Scaling Law, as the IQ improvement of large models in these two phases is more evident, thus offering better cost-effectiveness.
-
The current methods to improve model performance, ranked by cost-effectiveness from high to low, are: Test Time Scaling Law > RL Scaling Law > pre-training phase Scaling Law (due to insufficient data, the only option is to increase the base model size). If there are cost-effective scaling options, they should be prioritized, while low-cost options will only be adopted when no better alternatives exist. This is akin to shopping; one would naturally not buy a low-cost item if a high-cost alternative is available.
-
If one day the RL Scaling Law and Test Time Scaling Law hit their ceilings, and no new cost-effective scaling laws are found, it does not mean that model performance cannot improve. Everyone can still revert to the pre-training phase Scaling Law. It doesn’t matter if there is no new data; just increase the model size, and performance will still rise. However, this is essentially the last resort, a method of last resort. As long as there are cost-effective methods available, one would not take this path.
-
Someone might ask: According to your theory, hoarding so many GPU resources doesn’t actually help train the best models? If we follow the theory above, there is indeed not much necessity. For example, Deepseek’s 2000 cards can also produce the best models. However, having many cards has the advantage of compressing the time cycle for experimenting with new ideas and training large base models. For instance, you need to explore various algorithms, parameters, or data combinations for different models. If you have 10 new ideas and only 2000 cards, it might take 5 days to reach a conclusion. However, if you have tens of thousands of cards, you could reach a conclusion in just one day. Thus, having many cards greatly aids exploratory efficiency. More cards lead to more innovations; this is certainly true.
Grok 3 Base Model (Compared to Deepseek V3, not the R1 logical reasoning model)
-
Why does Grok 3, as a general base model, only have evaluation metrics for mathematics, science, and coding datasets? The absence of common capabilities like the widely used MMLU metric for comparison is a rather non-standard comparison approach. It may be inferred that Grok 3’s general capabilities have not significantly improved compared to OpenAI and Deepseek’s models, which is why they are not presented for comparison?
-
If one wants to enhance the base model’s capabilities in mathematics, science, and coding, it is not particularly difficult from either a methodological or cost perspective. The currently standard approach is similar to Deepseek V3, which distills long COT data for mathematical and coding logic problems from Deepseek R1, i.e., deep reasoning process data. This means bringing deep reasoning long COT data into the base model’s Post-Training phase or even pre-training phase (the so-called “left foot (Deepseek base) stepping on the right foot (Deepseek R1) for self-promotion” model). This can significantly enhance the base model’s capabilities in mathematics and coding, which is what Grok3 promotes as having “chain reasoning and self-correction mechanisms”. The evaluation metrics will look better, and the total amount of distilled data won’t be too large (hundreds of GB should suffice), with very low costs and minimal computational requirements.
-
OpenAI will soon release a non-logical reasoning model, GPT 4.5, which likely follows a similar approach, distilling COT data from the o3 model to enhance the GPT 4.5 base model’s intelligence. The “left foot stepping on the right foot for self-promotion” method will be the main means of improving the base model’s capabilities in the future.
-
The computational cost of Grok 3 is 10 times that of Grok 2. If we adhere to the Chinchilla Scaling Law, the best practice would be for Grok 3’s training data volume to increase threefold compared to Grok 2, with the model size also increasing threefold (however, the current trend is to reduce model size while increasing data volume [i.e., the “small model, large data” approach], even though this does not meet the optimal training principle, but since the model size is smaller, this model is more suitable for online inference services, reducing service costs).
-
If the claim from the launch event that Grok 3’s computational cost is 10 times that of Grok 2 is true, then there are two possibilities. One is that the data volume has increased significantly, which would mean a substantial increase in multimodal data, for example, increasing from 10T to 30T (the current text model uses data volumes typically between 18T to 20T, which is basically at the limit; to significantly increase this would require adding multimodal data, but increasing multimodal data does not greatly help improve large model intelligence, so this increment should not be too large). If this is the case, Grok3’s model size would increase by about three times; the second possibility is that the training data volume has not increased much beyond 20T. If this is the case, it can be inferred that Grok3’s model size is significantly larger than Grok 2, starting at least 4 to 5 times larger (if the newly added data is not much, then the only way is to increase the model size to consume the added computational resources). Regardless of which possibility is true, Grok 3’s model size is certainly much larger than Grok 2, and Grok 2’s model itself is already not small (the Grok 2 release webpage claims performance exceeding Llama 3.1 405B, so both in terms of data and model size, it will not be too small; if it is a Dense model, 70B is the minimum estimate). Therefore, Grok 3’s size is likely not ordinary (estimated between 200B and 500B).
-
It is evident that Grok 3 still adopts the “traditional” approach of increasing the base model size, which is the method analyzed in the “Scaling Law” section, to enhance base model capabilities during the pre-training phase. As previously analyzed, this approach has very low cost-effectiveness. A more fashionable approach would be to focus on RL Scaling, which would be much more cost-effective. But why would they engage in such a losing deal? A possible explanation will be provided later.
Grok 3 Logical Reasoning Version (Deep Thinking Version, Compared to Deepseek R1)
-
The deep thinking version of Grok 3, not discussing the experience, only from the evaluation metrics, has reached or exceeded o3 mini, and it is indeed one of the best or the best currently.
-
Returning to the earlier question: why, knowing that increasing the pre-training phase model size is cost-ineffective, does Grok 3 still use this model? The internal reason might be (speculative without evidence): the RL Scaling adopted during the Post-Training phase may have a positive correlation with the base model size, meaning that for the same computational consumption during the RL phase, if the base model size is larger, the scaling effect during the RL phase is better. Only in this way would there be a necessity to maximize the model size during the pre-training phase. We can hypothesize that Grok 3 adopts this overly computationally intensive method, which appears to have low cost-effectiveness, in the hope of significantly improving the capability of the deep thinking version by enlarging the base model.
-
It seems that while Deepseek R1 performs well and is open-source, receiving much praise, those who wish to use it practically find the base model too large, making deployment difficult and resource-consuming, which is not very user-friendly for downstream applications. So why does Deepseek insist on pushing such a significantly oversized model for downstream applications? (Smaller distilled models appear to have good metrics but seem to perform poorly in practical applications). Could it also be because if the base model is not large enough, the deep thinking model’s performance would not be as good?
-
If the above hypothesis holds, it implies that the three Scaling Laws (Pre-train, RL, Test Time), ranked by cost-effectiveness in improving large model intelligence, are: Test Time > RL > Pre-Train. This is the conclusion drawn earlier. However, if the above hypothesis holds, it indicates that the ceiling of Test Time Scaling is the lowest, relying on the scaling capability of the RL phase, and the ceiling of RL phase Scaling is the second lowest, relying on the scaling of the pre-training phase. If this is the case, if one day the ceilings of RL and Test Time are reached, it means we can initiate another round to increase the base model size, subsequently raising the ceiling of RL phase Scaling, allowing us to further scale RL and Test Time to achieve even higher intelligence in large models. If this is true, it implies that the solution to AGI is already complete? In fact, there is no need for new Scaling Laws to exist?
-
The above inference is based on the premise that Grok 3’s substantial computational cost to increase model size is the result of careful consideration or small-scale experiments, rather than merely influenced by the old notion (higher computational power in the pre-training phase equals better results). If this premise does not hold, then the above inferences do not hold. In any case, all responsibility lies with Musk, over.
Scan the QR code to add the assistant on WeChat