The release of LLAMA-3 is a significant event in the open-source large model community. Riding the wave of this hype, I would like to share some personal views on LLAMA-3, open-source vs. closed-source large models, and synthetic data.
1. Basic Information on LLAMA-3
The model structure has not changed significantly compared to LLAMA-2; the main change is that the token vocabulary has expanded from 32K in LLAMA-2 to 128K to increase encoding efficiency. Another change is the introduction of Grouped Query Attention (GQA), which can reduce the size of the KV cache during the inference process, thereby increasing inference efficiency. Additionally, the input context length has been extended from 4K to 8K, which is still somewhat short compared to competitors. The most important change is the significant expansion of the training data, from 2T tokens in LLAMA-2 to about 15T tokens, an increase of approximately 8 times, with code data expanded by 4 times. This leads to a substantial improvement in LLAMA-3’s coding capabilities and logical reasoning abilities. 15T token data is quite large; it is rumored that GPT-4 used 13T of token data. LLAMA-3 is available in three versions: small, medium, and large. The small model has a parameter scale of 8B, performing slightly better than Mistral 7B/Gemma 7B and is roughly on par; the medium model has a parameter scale of 70B, currently performing between ChatGPT 3.5 and GPT-4; the large model with 400B is still in training, with the design goal being a multimodal, multilingual version, expected to perform on par with GPT-4/GPT-4V, otherwise Meta would likely not release it. Contrary to many expectations, LLAMA-3 did not adopt an MOE structure, which is quite normal. The main function of MOE is to reduce training and inference costs; in terms of performance, an MOE of the same scale will definitely not outperform a dense model. Of course, if the model scale is large, more thought may need to be given to how to reduce inference costs. I feel that the approach of creating an 8B model for LLAMA-3 is very, very correct. For small models, if you fix the model size, then as long as you continue to increase high-quality data, the model’s performance will definitely continue to improve. This conclusion can actually be drawn from the Chinchilla law paper published in 2021. Generally, the optimal training data volume corresponding to the Chinchilla law is 20 times the model size. For example, for an 8B model, the optimal training data volume is 160B. However, we cannot mechanically interpret and apply the Scaling law. From the experimental data of Chinchilla’s paper, we can see two other ways to improve model performance, even though they are not optimal for training. One is to fix the model size and continuously increase the training data; the model’s performance will continue to improve as long as you have a constant stream of new data to add. The other is to fix the training data volume and continuously enlarge the model parameter scale; similarly, the model’s performance will also improve. If we call the simultaneous increase of training data and model capacity by a specified ratio the “Optimal Chinchilla Law”, then these two practices can be referred to as “Sub-optimal Chinchilla Law”. From the above, we can see that before the second half of 2025, we can still follow the current Scaling law route, which generally involves simultaneously increasing data and model scale to rapidly enhance model capabilities. By the second half of 2025, it is likely that it will become difficult to find a large amount of new data, at which point breakthroughs in “synthetic data” technology will be needed to generate new training data by machines themselves; otherwise… will model capabilities stop improving? Not necessarily; at that time, we can only increase model scale without increasing training data, and in principle, model capabilities can continue to improve. However, the efficiency of improvement will not be as fast as the current method of simultaneously increasing training data and model scale.
2. Open Source vs. Closed Source
Meta is currently the backbone of the open-source large model community. It is currently judged that the LLAMA-3 series will all be open-sourced, including the 400B model, which will also be released in a few months. This means we will have an open-source large language model with performance comparable to GPT-4, which is great news for many complex applications (of course, the 400B model is too large, which is a practical issue). If Meta’s LLAMA-3 series is fully open-sourced, and even LLAMA-4 continues to be open-sourced (the likelihood of this seems high, as Meta shows a strong commitment to open-source compared to Google, which seems to prioritize commercial interests more), then the domestic community should focus on researching how to better localize the LLAMA series (for some reasons, LLAMA has deliberately weakened its Chinese capabilities, but this is not a major issue. Creating a good Chinese model does not necessarily require a large amount of Chinese data, such as GPT-4), including expanding the Chinese token vocabulary, continuing pre-training with Chinese training data at low cost, and removing harmful information through censorship, etc. Thus, with Meta’s continuous release of more powerful new version models in the future, there is a possibility that a powerful model (including language models and multimodal models) derived from the localization of LLAMA could appear sooner than the strongest models released domestically, both closed-source and open-source. If, in a few months, an open-source text and multimodal model at the GPT-4 level (“well-localized + successfully compressed model” of LLAMA-3 400B) appears on the market, it will put pressure on domestic large model developers, whether open-source or closed-source. It is not impossible that there will be calls to ban LLAMA domestically, and the reasons for the ban are easy to find. I hope we do not reach such a situation. – Currently, in terms of model capabilities, the open-source camp is indeed weaker than the closed-source camp, which is a fact. However, based on the technological developments in the past year and a half, the gap between open-source models (including both foreign and domestic models) and the best closed-source models is gradually narrowing, not widening, which is also a fact supported by many data. So what factors severely affect the capability differences between open-source and closed-source models? I believe the smoothness or steepness of the model capability growth curve is quite important. If the model capability growth curve is steeper (the amount of capability growth in various aspects of the model per unit time is faster, similar to the “acceleration” of an object’s movement), it means that more computational resources need to be invested in a short period, in which case closed-source models have an advantage over open-source models, mainly due to resource advantages leading to performance advantages. Conversely, if the model capability growth curve is flatter, it means that the differences between open-source and closed-source models will become smaller, and the speed of catching up will also increase. This difference in capability between open-source and closed-source models, determined by the steepness of the model capability growth curve, can be termed the “acceleration difference” in model capability. Let’s see in a few years whether the capabilities of open-source and closed-source models are gradually narrowing or gradually increasing. This depends on our technological progress in “synthetic data”. If breakthroughs in “synthetic data” technology can be achieved in the next two years, the gap between the two could widen; if not, the capabilities of open-source and closed-source models will be comparable. Therefore, “synthetic data” is the most critical and decisive technology for large language models in the next two years, very likely without exception.
3. Synthetic Data
Overall, “synthetic data” is an emerging research direction that is still immature, and we have not yet seen any mainstream methods that can dominate future technological directions, with a strong exploratory and uncertain nature. The best products currently observed using “synthetic data” are probably DALLE-3 and Sora, which contain image and video re-captioning models; essentially, this is machine-generated “synthetic data”. Currently, a large amount of resources should be invested in “synthetic data” research, as a precaution and to form core competitiveness. By the second half of next year, it is likely that there will be no high-quality new data available for training large language models. Relying on linearly increasing data to support the exponentially developing model capabilities is not sufficient. If “synthetic data” does not achieve breakthrough progress in the next two years, the development speed of large models will suddenly decline, unable to maintain the current rapid development pace. The rapid development of AIGC is essentially still riding on the data dividend; if GPT-5 does not reach AGI and there are no breakthroughs in synthetic data, then there is considerable doubt about whether large models can achieve AGI. Relying on multimodal data to significantly enhance the key capabilities of large models, such as logical reasoning abilities, currently seems to be just a wish for many people, with no clear data or experiments to support this. Personally, I believe this path is not feasible. Therefore, we should not pin our hopes for further enhancing AGI capabilities on multimodal data. The future will depend on our progress in “synthetic data”, leading to two different future scenarios. One is that for a long time, synthetic data cannot be practically applied on a large scale. If so, we will see the following phenomena: the capabilities of large models will basically reach their peak, various doubts about the current AGI technical route will gradually amplify, and the capabilities of open-source and closed-source models will level off. This would be a disaster for many closed-source model companies (even though we can continue to enhance model capabilities by increasing model scale, the growth curve of model capabilities will be much flatter than now, meaning the “acceleration difference” in model capabilities will decrease, making it easier for open-source models to catch up with closed-source models). The other scenario is that within the next two years, either we achieve significant progress in “synthetic data”, or even without new data, we have breakthrough technologies that can greatly enhance the data utilization efficiency of large models (with the same amount of data and model size, if the model performs better, it indicates higher data utilization efficiency). Currently, we have not seen any mainstream technology that can dominate future development in this aspect. If so, we will continue to develop according to the Scaling law, which means continuing to increase new data and expand model scale to continuously enhance model capabilities. If this is the case, AGI may be achievable through the large model technical route, which means that it will require a resource investment compared to the current orders of magnitude higher, which is basically astronomical. Under such massive investment, it is questionable whether companies like Meta will continue to support open-source efforts at such a large scale, and at this point, open-source models may increasingly lag behind closed-source models.
Scan the QR code to add the assistant on WeChat