10 Insights on Large Models by Academician Shen Xiangyang

Reprinted from Zhejiang Digital Economy Research Institute

Introduction

This article is a compilation of the speech by Academician Shen Xiangyang, the founding director of the Guangdong-Hong Kong-Macao Greater Bay Area Digital Economy Research Institute, on September 28, themed “How Should We Think About Large Models in the Era of General Artificial Intelligence?” In his sharing, Academician Shen summarized his 10 insights on large models, hoping to provide assistance.

Here are his 10 insights:

1. Computing Power is a Barrier: The requirements for computing power for large models have been immense over the past decade. Today, to create an artificial intelligence large model, it’s all about the cards; without cards, there’s no emotion.

2. Data About Data: If GPT-5 comes out, it might reach 200T of data. However, there aren’t that many good data on the internet; after cleaning, around 20T might be the ceiling. Therefore, to develop GPT-5 in the future, in addition to existing data, more multimodal data and even artificially synthesized data will be needed.

3. The Next Chapter of Large Models: There is much multimodal research to be done. I believe a very important direction is the unification of multimodal understanding and generation.

4. Paradigm Shift in Artificial Intelligence: After o1 was released, the approach shifted from the previous pre-training method of GPT to today’s path of autonomous learning, which strengthens learning during reasoning and involves a continuous self-learning process. This entire process closely resembles how humans think and analyze problems, requiring significant computing power.

5. Large Models Sweep Across Industries: In the wave of large model construction in China, an increasing number are industry-specific large models. This trend is evident; the proportion of general large models will continue to decrease in the future.

6. AI Agent, From Vision to Reality: The super application was there from the beginning; this super application is a super assistant, a super agent.

7. Open Source vs Closed Source: I believe Meta’s Llama is not traditional open source; it has only opened a model without providing the original code and data. Therefore, when using open-source systems, we must be determined to truly understand the work of closed-source large models.

8. Emphasize AI Governance: The impact of artificial intelligence on various industries and society is significant, and we need to face this challenge together.

9. Rethink Human-Machine Relationship: Only by clarifying human-machine interaction can we become the leaders of high-tech enterprises with real commercial value. It is still too early to say that OpenAI combined with Microsoft represents this era; they are leading, but there is still much room for imagination in the future.

10. The Essence of Intelligence: Although large models have shocked everyone, we still lack a theory for large models and deep learning. Discussions about the emergence of artificial intelligence have not been clearly articulated.

10 Insights on Large Models by Academician Shen Xiangyang

Below is the full text of Academician Shen Xiangyang’s speech at this forum:

I am very pleased to have the opportunity to share some recent learning and insights in artificial intelligence with everyone today in Shenzhen.

I will continue the topic of artificial intelligence introduced by Mr. Yao Qizhi and report on some of the things we are currently doing in the era of large models, especially from the perspective of technological integration and industrial transition.

In fact, the importance of technological development in the era of artificial intelligence is not just limited to that; the entire history of human development is a history of technological development. Without technology, there is no GDP growth. We need not look back to the days of drilling wood for fire or inventing the wheel. We can observe the remarkable breakthroughs in physics over the past century, and the breakthroughs in artificial intelligence and computer science over the past 70 years, which present many development opportunities.

Today, we are discussing artificial intelligence and large models. Over the past few years, everyone has undoubtedly been shocked step by step by new experiences in artificial intelligence. Even as someone who has worked in artificial intelligence for a lifetime, it was hard to imagine the current situation a few years ago.

I want to mention three examples: the first is text generation from text, the second is image generation from text, and the third is video generation from text. As we just discussed, ChatGPT is an artificial intelligence system that exists not only internationally but also domestically. For instance, before I came here to give a speech, I asked ChatGPT what topic I should speak on at Tencent’s Young Scientist 50² Forum, given my background. Some may find it amusing, but I found it very useful.

ChatGPT is familiar to everyone. Two years ago, OpenAI released a text-to-image system. You provide a piece of text, and it generates an image. Seven months ago, they released Sora, which generates a 60-second video from a piece of text, like a video of walking through the streets of Tokyo. This is truly astounding. (Due to time constraints, I won’t play the video.)

Let me discuss the example of text-to-image generation. I am a computer graphics researcher and believe I have a good sense of what makes a photo good or bad. Two years ago, this photo emerged, the first AI-generated photo in human history, which graced the cover of the American fashion magazine (Cosmopolitan). A digital artist in San Francisco used OpenAI’s system and provided a piece of text that led to this result. The text was: “In the vast starry sky, a female astronaut strides confidently on Mars towards a wide-angle lens.” I may not have much artistic talent, but I was very shocked when I saw this image, and I believe you would agree with me. The fact that artificial intelligence can produce such an image is indeed remarkable.

Today, we have these remarkable technologies and products, and we are also making great efforts domestically to develop large models, covering everything from technology to models to applications. As Academician Yao mentioned, Tsinghua University has also presented many of its latest works. Therefore, I want to share with you how we should think about large models in the era of general artificial intelligence, and I would like to discuss a few of my views.

First Insight: Computing Power is a Barrier

Today, general artificial intelligence, large models, and deep learning hinge on the overall growth of computing power in recent years.

Over the past decade, the growth of computing power used by large models has been immense; initially, it was a six- to seven-fold increase per year, later exceeding four times annually. Now, let me ask you a question: If something increases four times a year, how much will it increase in ten years? Think about it, and I will return to this question shortly.

Everyone knows that the companies that have benefited the most from this wave of artificial intelligence development are Nvidia, whose shipments have increased annually, and the company’s market value has become one of the three companies in the world with a market value of $3 trillion (along with Microsoft and Apple). The key reason is the increasing demand for computing power every year. The demand for Nvidia chips is still surging in 2024. For instance, Elon Musk is currently building a cluster of 100,000 H100 cards. Building a system with 10,000 cards is already a significant challenge, and constructing a system with 100,000 cards is even more difficult, requiring high network specifications.

Today, when discussing computing power and large models, the most important concept is (computing power and data) scaling laws; the more computing power, the greater the intelligence. Currently, we have yet to reach the ceiling. Unfortunately, as the data volume increases, the growth of computing power is not linear; it resembles a quadratic growth pattern.

As the model grows, to train the model, the data volume must also increase, resulting in a relative quadratic growth. Therefore, the demand for computing power has been immense over the past decade. Hence, I say, today, to create an artificial intelligence large model, it’s all about the cards; without cards, there’s no emotion.

I just asked everyone a question: if it grows four times a year, how much will it grow in ten years? Those of us in computer science know about “Moore’s Law,” which states that computing power doubles approximately every 18 months. Intel has developed this way over the years. Why has Nvidia now surpassed Intel? A significant reason is that their growth rates differ. If computing power doubles every 18 months, it will increase by about 100 times in ten years, which is remarkable; however, if it grows four times a year, it will increase by 1,000,000 times in ten years. This growth is astonishing. If you think about it this way, it is understandable why Nvidia’s market value has surged so rapidly over the past decade.

Second Insight: Data About Data

Computing power, algorithms, and data are the three crucial factors in artificial intelligence. I previously mentioned that we need a lot of data to train general artificial intelligence. When ChatGPT3 was released, it was stated that it required 2 trillion tokens of data; by the time GPT-4 was released, it was around 12T; with continuous training, GPT-4 has likely exceeded 20T today. Those concerned with artificial intelligence know that we have been waiting for GPT-5, which has been delayed. If GPT-5 is released, I personally estimate it could reach 200T of data. Looking back, there isn’t that much good data on the internet; after cleaning, around 20T might be the ceiling. Therefore, to develop GPT-5 in the future, we will need more multimodal data and even artificially synthesized data, in addition to existing data.

An interesting point is that for the past 30 to 40 years, we have shared our information online. Previously, we thought we were working for search engines, but now it’s remarkable that our 30 to 40 years of accumulation is for a moment like ChatGPT, which integrates everything and learns an artificial intelligence model through powerful computing power.

Third Insight: The Next Chapter of Large Models

Having come this far, what should we do next? First is the language model. Represented by ChatGPT, its underlying technology is natural language processing. Today, we are working on multimodal models, represented by GPT-4, which involves many techniques in computer vision. Moving forward, we need to create embodied intelligence. Where is the purpose of embodied intelligence? Essentially, we need to build a world model, even if it is multimodal; there is still no underlying physical model. Therefore, we need to create such a world model. A world model means you not only need to read a myriad of books but also travel extensively to feedback more knowledge into your brain. Hence, we should develop robots. I believe Shenzhen should be determined to develop robots and embodied intelligence. One special track within robotics is autonomous driving, which is a particular type of robot that operates on designated routes.

How should we proceed? There is much multimodal research to be done. I believe a very important direction is the unification of multimodal understanding and generation. Even if Sora is developed, it is still separate; multimodal generation and understanding have not been unified. There is much research work we can undertake in this area.

For example, several of my students have founded a large model company called Jueyue Xingchen, and they excel in multimodal understanding. If you show an AI a picture, it can explain why the behavior in the image is called “ineffective skill.” The AI might explain that the image seems to depict a child rolling on the ground, while his mother remains indifferent, looking at her phone and drinking a beverage, hence the child’s behavior is termed ineffective skill. AI is increasingly proficient in understanding images.

Fourth Insight: Paradigm Shift in Artificial Intelligence

Two weeks ago, OpenAI released its latest model, o1. I also mentioned that GPT has been evolving, and after GPT-4, GPT-5 has yet to be released. People are wondering if simply increasing the parameters of the large model has reached its peak. No one knows; it hasn’t been released yet, and we haven’t developed a larger model domestically.

However, a new dimension has emerged, which is not about pre-training (expansion) but about expanding during reasoning. This transition has shifted from the original GPT approach to today’s path of autonomous learning, which involves reinforcement learning during reasoning and a continuous self-learning process.

Previously, we did pre-training primarily to predict the next character or token. Now, the new approach is to draft and see which path is correct, similar to how the human brain thinks. There is a fast system and a slow system, just as we draft math problems to see which route works and optimize the opportunities during the thought process. Until now, only OpenAI has released such a system, and I encourage everyone to look at some examples within it.

Most importantly, the entire process closely resembles how humans think, analyze problems, draft, validate, correct, and start over. This approach opens up vast possibilities. However, doing this also requires significant computing power.

Fifth Insight: Large Models Sweep Across Industries

All companies must face the opportunities brought by large models, but not every company needs to develop a general large model. If you don’t have at least 10,000 cards, you won’t have the opportunity to create a general large model. To create a general large model, at least 10,000 cards are required.

For instance, when GPT-4 was released, its training volume was 2×10^25 FLOPS. Such a massive training volume requires a year of running with 10,000 A100 cards to reach this amount. If you can’t reach this volume, there is no chance of creating a true general large model. With a general large model, we can build our industry-specific large models, such as in finance and insurance, where perhaps 1,000 cards can perform excellently with some fine-tuning. For a company, if you have your own data, internal data, and customer data, you can utilize dozens to hundreds of cards to create a very good model tailored to your enterprise. Thus, it is a layer-by-layer buildup.

Of course, there is also an important dimension that I particularly like, which is the future of personal large models. Today, we are gradually accumulating data on our PCs and phones, leading to a better understanding of ourselves. In the future, I believe there will be a super-intelligent AI that helps you by collecting relevant data to build a personal large model. This is a natural development on personal terminals; a phone is a very natural fit. On the PC side, companies like Microsoft and Lenovo are also promoting the concept of AI PCs, so there are opportunities in this area as well.

In the wave of large model construction in China, an increasing number are industry-specific large models. Here’s an example: before launching large models in China, approval from the Cyberspace Administration is required. By the end of July this year, a total of 197 models had been approved, of which 70% are industry-specific large models, and 30% are general large models. This trend is evident; the proportion of general large models will continue to decrease in the future. For example, we can build financial models on top of general large models, as demonstrated by a Shanghai company that developed a large model for its financial clients. For instance, when Nvidia’s financial report is released, it can immediately summarize its highlights and issues.

Sixth Insight: AI Agent, From Vision to Reality

Today, we see what the largest super application of large models is and where the greatest opportunities lie. Many people are still trying to find a super application. In reality, the super application was there from the beginning; this super application is a super assistant, a super agent.

Previously, I worked with Gates at Microsoft for many years, and we pondered this question. Where does the difficulty lie? It lies in truly understanding a workflow when you want to perform useful tasks. When you ask a question, it can break it down step by step. Today, we can do some impactful tasks, such as customer service and personal assistance. However, many tasks cannot be accomplished; why can’t they be done? You need to create a digital brain. The large model at the bottom is just the first step. The capabilities of large models are not yet powerful enough to assist you step by step with these tasks because to develop such an agent, it must understand what these problems are, and each component requires corresponding skills.

Everyone has already achieved many good examples using today’s models. For instance, you can create an AI health consultant that understands cosmetics and recommends products. In the future, you will see many applications in this area.

Seventh Insight: Open Source and Closed Source

In the development of global technology, especially in Chinese technology, two things are very important.

The first is the emergence of the internet, which allows you to find all papers and materials online.

The second is open source, which has dramatically narrowed the gap with leaders when developing applications. However, open source differs from large models and database open source. Although the current open source capabilities are approaching closed source, many companies in China are also working on open-source projects. The open-source project that has done very well today is Meta’s Llama 3.1, which claims to be almost on par with OpenAI. I do not agree with this perspective; I believe it is not traditional open source. It only opened a model without providing the original code and data. Therefore, when using open-source systems, we must be determined to truly understand the work of closed-source large models.

Eighth Insight: Emphasize AI Governance

Due to the rapid development of AI, the world is paying great attention to AI safety. The impact of this issue is enormous, as artificial intelligence significantly affects various industries and society as a whole. The development of the entire world requires collective facing of these challenges.

Ninth Insight: Rethink Human-Machine Relationship

I just introduced text-to-text, text-to-image, and text-to-video generation. How much of this is machine intelligence, and how much is the shock brought by human-machine interaction?

About ten years ago, New York Times columnist John Markoff wrote a book I greatly admire, “Machine of Loving Grace,” which summarizes two lines of technological development: one is artificial intelligence, and the other is IA (Intelligent Augmentation), which is intelligence enhancement through human-machine interaction. With the advent of computers, many tasks have been aided, such as chess.

In fact, only by clarifying human-machine interaction can we become the leaders of high-tech enterprises with real commercial value in each generation. Today, the interface of artificial intelligence is very clear, which is the dialogue process, represented today by ChatGPT. However, it is still too early to say that OpenAI combined with Microsoft represents this era; they are leading, but there is still much room for imagination in the future.

Tenth Insight: The Essence of Intelligence

Today, although large models have shocked everyone, we still lack a theory for large models and deep learning. Today, we would welcome any theory. Unlike physics, which has beautiful laws to describe everything from the vast universe to tiny quantum particles, artificial intelligence lacks such theories, with no interpretability or robustness. The framework of deep learning does not reach true general artificial intelligence.

Discussions about the emergence of intelligence in models have not been clearly articulated. Why does intelligence emerge when models reach a certain size? Why can a 70B model exhibit intelligence? There is no rationale for this. Therefore, we are also diligently researching these issues. Last summer, I organized a seminar at the Hong Kong University of Science and Technology themed “Mathematical Theory for Emergent Intelligence,” discussing the need to clarify scientific and mathematical principles underlying emergent intelligence. We need more willing explorers to participate, especially with the emergence of Tencent’s “Scientific Exploration Award” and “New Cornerstone Researchers” program, allowing more young scientists to join and gain confidence and belief in tackling the challenging problems for the future development of artificial intelligence.

Once again, congratulations to all the award winners and young scientists. The development of technology relies on the younger generation, especially in artificial intelligence. Thank you all once again.

END

Source | Tencent Technology

Copyright Statement | Copyright belongs to the original author and source. This platform’s reprint is for readers’ learning, reference, and communication purposes only. If there is any infringement, please contact for deletion.