Full Speech by Shen Xiangyang at the Youth Scientist 50² Forum: 10 Thoughts on Large Models

Click the blue text

Shen Xiangyang

Chairman of the Board of Hong Kong University of Science and Technology, Foreign Member of the National Academy of Engineering, USA

On September 28, the 4th “Youth Scientist 50² Forum” was held at Southern University of Science and Technology,Foreign Member of the National Academy of Engineering, USAShen Xiangyang delivered a keynote speech titled “How Should We Think About Large Models in the Era of General Artificial Intelligence” and provided his 10 thoughts on large models.The following are the specific contents of his 10 thoughts:1. Computing power is a threshold: The requirements for computing power of large models have been enormous over the past decade. Today, to build an artificial intelligence large model, it’s all about the cards; without them, there’s no progress.2. Data about data: If GPT-5 comes out, it may reach a data volume of 200T. However, there isn’t that much good data on the internet; after cleaning, about 20T might be the limit. Therefore, to create GPT-5 in the future, we need not only existing data but also more multimodal data and even artificially synthesized data.3. The next chapter of large models: There is much research to be done on multimodal models. I believe a very important direction is the unification of multimodal understanding and generation.4. Paradigm shift in artificial intelligence: After o1 was released, the approach shifted from the original pre-training idea of GPT to today’s path of autonomous learning, which involves reinforcement learning in the reasoning step and a continuous self-learning process. This entire process resembles how humans think and analyze problems, requiring a significant amount of computing power.5. Large models sweeping across industries: In the wave of large model construction in China, there are increasingly more industry-specific large models. This trend is certain; the proportion of general large models will continue to decrease.6. AI Agents, from vision to implementation: The super application has always been there; this super application is a super assistant, a super agent.7. Open source vs. closed source: I believe Meta’s Llama is not traditional open source; it only opens a model without providing the original code and data. Therefore, when using open source systems, we must truly understand the workings of closed-source large models.8. Emphasizing AI governance: The impact of artificial intelligence on various industries and society as a whole is immense and must be faced collectively.9. Rethinking the human-machine relationship: Only by thoroughly understanding human-machine interaction can one become a leader with real commercial value in every generation of high-tech enterprises. It is still too early to say that OpenAI combined with Microsoft represents this era; they are leading, but there is still much imaginative space in the future.10. The essence of intelligence: Although large models have brought many shocks, we still lack a theoretical foundation for large models and deep learning. Regarding the emergence of artificial intelligence, people are just discussing it without clear explanations.The “Youth Scientist 50² Forum” is the academic annual meeting of the New Cornerstone Science Foundation, co-hosted by Southern University of Science and Technology, Tencent’s Sustainable Social Value Division, and the New Cornerstone Science Foundation. The New Cornerstone Science Foundation was established by Tencent with a 10 billion RMB investment over 10 years and is one of the largest public scientific foundations in China. Its establishment and operation are concrete actions by Tencent to practice technology for good and long-term investment in scientific funding.The “Youth Scientist 50² Forum” is a cross-disciplinary academic exchange platform for winners of the “Science Exploration Award.” The “Science Exploration Award” was established in 2018 and is a public award funded by the New Cornerstone Science Foundation and led by scientists, making it one of the highest-funded projects for young scientific talents in China. Each awardee is required to share their BIG IDEA and latest explorations at the forum at least once during the 5-year funding period. “50²” signifies that the 50 young scientists selected each year for the “Science Exploration Award” will have a significant impact on scientific and technological breakthroughs over the next 50 years.

Below is the full text of Shen Xiangyang’s speech at this forum:I am very pleased to have the opportunity to share some recent learnings and insights in artificial intelligence with everyone today in Shenzhen.Following Mr. Yao Qizhi’s discussion on artificial intelligence, I would like to report on some of the things we are currently doing in the era of large models, particularly from the perspective of technological integration and industrial transition.In fact, the importance of technological development in the era of artificial intelligence is not just a matter of importance; the entire history of human development is a history of technological development. Without technology, there would be no GDP growth. We need not look back to the invention of fire or the wheel—let’s look at the remarkable breakthroughs in physics over the past 100 years, and the breakthroughs in artificial intelligence and computer science over the past 70 years; many opportunities for development can be seen.Today, our topic is artificial intelligence and large models. Over the past few years, everyone has definitely been shocked step by step by new experiences in artificial intelligence, even someone like me who has worked in artificial intelligence for a lifetime found it hard to imagine the current situation a few years ago.I want to mention three examples: the first is text-to-text generation, the second is text-to-image generation, and the third is text-to-video generation. As mentioned earlier, ChatGPT is an artificial intelligence system that exists not only internationally but also domestically. For example, before coming here to give a speech at Tencent’s Youth Scientist 50² Forum, I asked ChatGPT what topic I should discuss based on my background. You might find it amusing, but I found it quite useful.ChatGPT is well-known; two years ago, OpenAI released a text-to-image system that generates an image from a given text. Seven months ago, they released Sora, which creates a 60-second video from a text prompt, such as a video of walking on the streets of Tokyo. This is all very impressive. (Due to time constraints, I won’t play the video.)Let me discuss the text-to-image example. I specialize in computer graphics and believe I have a good sense of what makes a photo good or bad. Two years ago, this photo was released, marking the first AI-generated photo in human history to appear on the cover of a fashion magazine (Cosmopolitan). A digital artist in San Francisco used OpenAI’s system to generate this result from a prompt: “In the vast starry sky, a female astronaut strides confidently on Mars towards a wide-angle lens.” I personally lack great artistic talent, but I was shocked when I saw this image, and I believe you would agree with my assessment that the AI produced an image that truly resembles a female astronaut. Thus, artificial intelligence has reached a remarkably intelligent level.With these remarkable technologies and products, we are also working hard domestically on large models, from technology to models to applications, we are making progress in all aspects. As mentioned earlier, Academician Yao also discussed many of Tsinghua’s latest works. Therefore, I would like to share with you how we should think about large models in the era of general artificial intelligence; I would like to express several of my views.

First Thought: Computing Power is a Threshold

The most important thing about today’s general artificial intelligence, large models, and deep learning is the overall growth of computing power in artificial intelligence in recent years.Over the past decade, the growth in computing power used by large models initially increased six to seven times each year, and later exceeded four times annually. I now ask everyone a question: if something grows four times a year, how many times will it grow in ten years? Think about it for a moment; I will return to this question shortly.Everyone knows that the biggest beneficiaries of this wave of AI development are companies like NVIDIA, whose shipments have been increasing year by year, and computing power is gradually enhancing, making it one of the three companies in the world with a market value of $3 trillion (along with Microsoft and Apple). Most importantly, this is due to the increasing demand for computing power each year. In 2024, the number of NVIDIA chips being purchased is still rapidly increasing; for example, Elon Musk is currently building a cluster of 100,000 H100 cards, which is already very difficult, and building a system with 100,000 cards is even more challenging, requiring extremely high network demands.Today, when discussing computing power and large models, the most important concept is (scaling laws) regarding the expansion of computing power and data. The more computing power, the greater the intelligence; we have not yet reached the ceiling. Unfortunately, as the data volume increases, the growth in computing power is not linear; instead, it resembles a quadratic growth pattern.Because when the model becomes larger, to train the model, the data volume must also increase, so it tends to grow more like a quadratic function. Therefore, the requirements for computing power have been immense over the past decade. Thus, I would like to say that today, to build an artificial intelligence large model, it’s all about the cards; without cards, there is no progress.I just asked everyone a question: if it grows four times a year, how many times will it grow in ten years? Computer scientists know about something called “Moore’s Law,” which states that computing power doubles approximately every 18 months; Intel has developed this way over the years. Why has NVIDIA now surpassed Intel? A significant reason is its different growth rate. If computing power doubles every 18 months, it would grow approximately 100 times in ten years, which is quite remarkable; if it grows four times a year, it would be 1 million times in ten years, which is an astonishing growth. If you think about it this way, it is understandable why NVIDIA’s market value has increased so rapidly over the past decade.

Second Thought: Data About Data

Computing power, algorithms, and data are the three important factors of artificial intelligence. I previously mentioned that we need a lot of data to train general artificial intelligence. When ChatGPT-3 was released, it was still in the paper publication phase, and it was said that it required 200 trillion tokens of data; by the time GPT-4 came out, it was about 12T; after continuous training, GPT-4 is now estimated to have exceeded 20T. Those interested in artificial intelligence know that we have been waiting for GPT-5 for a long time, but it has not yet been released. If GPT-5 comes out, I personally predict it may reach a data volume of 200T. Looking back, there isn’t that much good data on the internet; after cleaning, about 20T might be the limit. Therefore, to create GPT-5 in the future, we need not only existing data but also more multimodal data and even artificially synthesized data.An interesting point is that over the past thirty to forty years, everyone has shared their information online; previously, we thought we were working for search engines, but now it is even more remarkable that our thirty to forty years of accumulation is for moments like ChatGPT, which integrates everything and learns an artificial intelligence model through powerful computing power.

Third Thought: The Next Chapter of Large Models

Now that we have come this far, what should we do next? First, let’s consider language models. Models represented by ChatGPT are fundamentally based on natural language processing. Today, we are working on multimodal models, represented by GPT-4, which incorporates many techniques from computer vision. The next step is to develop embodied intelligence. What is the purpose of embodied intelligence? Essentially, we need to build a world model, even if it is multimodal, the underlying physical model does not exist. Therefore, we need to create such a world model. A world model means that you not only need to read thousands of books but also travel thousands of miles, feeding more knowledge back into your brain. Thus, we should focus on creating robots. I believe Shenzhen should be determined to develop robots and embodied intelligence. Within the robotics field, there is a special track called autonomous driving, which is a particular type of robot that only operates on predetermined paths.How do we go about this? There is much multimodal research to be done, and I believe a very important direction is the unification of multimodal understanding and generation. Even if Sora is created, it is still separated; multimodal generation and understanding have not been unified. There is much research work we can do in this area.For example, a few of my students have started a large model company called Jueyue Xingchen, and they excel in multimodal understanding. If you show an AI a picture, it can explain why the behavior in the image is called “ineffective skill.” The AI might explain that the image shows a child rolling on the ground, but their mother is indifferent, busy looking at her phone and drinking, so the child’s behavior is labeled as ineffective skill. AI’s understanding of images is improving significantly.

Fourth Thought: Paradigm Shift in Artificial Intelligence

Two weeks ago, OpenAI released the latest model, o1. I previously mentioned that GPT has been developing, and after GPT-4, GPT-5 has yet to be released. People are wondering if the growth of model parameters has reached its peak. No one knows; it has not been released yet, and we have not developed a larger model domestically.However, a new dimension has emerged, shifting from pre-training (expansion) to expansion during reasoning. This has transitioned from the original GPT approach to today’s autonomous learning path, which involves reinforcement learning during the reasoning step and a continuous self-learning process.In the past, we primarily focused on pre-training, which involved predicting the next word or token. Now, the new approach is to draft and test which path is correct, similar to how the human brain works, involving a fast system and a slow system. It’s like solving a math problem: first, draft it and see which path is valid, creating a chain of thought and optimizing the opportunities during the thought process. So far, only OpenAI has released such a system, and I encourage everyone to explore some examples within it.Most importantly, this entire process resembles how humans think and analyze problems: drafting, verifying, correcting, and redoing. This thought space will be very large. However, doing this also requires significant computing power.

Fifth Thought: Large Models Sweeping Across Industries

All companies must face the opportunities brought by large models, but not every company needs to create a general large model. If you don’t have at least 10,000 cards, you won’t have the opportunity to create a general large model. To create a general large model, you need at least 10,000 cards.For example, when GPT-4 was released, its total training volume was 2×10^25 FLOPS. Such a large training volume requires a year of running on 10,000 A100 cards to reach that amount. If you can’t reach that volume, you won’t be able to create a truly general large model. Once you have a general large model, you can build your industry-specific large model on that foundation, such as in finance or insurance, where perhaps a thousand cards can perform very well with some fine-tuning. For a company, if you have your own data, internal data, and customer data, you can use dozens or hundreds of cards to create a very good model tailored to your enterprise. Thus, it is a layer-by-layer construction process.Of course, another very important dimension, which I particularly like, is the future of personal large models. Today, we are gradually accumulating data on PCs and mobile phones, leading to a better understanding of our needs. I believe that in the future, there will be a super-intelligent AI that helps you build a personal large model after collecting relevant data. This is a natural development in personal devices like smartphones. In the PC domain, companies like Microsoft and Lenovo are also promoting the concept of AI PCs, creating such opportunities.In the wave of large model construction in China, there are increasingly more industry-specific large models. For example, before the launch of large models in China, approval from the Cyberspace Administration was required. By the end of July this year, a total of 197 models had been approved, of which 70% were industry-specific large models, and 30% were general large models. This trend is certain; the proportion of general large models will continue to decrease. For instance, we can develop financial models based on general large models; a company in Shanghai has created a large model tailored for its financial clients. For example, when NVIDIA’s financial report is released, it can quickly summarize its highlights and issues.

Sixth Thought: AI Agents, From Vision to Implementation

Today, we see that the greatest super application of large models and the biggest opportunity lies in finding a super application. Many people are still trying to find a super application. In fact, the super application has always been there; this super application is a super assistant, a super agent.Previously, I worked with Gates at Microsoft for many years, and we have been pondering this issue. What makes it difficult? The challenge lies in truly performing useful work and understanding a workflow; when you ask a question, it can break it down step by step. Today, there are already some impactful applications, such as customer service and personal assistants. However, many tasks cannot be accomplished; why is that? To create a digital brain, the underlying large model is just the first step; its capabilities are not yet strong enough to handle all the tasks above it. To create such an agent that can perform tasks, it needs to understand what these problems are, and each part must have corresponding skills.Today, many good examples have already emerged using current models, such as AI health consultants that understand cosmetics and can recommend products. In the future, we will see many more applications in this area.

Seventh Thought: Open Source and Closed Source

In the past few decades, the development of technology in the world, especially in China, has been marked by two important events.The first is the emergence of the internet, which allows you to find all papers and materials online.The second is open source, which has dramatically reduced the gap between you and the leaders when developing applications. However, open source differs from large models and database open source; while the capability of open source is closing in on closed source, many domestic companies are also working on open source projects. Today, Meta’s Llama 3.1 is considered to be very competitive with OpenAI. However, I disagree; I believe it is not traditional open source; it merely opens a model without providing the original code and data. Therefore, when using open source systems, we must truly understand the workings of closed-source large models.

Eighth Thought: Emphasizing AI Governance

Due to the rapid development of AI, the world is paying great attention to AI safety. The impact of this issue is enormous; artificial intelligence has a significant impact on various industries and society as a whole, and the development of the world must be faced collectively.

Ninth Thought: Rethinking the Human-Machine Relationship

I previously introduced text-to-text, text-to-image, and text-to-video generation—how much of this is machine intelligence, and how much is due to the impact of human-machine interaction?About ten years ago, New York Times columnist John Markoff wrote a book I greatly admire, “Machine of Loving Grace,” which summarizes two lines of technological development: one is artificial intelligence, and the other is IA (Intelligent Augmentation), which is the enhancement of intelligence through human-machine interaction. After the advent of computers, they have assisted humans in many tasks, with chess being one example.In fact, truly understanding human-machine interaction is essential to becoming a leader with real commercial value in every generation of high-tech enterprises. The interface of today’s artificial intelligence is already very clear: it is a dialog process, represented today by ChatGPT. However, it is still too early to say that OpenAI combined with Microsoft represents this era; they are leading, but there is still much imaginative space in the future.

Tenth Thought: The Essence of Intelligence

Today, although large models have shocked everyone, we still lack a theoretical foundation for large models and deep learning. Today, we would be thrilled to have any theory, unlike in physics, where there are beautiful physical laws to describe everything from the vastness of the universe to the minutiae of quantum mechanics. Today, artificial intelligence lacks such theories, lacking interpretability and robustness. The framework of deep learning does not reach true general artificial intelligence.Regarding the emergence of artificial intelligence, people are merely discussing it without clear explanations. Why does intelligence emerge when models reach a certain size? Why can a 70B model exhibit intelligence? There is no clear rationale for this. Thus, we are working hard to research these issues. Last summer, I organized a seminar at the Hong Kong University of Science and Technology titled “Mathematical Theory for Emergent Intelligence,” discussing the need to clarify some scientific and mathematical principles behind emergent intelligence. We need more people willing to explore this area, especially with the emergence of initiatives like Tencent’s “Science Exploration Award” and the “New Cornerstone Researcher” program, which encourages more young scientists to join and instill confidence and belief in tackling difficult problems for the future development of artificial intelligence.Once again, congratulations to all the award winners and young scientists. The development of technology relies on the continuous efforts of each generation, especially in artificial intelligence. Thank you all once again.

About Us

Tsinghua University Institute for AI International Governance (Institute for AI International Governance, Tsinghua University, THU I-AIIG) is a research institution established by Tsinghua University in April 2020. Based on Tsinghua University’s existing accumulation and interdisciplinary advantages in artificial intelligence and international governance, the institute focuses on significant theoretical and policy issues related to international governance of artificial intelligence, aiming to enhance Tsinghua’s global academic influence and policy leadership in this field, providing intellectual support for China’s active participation in international governance of artificial intelligence.

Sina Weibo: @Tsinghua University Institute for AI International Governance

WeChat Video Account: THU-AIIG

Bilibili: Tsinghua University AIIG

Source | This article is reproduced from “Tencent Technology”; click “Read Original” for more content