Sora has once again ignited the AIGC industry, accelerating the arrival of the AGI era.
Author|36Kr Research Institute
Source|36Kr Research Institute (ID: kr_research)
Cover Source| Visual China
In February 2024, OpenAI released its first video generation model, Sora, allowing users to generate high-definition videos with smooth scene transitions and clear details by simply inputting a text description. Compared to AI-generated videos from a year ago, Sora has achieved qualitative improvements across various dimensions. This breakthrough has once again brought AIGC into the public eye. AIGC refers to AI systems trained on large amounts of data that can generate text, audio, images, code, and other content based on personalized user instructions. Since the launch of ChatGPT in 2022, generative AI has shown enormous potential and value in various application scenarios, including gaming, film, publishing, finance, and digital humans. According to incomplete statistics, global AIGC industry financing exceeded 190 billion yuan in 2023, with companies in this sector securing funding almost every month. For example, in June 2023, Runway received a new round of funding totaling $141 million from investors including Google, NVIDIA, and Salesforce; meanwhile, its strong competitor Pika completed three rounds of financing in just half a year, raising a total of $55 million.
This article will analyze the direction of commercial applications of AIGC and industry development trends based on the current status of the AIGC industry ecosystem and technological development path.
Industry Ecosystem Overview
Industry Ecosystem Map: The foundational layer represented by data services needs breakthroughs, the model layer occupies a core position, and the application layer is flourishing.
Overall, the current AIGC industry ecosystem can be divided into three parts: the upstream infrastructure layer, the midstream model layer, and the downstream application layer. The infrastructure layer includes data, computing power, and model development training platforms/computational platforms, which serve as the algorithmic foundation; the model layer includes foundational general large models, intermediate models, and open-source communities; the application layer has developed strategy generation and cross-modal generation based on text, audio, image, and video modalities, achieving commercial applications in various industries such as finance, data analysis, and design.
Illustration: AIGC Industry Ecosystem Map
Infrastructure Layer: Data Service Sector Becomes New Increment for Industry, Computing Power and Algorithm Industry Ecosystem Layout Relatively Certain
AIGC has extremely high requirements for the volume of training data, the industry field it belongs to, corresponding vertical businesses, and granularity. For pre-trained large models, multimodal datasets are crucial. Additionally, to ensure that training Q&A and outputs meet expectations, data providers need to guarantee the immediacy and effectiveness of the data. Currently, the largest open-source cross-modal database globally is LAION-5B, while the world’s first hundred million-level Chinese multimodal dataset, “Wukong,” is open-sourced by Huawei Noah’s Ark Lab.
Since various large models entered the public eye, the size limit of their tokens has troubled many developers and users. For example, in the case of GPT, when a user sends a command, the program automatically combines the recent conversation records (limited to 4096 tokens) into the final question and sends it to ChatGPT. Once the user’s conversation memory exceeds 4096 tokens, it becomes difficult for the model to include previous conversation content within its logical reasoning, leading to instances of AI hallucination when faced with complex tasks.
Against this backdrop, developers are continually seeking new solutions, with vector databases being one of the popular solutions. The core concept of a vector database is to convert data into vectors and store them in the database; when a user inputs a question, it also converts the question into a vector, searches for the most similar vectors and contexts in the database, and finally returns the text to the user. This not only significantly reduces the computational load on GPT, thus improving response speed, but also lowers costs, supports multimodal data, and circumvents GPT’s token limits. As overseas vector databases like Weaviate and MongoDB become the focus of capital, domestic giants like Tencent and JD are also beginning to lay out in this field.
Compared to the data sector, the supply side of computing power and algorithm infrastructure in China is still dominated by leading enterprises, with relatively few opportunities for startups. However, AI computing centers, which provide the necessary computing power, data services, and algorithm services for the application layer based on AI computing architecture, have become one of the new types of public computing infrastructure.
For example, AIDC OS is a dedicated AI operating system independently developed by JiuZhang Cloud’s DataCanvas. It targets large-scale computing power in intelligent computing centers and internal intelligent computing clusters of medium and large enterprises, providing management, unified scheduling of intelligent computing resources, operational support for intelligent computing businesses, and core capabilities for AI model construction, training, and inference. AIDC OS enhances the operational capabilities of computing power operators from bare computing equipment maintenance to AI large model operation capabilities, along with its openness and compatibility for various heterogeneous computing power and AI applications, successfully realizing the effective enhancement of the added value of computing power assets.
Model Layer: Domestic Market Players Concentrated in Foundational General Large Models, Few Intermediate Players
AIGC foundational general large models can be divided into open-source and closed-source categories. Closed-source models are generally accessed through paid APIs or limited trial interfaces, with foreign closed-source models including OpenAI’s GPT model, Google’s PaLM-E model, etc. Domestic closed-source model vendors started relatively late but have rapidly improved their multimodal interaction capabilities and integration with smart hardware. For instance, the WAKE-AI large model developed by Liweike Technology recently has multimodal interaction capabilities, including text generation, language understanding, image recognition, and video generation, specifically optimized for AI+ terminals. Currently, the WAKE-AI model is temporarily used on Liweike Technology’s smart terminals—AI glasses and XR glasses. In the future, Liweike Technology will open this AI platform, allowing more developers to quickly and cost-effectively deploy or customize multimodal AI on various terminals with low-code or no-code methods.
Open-source models utilize publicly available source code and datasets, allowing anyone to view or modify the source code, such as Stability AI’s Stable Diffusion, Meta’s Llamax, xAI’s Grok-1, and China’s Zhiyuan’s Aquila. In comparison, closed-source models have the advantage of lower upfront costs and stable operations; open-source models provide higher data privacy security guarantees based on private deployment and have a faster iteration and update speed. Currently, most large model development enterprises or institutions in China are committed to developing cross-modal large models, such as Tencent’s Hunyuan AI and Baidu’s Wenxin large model, which can perform cross-modal generation, but an open-source ecosystem has yet to be widely formed.
The intermediate model market players can be roughly divided into vertical large models and intermediate integrators. Vertical large models require a high understanding of vertical industry business and resource accumulation, while intermediate integrators are responsible for combining multiple model interfaces to form a new overall model. For example, the AI game engine company RPGGO can assist individual creators in simplifying the development process and maximizing creative output based on its self-developed game engine, Zagii Engine; for game studios, RPGGO provides API linkage to enhance game development efficiency.
In terms of strategic cooperation or product layout, domestic foundational large model vendors are striving to layout intermediate and terminal application layers to provide capability exports and data imports for their foundational large model products, such as Liweike Technology, which is proactively laying out multimodal AI platforms for future smart terminals.
Application Layer: Text Generation Has a Longer Development Time, Cross-Modal Generation Has the Highest Potential
The application layer of the AIGC industry is mostly based on model capabilities and insights into user needs, directly serving B-end or C-end customers, which can be simply understood as various tools in the mobile internet era, with significant potential space for future participation from numerous startups.
According to modality classification, the application layer can be divided into text generation, audio generation, image generation, video generation, cross-modal generation, and strategy generation. Due to the long development history of NLP technology, text generation is the longest-developed and most mature application track. In this wave of AIGC development, cross-modal generation will bring the most new application scenarios. Among these, text-to-image, text-to-video, and image/video-to-text generation products have already emerged, especially text-to-image generation, such as Stability AI, which has already proven its C-end user base globally.
According to the Quantum Bit Research Institute’s estimates of the technological maturity, application maturity, and future market scale of different modalities and application scenarios, the text generation track has the greatest potential in the text-assisted generation sector; in cross-modal generation, the text-to-image/video track has the greatest potential.
Illustration: AIGC Industry Application Layer Development Forecast by Track,Circle size indicates estimated market scale for 2030
Data Source: Quantum Bit Research Institute, compiled by 36Kr Research Institute
By 2030, China’s AIGC market scale will reach trillions
According to data from the Quantum Bit Research Institute, China’s AIGC market scale in 2023 is approximately 17 billion yuan, and it is expected that the growth rate will remain around 25% before 2025, reaching 25.7 billion yuan by 2025. Starting from 2025, as foundational large models gradually open up, the intermediate and application layers will experience explosive growth, driving rapid growth in the AIGC industry market scale, with an average annual compound growth rate exceeding 70%, and by 2027, China’s AIGC market scale will exceed 60 billion yuan. Starting in 2028, as the AIGC industry ecosystem matures and achieves commercial applications across various industries, by 2030, the market scale will exceed one trillion yuan.
Illustration: 2023-2030 China AIGC Industry Market Scale Forecast
Data Source: Quantum Bit Research Institute, compiled by 36Kr Research Institute
Frontier Technology Analysis
Multimodal Development Has Become Industry Consensus, Text-End Technology Path Has Converged to LLM
According to the number of data types processed, AI models can be divided into unimodal and multimodal categories: unimodal can only process one type of data, such as text, audio, or images; multimodal can handle two or more types of data. Compared to unimodal, multimodal large models have obvious advantages in input and output: different modalities of input data are complementary, and multi-dimensional training data input helps to rapidly expand the capabilities of general large models, while multimodal data input has lower usage thresholds and less data loss, significantly enhancing the user’s application experience; multimodal data output eliminates the need for multiple model integrations, making commercialization easier.
Currently, the industry consensus has shifted from unimodal to multimodal development in AIGC large models. Influenced by applications like ChatGPT (launched in November 2022) and image generation representatives like Midjourney V5 (launched in March 2023), text and image generation applications experienced explosive growth in 2023. On February 16, 2024, OpenAI released the text-to-video application Sora, making video generation a new industry hotspot, with expectations of heightened technological and capital attention in 2024.
Illustration: Multimodal Large Model Technology Development Status
Data Source: Southwest Securities, public market data, compiled by 36Kr Research Institute
Currently, pre-trained models based on the Transformer architecture are the mainstream training method for multimodal large models. For instance, Google’s GEMINI conducts pre-training on different modalities, using additional multimodal data for fine-tuning to enhance its effectiveness. With the advancement of text generation large models, LLM has become a deterministic technological path. Through expansion, LLM’s performance can significantly improve on quantitative metrics such as perplexity (the fluency of generated text) as long as diverse language patterns and structures are encountered during training, allowing LLM to accurately mimic and reproduce these patterns with high fidelity.
However, multimodal technology faces the dilemma of impending data depletion. Different types of data have varying labeling costs; visual and other modal data collection costs are typically higher than text data, leading to multimodal datasets (especially high-quality datasets) being far fewer than text datasets. Epochai data shows that in the context of rapid development of AIGC large models, high-quality language data may be depleted before 2026, while low-quality language data may face depletion within the next 20 years.
To address the data depletion issue, AI synthetic data has emerged, such as structured data companies like Mostly AI and unstructured data companies like DataGen. The former can generate anonymous datasets with characteristics comparable to real data predictions, while the latter provides a self-service platform for computer vision teams to create synthetic datasets. AI synthetic data adapts to the data modality combinations of multimodal models and allows for faster data acquisition, effectively increasing data volume.
Path Comparison: Diffusion Models Dominate, Auto-regressive Models Still Have Potential
AI-generated video and AI-generated image share a similar underlying technical framework, mainly including Generative Adversarial Networks (GAN), Auto-regressive Models, and Diffusion Models. Currently, diffusion models have become the mainstream model for AI-generated videos.
(1) Generative Adversarial Networks (GAN)
GAN is an early mainstream image generation model that enhances the model’s image generation and discrimination capabilities through adversarial training between a generator and a discriminator, making the generated data approach real data, thus making the generated images resemble real images. Compared to other models, GAN has fewer model parameters, making it better suited for modeling single or multiple object classes. However, the downside is that GAN’s training process has poor stability, leading to a lack of diversity in generated images, which is why it has gradually been replaced by auto-regressive models and diffusion models.
(2) Auto-regressive Models
Auto-regressive models use Transformers for auto-regressive image generation. The overall framework of Transformers is divided into two main parts: Encoder and Decoder, which can simulate the spatial relationship between pixels and high-level attributes (texture, semantics, and proportions) using multi-head self-attention mechanisms for encoding and decoding. Compared to GAN, auto-regressive models have clear density modeling and stable training advantages, generating more coherent and natural videos through the connections between frames. However, due to the typically larger number of parameters in auto-regressive models compared to diffusion models, their requirements for computational resources and datasets are often higher than those of other models, thus limiting their training time and data requirements. However, due to their potential for greater parameter expansion, auto-regressive models for image and video generation are expected to borrow experiences from Transformers in the text domain LLM through cross-modal, large-scale training of different modalities, ultimately achieving “miracles through great effort.”
In simple terms, diffusion models define a Markov chain of diffusion steps, continuously adding random noise to data until pure Gaussian noise data is obtained, then learning the reverse diffusion process to generate images, optimizing the data distribution through systematic perturbations, and restoring the data distribution in a gradual process. Taking Sora as an example, Sora consists of three main Transformer components: Visual Encoder, Diffusion Transformer, and Transformer Decoder. During the training process, given an original video X, the Visual Encoder compresses the video into a lower-dimensional latent space, then trains in the latent space, applying the Diffusion Transformer based on the diffusion model, adding noise and then denoising, gradually optimizing, and finally mapping the generated time and spatially compressed latent representation back to pixel space via the Transformer decoder, i.e., video X1. Due to higher computational efficiency, lower costs, and the ability to achieve high-quality images while processing data (compression/enlargement), diffusion models have gradually become the mainstream technical path in text-to-image and text-to-video domains.

Illustration: Diffusion Model vs. Auto-regressive Model
Data Source: Public market data, compiled by 36Kr Research Institute
With the emergence of products like ChatGPT, Wenxin Yiyan, and Sora, the scenarios covered by AIGC are becoming increasingly rich, and performance is gradually maturing. Opportunities and challenges coexist; AIGC brings development opportunities to industries, creating more new application scenarios and business models, while also accompanying challenges that need to be addressed.
For B2B Enterprises, AIGC Can Integrate Organically with Existing Businesses to Achieve Cost Reduction and Efficiency Improvement, Bringing New Opportunities to Industries such as Digital Humans, SaaS, Digital Design, and Finance
Digital Humans.The development of virtual digital humans is closely related to breakthroughs in underlying technologies in AI, CG, virtual reality, and other fields. The integration of AIGC with digital humans endows virtual beings with more “agility” and “vitality,” while achieving their application in more scenarios. On one hand, AIGC technology can transform static photos into dynamic videos, achieving video effects such as face replacement and expression changes, making virtual humans more vivid and realistic; on the other hand, AI technology enhances the multimodal interaction capabilities of virtual beings, enabling automatic interaction without human intervention, allowing virtual beings to possess intrinsic “thinking” abilities, accelerating their application in more fields. Additionally, AI technology is expected to achieve “one-stop” full-process automation from creation, driving, to content generation, reducing development costs for enterprises. For instance, Quwan Technology has preliminarily established a high-naturalness virtual digital human generation technology platform that can generate virtual digital humans with over 90% facial similarity in about 10 seconds using one or a few photos, which is time-efficient, cost-effective, and has multimodal interaction capabilities, lowering the technical threshold and economic burden for ordinary users, allowing application in scenarios such as science popularization education, live retail, and gaming animation.
SaaS.In the face of an ever-evolving market environment, maintaining digital operations on the business side and smooth upstream and downstream connections has become an inevitable choice for more and more enterprises, meaning that the SaaS industry needs to enhance its intelligence to provide services that can quickly respond, interact, and analyze decision-making value. In customer management scenarios, AIGC’s text generation model can serve as a chatbot, quickly responding based on customer communication content, providing personalized interactions and proactively offering other relevant services beyond inquiries, making SaaS software easier to access and use. In business process automation scenarios, AIGC can manage enterprise business processes comprehensively through simple commands, enhancing work efficiency. For example, in financial management, it can integrate and analyze financial data to provide comprehensive financial reports and analyses; in marketing, it can dynamically generate personalized customer emails and advertisements; in supply chain management, it can automatically process upstream and downstream documents and data entry; in human resources, it can achieve intelligent interviews and automated salary assessments.
Digital Design.As foundational technologies like multimodal pre-trained large models mature, AIGC demonstrates stronger capabilities in audio, image, and video generation, with increasingly widespread applications. On one hand, image generation rapidly applies in digital design fields such as industrial design, graphic design, illustration design, and game animation production. In the early stages of work, AIGC can assist in collecting materials and quickly generating drafts; in later stages, users can achieve functions like color adjustment, composition modification, image editing, and style adjustments through text commands, reducing the threshold for design creation while minimizing basic mechanical labor. On the other hand, video generation can provide more intuitive demonstration effects in industries such as architectural design, industrial design, and game design, significantly shortening work hours.
Illustration: AIGC Integrating into Digital Design Workflow
Finance.In the face of intense market competition, the traditional finance industry has struggled to meet personalized consumer demands. The finance industry is resource-intensive, and utilizing AIGC’s analysis and generation capabilities can enhance service efficiency, optimize business processes, and provide more convenient, customer-centric products and services. Specifically, AIGC is mainly applied in areas such as risk assessment, quantitative trading, and counter business processing. In the risk assessment phase, AIGC can quickly analyze dispersed, multi-dimensional trading data and behavior patterns, accurately monitoring and identifying potential risks and detecting fraud, improving risk control accuracy. In the counter service phase, AIGC can recommend more suitable products and customized financial services based on customer needs and profiles, enhancing customer satisfaction.
For B2C Enterprises, AIGC Will Help Industries such as Gaming, Film, and Publishing Enhance Content Production Efficiency and Improve Consumer Experience
Publishing.For the content-driven publishing industry, AIGC will trigger a paradigm shift in content production. On one hand, AIGC replaces users as content producers, rapidly improving content output efficiency; on the other hand, AIGC can assist in editing tasks, saving editing time and freeing up human resources. Specifically, in the content production phase, AIGC’s text output capability assists authors in completing content creation, and as technology develops, it may even create content directly with a unique writing style. Currently, some novel websites have launched AIGC-assisted creation functions, allowing authors to input special keywords to automatically generate content and provide inspiration. In the editing phase, AIGC can quickly complete article proofreading by capturing trending news and events while automatically analyzing and selecting topics based on text recognition and deep learning models, enhancing editing efficiency.
Gaming.In a context of increasingly fierce industry competition and more segmented player preferences, the integration of AIGC with gaming optimizes player experience comprehensively in aspects such as content, graphics, and gameplay, enhancing the competitiveness of games themselves. In terms of content and gameplay, on one hand, AIGC improves NPC dialogue logic, refines tone, expression, and body language, and builds emotional connections between the environment and NPCs, enhancing player interaction and providing a high degree of immersive experience; on the other hand, by inputting target, scene, and character information, AIGC can generate gameplay scripts and provide suggestions regarding mechanics and storylines, balancing and enriching gameplay to enhance its fun. Additionally, AIGC can assist in generating more exquisite graphics, allowing staff to generate images and animations through textual descriptions, improving painting efficiency while enhancing player experience.
Film.The film industry generally has long workflows involving significant human and time costs, and AIGC will empower the entire process of film production, from strategy, filming, production to promotion, significantly lowering the barriers to entry in the film industry while providing creative references for content. In the planning phase, deep learning algorithms can quickly read a large number of released films and provide script creative references to screenwriters by combining keywords; after the script is completed, they can also help with polishing and translation. In the filming phase, directors can utilize AIGC to assist in storyboarding and designing shot languages; producers can save time required for scheduling, production coordination, and budget management, simplifying work and saving time costs. In the post-production phase, AIGC can complete basic tasks such as adding subtitles, video editing, and color correction; as technology matures, it can gradually handle complex tasks such as special effects production and animation creation. For example, the Oscar-winning film “Everything Everywhere All at Once” in 2023 had only five visual effects team members, who collaborated with Runway to use its AI tools for background creation, slow-motion video production, and creating infinitely extending images, greatly enhancing visual effects production efficiency.
Illustration: AIGC Empowering Various Stages of Film Production
Despite AIGC Significantly Enhancing the Intelligence Level and Operational Efficiency Across Various Industries, Its Development Still Faces Certain Limitations and Challenges at the Application End
SaaS.AIGC’s application in the SaaS industry has raised issues regarding data privacy and information security. In providing personalized services and support, AIGC needs to input sensitive information data related to internal operations, finances, and personal transactions of enterprises. AIGC models have potential memory capabilities, which may inadvertently extract private data from other users during the content generation process, leading to serious privacy breach risks.
Digital Design.The design industry places particular importance on copyright requirements. When AIGC is trained using large-scale data from the internet and third-party datasets, it may include unauthorized data obtained through web scraping or other methods, generating derivative works with similar styles that mix existing content and new creative elements, leading to confusion over intellectual property ownership and potential legal risks and copyright disputes. Especially in the digital design field, the application of AIGC may involve significant usage and transformation of original data, raising considerable disputes over copyright ownership of generated works.
Finance.Most transactions in the finance industry require reference to information from various parties, with high accuracy requirements. However, the accuracy of AIGC’s analyses based on historical and real-time information still needs improvement and cannot predict unexpected events. In recent years, financial institutions have launched generative AI tools such as intelligent advisors; if investors overly rely on the predictions and suggestions provided, it may lead to irrational investment behavior, exacerbating herd effects and increasing risk concentration. Additionally, AIGC can easily generate false news or misleading information, leading investors to make erroneous decisions while potentially causing abnormal market price fluctuations.
Gaming.As a form of entertainment that emphasizes real-time human-computer interaction, the emergence of AIGC provides players with a better immersive experience in the virtual world. However, unregulated narrative control and infinite extensions of human-computer dialogue create significant uncertainty in compliance with interactive content; if AIGC fails to manage filtering words effectively, players may be offended or harmed.
Film.In the film industry, which requires emotional resonance, AIGC can only rely on existing data and algorithms to generate relatively stiff and cold content compared to human creations based on rich emotions and deep experiences, and its anthropomorphic emotional expression still needs improvement.
Publishing.In the literary field, there are strict requirements regarding ethical and moral issues related to content. Currently, AIGC cannot ensure the compliance of generated content; the training data used to develop AIGC models may include content that is discriminatory or violent, thus generating harmful content such as racial discrimination and gender bias.
Overall, AIGC, empowered by multimodal large models, deep learning algorithms, etc., serves thousands of industries, including finance, gaming, and publishing, but the accompanying issues and challenges such as ethics, copyright, and data security cannot be overlooked.
The Cross-Modal Generation Capabilities Demonstrated by Software like Sora Indicate an Acceleration Toward the AGI Era
Artificial General Intelligence (AGI) is an AI system that can think, learn, correct, and perform intellectual tasks like humans across any professional field, requiring the AI system to possess common sense, shared behavioral norms, and values understood by humans. Its most significant feature is the ability to respond to the rules of the real world, such as physical states, natural laws, and chemical changes, making it one of the highest goals in AI development. The release of applications like Sora and ChatGPT signifies breakthrough progress in the AI technology field, demonstrating stronger spatiotemporal modeling capabilities and higher computational complexity, capable of simulating a real physical world that adheres to physical laws, laying the technical foundation for understanding the real world and simulating worlds, and further accelerating the development of multimodal AI, thereby hastening the progress of AGI.
Technological Innovation and Integration Will Continuously Enhance AIGC’s Generation and Application Capabilities
In the future, on one hand, as deep learning, computer vision, and other technologies continue to mature and new technologies like knowledge distillation are continuously innovated, AIGC’s capabilities in generation quality, speed, and efficiency will further improve; on the other hand, multimodal large models will integrate with richer technologies such as natural language processing, virtual reality, augmented reality, and digital twins, expanding more application scenarios like autonomous driving, drug development, and security, while providing users with more comprehensive solutions to meet the growing needs of users. For instance, in the field of autonomous driving, AIGC technology can create more synthetic data to compensate for the lack of real data, accelerate the construction of simulation scenarios, and improve simulation testing efficiency.

Share, Like, and Follow
2024 AIGC Industry Research