Yunzhong from Aofeisi Quantum Bit | WeChat Official Account QbitAI
It has been a full year since ChatGPT and GPT-4 ignited a new round of artificial intelligence revolution. In this year, numerous companies both domestically and internationally have flooded into the “beast arena” of large models, accelerating the iteration and leap of large model technology.
The unprecedented general task handling capability of large models has opened up possibilities for unlocking more application scenarios. Various industries have started exploring the potential of integrating large models with their own businesses, with a greater demand for large models than ever before.
However, behind this clamor, more and more industry insiders and outsiders are beginning to calmly ponder the following questions:
What can large models do? When will large models become profitable?
In the unique technological innovation soil of China, this is an unavoidable proposition.
△Image: Generated by DALL·E 3
After years of development in the Chinese AI industry, although certain breakthroughs have been made in fields like biometrics, industrial robotics, and autonomous driving, truly disruptive products and applications that are widely implemented have yet to emerge.
Will the problems unresolved in the small model era be easily addressed with the arrival of large models?
As one of the earliest AI startups in China, Megvii has experienced the ups and downs of AI technology innovation and commercialization exploration. How does Megvii view and layout in the face of the new wave of AI triggered by large models?

Focusing on Multi-Modal Large Models
“From the perspective of technological evolution, whether it is the previous AlphaGo or today’s large models, they are essentially continuations of deep learning. The current wave of AI technology development has only one core technological capability, which is deep learning.” said Yin Qi, co-founder and CEO of Megvii. From CNN, ResNet to Transformer, deep learning is the most core technological axis.
The explosion of large models stems from the accumulation of research results in core areas of deep learning such as NLP, vision, and speech over the past decade in both academia and industry; this is a process from quantitative change to qualitative change.
The transition from small models to large models involves changes in model scale and performance, while the core line of deep learning remains unchanged. In Yin Qi’s view, amidst the entrepreneurial wave triggered by deep learning, although many companies claim to be AI companies, most are still engaged in AI industry applications.
Since its inception, Megvii has consistently focused on computer vision, adhering to foundational research in deep learning. “Megvii has accumulated core capabilities in deep learning, which is the foundation for our continuous innovation.”
Now, with the leap of large model technology, the field of visual models is showing trends of “large” and “unified.” “Large” signifies big data, big computing power, and a large number of parameters, while “unified” reflects the integration of modalities such as NLP, vision, and speech, as well as the merging of perception, understanding, and generation capabilities.
As an AI company excelling in visual technology, Megvii is combining visual models with language models to vigorously develop multi-modal large models, achieving comprehensive understanding and analysis of multi-modal information.
Yin Qi stated, “Megvii’s goal has not changed since day one, which is to move towards AGI. Our path is also quite clear, which is to combine software and hardware. Multi-modal large models are the most important link at present, and we will focus our research in this area.”
△Image: Generated by DALL·E 3
Megvii’s research team has been engaged in large model research for a long time, accumulating a wealth of foundational research results and research talent in visual technology, underlying frameworks, and data loops, laying the groundwork for the continuous iteration of multi-modal large models.
The multi-modal large model proposed by Megvii is a product of the deep integration of vision with NLP in the process of visual technology moving towards “large” and “unified”; it is a multi-modal model for language and visual understanding.
Based on long-term accumulated industry experience, Megvii positions its multi-modal large model in the range of several billion to several hundred billion parameters. Large models within this range possess strong general attributes, while also offering better solutions in terms of industry deployment costs, efficiency, and hardware compatibility.
With the emergence of OpenAI’s Sora model, multi-modal large models have recently ignited interest across various industries. Although video generation is the most intuitive highlight of Sora, what is even more astonishing is its powerful understanding capabilities for images, videos, and more.
“Sora represents an important intermediate technical key point in OpenAI’s journey towards AGI. Our focus should be on understanding its underlying technological framework, rather than the Sora application itself,” Yin Qi believes, emphasizing that in the field of images and videos, it is essential to separate “generation” from “understanding.”
If Sora is viewed as an independent application, it embodies generative capabilities, with core application scenarios leaning more towards the consumer side. In contrast, Megvii will focus on perceptual understanding capabilities, with its multi-modal large model serving as an engine for comprehensive perception, understanding, and reasoning across different modalities such as images, videos, and text.
Megvii will concentrate more on understanding capabilities and, based on this, create industry applications targeting B2B businesses. It is believed that multi-modal large models will unlock more industry application scenarios.
Integrating Multi-Modal Large Models into Industries
Despite high expectations for large models from both inside and outside the industry, a common consensus is that the current foundational large models do not possess widespread applicability for industries with diversified needs.
In the process of transferring large model capabilities to various industries, it is inevitable to encounter complex scene requirements. When evaluating large models, enterprise users will consider factors such as application scenarios, data security, upgrade maintenance, and cost-effectiveness comprehensively.
For large model companies, this means there is a significant amount of “last mile” work to be done, such as scene technology matching, end-to-end deployment, hardware-software adaptation, and security.
In Yin Qi’s view, with the arrival of the large model era, the efficiency of the “last mile” will significantly improve, and costs will decrease substantially. However, the issue of implementing the “last mile” in the industry still exists. He stated that Megvii’s path choice is to firmly pursue B-end commercialization.
△Image: Generated by DALL·E 3
For B-end businesses, relying solely on foundational large models is challenging for practical implementation, and ROI is difficult to turn positive. Therefore, Megvii will focus on promoting the application of multi-modal large models in industries, delving into industry-specific large models.
Applying large models to specific industries requires end-to-end solutions, which is not low-hanging fruit; it necessitates a comprehensive understanding of models, systems, data, and the industry.
Firstly, from a technical perspective, it is not sufficient to simply tweak open-source models; end-to-end large model capabilities must be developed.
Secondly, from an industry perspective, it fundamentally needs to be customer-centric, co-creating industry large models with clients. The accumulation of industry know-how remains a scarce capability in the large model era.
Over the years, Megvii has served many leading clients across various industries, accumulating specialized knowledge and experience in key sectors. Currently, Megvii is collaborating with clients in finance, telecommunications, mobile phones, and smart automotive fields to promote the implementation of large models in these industries.
“The financial industry is currently progressing relatively quickly,” explained Zhao Liwei, Senior Vice President of Megvii and head of the Cloud Services Division. “Since mid-last year, some of our financial clients have begun exploring large models. Because they already possess a certain level of foundational capability reserve and are sensitive to new technologies, their demand for innovation is exceptionally urgent.”
Zhao Liwei stated that large models have great potential in data-intensive and knowledge-intensive industries like finance. In his view, from a practical effect perspective, the “efficiency increase” effect of large models in the short term will outweigh the “cost reduction” effect, making them more acceptable to clients.
However, implementing large models for industry efficiency is an exceptionally complex project. Many industry clients have already established standard business processes based on traditional IT capabilities such as big data, ERP, and CRM.
If large models are simply substituted for existing IT systems, the gains will be very limited. Only by thoroughly understanding existing business needs and logic, disrupting traditional business processes, organizational relationships, and even decision-making systems can the advantages of large models in efficiency enhancement be realized. The implementation of large models is not merely a technical issue but a complex business problem. The current round of large model implementation must be co-created with clients to be realized.
Currently, around the financial industry, Megvii is collaborating with banks, insurance companies, and other clients to explore large models in business scenarios such as financial risk control, intelligent customer service, document/code writing, image-text analysis, and marketing.
Zhao Liwei stated, “This year will definitely be a process from 0 to 1. The most important thing is to start with key clients, find suitable business scenarios for large models, and achieve a business closed loop. This is our top priority.”
— End —
Click here👇 to follow me, and remember to star it!~
One-click three connections: “Share”, “Like”, and “View”
Cutting-edge technology advancements seen daily ~