Deep Learning Advancements in Multimodal AI Models

It has been a whole year since the emergence of ChatGPT, GPT-4, and other innovations that sparked a new wave of transformation in artificial intelligence. During this year, numerous companies both domestically and internationally have entered the “arena” of large models, accelerating the iteration and leap of large model technologies.

The unprecedented capability of large models to handle general tasks has opened up possibilities for unlocking more application scenarios. Various industries have started to explore the potential of integrating large models into their business operations, with a stronger demand for large models than ever before.

However, behind this commotion, more and more industry insiders and outsiders are beginning to ponder the following questions:

What can large models do? When will large models become monetized?

In the unique soil of technological innovation in China, this is a question that cannot be avoided.

Deep Learning Advancements in Multimodal AI Models
Image: Generated by DALL·E 3

After years of development, the AI industry in China has made certain breakthroughs in fields such as biometrics, industrial robotics, and autonomous driving, but truly disruptive products and applications that have been widely implemented have yet to appear.

Will the challenges that were unresolved during the era of small models be easily solved with the arrival of large models?

As one of the earliest artificial intelligence startups in China, Megvii has experienced the ups and downs of AI technology innovation and commercialization exploration. How does Megvii view and plan its approach amidst the new wave of AI sparked by large models?

Deep Learning Advancements in Multimodal AI Models

Focusing on Multimodal Large Models

“From the perspective of technological evolution, whether it is the previous AlphaGo or the current large models, they are essentially continuations of deep learning. The current wave of AI technology development is centered around one core capability, which is deep learning.” said Yin Qi, co-founder and CEO of Megvii. From CNN, ResNet to Transformer, deep learning is the core technological axis.

The explosion of large models originates from the accumulation of research results in core deep learning fields such as NLP, vision, and speech over the past decade in both academia and industry. This is a process of transformation from quantitative change to qualitative change.

The transition from small models to large models involves changes in model scale and performance, while the core principle of deep learning remains unchanged. In Yin Qi’s view, amidst the entrepreneurial wave triggered by deep learning, many companies claim to be AI companies, but most are still engaged in AI industry applications.

Since its inception, Megvii has consistently focused on computer vision and has adhered to foundational research in deep learning. “Megvii has accumulated core capabilities in deep learning, which is the foundation for our continuous innovation.”

Now, with the leap in large model technology, the field of visual models is showing trends of “largeness” and “unification.” “Largeness” refers to big data, big computing power, and a large number of parameters, while “unification” is reflected in the integration of NLP, vision, speech, and the fusion of perception, understanding, and generation capabilities.

As an AI company excelling in visual technology, Megvii is combining visual models with language models to vigorously develop multimodal large models, achieving a comprehensive understanding and analysis of multimodal information.

Yin Qi stated, “Megvii’s goal has never changed since day one, which is to move towards AGI. Our path is also quite clear: we aim for a combination of hardware and software. Multimodal large models are the most important link at present, and we will focus our research in this area.”

Deep Learning Advancements in Multimodal AI Models
Image: Generated by DALL·E 3

Megvii’s research team has been engaged in large model research for quite some time, accumulating a wealth of foundational research results and talents in visual technology, underlying frameworks, and data loops, laying the groundwork for the continuous iteration of multimodal large models.

The multimodal large model proposed by Megvii is a product of the deep integration of vision with NLP during the process of moving towards “largeness” and “unification,” representing a multimodal understanding model that combines language and vision.

Based on long-term accumulated industry experience, Megvii positions its multimodal large model in the range of hundreds of millions to billions of parameters. Large models within this range possess strong general attributes, while also providing optimal solutions in terms of industry deployment costs, efficiency, and hardware compatibility.

With the advent of the OpenAI Sora model, multimodal large models have recently ignited interest across various industries. Although video generation is the most intuitive highlight of Sora, what is even more astonishing is its powerful understanding capabilities for images, videos, and more.

“Sora represents a crucial intermediate technological key point in OpenAI’s journey towards AGI. Our focus is to understand its underlying technological framework rather than the Sora application itself,” Yin Qi believes that in the field of images and videos, “generation” and “understanding” should be viewed separately.

If we consider Sora as an independent application, it demonstrates generation capabilities, with core application scenarios leaning more towards the consumer end. In contrast, Megvii will focus on perception and understanding capabilities, with its multimodal large model serving as an engine for comprehensive perception, understanding, and reasoning across different modalities such as images, videos, and text.

Megvii will concentrate more on understanding capabilities and, based on this, create industry applications aimed at B2B businesses. It is believed that multimodal large models will unlock more industry application scenarios.

Integrating Multimodal Large Models into Industries

Despite high expectations for large models from both inside and outside the industry, a common consensus is that current foundational large models lack wide applicability for industries with diverse demands.

In the process of transferring large model capabilities to various industries, complex scenario requirements will inevitably arise. When evaluating large models, enterprise users will consider factors such as application scenarios, data security, upgrade maintenance, and cost-effectiveness.

For large model companies, this means there is a significant amount of “last mile” work to be done, such as scenario technology matching, end-to-end deployment, hardware-software compatibility, and security.

In Yin Qi’s view, with the arrival of the large model era, the efficiency of the “last mile” will significantly improve, and costs will decrease substantially. However, the issue of implementing the “last mile” in the industry still exists. He stated that Megvii’s path choice is to firmly pursue B-end commercialization.

Deep Learning Advancements in Multimodal AI Models
Image: Generated by DALL·E 3

For B-end businesses, relying solely on foundational large models is challenging for practical implementation, and the ROI is difficult to turn positive. Therefore, Megvii will focus on promoting the application of multimodal large models in industries, targeting industry-specific large models.

Applying large models to specific industries requires end-to-end solutions, which are not easy to achieve, as they must possess a comprehensive understanding of models, systems, data, and industries.

First, from a technical perspective, it is not enough to simply adjust open-source models; it is essential to have end-to-end large model capabilities.

Second, from an industry perspective, it fundamentally needs to be customer-centric, co-creating industry large models with customers. Accumulating industry know-how remains a scarce capability in the era of large models.

Over the years, Megvii has served numerous leading clients in various industries, accumulating specialized knowledge and experience in key sectors. Currently, Megvii is collaborating with clients in finance, telecommunications, mobile phones, and smart vehicles to promote the implementation of large models in these industries.

“Currently, the financial industry is progressing relatively quickly,” explained Zhao Liwei, Senior Vice President of Megvii and head of the Cloud Services Division. “Since mid-last year, some financial clients we serve have begun exploring large models. They have a certain foundational capability reserve and are sensitive to new technologies, so their demand for innovation is extremely urgent.”

Zhao Liwei stated that large models have great potential in data-intensive and knowledge-intensive industries like finance. In his view, the short-term “efficiency increase” effect of large models will outweigh “cost reduction,” making it more acceptable to clients.

However, achieving efficiency gains in the industry with large models is an exceptionally complex undertaking. Many industry clients have already established standard business processes based on traditional IT capabilities such as big data, ERP, and CRM.

If large models are merely used to replace existing IT systems, the benefits will be very limited. Only by fully understanding existing business needs and logic, disrupting previous business processes, organizational relationships, and even decision-making systems, can the efficiency-enhancing effects of large models be realized. The implementation of large models is not merely a technical issue but a complex business challenge. This round of large model implementation must be achieved through co-creation with clients.

Currently, around the financial industry, Megvii is collaborating with banks, insurance companies, and other clients to explore large models in business scenarios such as financial risk control, intelligent customer service, document/code writing, image-text analysis, and marketing.

Zhao Liwei stated, “This year will definitely be a process from 0 to 1, and the most important thing is to start with key clients, identify suitable business scenarios for large models, and achieve a business closed loop. This is our top priority.”

Source: Quantum Bit

Reviewed by: Zhao Lixin

Deep Learning Advancements in Multimodal AI Models

Leave a Comment