Efficient and Effective Learning of Large Multimodal Models

Source: ZHUANZHI

This article is about 1000 words and is recommended to read in 5 minutes.
Research on Large Multimodal Models (LMMs) has become a focal point in the field of deep learning, demonstrating its importance in contemporary research. LMMs can process data from different modalities, enhancing predictive capabilities by leveraging complementary information to perform various tasks.

The learning process of LMMs consists of two key stages: a computation-intensive pre-training phase aimed at obtaining general representations from large-scale noisy data, and a subsequent fine-tuning phase focused on adapting the pre-trained model to specific tasks.

Traditionally, the pre-training of foundational LMMs has been considered the exclusive privilege of research labs with rich computational resources. In this paper, we propose a novel method for efficient pre-training of foundational Vision-Language Models (VLMs). This involves reducing data requirements by utilizing readily available frozen Large Language Models (LLMs) through a specialized pre-training process. Additionally, we introduce an efficient VLM pre-training method that reduces redundancy in modality projection. With our approach, the amount of data required for training LLMs is significantly reduced from 129 million instances to 4 million instances, and the associated training costs can be reduced to 1/10, with performance showing almost no significant decline.

Furthermore, we propose a simple yet powerful temporal fusion mechanism to adapt the pre-trained image-language model to downstream video tasks. Our video description model achieves performance competitive with state-of-the-art benchmarks without extensive video-text dataset pre-training. Beyond the multimodal research domains in computer vision and natural language processing, our research also extends into the field of bioinformatics, exploring multimodal learning through protein-RNA models. Our findings indicate that pre-trained protein models contain biological structural information that can be shared with RNA. Given the limited number of RNA structures analyzed in experiments, our discovery opens new research directions for transfer learning between proteins and RNA.

Finally, we employ physics-enhanced simulations to train T-cell-peptide models, demonstrating that integrating such simulations significantly improves model training effectiveness, especially in scenarios with limited labeled data. This highlights the potential of combining simulations with machine learning, providing valuable strategies for advancing the training of LMMs in the biological field.

Over the past decade, deep learning research has made significant advancements and achieved remarkable results across multiple domains, including image classification, image segmentation, action recognition, and language modeling. Although these models have demonstrated excellent performance in specific tasks by training on large domain-specific datasets, contemporary research has shifted towards developing models capable of interpreting information across multiple modalities (such as visual, language, and audio).Moreover, given the potential to enhance model predictive capabilities, recent research advocates for training models that can seamlessly integrate information from different modalities. For instance, in the context of online meetings, showing a model a video can improve summary quality by considering both visual content (showing human activities) and auditory cues (capturing conversational dynamics). This integration of complementary modalities aids in making more accurate decisions.The study of multimodal learning also aims to simulate the human ability to acquire knowledge from multiple sources. By promoting the acquisition of capabilities similar to human perception and cognitive functions, these models aim to break through the limitations of single modalities, demonstrating a holistic understanding of information perception and expression.The rapid development in the fields of computer vision and natural language processing has propelled significant progress in multimodal learning, particularly in the development of vision-language models. The current mainstream paradigm is typically divided into two stages:

Pre-training phase: This initial stage involves pre-training the model using large-scale web datasets, enabling the model to acquire extensive knowledge covering both visual and language domains. These pre-trained models, often referred to as “foundation models,” serve as the basis for complex patterns and representations in multimodal data.
Fine-tuning phase: After pre-training, the foundation model undergoes fine-tuning to adapt to the specific task requirements. Notably, in some cases, the model can generate predictions through contextual learning without the need for fine-tuning. This phase plays a critical role in adjusting the model’s capabilities to meet task-specific demands.

In the following sections, we will delve into these two training stages. This paper introduces a novel modality projection module and proposes a new learning paradigm aimed at enhancing the efficiency of vision-language model pre-training. Furthermore, we will elaborate on the new fine-tuning module, specifically addressing the challenges of adapting the pre-trained foundation model to specific tasks in scenarios with limited training samples. Through these contributions, this research aims to advance the understanding and efficiency of multimodal learning in vision-language models.

About Us

Data Pie THU, as a data science public account, relies on the Tsinghua University Big Data Research Center to share cutting-edge data science and big data technology innovation research dynamics, continuously disseminating data science knowledge, striving to build a platform for gathering data talent, and creating China’s strongest group in big data.

Sina Weibo: @Data Pie THU

WeChat Video Account: Data Pie THU

Today’s Headlines: Data Pie THU

Leave a Comment Cancel reply