Machine Heart Report
Machine Heart Editorial Team
What is the progress of multi-modal large language models?Here are 26 of the current best multi-modal large language models.
The focus in the field of AI is shifting from large language models (LLMs) to multi-modal capabilities. Thus, multi-modal large language models (MM-LLMs) that enable LLMs to have multi-modal capabilities have become a highly regarded research topic.
Recently, a research team from Tencent AI Lab, Kyoto University, and Mohammed bin Zayed University of Artificial Intelligence published a review report that comprehensively summarizes the recent progress of MM-LLMs. The paper not only outlines the model architectures and training processes of MM-LLMs but also reviews 26 of the current best MM-LLMs. If you are considering researching or using MM-LLMs, you might want to start with this report to find the model that best suits your needs.
-
Paper Title: MM-LLMs: Recent Advances in MultiModal Large Language Models
-
Paper Link: https://arxiv.org/abs/2401.13601
Report Overview
In recent years, research on multi-modal (MM) pre-training has progressed rapidly, pushing the performance of many downstream tasks to new boundaries. However, as the scale of models and datasets continues to expand, traditional multi-modal models face the problem of excessive computational costs, especially when trained from scratch. Given that multi-modal research lies at the intersection of various modalities, a logical approach is to leverage existing pre-trained unimodal foundation models, particularly powerful large language models (LLMs).
This strategy aims to reduce the computational costs of multi-modal pre-training and enhance its efficiency, thus giving rise to a new field: MM-LLM, or multi-modal large language models.
MM-LLMs utilize LLMs to provide cognitive functions, enabling them to handle various multi-modal tasks. LLMs can offer several necessary capabilities, such as robust language generalization, zero-shot transfer ability, and in-context learning (ICL). Meanwhile, foundation models for other modalities can provide high-quality representations. Given that the foundation models for different modalities are pre-trained separately, the core challenge faced by MM-LLMs is how to effectively connect LLMs with models of other modalities for collaborative reasoning.
The main focus in this field is on optimizing the alignment between modalities and aligning the models with human intentions. The primary workflow used in this area is multi-modal pre-training (MM PT) + multi-modal instruction tuning (MM IT).
GPT-4 (Vision) and Gemini, released in 2023, have shown excellent multi-modal understanding and generation capabilities, thereby igniting enthusiasm for research in MM-LLMs.
Initially, the research community primarily focused on multi-modal content understanding and text generation, with models including (Open) Flamingo, BLIP-2, Kosmos-1, LLaVA/LLaVA-1.5, MiniGPT-4, MultiModal-GPT, VideoChat, Video-LLaMA, IDEFICS, Fuyu-8B, and Qwen-Audio.
To create MM-LLMs that can support both multi-modal input and output, some research has explored specific modality generation, such as Kosmos-2 and MiniGPT-5 for image generation, while SpeechGPT focuses on speech generation.
Recently, the focus has shifted to mimicking arbitrary modality-to-arbitrary modality transformations, which may pave the way towards Artificial General Intelligence (AGI).
Some research aims to combine LLMs with external tools to achieve near-arbitrary multi-modal understanding and generation; such studies include Visual-ChatGPT, ViperGPT, MM-REACT, HuggingGPT, and AudioGPT.
Conversely, to reduce errors propagated in hierarchical systems, some research teams aim to develop end-to-end arbitrary modality MM-LLMs; these studies include NExT-GPT and CoDi-2.
Figure 1 illustrates the timeline of MM-LLMs.
To promote the research and development of MM-LLMs, this team from Tencent AI Lab, Kyoto University, and Mohammed bin Zayed University of Artificial Intelligence has compiled this review report. Machine Heart has summarized the main parts of this report, especially the introduction to the 26 current best (SOTA) MM-LLMs.
Model Architecture
This section details the five major components of a typical model architecture, along with the implementation choices for each component, as shown in Figure 2.
MM-LLMs focused on multi-modal understanding only include the first three components.
During the training phase, the modality encoder, LLM backbone, and modality generator are typically kept frozen. The key optimization points are the input and output projectors. Since the projectors are lightweight components, the proportion of trainable parameters in MM-LLMs is very small compared to the total number of parameters (usually around 2%). The total number of parameters depends on the scale of the core LLM used in the MM-LLM. Therefore, when training MM-LLMs for various multi-modal tasks, high training efficiency can be achieved.
Modality Encoder (ME): Encodes inputs from different modalities to obtain corresponding features.
Input Projector: Aligns the features of other modalities with the text feature space.
LLM Backbone: MM-LLMs use LLMs as core agents, inheriting some important characteristics of LLMs, such as zero-shot generalization, few-shot in-context learning, chain of thought (CoT), and instruction following. The task of the LLM backbone is to handle the representations of various modalities, involving semantic understanding, reasoning, and decision-making related to the input. Its outputs include (1) direct text outputs, and (2) signal tokens from other modalities (if any). These signal tokens can be used as instructions for the generator—whether to generate multi-modal content, and if so, specify what content to generate.
Commonly used LLMs in MM-LLMs include Flan-T5, ChatGLM, UL2, Qwen, Chinchilla, OPT, PaLM, LLaMA, LLaMA-2, and Vicuna.
Output Projector: Maps the signal token representations from the LLM backbone into features understandable by subsequent modality generators.
Modality Generator: Generates outputs corresponding to different modalities. Current research typically uses existing latent diffusion models (LDM), such as using Stable Diffusion to synthesize images, Zeroscope to synthesize videos, and AudioLDM-2 to synthesize audio.
Training Process
The training process of MM-LLMs can be divided into two main stages: MM PT (multi-modal pre-training) and MM IT (multi-modal instruction tuning).
MM PT
In the pre-training phase (typically using the XText dataset), the input and output projectors are trained by optimizing predefined objectives to align different modalities. (Sometimes, parameter-efficient fine-tuning (PEFT) techniques are also used for the LLM backbone.)
MM IT
The MM IT method requires fine-tuning the pre-trained MM-LLM using a dataset formatted with a set of instructions. Through this fine-tuning process, the MM-LLM can generalize to unseen tasks, execute new instructions, and thus enhance zero-shot performance.
MM IT includes supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), aiming to align with human intentions or preferences and enhance the interaction capabilities of MM-LLMs.
SFT can convert part of the data from the pre-training phase into instruction-aware formats.
After SFT, RLHF further fine-tunes the model, requiring feedback information about the responses given by the MM-LLM (such as natural language feedback (NLF) labeled by humans or AI). This process employs a reinforcement learning algorithm to effectively integrate non-differentiable NLF. The training objective of the model is to generate corresponding responses based on the NLF.
Many datasets used in existing MM-LLMs during the MM PT and MM IT stages are subsets of those in Tables 3 and 4.
Current Best MM-LLMs
The team compared the architectures and training dataset scales of 26 current best (SOTA) MM-LLMs, as shown in Table 1. They also briefly summarized the core contributions and development trends of each model.
(1) Flamingo: A series of visual language (VL) models designed to handle interwoven visual data and text, capable of outputting free-form text.
(2) BLIP-2: Proposes a framework that utilizes resources more efficiently, employing a lightweight Q-Former to connect different modalities while using a frozen LLM. By using LLMs, BLIP-2 can be guided to perform zero-shot image-to-text generation through natural language prompts.
(3) LLaVA: The first to transfer instruction tuning techniques to the multi-modal domain. To address data sparsity issues, LLaVA created a completely new open-source multi-modal instruction-following dataset and a multi-modal instruction-following benchmark, LLaVA-Bench, using ChatGPT/GPT-4.
(4) MiniGPT-4: Proposes a streamlined method where only a linear layer is trained to align the pre-trained visual encoder with the LLM. This efficient approach demonstrates capabilities comparable to GPT-4.
(5) mPLUG-Owl: Introduces a new modular training framework for MM-LLMs that integrates visual context. To evaluate the performance of different models on multi-modal tasks, this framework also includes a indicative evaluation dataset, OwlEval.
(6) X-LLM: Extends to include multiple modalities, including audio, demonstrating strong scalability. It leverages the language transfer capabilities of QFormer and has been successfully applied in the context of Han-Tibetan language family Chinese.
(7) VideoChat: Pioneered an efficient chat-centered MM-LLM for video understanding dialogue. This research sets standards for future studies in this field and provides protocols for academia and industry.
(8) InstructBLIP: This model is trained based on the BLIP-2 model, updating only the Q-Former during the MM IT phase. By introducing instruction-aware visual feature extraction and corresponding instructions, this model can extract flexible and diverse features.
(9) PandaGPT: A pioneering general model capable of understanding instructions from six different modalities and acting accordingly: text, image/video, audio, heat, depth, and inertial measurement units.
(10) PaLIX: Its training process utilizes mixed visual language objectives and unimodal objectives, including prefix completion and mask token completion. Research shows that this method can be effectively used for downstream tasks and reaches the Pareto frontier in fine-tuning settings.
(11) Video-LLaMA: Proposes a multi-branch cross-modal pre-training framework that allows LLMs to process the visual and audio content of a given video while conversing with humans. This framework aligns visual and language as well as audio and language.
(12) Video-ChatGPT: This model is specifically designed for video dialogue tasks and can generate discussions about videos by integrating spatiotemporal visual representations.
(13) Shikra: Proposes a simple yet unified pre-training MM-LLM, specifically tuned for referential dialogue tasks. Referential dialogue tasks involve discussing regions and objects in images. This model demonstrates commendable generalization abilities, effectively handling unseen scenarios.
(14) DLP: Proposes a P-Former for predicting ideal prompts, trained on a unimodal statement dataset. This indicates that unimodal training can enhance multi-modal learning.
(15) BuboGPT: To fully understand multi-modal content, this model learns a shared semantic space during its construction. It explores the fine-grained relationships between different modalities such as images, text, and audio.
(16) ChatSpot: Proposes a simple yet effective method for fine-tuning MM-LLMs to accurately follow instructions, thereby facilitating fine-grained interactions. By integrating precise reference instructions (composed of image-level and region-level instructions), multi-granular visual language task descriptions are enhanced.
(17) Qwen-VL: A multi-language MM-LLM supporting both English and Chinese. Qwen-VL also allows for multiple images to be input during the training phase, improving its ability to understand visual context.
(18) NExT-GPT: An end-to-end, general-purpose MM-LLM that supports arbitrary modality-to-arbitrary modality interactions, allowing for free input and output of images, videos, audio, and text. It employs a lightweight alignment strategy—using LLM-centered alignment in the encoding phase and instruction-following alignment in the decoding phase.
(19) MiniGPT-5: This MM-LLM integrates technology for transforming into generative tokens and incorporates Stable Diffusion. It excels at performing multi-modal generation tasks that intertwine visual language outputs. During the training phase, it adds non-classifier guidance to enhance generation quality.
(20) LLaVA-1.5: This model is based on the LLaVA framework with simple modifications, including the use of an MLP projection, introduction of VQA data adjusted for academic tasks, and use of prompts with simple response formats. These adjustments enhance the model’s multi-modal understanding capabilities.
(21) MiniGPT-v2: This MM-LLM is designed as a unified interface for diverse visual language multi-task learning. To create a single model capable of proficiently handling various visual language tasks, identifiers are integrated during both training and inference phases for clear task differentiation, ultimately improving learning efficiency.
(22) CogVLM: An open-source MM-LLM that builds bridges between different modalities through a trainable visual expert module used in attention and feedforward layers. This allows for deep fusion of multi-modal features without compromising performance on downstream NLP tasks.
(23) DRESS: Proposes a method for enhancing alignment with human preferences using natural language feedback. DRESS extends conditional reinforcement learning algorithms to integrate non-differentiable natural language feedback and trains models to generate appropriate responses based on that feedback.
(24) X-InstructBLIP: Proposes a cross-modal framework using instruction-aware representations, sufficient to assist LLMs in handling diverse tasks across multiple modalities (including images/videos, audio, and 3D). Notably, it accomplishes this without requiring pre-training for specific modalities.
(25) CoDi-2: A multi-modal generation model that excels at executing multi-modal fusion instruction following, context generation, and multi-turn dialogue interactions between users and models. It enhances CoDi, enabling it to handle complex interwoven modality inputs and instructions to generate implicit features in an autoregressive manner.
(26) VILA: This model performs excellently on visual tasks while demonstrating outstanding reasoning capabilities while maintaining pure text abilities. VILA’s strong performance is attributed to its effective utilization of LLM’s learning capabilities, employing the fusion properties of image-text pairs and achieving fine-grained text data remixed.
Current trends in MM-LLM development:
(1) Development from focusing on multi-modal understanding to specific modality generation, and further towards arbitrary modality-to-arbitrary modality transformations (e.g., MiniGPT-4 → MiniGPT-5 → NExT-GPT).
(2) Continuous optimization of training processes from MM PT to SFT and then to RLHF, striving for better alignment with human intentions and enhancing the model’s dialogue interaction capabilities (e.g., BLIP-2 → InstructBLIP → DRESS).
(3) Embracing diverse modality expansions (e.g., BLIP-2 → X-LLM and InstructBLIP → X-InstructBLIP).
(4) Integrating higher quality training datasets (e.g., LLaVA → LLaVA-1.5).
(5) Adopting more efficient model architectures, from the complex Q-Former and P-Former input projector modules in BLIP-2 and DLP to the simpler yet effective linear projector in VILA.
Benchmarks and Performance
To comprehensively compare the performance of various models, the team compiled a table containing data from multiple papers on major MM-LLMs, covering 18 visual language benchmarks, as shown in Table 2.
Future Directions
The team finally discussed some promising future research directions in the MM-LLM field:
-
More powerful models: Enhancing the capabilities of MM-LLMs mainly through these four key avenues: expanding modalities, diversifying LLMs, improving the quality of datasets for multi-modal instruction tuning, and enhancing multi-modal generation capabilities.
-
More challenging benchmarks
-
Mobile/lightweight deployment
-
Embodied intelligence
-
Continuous instruction tuning

© THE END
For reprints, please contact this public account for authorization
Submissions or inquiries: [email protected]