How Multimodal Large Models Reshape Computer Vision

Introduction:

The author will delve into the concept of Multimodal Large Language Models (MLLMs). This model not only inherits the powerful reasoning capabilities of Large Language Models (LLMs) but also integrates the ability to process multimodal information, enabling it to easily handle various types of data, such as text and images.©️【Deep Blue AI】

In short, Multimodal Large Language Models (MLLMs) are innovative models that perfectly combine the reasoning capabilities of large language models (such as GPT-3 or LLaMA-3) with the ability to receive, reason, and output multimodal information.

The following image is an example of a multimodal AI system in the healthcare field, which receives two inputs:

● A medical image;

● A text query: “Is there any pleural effusion in this image?” The system then outputs an answer (i.e., a prediction) for the given query.

How Multimodal Large Models Reshape Computer Vision

▲Figure 1｜A multimodal medical system created by combining the Vision encoder of radiology images and LLM ©️【Deep Blue AI】

■1.1 The Rise of Multimodal Technology in AI

In recent years, the field of artificial intelligence has undergone significant transformations, primarily driven by the rise of Transformers in language models. Since Google proposed this architecture in 2017, its application and impact in the field of computer vision are no longer a new topic.

One of the earliest relevant examples is the Vision Transformer (ViT), which uses Transformers to segment images into multiple patches and treats these patches as independent visual tokens for input representation.

With the vigorous development of LLMs, a new type of generative model—MLLM—has emerged.

As shown in the figure below, by 2023, most major tech companies had developed at least one MLLM. In May 2024, OpenAI’s release of GPT-4o became the headline news at that time.

▲Figure 2｜Some of the multimodal large language models (MLLMs) developed between 2022 and 2024 ©️【Deep Blue AI】

MLLMs vs VLMs vs Foundation Models:

Some people believe that MLLMs are the true foundation models. For example, Google’s Vertex AI considers multimodal large language models such as Claude 3, PaliGemma, or Gemini 1.5 as its foundation models.

On the other hand, Vision Language Models (VLMs) are a specialized category of multimodal models focused on integrating text and image inputs to generate text outputs.

The main differences between multimodal models and VLMs are:

● Multimodal models can handle a wider variety of modalities, while VLMs are mainly limited to processing text and images;

● Compared to multimodal models, VLMs have weaker reasoning capabilities.

■1.2 MLLM Structure

As shown in the figure below, the structure of MLLM is mainly divided into three parts:

●Modal Encoder:This component is responsible for compressing raw data formats such as visual and audio into more concise representations. A popular strategy is to use pre-trained encoders (such as CLIP) to calibrate other modalities, avoiding the need to train from scratch.

●LLM Backbone:This is the “brain” of the MLLM, requiring a language model to output text responses. The encoder receives images, audio, or video and generates features, which are then processed by the connector (or modal interface).

●Modal Interface (i.e., Connector): It serves as an intermediary or link between the encoder and the LLM. Since the LLM can only interpret text, it is crucial to effectively connect text with other modalities.

▲Figure 3｜Multimodal Understanding: Components of the First Stage of Multimodal ©️【Deep Blue AI】

This time, the author will not enumerate various use cases in which these models excel, but will choose to use several GPUs and test the top three MLLMs with challenging queries (discarding common examples like cats and dogs).

●GPT-4o:The most powerful multimodal model released by OpenAI in May 2024. This model is accessible through OpenAI’s API visual features.

●LLaVA 7b:This model integrates a visual encoder and Vicuna for general visual and language understanding, and its performance is impressive, sometimes even comparable to GPT-4.

●Apple Ferret 7b: An open-source MLLM developed by Apple. It achieves spatial understanding through comprehension and association, enabling the model to recognize and describe any shape in an image, providing precise understanding, especially excelling in understanding smaller image areas.

■2.1 Counting Objects in the Presence of Occlusion

The following figure demonstrates the performance of these three models when given an image and a challenging prompt asking them to count the number of hard hats.

▲Figure 4｜The Apple Ferret model is the only one that correctly identifies the boundary box positions (including occluded boundary boxes) ©️【Deep Blue AI】

Although GPT-4o provides detailed scene descriptions, it shows deviations in locating the required hard hats, with some coordinates given exceeding the actual size of the image, explaining why only one boundary box is visible in the lower right corner of the frame.

The open-source model LLaVA failed to identify all four hard hats in the detection task, missing one that was occluded on the left side, and the provided boundary box positions also contained errors.

Surprisingly, Apple’s Ferret model demonstrated outstanding detection capabilities, successfully identifying all four objects in the image, including the occluded one on the left! This is undoubtedly a remarkable performance.

2.2 Autonomous Driving: Risk Perception and Planning

To explore the capabilities of these models further, the original work selected this specific scenario from the autonomous driving dataset. Additionally, the original work deliberately increased the complexity of the task: asking the models to evaluate risks for both vehicles and pedestrians from the perspective of an autonomous vehicle (see figure below).

▲Figure 5｜Asking the models to detect objects and assess risks: The Apple Ferret model performed better than GPT-4o ©️【Deep Blue AI】

The results showed that LLaVA’s performance was quite poor: it failed to recognize the large truck in front of the autonomous vehicle, leading to a misjudgment. This raises the question of whether open-source models are truly inadequate when performing challenging tasks?

While GPT-4o excels in providing detailed and reasonable text responses, it again struggles with accurately detecting boundary boxes.In contrast, Apple’s Ferret model stood out as the only one capable of detecting most objects with precise boundary box coordinates, undoubtedly adding to its appeal.

2.3 Sports Analysis: Object Detection and Scene Understanding

So far, at least one model, namely Apple Ferret, has demonstrated exceptional performance in counting and detecting objects. Now, let’s turn our attention to a more challenging area: sports analysis.

Typically, single-modal fine-tuning architectures (such as YOLO) perform well in detecting players in football matches. So, can MLLMs also exhibit equally outstanding performance in this domain? We shall see.

▲Figure 6｜Football match scene tested with the three MLLMs in this article ©️【Deep Blue AI】

Question/Prompt: As an AI system deeply engaged in the sports field, particularly focusing on football, you will be presented with a football match scene. Your tasks include:

Providing a detailed description of the scene;

Accurately counting the number of players from each team;

Providing the boundary box coordinates for the football and goalkeeper;

Assessing the likelihood of a goal and predicting which team is more likely to score.

However, as shown in the figure below, none of the three models managed to accurately identify the two teams and their players in the task of detecting players and football!

▲Figure 7｜Football match scene tested with the three MLLMs in this article ©️【Deep Blue AI】

Overall, the average performance of MLLMs is commendable, but it is evident that they still have room for improvement when faced with more complex computer vision tasks.

Here are some top MLLMs that have redefined the field of computer vision:

■3.1 GPT-4o (2024, OpenAI)

● Input: Text, Image, Audio (Beta), Video (Beta)

● Output: Text, Image

● Introduction: GPT-4o, or “GPT-4 Omni,” where “Omni” represents its multimodal capabilities across text, visual, and audio modalities. It is a unified model capable of understanding and generating any combination of text, image, audio, and video input/output.

● Trial link: https://chatgpt.com/

● Fun fact: GPT-4o employs a “multimodal thinking chain” approach, first considering how to break down the problem into a series of steps across different modalities before executing these steps to arrive at a solution.

■3.2 Claude 3.5 Sonnet (2024, Anthropic)

● Input: Text, Image

● Output: Text, Image

● Introduction: Claude 3.5 Sonnet is a multimodal AI system with a context window of 200,000 tokens, capable of understanding and generating text, images, audio, and other data formats. It excels in in-depth analysis, research, hypothesis generation, and task automation across various fields.

● Trial link: https://claude.ai

● Fun fact: Anthropic employs a technique called “recursive reward modeling,” which uses earlier versions of Claude to provide feedback and rewards for the model’s outputs.

■3.3 LLaVA (2023, University of Wisconsin-Madison)

● Input: Text, Image

● Output: Text

● Introduction: LLaVA (Large Language and Vision Assistant) is an open-source multimodal AI model capable of processing and generating text and visual data as input and output. It is comparable to GPT-4 in chat capabilities and has set new records in scientific QA, demonstrating advanced visual language understanding capabilities.

● Trial link: https://llava-vl.github.io

● Fun fact: LLaVA is trained using a technique called “instruction tuning,” where GPT-4 is used to generate synthetic multimodal tasks involving text and images. LLaVA learns from these different examples generated by GPT-4 without direct human supervision.

■3.4 Gemini 1.5 (2024, Google)

● Input: Text, Image

● Output: Text, Image

● Introduction: Gemini is a series of large language models developed by Google that can understand and operate across multiple modalities, including text, image, audio (Beta), and video (Beta). It debuted in December 2023, with three optimized variants: Gemini Ultra (largest), Gemini Pro (for scaling), and Gemini Nano (for device tasks).

● Trial link: https://gemini.google.com/

● Fun fact: The name Gemini is derived from the Gemini constellation in Greek mythology, representing duality, which aptly reflects its powerful capabilities as both a language model and its ability to process and generate multimodal data such as images, audio, and video.

■3.5 Qwen-VL (2024, Alibaba Cloud)

● Input: Text, Image

● Output: Text, Image

● Introduction: Qwen-VL is an open-source multimodal AI model that combines language and visual capabilities. It is an extension of the Qwen language model, designed to overcome the limitations of multimodal generalization. Recent upgraded versions (Qwen-VL-Plus and Qwen-VL-Max) have improved image reasoning capabilities, better image and text detail analysis, and support for high-resolution images with different aspect ratios.

● Trial link: https://qwenlm.github.io/blog/qwen-vl/

● Fun fact: After its launch, Qwen-VL quickly rose to the top of the OpenVLM leaderboard, but was surpassed by other more powerful models, especially GPT-4o.

References:

[1]https://arxiv.org/abs/2306.13549

[2]https://arxiv.org/pdf/2005.14165

[3]https://research.google/blog/multimodal-medical-ai/

[4]https://multimodal-large-language-models-mllms-transforming-computer-vision-76d3c5dd267f

Written by｜Sienna

Reviewed by｜Los

Deep Blue AcademySeptember Series Public Course · Robotic Arm Themestart! A total of 6 sessions! Welcome to scan the code to join the public course group ⬇️

The first public course“Generalizable Object Manipulation Strategies Based on Foundation Models” will start onSeptember 10 (This Tuesday) at 19:55, with the speaker beingDoctoral Student at Renmin University of China, Xia Wenke. Click the card below to reserve the live broadcast and not get lost 👇

⬇️ The Deep Blue AI author team is recruiting long-term…

Recommended Reading:

“I have become smaller but stronger! NVIDIA releases the latest large language model compression technology, lossless performance and several times improvement!”

The first open-source hybrid motion planning framework for autonomous driving, holding “planning explainability” and “decision accuracy” as two trump cards!

【Deep Blue AI】Original content by the Deep Blue AI author team, created with personal effort. We hope everyone abides by the original rules and cherishes the authors’ hard work. For reprints, please send a private message for authorization, and be sure to indicate that it comes from【Deep Blue AI】WeChat public account, otherwise legal action will be taken ⚠️⚠️

■1.1 The Rise of Multimodal Technology in AI

■1.2 MLLM Structure

■2.1 Counting Objects in the Presence of Occlusion

2.2 Autonomous Driving: Risk Perception and Planning

2.3 Sports Analysis: Object Detection and Scene Understanding

Leave a Comment Cancel reply