VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Paper Link: https://arxiv.org/pdf/2501.13106

Abstract

01

In this paper, we propose VideoLLaMA3, a more advanced multimodal foundation model for image and video understanding. The core design philosophy of VideoLLaMA3 is vision-centric, which has two meanings: a vision-centric training paradigm and a vision-centric framework design.

The key insight of our vision-centric training paradigm is that high-quality image-text data is crucial for image and video understanding. Instead of preparing a large video-text dataset, we focus on building a large-scale and high-quality image-text dataset.

As shown in Figure 2, VideoLLaMA3 has four training stages:

1) Adapting the visual encoder to accept images of different resolutions as input;

2) Visual-language alignment, jointly adjusting the visual encoder, projector, and LLM using a large-scale image-text dataset covering various types (including scene images, documents, charts) and only text data;

3) Multi-task fine-tuning, integrating image-text SFT data for downstream tasks and establishing a foundation for video understanding using video-text data;

4) Video-centric fine-tuning, further enhancing the model’s capability in video understanding.

In terms of framework design, to better capture fine-grained details in images, the pre-trained visual encoder is adapted to encode images of different sizes into a corresponding number of visual tokens, rather than a fixed number of tokens.

For video input, we reduce the number of visual tokens based on their similarity, making the representation of videos more precise and compact. Thanks to the vision-centric design, VideoLLaMA3 achieves remarkable performance in benchmark tests for image and video understanding.

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Figure 1: Performance Comparison of VideoLLaMA3 with Previous Methods

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Figure 2: Training Paradigm of VideoLLaMA3

Methods

02

As shown in Figure 3, in terms of model, VideoLLaMA3 includes two key technical points: Arbitrary Resolution Visual Tokenization (AVT) and Differential Frame Pruner (DiffFP).

In terms of data, since we propose enhancing video understanding capabilities based on image understanding, we have also developed a pipeline for constructing a high-quality re-annotated image dataset.

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Figure 3: Architecture of VideoLLaMA3

Arbitrary Resolution Visual Tokenization

In multimodal large language models (MLLMs), visual input is extracted into visual tokens for multimodal understanding. The common practice is to use a pre-trained ViT-based visual encoder to extract visual input.

This pre-trained visual encoder only accepts fixed-resolution images, leading to information loss. To mitigate this information loss, the AnyRes technique was proposed to segment images into fixed-resolution blocks.

Although the AnyRes technique increases the number of visual tokens, it remains inflexible and ignores the positional relationship within the image when extracting visual tokens.

In VideoLLaMA3, we adopt the idea of Arbitrary Resolution Visual Tokenization (AVT) to dynamically process images and videos of arbitrary resolutions.

Specifically, we adjust the pre-trained visual encoder (based on ViT architecture) to handle different resolutions by replacing absolute position embeddings in ViT with 2D-RoPE.

With AVT, images and videos of different resolutions can be better represented, and the visual tokens contain more details. To make the visual encoder compatible with AVT, we fine-tune the visual encoder and projector during the visual encoder adaptation phase (i.e., the first phase in Figure 2) using scene images, document data, and scene images with text.

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Figure 4: Calculation Process of DiffFP

Differential Frame Pruner

For video, its input usually has more tokens than image input after tokenization. To reduce the computational demand for videos, we apply 2×2 spatial downsampling per frame through bilinear interpolation to limit the context length within a certain range.

Additionally, considering that videos consist of frames with overlapping content, representing videos by stacking visual tokens from each frame would lead to token redundancy and repetition. To further reduce the number of tokens in videos, we propose the Differential Frame Pruner (DiffFP) to prune video tokens.

Inspired by RLT, we compare the 1-norm distance between temporally continuous patches in pixel space. We consider patches with smaller temporal distances to be redundant, and the later patches can be pruned.

Specifically, as shown in Figure 4, we first calculate the 1-norm distance between consecutive frames in pixel space, then remove patches with distances below a predefined threshold. Following RLT, we set the default threshold to 0.1.

Building a High-Quality Image Re-annotation Dataset

To train our VideoLLaMA3, we constructed a high-quality image re-annotation dataset, VL3-Syn7M. All images in this dataset come from COYO-700M and were processed using the following cleaning process we proposed:

Aspect ratio filtering. We first filter images based on their aspect ratios, removing those with extreme values. This step ensures the dataset contains images with typical aspect ratios, preventing potential biases in feature extraction during the process. For example, overly long or wide images may distort the model’s interpretation due to their unusual shapes.

Aesthetic score filtering. An aesthetic scoring model is applied to assess the visual quality of images. Based on these scores, images with low aesthetic ratings will be discarded. This step eliminates images with poor visual quality or composition, reducing noise and improving the quality of the descriptions generated by the model.

Text-image similarity calculation and rough annotation. The BLIP2 model is used to generate initial titles for the images, and then the CLIP model calculates the text-image similarity. Images with low similarity will be excluded as they may contain content that is difficult to describe concisely. This process ensures that the remaining images are both descriptive and easy to interpret.

Visual feature clustering. Visual features are extracted using the CLIP visual model, and the K-nearest neighbors (KNN) algorithm is applied for clustering. This method identifies cluster centers in the visual feature space. From each cluster, we select a fixed number of images. This approach ensures diversity within the dataset while maintaining a balanced distribution of semantic categories, enhancing the model’s generalization ability across various visual content.

Image re-annotation. After filtering and clustering the images, we perform detailed re-annotation. Short titles are generated using InternVL2-8B [31, 53], while detailed titles are produced by InternVL2-26B [31, 53]. These two types of titles (VL3-Syn7M-short and VL3-Syn7-detailed) are used at different stages of training to meet various needs.

Through the above cleaning and re-annotation process, we created the VL3-Syn7M dataset, which contains 7 million image-title pairs. This high-quality dataset will be a key component in training our model, providing a rich variety of images and annotations to support excellent performance across a wide range of visual tasks.

Training

03

As shown in Figure 3, VideoLLaMA3 includes four key components: visual encoder, video compressor, projector, and a large language model (LLM).

The visual encoder extracts visual tokens and is initialized with the pre-trained SigLIP. To reduce the number of visual tokens representing videos, a video compressor is used. The projector connects the features between the visual encoder and the LLM. For the LLM, we use the Qwen2.5 model.

Inspired by previous explorations in multimodal large language models (MLLMs), we developed video understanding capabilities based on a strong foundation of image understanding. To enable the model to have robust image and video understanding capabilities, the training of VideoLLaMA3 is divided into four stages:

1) Visual encoder adaptation,

2) Visual-language alignment,

3) Large-scale multi-task fine-tuning,

4) Video-centric fine-tuning.

While the first three stages mainly focus on enhancing image understanding capabilities, the last stage focuses on video understanding. The details of the training stages are as follows:

Visual encoder adaptation stage. In this stage, we fine-tune the visual encoder initialized with the pre-trained SigLIP using a large-scale image dataset.

In this stage, the visual encoder is set to be trainable while the language decoder remains frozen. This fine-tuning enables the encoder to become a dynamic resolution processor, enhancing its ability to handle images of different resolutions. Meanwhile, the projector is also trained to better align the features of the visual encoder with those of the LLM.

Visual-language alignment stage. This stage primarily focuses on introducing multimodal knowledge into the model. In this stage, all parameters are set to be trainable, allowing both the LLM and visual encoder to be fine-tuned to integrate multimodal knowledge.

Multi-task fine-tuning stage. In this stage, we perform instruction tuning using multimodal question-answering data containing image and video questions. This step is crucial for enhancing the model’s ability to follow natural language instructions and improve its multimodal understanding.

Additionally, this stage lays the foundation for the model’s video understanding capabilities, enabling it to handle and analyze temporal information. Furthermore, in this stage, we introduce a video compressor to reduce the number of video tokens.

Video-centric fine-tuning stage. In this stage, we focus on enhancing the model’s video understanding capabilities. In this stage, all parameters are unfrozen. The data used in this stage includes video-text data, image-only data, and text-only data.

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Figure 5: Data Formats of Different Data Types

Data Formats

Figure 5 illustrates the data formats for images, videos, and streaming videos.

Image sequences. Images are represented as a series of tokens called image tokens. The “\n” character is used to separate tokens belonging to different images. Additionally, text tokens follow the image tokens, separated by “\n”, achieving a mixed representation of image and text data.

Video sequences. Frames in a video sequence are represented as frame tokens. A timestamp token is inserted before each frame’s token, formatted as “Time: xxs”, to indicate the time corresponding to that frame.

Frames in a video sequence are separated by commas “,”. After the video tokens, a “\n” is inserted to separate the video data from the subsequent text tokens, ensuring a clear distinction between the two modalities.

Streaming video sequences. For streaming video data, video and text tokens are interleaved in the sequence. Timestamps (i.e., “Time: xxs”) are inserted before frame tokens, similar to video sequences.

To simulate the interactive scenario of streaming videos, answer tokens (i.e., “GPT: xxx”) may appear in the sequence to indicate contextualized outputs or interactions. The interleaved format ensures seamless integration of video and text data streams.

Data Mixing

Following the principles outlined in LLaVA-OneVision, namely “quality over quantity”, we conduct rigorous cleaning procedures to ensure data quality.

In this section, we will detail the data mixing situation at each stage, as well as the synthesis and cleaning methods applied to different data subsets.

Visual Encoder Adaptation

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Table 1: Data Mixing in Visual Encoder Adaptation Stage

The visual encoder adaptation stage aims to enhance the model’s ability to understand various diverse scenarios and improve its feature extraction capability, focusing particularly on capturing fine-grained information such as objects, regions, and text.

As shown in Table 1, the training data for this stage combines scene images and document recognition images, along with a small number of scene text images. It is important to note that all data labeled as “Recap” are titles generated using InternVL2-8B.

For scene images, our data sources include VL3-Syn7M-short, LLaVA-Pretrain-558K [55], Object365, and SA-1B. Notably, the introduction of Object365 and SA-1B datasets is to enhance data diversity, as the images in these datasets primarily consist of complex scenes.

Scene text images come from BLIP3-OCR. The short re-annotations and text content in the images are used as titles, and the text content’s titles follow a left-to-right, top-to-bottom pattern across the image.

The document images used in this stage are a subset of pdfa-eng-wds and idl-wds. A total of 2.8 million images were selected from these two datasets, and the text content of the documents serves as image titles, following the reading order.

Visual-Language Alignment

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Table 2: Data Mixing in Visual-Language Alignment Stage

In this stage, we fine-tune the model using high-quality data. As shown in Table 2, we selected five types of data to cover a wide range of everyday scenarios: scene images, scene text images, documents, charts, and fine-grained data, as well as a large amount of high-quality text-only data.

For scene images, we included COCO-2017, Object365, SA-1B, ShareGPT4o, ShareGPT4V, DenseFusion, and LLaVA-Recap (LCS-558K). For Object365, COCO-2017, and SA-1B datasets, we combined the original image annotations with InternVL2-26B to re-annotate and generate detailed image titles.

Scene text images include a variety of Chinese and English scene text recognition datasets. These datasets, such as BLIP3-OCR, COCO-Text, TextOCR, LSVT, and ReCTS, provide diverse examples of text in real-world environments.

Additionally, we filtered out images with clear and readable text from the LAION dataset, obtaining a total of 3 million high-quality images, which we refer to as the Laion-OCR dataset. For the titles of the Laion-OCR dataset, we included the text content and corresponding bounding box annotations of its text position. The title format is as follows:

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

For document images, we included pdfa-eng-wds, idl-wds, UReader-TR, Vary600k, and SynthDoG. The SynthDoG dataset is built by generating synthetic precise document images, avoiding manual annotation errors and ensuring the accuracy of model training.

Moreover, we also added handwritten document datasets FUNSD and complex document datasets DUDE. FUNSD provides annotated handwritten samples for handwritten recognition, while DUDE contains documents with complex layouts, enhancing the model’s capability to handle various types of documents.

For chart images, due to the many similarities in content presentation between charts and documents, we only included a limited amount of chart data. This data comes from the Chart-to-Text dataset.

For fine-grained images, we constructed two types of data: region title data and localization title data. Region title data describes the content of specific areas of images. This data comes from Ospery-724K, Object365, ADE20K, and MDVP-Data datasets, helping the model understand details at the regional level within images.

Localization title data includes text descriptions of objects and corresponding bounding box annotations, primarily constructed from the Flickr-30K and GranD datasets. These two types of data enhance the model’s understanding of images, supporting more accurate object localization and recognition in complex scenes.

Multi-task Fine-tuning

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Table 3: Data Mixing in Multi-task Fine-tuning Stage

In this stage, we refine the model’s ability to interpret and follow natural language instructions by using instruction-following data for instruction tuning.

This data mixing aims to cover a wide range of tasks, allowing the model to learn to execute various actions based on instructions in different contexts and modalities. Additionally, to activate the model’s video understanding capabilities, we incorporated general video data.

Similar to the visual-language alignment stage, we categorized the image data into six different groups: general, document, chart/graphic, OCR, localization, and multi-image, as shown in Table 3.

Each category targets specific aspects of visual understanding, ensuring that the model can effectively handle tasks related to different types of visual information. Besides these visual data categories, we also included a large amount of text-only data to enhance the model’s ability to handle diverse instruction-following tasks involving visual and text inputs.

General image data includes high-quality datasets such as LLaVA-SFT-665K and LLaVA-OV-SI, which serve as foundational resources to enhance the model’s scene understanding capabilities. We also cleaned and filtered the Cambrian-10M dataset.

Additionally, we integrated meaningful data from the Pixmo dataset, including tasks like document analysis, title generation, and counting. These scene images cover a wide range of tasks, including title generation, counting, document understanding, and mathematical reasoning.

In constructing document and chart/graphic datasets, we carefully selected high-quality data sources and performed quality cleaning to ensure the reliability of the data. It should be noted that the Docmatix dataset was included, as it contains multi-page and diverse documents, which are crucial for significantly enhancing the model’s ability to understand and process long, complex document structures and contents.

For OCR data, we considered two common scenarios in the real world: development scenarios and natural scenes. For development scenarios, we used the MultiUI dataset to activate the model’s ability to understand and process text in user interfaces.

For natural scenes, we utilized the Laion-OCR dataset to build additional instruction tuning data. The OCR instruction tuning data includes the following five sub-tasks:

1) Text presence detection: Determine whether specific text exists in the image.

2) Text localization: Locate specific text in the image and output its bounding box.

3) Text recognition within the bounding box: Given a bounding box, recognize the text it contains.

4) Text comparison between images: Given two images, determine which image contains the specified text.

5) Comprehensive text detection and recognition: Detect and recognize all text present in the image.

For localization images, we selected data from mature datasets like RefCOCO and VCR, which focus on tasks that combine visual elements with specific textual descriptions.

For multi-image scenes, we utilized the Demon-Full and Contrastive-Caption datasets. The Demon-Full dataset is particularly valuable as it includes various tasks involving multi-image scenes, such as comparing differences between two images, generating titles for the last image of a comic strip, completing missing text in images with occluded parts, and determining whether multiple images belong to the same category.

These tasks help the model handle complex scenes involving multiple images, providing a more comprehensive understanding of how to interpret visual information across a series of related images. At the same time, this multi-image data further enhances the model’s video understanding capabilities.

The video data used in this stage includes commonly used high-quality video subtitle datasets, as well as a small amount of question-answer data. Additionally, we supplemented high-quality data from VideoLLaMA2 and internal temporal localization data.

This internal temporal localization data particularly focuses on the temporal relationships between video frames, enabling the model to grasp the order of events and understand the flow of actions over time. These combined data sources provide the model with stronger and more detailed video understanding capabilities.

Video-Centric Fine-tuning

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Table 4: Data Mixing in Video-Centric Fine-tuning Stage

The video-centric fine-tuning stage aims to adjust VideoLLaMA3 to be a video expert and fully unleash its video understanding capabilities by primarily focusing on large-scale and high-quality video instruction following.

We first collected general annotations, questions, and answers from multiple open-source datasets, including LLaVA-Video, ShareGPT-4o, FineVideo, CinePile, ShareGemini, VideoGPT+, and VideoRefer, which contain videos.

These approximately 2.7 million video-centric dialogues ultimately form a dataset that covers various scenes and tasks, serving as examples to teach the model how to understand complex dynamic and static content in videos.

Additionally, we expanded the data scale and strengthened the model by synthesizing dense subtitles and questions on specific aspects. Specifically, following the process proposed in previous work, we first filtered 68K dynamic videos from the Panda-70M dataset using optical flow, and then used Qwen2-VL-72B to generate diverse dense subtitles and questions for each video regarding temporal understanding, spatial understanding, object description, and understanding of temporal order. Ultimately, the question-answer pairs used for training amount to 242K.

In addition to the regular video-centric dialogues, we also introduced the capabilities of streaming video understanding and temporal localization to expand our model’s application scenarios.

For streaming video understanding, we obtained data from ActivityNet, YouCook2, and Ego4D, and organized the video frames and multiple temporally dense subtitles in an interleaved manner as described in Section 3.1, aiming to enhance understanding of fine-grained events in videos and maintain multi-turn dialogues in streaming videos.

As these videos are often long, we cut them into segments of up to two minutes based on the time intervals of the dense subtitles, removing segments with excessively dense or sparse subtitles.

Synthetic streaming dialogues from VideoLLM-Online are also involved. For temporal localization, we collected 205K data from datasets including ActivityNet, YouCook2, ViTT, QuerYD, HiREST, Charades-STA, Moment-10M, and COIN, and directly converted the localization annotations into text formats like “1.0-2.0 seconds” for training.

Finally, we utilized a certain amount of image-only and text-only data from LLaVA, LLaVAOneVision, Magpie, and Tulu 3 to mitigate the impact of catastrophic forgetting on the model’s capabilities.

Implementation Details

In this section, we briefly introduce the implementation details of each training stage. For all stages, we adopt a cosine learning rate scheduler. The warmup ratio of the learning rate is set to 0.03. The maximum token length is set to 16384, while the maximum length of visual tokens is set to 10240.

In the visual encoder adaptation stage, when training VideoLLaMA3-2B, we initialize the visual encoder with the pre-trained weights of SigLIP and initialize the LLM with the pre-trained weights of Qwen2.5-2B. For VideoLLaMA3-7B, the visual encoder is initialized with the fine-tuned SigLIP weights from VideoLLaMA3-2B, while the LLM is initialized with Qwen2.5-7B.

The projector is implemented with a two-layer MLP, and the activation function is GELU. In this stage, we only train the visual encoder and projector, with their learning rates set to and. For the remaining stages, the learning rates of the LLM, projector, and visual encoder are set to,and. The differential frame pruner is applied in the multi-task fine-tuning stage and in the video-centric fine-tuning stage involving video data. The threshold for discarding similar visual tokens is set to 0.1.

To limit context length, visual tokens are downsampled in the visual encoder by a factor of 2 through bilinear interpolation. The visual tokens of images are only downsampled in the video-centric fine-tuning stage to align with video data. When loading video data, we first sample frames at a rate of one frame per second using FFmpeg. If the total number of frames exceeds a certain value, these frames will be further evenly sampled, with this value set to 180 to accommodate most videos not exceeding 3 minutes in duration.

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Table 5: Evaluation Results of 2B Model on Image Benchmarks

Experiments

04

Image-Based Evaluation

The evaluation results of the 2B model are shown in Table 5. Compared to previous models, VideoLLaMA3 demonstrates significant improvements across a range of tasks. For example, in OCR benchmark tests such as InfoVQA, VideoLLaMA3 achieved a performance score of 69.4%, compared to the previous best score of 65.5%.

In mathematical reasoning tasks like MathVista, our 2B model scored 59.2%, exceeding the state-of-the-art method by 7.9%. In multi-image benchmark tests like MMMU-Pro, VideoLLaMA3 outperformed the previous best method by 2.6%.

In real-world knowledge question-answering tasks like RealWorldQA, VideoLLaMA3 leads with a top score of 67.3%, compared to the previous method’s 62.9%.

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Table 6: Evaluation Results of 7B Model on Image Benchmarks

Similarly, we evaluated our larger 7B model on various image benchmarks, with results summarized in Table 6.

From the table, it is evident that VideoLLaMA3 consistently outperforms previous models across most benchmark tests. Notably, in mathematical reasoning tasks, our 7B model surpassed the previous best score of 6.5% on MathVision.

In chart understanding tasks, we observed a performance improvement of 1.3% over previous methods on InfoVQA. Additionally, in general reasoning tasks like RealWorldQA, VideoLLaMA3 outperformed previous models by 2.0%.

Overall, the results confirm that VideoLLaMA3 provides sustained improvements across a wide range of benchmark tests, demonstrating its effectiveness and versatility in handling complex tasks, including OCR, mathematical reasoning, and general knowledge.

These improvements make VideoLLaMA3 a powerful tool in practical applications, advancing the field of multimodal learning.

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Table 7: Evaluation Results of 2B Model on Video Benchmarks

Video-Based Evaluation

Table 7 evaluates the performance of the 2B model size video understanding model. VideoLLAMA3 consistently shows competitive results and outperforms baseline methods.

In general video understanding, VideoLLAMA3 achieved top scores in the following items: VideoMME w/o sub (59.6%), VideoMME w/ sub (63.4%), ActivityNet-QA (58.2%), PerceptionTest-test (68.0%), MVBench (65.5%), and MMVU (37.6%).

On MVBench, it ranks second (65.5%), slightly behind InternVL2.5 2B (68.8%). For long video understanding, VideoLLAMA3 performed best across all benchmark tests: MLVU-dev (65.4%), LongVideoBench-val (57.1%), and LVBench (40.4%), showcasing its exceptional ability to handle long video content.

In temporal reasoning, VideoLLAMA3 leads in TempCompass (63.4%), NextQA (81.1%), and Charades-STA (55.5%).

Compared to Apollo-2B, InternVL2.5-2B, and Qwen2-VL-2B, VideoLLAMA3 not only maintains a leading position in most benchmark tests but also demonstrates consistent advantages in tasks requiring comprehensive and long-term video understanding, reinforcing its strong capabilities in diverse video-related tasks.

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Table 8: Evaluation Results of 7B Model on Video Benchmarks

Regarding the VideoLLaMA3-7B model, the results are shown in Table 8. At the 7B model size, VideoLLaMA3-7B still demonstrates competitive results.

In general video understanding, it leads in 5 out of 7 benchmark tests, including VideoMME w/o sub, VideoMME w/ sub, PerceptionTest-test, and ActivityNet-QA.

On MVBench, it also achieved results comparable to InternVL2.5-8B. For long video understanding, VideoLLaMA3-7B scored the highest on MLVU-dev and achieved the second-best results on LongVideoBench-val and LVBench.

Ablation Study of Visual Encoder

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Table 9: Ablation Study Results of Different Visual Encoders

In multimodal large language models (MLLMs), the embeddings of the pre-trained visual encoder should align with the embeddings of the large language model (LLMs). Therefore, the representation performance of the visual encoder is crucial to the final performance of MLLMs.

In this work, we investigate the impact of different visual encoders. Specifically, we compare three pre-trained transformer-based visual encoders: CLIP, DFN, and SigLIP. Due to computational constraints, we conducted the study on a subset of the entire dataset.

Additionally, to investigate the performance of the original pre-trained weights, we fixed the weights of the visual encoder and kept the visual input at a fixed resolution, which is the same as the pre-trained resolution of the visual encoder (CLIP at 336×336, DFN at 378×378, SigLIP at 384×384). The training was divided into three stages:

1) Training the projector using LLaVA-Pretrain-558K;

2) Adjusting all parameters using our re-annotated COYO data;

3) Performing SFT using LLaVA-SFT-665K.

The comparison results are shown in Table 9. SigLIP outperformed the other two visual encoders in fine-grained understanding tasks involving text. Based on this ablation study, we chose the pre-trained SigLIP as our base visual encoder and then adapted it to accept dynamic resolutions as input.

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

WeChat IDzhiyvshijie

Smart Language VisionWelcome to Follow

Leave a Comment