VideoLLaMA3: Long Video Understanding with AVT and DiffFP

Click

Follow us with the blue text above

Researchers from Alibaba have proposed VideoLLaMA 3, a multimodal foundational model that combines Any-resolution Vision Tokenization (AVT) and Differential Frame Pruner (DiffFP) technologies to effectively handle image and video understanding, achieving significant results in image and video benchmarks, especially excelling in long video understanding and temporal reasoning. Future research directions include improving the quality of video-text datasets, optimizing real-time performance, and integrating audio and other modalities.

Paper Introduction

The advancement of multimodal intelligence relies on the processing and understanding of images and videos. Images can reveal static scenes by providing details about objects, text, and spatial relationships. However, this also presents significant challenges. Video understanding involves operations such as tracking changes over time while ensuring consistency between frames, which requires dynamic content management and temporal relationships. The collection and annotation of video-text datasets are relatively more challenging than those of image-text datasets, making these tasks even more daunting.

Traditional MLLMs methods face challenges in video understanding. Techniques such as sparse sampling frames, basic connectors, and image-based encoders fail to effectively capture temporal dependencies and dynamic content. Techniques such as token compression and extended context windows struggle to handle the complexities of long videos, while the integration of audio and visual inputs often lacks seamless interaction. Real-time processing and scaling model sizes remain inefficient, and existing architectures are not optimized for handling long video tasks.

VideoLLaMA3: Long Video Understanding with AVT and DiffFP

To address the challenges of video understanding, researchers from Alibaba Group proposed the VideoLLaMA3 framework. This framework combines Any-resolution Vision Tokenization (AVT) and Differential Frame Pruner (DiffFP). AVT improves traditional fixed-resolution tokenization, allowing the vision encoder to dynamically handle variable resolutions, thus reducing information loss. This is achieved by adopting a ViT-based encoder with 2D-RoPE for flexible positional embeddings. To retain important information, DiffFP processes redundancy and long video tokens by pruning frames with minimal differences (obtained through the 1-norm distance between patches). The combination of dynamic resolution processing and efficient token reduction enhances representation quality while lowering costs.

VideoLLaMA3: Long Video Understanding with AVT and DiffFP

The model consists of a vision encoder, video compressor, projector, and LLM, initialized with the pre-trained SigLIP model. It extracts visual tokens while the video compressor reduces video token representations. The projector connects the vision encoder to the LLM, which uses the Qwen2.5 model. Training is divided into four stages: Vision Encoder Adaptation, Vision-Language Alignment, Multi-task Fine-tuning, and Video-centric Fine-tuning. The first three stages focus on image understanding, while the last stage enhances video understanding by integrating temporal information. The Vision Encoder Adaptation Stage focuses on fine-tuning the vision encoder initialized with SigLIP on a large image dataset, enabling it to handle images of different resolutions. The Vision-Language Alignment Stage introduces multimodal knowledge, allowing the LLM and vision encoder to be trainable for integrating visual and language understanding. In the Multi-task Fine-tuning Stage, instruction fine-tuning is performed using multimodal question-answer data (including image and video questions) to enhance the model’s ability to follow natural language instructions and handle temporal information. The Video-centric Fine-tuning Stage unfreezes all parameters to enhance the model’s video understanding capabilities. Training data comes from various sources such as scene images, documents, charts, fine-grained images, and video data, ensuring comprehensive multimodal understanding.

VideoLLaMA3: Long Video Understanding with AVT and DiffFP

Researchers conducted experiments to evaluate the performance of VideoLLaMA3 on image and video tasks. For image-based tasks, the model was tested in document understanding, mathematical reasoning, and multi-image understanding, outperforming previous models and showing improvements in chart understanding and real-world knowledge QA. In video-based tasks, VideoLLaMA3 excelled in benchmarks such as VideoMME and MVBench, demonstrating its proficiency in general video understanding, long video understanding, and temporal reasoning. The performance of the 2B and 7B models is highly competitive, with the 7B model leading in most video tasks, highlighting the model’s effectiveness in multimodal tasks. Other reported areas of significant improvement include OCR, mathematical reasoning, multi-image understanding, and long video understanding.

VideoLLaMA3: Long Video Understanding with AVT and DiffFP

Finally, the proposed framework advances vision-centered multimodal models, providing a powerful framework for understanding images and videos. By leveraging high-quality image-text datasets, it addresses the challenges of video understanding and temporal dynamics, achieving excellent results across various benchmarks. However, challenges such as the quality of video-text datasets and real-time processing still exist. Future research could enhance video-text datasets, optimize real-time performance, and integrate other modalities such as audio and speech. This work can serve as a baseline for future advancements in multimodal understanding, improving efficiency, generalization, and integration.

Paper Download

  • Paper Link: https://arxiv.org/abs/2501.13106
  • Github Link: https://github.com/DAMO-NLP-SG/VideoLLaMA3

⇩ Follow ‘Singularity Intelligence Source’ to explore ‘Artificial Intelligence’ ⇩

Leave a Comment