VideoLLaMA3: Advanced Multimodal Foundation Model

Click belowCard, follow “AICV and Frontier
01 Introduction

A more advanced multimodal foundation model for image and video understanding. The core design philosophy of VideoLLaMA3 is vision-centric:

  • Vision-centric training paradigm

  • Vision-centric framework design.

The key point of the vision-centric training paradigm is that high-quality image-text data is crucial for understanding both images and videos. Instead of preparing a large amount of video-text datasets, VideoLLaMA3 focuses on building large-scale and high-quality image-text datasets.

Divided into four training stages:

1) Vision-centric calibration stage, warming up the visual encoder and projector;

2) Vision-language pre-training stage, jointly tuning the visual encoder, projector, and LLM with a large-scale image-text dataset covering various types (including scene images, documents, charts) and pure text data.

3) Multi-task fine-tuning stage, incorporating image-text SFT data into downstream tasks and video-text data to lay the foundation for video understanding.

4) Video-centric fine-tuning, further enhancing the model’s video understanding capabilities.

In framework design, to better capture fine-grained details in images, a pre-trained visual encoder encodes images of different sizes into a corresponding number of visual tokens, rather than a fixed number of tokens. For video input, the number of visual tokens is reduced based on similarity, making the representation of the video more precise and compact. (This is interesting; with fewer tokens, some information will be lost to a certain extent, but the effect is better??)

Contributions:

  • Proposed a vision-centric training paradigm. Improved video understanding capabilities through large-scale image understanding pre-training.

  • Proposed two vision-centric framework designs to better represent images and videos with the visual encoder.

02 Method
  • Framework
VideoLLaMA3: Advanced Multimodal Foundation Model
The language encoder is based on Qwen2, and the visual encoder is based onsiglip-so400m-patch14-384 (VideoLLaMA3-7B)
VideoLLaMA3: Advanced Multimodal Foundation Model
VideoLLaMA3 has two key technical points:
Arbitrary Resolution Visual Tokenization (AVT): AVT converts images or videos of arbitrary resolution into a set of 1-D token sequences, thus accommodating different numbers of input images and videos with varying resolutions, enabling more flexible visual input;
Differential Frame Pruner (DiffFP): As a video compressor, DiffFP eliminates video content, minimizing the differences between adjacent frames. This approach improves video processing efficiency, especially for long videos.
Compared to VideoLLaMA2, the VideoLLaMA3 framework lacks audio modality information input??
  • Training Stages
VideoLLaMA3: Advanced Multimodal Foundation Model
First Stage: Training the Visual Encoder
VideoLLaMA3: Advanced Multimodal Foundation Model
Second Stage: Vision-Language Alignment
VideoLLaMA3: Advanced Multimodal Foundation Model
Third Stage: Multi-task Fine-tuning
VideoLLaMA3: Advanced Multimodal Foundation Model
Fourth Stage: Video-centric Fine-tuning
VideoLLaMA3: Advanced Multimodal Foundation Model
03 Experimental Results
VideoLLaMA3: Advanced Multimodal Foundation Model
VideoLLaMA3: Advanced Multimodal Foundation Model
VideoLLaMA3: Advanced Multimodal Foundation Model
VideoLLaMA3: Advanced Multimodal Foundation Model
VideoLLaMA3: Advanced Multimodal Foundation Model
Using it is also very smooth, go try it nowVideoLLaMA3: Advanced Multimodal Foundation Model
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
device = "cuda:0"
model_path = "DAMO-NLP-SG/VideoLLaMA3-7B"
model = AutoModelForCausalLM.from_pretrained(    model_path,    trust_remote_code=True,    device_map={"": device},    torch_dtype=torch.bfloat16,    attn_implementation="flash_attention_2",)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

conversation = [    {"role": "system", "content": "You are a helpful assistant."},    {        "role": "user",        "content": [            {"type": "video", "video": {"video_path": "./assets/cat_and_chicken.mp4", "fps": 1, "max_frames": 128}},            {"type": "text", "text": "What is the cat doing?"},        ]    },]

inputs = processor(    conversation=conversation,    add_system_prompt=True,    add_generation_prompt=True,    return_tensors="pt")
inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
if "pixel_values" in inputs:    inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
output_ids = model.generate(**inputs, max_new_tokens=128)
response = processor.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(response)
Click belowCard, follow “AICV and Frontier

Leave a Comment