🚀 Quick Read

Model Introduction: Qwen2.5-VL is the flagship open-source visual language model from Alibaba’s Tongyi Qianwen team, available in three different sizes: 3B, 7B, and 72B.
Main Features: Supports visual understanding, long video processing, structured output, and device operation.
Technical Principles: Utilizes a series structure of ViT and Qwen2, supports multi-modal rotary position encoding (M-ROPE), and recognizes images at any resolution.

Content (With Running Example)

What is Qwen2.5-VL

Qwen2.5-VL: Alibaba's Latest Open Source Visual Language Model — autotrain-advanced

Qwen2.5-VL is the flagship open-source visual language model from Alibaba’s Tongyi Qianwen team, available in three different sizes: 3B, 7B, and 72B. This model excels in visual understanding, capable of recognizing common objects and analyzing text, charts, and other elements within images.

Qwen2.5-VL has the capability to act as a visual agent, reasoning and dynamically using tools, with initial abilities to operate computers and mobile phones. In video processing, Qwen2.5-VL can understand long videos exceeding one hour, accurately locating relevant segments to capture events. The model also supports structured output for data such as invoices and forms.

Qwen2.5-VL has performed excellently in multiple performance tests, showing significant advantages in document and chart understanding, with the 7B model surpassing GPT-4o-mini in several tasks. The model’s release provides developers with powerful tools to play important roles in various application scenarios.

Main Features of Qwen2.5-VL

Visual Understanding: Can recognize common objects such as flowers, birds, fish, and insects, and analyze text, charts, icons, graphics, and layouts within images.
Visual Agent Capability: Can directly act as a visual agent, reasoning and dynamically using tools, with initial capabilities to use computers and mobile phones.
Understanding Long Videos and Capturing Events: Can understand videos longer than one hour, accurately locating relevant video segments to capture events.
Visual Localization: Can accurately locate objects in images by generating bounding boxes or points, providing stable JSON output for coordinates and attributes.
Structured Output: Supports structured output of content for invoices, forms, tables, etc.

Technical Principles of Qwen2.5-VL

Model Structure: Qwen2.5-VL continues the series structure of ViT and Qwen2 from the previous generation Qwen-VL, with all three different sizes utilizing a 600M scale ViT, supporting unified input for images and videos. This allows the model to better integrate visual and language information, enhancing its understanding of multi-modal data.
Multi-Modal Rotary Position Encoding (M-ROPE): The M-ROPE used in Qwen2.5-VL decomposes rotary position encoding into three parts: time, space (height and width), enabling large-scale language models to simultaneously capture and integrate positional information from one-dimensional text, two-dimensional visuals, and three-dimensional videos, endowing the model with powerful multi-modal processing and reasoning capabilities.
Any Resolution Image Recognition: Qwen2.5-VL can understand images of different resolutions and aspect ratios, easily recognizing the clarity or size of images. Based on naive dynamic resolution support, it can map images of any resolution into a dynamic number of visual tokens, ensuring consistency between model input and image information.
Simplified Network Structure: Compared to Qwen2-VL, Qwen2.5-VL enhances the model’s perception of temporal and spatial scales, further simplifying the network structure to improve model efficiency.

How to Run Qwen2.5-VL

1. Install Dependencies

First, ensure that the necessary dependency libraries are installed:

pip install git+https://github.com/huggingface/transformers accelerate

If you are not using Linux, you may not be able to install <span>decord</span>, you can use <span>pip install qwen-vl-utils</span> to fall back to using torchvision for video processing. However, you can use install decord from source to use decord.

2. Load the Model

Load the model and prepare for inference:

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# Load the model
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)

# Load the processor
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

# Prepare messages
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Prepare for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Perform inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Resources

Official Project Website:https://qwenlm.github.io/blog/qwen2.5-vl
GitHub Repository:https://github.com/QwenLM/Qwen2.5-VL
Qianwen Model Online Experience:https://chat.qwenlm.ai/

❤️ If you are also interested in the current state of AI development and are very interested in AI application development, I will share the latest open-source projects and applications in the field of large models and AI every day, providing running examples and practical tutorials to help you quickly get started with AI technology. Welcome to follow me!